Context Engineering for AI Agents: What 9,649 Experiments Reveal About Feeding LLMs

A rigorous empirical study demolishes common assumptions about how to structure context for LLM agents operating on structured data.
The Core Insight

Every AI agent needs context. But how should you format that context? YAML? JSON? Markdown? Should you embed schemas directly or let agents retrieve files?
A new paper by Damon McMillan cuts through the noise with hard data: 9,649 experiments across 11 models, 4 formats, and schemas ranging from 10 to 10,000 tables. Using SQL generation as a proxy for programmatic agent operations, the findings challenge nearly every “best practice” floating around in agent engineering circles.
The headline result: format doesn’t significantly affect aggregate accuracy (p=0.484). That debate about YAML vs JSON vs Markdown? Statistically irrelevant for most use cases.
But wait — there’s nuance. Individual models, particularly open-source ones, show format-specific sensitivities. What works for Claude might tank on Llama. The universal “best format” doesn’t exist.
Why This Matters

Architecture choice is model-dependent. File-based context retrieval — where agents pull in relevant schema files rather than receiving everything inline — improves accuracy for frontier models (Claude, GPT, Gemini) by +2.7% (p=0.029). But for open-source models, the aggregate effect is -7.7% with substantial variation by specific model.
Translation: that slick file-native agent architecture you built for GPT-4 might actively harm performance when you swap in Mistral or Llama.
Model capability is the dominant factor. A 21 percentage point accuracy gap separates frontier and open-source tiers. This dwarfs any format or architecture effect. Before optimizing your prompt template, ask: am I using a model capable enough to benefit from these optimizations?
File-native agents can scale. The study demonstrates successful navigation of schemas with 10,000 tables through domain-partitioned organization. This validates the “treat context like a codebase” approach for large-scale agent deployments.
Compact isn’t always efficient. Here’s a counterintuitive finding: smaller file formats can consume more tokens at scale. Why? Format-unfamiliar search patterns. When agents don’t recognize a format’s structure, they compensate with more verbose exploration. Token cost ≠ file size.
Key Takeaways
Tailor decisions to model capability. Universal best practices are a myth. Test your specific model with your specific format and architecture before committing.
Frontier models get file-native benefits; open-source models vary. If you’re building for GPT-4/Claude, file-based retrieval probably helps. For open-source deployment, benchmark first.
Format sensitivity exists at the model level. Open-source models show stronger format preferences than frontier models. Know your model’s quirks.
Domain partitioning enables scale. For massive schemas, organize by domain and let agents navigate. This works — with the right model.
Measure tokens, not bytes. Your “efficient” compact format might trigger inefficient agent behavior. Profile actual token consumption in production.
The Research Gap
The study uses SQL generation as proxy. Real agent operations involve multi-step reasoning, tool selection, error recovery — behaviors potentially affected by context format in ways SQL tasks don’t capture.
And TOON (Token-Oriented Object Notation) — a format specifically designed for LLM consumption — shows promise but needs more investigation. If format truly doesn’t matter for current models, maybe we haven’t found the right format yet.
Looking Ahead
This paper represents the kind of empirical rigor AI engineering desperately needs. Too much agent development relies on vibes, anecdotes, and “what worked for that one demo.” Systematic studies like this build actual knowledge.
The meta-lesson: don’t assume, measure. Your intuitions about what helps LLMs might be wrong. The gap between frontier and open-source models means advice from someone using Claude might actively mislead someone deploying Llama. Context matters — including the context of which model you’re using.
For teams building production agents, the practical implications are clear:
1. Benchmark your specific model/format/architecture combination
2. Expect to re-benchmark when swapping models
3. Invest in model capability before prompt engineering
4. Design file organization for agent navigation, not human readability
The “universal best practice” is dead. Long live empirical testing.
Based on analysis of “Structured Context Engineering for File-Native Agentic Systems” (arXiv:2602.05447) by Damon McMillan
Tags: #AI #AIAgent #ContextEngineering #LLM #Prompt-Engineering #Empirical-Research