Structured Context Engineering: What Actually Works for AI Agents

3 min read


HERO

A 9,649-experiment study challenges conventional wisdom about how to feed AI agents structured data


How should you structure the context that AI agents consume? The answer might surprise you: it depends on your model, and the conventional wisdom might be wrong.

A new arXiv paper presents the largest systematic study of context engineering for AI agents working with structured data — and the findings challenge several commonly held assumptions.

The Core Insight

The Core Insight

The research used SQL generation as a proxy for programmatic agent operations, running 9,649 experiments across:
– 11 different models
– 4 formats (YAML, Markdown, JSON, TOON)
– Schemas ranging from 10 to 10,000 tables

The conclusions challenge what many practitioners take as gospel:

1. Architecture Choice is Model-Dependent

  • File-based context retrieval improves accuracy for frontier-tier models (Claude, GPT, Gemini): +2.7%, statistically significant (p=0.029)
  • But for open-source models, it shows mixed results: aggregate -7.7% (p<0.001)
  • There’s no universal best practice

2. Format Doesn’t Matter (Much)

Surprisingly, format choice (YAML, JSON, Markdown, TOON) doesn’t significantly affect aggregate accuracy (chi-squared=2.45, p=0.484).

However, individual open-source models show format-specific sensitivities — so while the average is neutral, your specific model might care a lot.

3. Model Capability Dominates Everything

The 21 percentage point accuracy gap between frontier and open-source models dwarfs any format or architecture effect.

This is the most important finding: don’t obsess over context formatting until you’ve picked your model.

4. File-Native Agents Scale

File-native agents can scale to 10,000 tables through domain-partitioned schemas while maintaining high navigation accuracy. Context structure matters for scale, just not in the way people assumed.

5. File Size is a Poor Efficiency Predictor

Compact or novel formats can actually incur token overhead driven by grep output density and pattern unfamiliarity. The relationship between file size and runtime efficiency is more complex than expected.

Why This Matters

Why This Matters

These findings have practical implications:

  • Don’t optimize prematurely — Get your model choice right first
  • Test your specific setup — Generic advice may not apply to your combination
  • Frontier models are more forgiving — They’re less sensitive to context engineering choices
  • Open-source requires more care — Context optimization matters more for non-frontier models

Key Takeaways

  • Model choice > context engineering — A better model beats better formatting
  • The “right” approach depends on your model — One size doesn’t fit all
  • Scale is achievable with proper architecture — 10,000 tables is tractable
  • Conventional wisdom needs testing — What everyone “knows” may be wrong

Looking Ahead

This research provides evidence-based guidance for deploying AI agents on structured systems. The main takeaway: treat architectural decisions as hypotheses to validate with your specific use case, not as universal truths.

The era of “spray and pray” context engineering is ending. The era of evidence-based context optimization is just beginning.


Based on arXiv paper 2602.05447: “Structured Context Engineering for File-Native Agentic Systems”

Share this article

Related Articles