How 16 AI Agents Built a Working C Compiler from Scratch

What happens when you let 16 Claude instances collaborate on building a C compiler that can compile the Linux kernel? Anthropic’s Nicholas Carlini just showed us—and the results challenge everything we thought we knew about AI-assisted development.
The Core Insight

Forget pair programming. Forget copilots. Anthropic researcher Nicholas Carlini developed agent teams—a new paradigm where multiple AI instances work in parallel on shared codebases without active human intervention. The stress test? Building a Rust-based C compiler from scratch.
The numbers are staggering:
– Nearly 2,000 Claude Code sessions
– $20,000 in API costs
– 100,000 lines of code produced
– Result: A compiler that can build bootable Linux 6.9 on x86, ARM, and RISC-V
This isn’t a toy demo. The compiler passes 99% of the GCC torture test suite and can compile real-world software including Redis, SQLite, PostgreSQL, FFmpeg, QEMU—and yes, it can compile and run Doom.
Why This Matters
Beyond Pair Programming
The “agent teams” approach represents a fundamental shift from human-AI collaboration to AI-AI collaboration with human supervision. Carlini’s harness is elegantly simple: an infinite loop that spawns Claude in fresh containers, each agent claiming tasks by writing lock files to a shared git repo.
When two agents try to claim the same task, git’s synchronization forces the second agent to pick something else. Merge conflicts happen frequently, but “Claude is smart enough to figure that out.”
The Art of Test Engineering
The most valuable insight isn’t about the agents—it’s about the infrastructure. Carlini emphasizes that test quality is everything. When Claude works autonomously, it will solve whatever problem your tests define. If the tests are wrong, Claude solves the wrong problem.
Key harness engineering lessons:
– Context window pollution: Don’t flood Claude with thousands of lines of output. Print errors on single lines so grep works.
– Time blindness: LLMs can’t tell time. Without intervention, they’ll happily spend hours running tests. Include deterministic random sampling and progress indicators.
– Parallelism bottlenecks: When 16 agents all hit the same bug compiling Linux kernel, they overwrote each other’s fixes. Solution? Use GCC as an oracle to help agents work on different file subsets.
Specialization Works
Beyond parallelizing the same task, Carlini assigned different roles:
– One agent coalesced duplicate code
– Another optimized compiler performance
– Another critiqued design from a “Rust developer perspective”
– Another maintained documentation
Key Takeaways

- Agent teams enable project-scale automation: Not just functions or files—entire 100k-line codebases
- Test engineering is the new bottleneck: Your agents are only as good as your verification
- LLM limitations require design workarounds: Context pollution, time blindness, and parallelism challenges all need architectural solutions
- Opus 4.6 is a capability threshold: Earlier models could barely produce a functional compiler; 4.6 can build one that compiles Linux
- Cost vs. human time: $20,000 sounds expensive until you consider what human engineering time would cost for an equivalent project
The Limitations
Carlini is refreshingly honest about what the compiler can’t do:
– No 16-bit x86 generator (still calls GCC for real mode boot)
– Assembler and linker are buggy; demo used GCC’s
– Generated code is less efficient than GCC -O0
– Rust code quality is “reasonable” but not expert-level
These limitations are themselves a benchmark—they show exactly where frontier AI capabilities currently end.
Looking Ahead
This project is designed as a capability benchmark, not a production tool. Carlini explicitly frames it as stress-testing what LLMs can “just barely achieve today” to prepare for what they’ll “reliably achieve in the future.”
The implication is clear: if 16 agents can build a working C compiler today, what will they build in 2027? The answer might be anything your test harness can adequately specify.
For developers and technical leaders, the takeaway isn’t “let AI build your compiler.” It’s this: the quality of your automation infrastructure determines the ceiling of what AI can build for you. Invest in tests, verification, and harness engineering. That’s where the real leverage lives.
Based on analysis of “Building a C compiler with a team of parallel Claudes” by Nicholas Carlini (Anthropic)