OpenAI’s Quiet Rebellion Against Nvidia: What the Cerebras Deal Reveals
A dinner plate-sized chip just changed the AI infrastructure game.
OpenAI’s new Codex-Spark coding model runs at 1,000 tokens per second — fast enough to make competitive coding agents feel glacial. But the real story isn’t the speed. It’s what’s powering it: Cerebras’ Wafer Scale Engine 3, not an Nvidia GPU in sight.
The Core Insight
OpenAI has spent the past year systematically dismantling its dependence on Nvidia. The Cerebras partnership, announced in January and delivering Codex-Spark as its first product, is just one piece of a larger strategy:
- October 2025: Massive multi-year deal with AMD
- November 2025: $38 billion cloud computing agreement with Amazon
- Ongoing: Custom AI chip design for TSMC fabrication
Meanwhile, a planned $100 billion infrastructure deal with Nvidia has “fizzled.” Reuters reported that OpenAI grew unsatisfied with the speed of Nvidia chips for inference tasks — exactly the workload Codex-Spark was built for.
The message is clear: when you’re burning billions on inference, vendor lock-in isn’t just expensive — it’s strategically dangerous.
Why This Matters
The Cerebras approach is architecturally radical. Their Wafer Scale Engine is literally the size of a dinner plate — a single chip fabricated across an entire silicon wafer rather than diced into individual processors. This eliminates memory bandwidth bottlenecks that throttle traditional GPU clusters during inference.
Cerebras has measured 2,100 tokens per second on Llama 3.1 70B and 3,000 tokens per second on OpenAI’s own open-weight gpt-oss-120B. That Codex-Spark “only” hits 1,000 tokens per second suggests either a larger model or more complex architecture under the hood.
For AI coding agents, speed isn’t just a nice-to-have — it’s the competitive moat. When Anthropic’s Claude Code can iterate three times in the time your agent iterates once, developer workflows naturally shift. OpenAI’s Codex-Spark partnership is a direct response: win on speed, or lose the developer market.
But the strategic implications extend beyond coding. Every major AI company is racing to reduce Nvidia dependence:
- Google: Custom TPUs plus Trillium generation coming
- Amazon: Trainium chips deployed internally and for AWS customers
- Meta: Custom MTIA accelerators
- Microsoft: Maia AI chip in development
OpenAI’s multi-vendor approach — AMD, Amazon, Cerebras, plus custom silicon — is the most aggressive diversification play in the industry.
Key Takeaways
Inference economics drive chip strategy: Training is a one-time cost; inference runs forever. Optimizing for inference speed at scale justifies unconventional silicon.
Speed is product differentiation for AI coding: A model that codes faster lets developers iterate faster. At the margin, this wins market share.
The Nvidia monopoly is fracturing: Not collapsing — Nvidia still dominates training — but the monoculture is ending. Alternative architectures are proving themselves.
Custom silicon is becoming table stakes: Every major AI lab is either building their own chips or diversifying suppliers. The days of “just buy more H100s” are numbered.
Looking Ahead
The Codex-Spark release is a proof point, not a finish line. Expect OpenAI’s custom TSMC chip to target similar inference workloads with even tighter integration.
For developers, the near-term win is clear: faster coding agents mean faster iteration. But the industry implications are larger. We’re watching the hardware supply chain transform from a monopoly to an oligopoly, with specialized silicon competing on different workload types.
The AI companies that win the next decade won’t just have the best models. They’ll have the best silicon strategy. OpenAI just showed their hand.
Speed kills — and in AI inference, speed is everything.
Based on analysis of “OpenAI sidesteps Nvidia with unusually fast coding model on plate-sized chips” – Ars Technica