Monty: The Rust-Based Python Interpreter That Makes LLM Code Execution Safe
Here’s a truth every AI engineer has confronted: letting an LLM run arbitrary Python code is terrifying, but sandboxing it properly is painful. Docker containers take hundreds of milliseconds to start. Pyodide cold starts are measured in seconds. And just running code directly? That’s a security incident waiting to happen.
Pydantic just released Monty, and it might solve this problem elegantly. It’s a minimal Python interpreter written in Rust, designed specifically for AI agents to write code safely.
The Core Insight
The premise is simple but powerful: LLMs work faster, cheaper, and more reliably when they write code instead of making sequential tool calls.
This isn’t theoretical. Cloudflare documented this with “Codemode.” Anthropic’s “Programmatic Tool Calling” research confirms it. Hugging Face’s Smol Agents builds on the same principle.
The problem has always been execution safety. Until now, your options were:
| Solution | Start Latency | Security | Complexity |
|---|---|---|---|
| Docker | 195ms | Good | Medium |
| Pyodide/WASM | 2800ms | Poor | Medium |
| Sandboxing Service | 1033ms | Strict | Medium |
| Just run it (YOLO) | 0.1ms | Non-existent | Easy |
| Monty | 0.06ms | Strict | Easy |
Read that again: 0.06 milliseconds. That’s three orders of magnitude faster than Docker. Four orders faster than Pyodide.
Why does this matter? Because when your agent needs to iterate—run, check, modify, run again—those startup costs compound. A Docker-based approach might allow 5 iterations per second. Monty could theoretically do 15,000.
Why This Matters
1. Controlled Access by Design
Monty doesn’t try to be a full Python runtime. It deliberately excludes:
– Standard library (except sys, typing, asyncio, and a few others)
– Third-party packages
– Class definitions (coming soon)
– Match statements (coming soon)
This isn’t a limitation—it’s the point. Your agent doesn’t need import os to call filesystem functions. It needs controlled access to functions you explicitly provide.
External functions are your interface. The LLM writes Python that calls fetch(url) or read_file(path). You implement those functions and decide exactly what they can do. The sandbox becomes trivial because there’s nothing dangerous exposed.
2. Pause, Resume, and Snapshot
This is where Monty gets interesting for agent architectures:
# Start execution - pauses when fetch() is called
result = m.start(inputs={'url': 'https://example.com'})
# State can be serialized
state = result.dump()
# Later, in a different process, restore and continue
restored = MontySnapshot.load(state)
final = restored.resume(return_value='response data')
You can serialize interpreter state at any external function call. Store it in a database. Resume in a different process. Fork execution paths. This enables durable execution patterns that are typically complex to implement.
3. Type Checking Included
Monty ships with ty integrated for type checking. You can validate LLM-generated code against type stubs before execution. This catches errors at parse time rather than runtime—crucial when your code comes from a model that might hallucinate function signatures.
Key Takeaways
Speed enables new patterns. When execution is effectively free, agents can iterate more aggressively, test hypotheses, and recover from errors.
Restriction is a feature. A Python subset that can only call functions you expose is easier to secure than a full runtime you try to sandbox.
Serialization unlocks distributed execution. Pausing and resuming interpreters across processes enables architectures that would be complex with traditional sandboxes.
The agent use case is specific. Monty isn’t trying to replace CPython. It’s purpose-built for one job: running code that agents write.
Looking Ahead
Monty will power “code-mode” in Pydantic AI. Instead of sequential tool calls, your agent writes Python that calls your tools as functions. The LLM gets the full expressiveness of a programming language. You get safety without container overhead.
The broader implication is that we might be approaching a phase shift in how agents interact with tools. Tool calling APIs are fundamentally limited—they’re serial, they have overhead, and they constrain how the LLM can express complex logic.
Letting models write code, and then safely executing that code, removes those constraints. An agent that can write for city in cities: results[city] = get_weather(city) instead of making sequential tool calls for each city works differently—and potentially better.
Monty is still experimental. Classes aren’t supported yet. The stdlib is minimal. But the architecture points toward a future where the line between “calling a tool” and “writing a program” disappears.
For AI engineers building agents, that’s worth watching.
Based on analysis of Pydantic’s Monty project on GitHub