Colored Petri Nets: A Practical Safety Harness for LLM-Built Distributed Apps

If you have ever tried to let an LLM “help” you write distributed application logic, you already know the uncomfortable truth: it can generate plausible concurrency code faster than you can review it. And concurrency bugs are the kind you do not notice until your system is under real load, with real money and real users on the line.

One promising way out is not to ask the model to be perfect, but to constrain the space of possible behaviors so that mistakes become easier to detect, test, and even prove impossible. Colored Petri nets (CPNs) are an old idea with a surprisingly modern fit.

Tags: distributed-systems, concurrency, formal-methods, llm-agents, rust, workflows

The Core Insight

The original article argues for a design pattern that shows up again and again in reliable “LLM-enabled” software: you can move much faster when the system has verifiable correctness properties. Tests, compilers, state machines, and strong types all serve the same role—they narrow the surface area where a model can introduce subtle, high-impact errors.

Colored Petri nets extend classic Petri nets by attaching data to tokens. In a standard Petri net, tokens are identity-less markers that move through a bipartite graph of places and transitions. A transition “fires” when its input places have the required tokens; firing consumes tokens from input places and produces tokens in output places.

CPNs add two capabilities that matter for real distributed systems:

1) Guards: Boolean conditions that must hold before a transition can fire. Think of these as explicit, declarative preconditions.

2) Multi-token consumption and production: Transitions can join or fork multiple flows at once. This maps naturally to the “claim resources + do work + emit results” patterns that dominate concurrent orchestration.

A useful way to reframe CPNs in modern programming terms is: a typed workflow engine where “state” is not a global mutable blob, but a set of tokens that must satisfy explicit constraints to move forward. The article also notes a conceptual alignment with Rust’s typestate pattern: represent legal states in the type system, so illegal transitions do not compile.

Two quotes capture the spirit of the piece:

“Verifiable correctness makes it much easier to take bigger leaps with LLMs.”
“CPNs… provide the potential for formally verifying concurrent programs at build time.”

The key bet is that if your concurrency is expressed as token movement under guards (plus a persistence layer that preserves those semantics), you get a structure that is easier to simulate, test, and reason about than hand-rolled coordination code.

Why This Matters

Most distributed application failures are not “algorithm” failures; they are coordination failures:

two workers accidentally process the same item at the same time
retries amplify load (thundering herds)
rate limits are enforced inconsistently
resource leases are duplicated or never released
backpressure is missing, so downstream systems collapse

Teams often solve these with ad-hoc combinations of:

database transactions (e.g., SELECT ... FOR UPDATE)
queues
per-domain or per-customer throttling maps
scattered locks
opaque “workflow state machines” embedded in code

Those solutions work, but they tend to grow into bespoke coordination layers that are hard to audit and even harder to extend safely.

CPNs offer a different organizing principle: make coordination the product. The business logic is still yours, but the concurrency rules become explicit artifacts that you can inspect.

This is especially relevant if you are experimenting with agentic systems or LLM-assisted development:

LLMs are good at producing local code changes.
Concurrency correctness is a global property.

A net-based representation forces you to define global constraints (what must be true for work to proceed), rather than hoping the model inferred those constraints correctly from a few comments and function names.

A concrete example: “polite” web scraping

The article uses a scraper scheduler as an intuition pump. A scraper has limited proxies, needs domain-level throttles, wants to avoid duplicate concurrent requests, and must apply cooldowns and retry backoff. Many scrapers implement this with a central database acting as a lease manager.

In CPN terms, scraping becomes a join:

a token representing an available proxy
a token representing a prioritized target
optionally a token representing the domain’s availability

The scrape transition only fires when all required tokens exist and pass guards. After firing, tokens flow through stages like raw_html → parsed → validated → stored, each with its own concurrency limits—naturally creating backpressure.

This style makes it harder to accidentally do impolite things (like exceeding per-domain rates), because the only way to start a scrape is to obtain the right combination of tokens.

The “LLM angle” that is easy to miss

The most interesting part is not “Petri nets are cool.” It is that CPNs are a language for coordination that can be checked.

If you let a model generate code for a CPN-like framework, you can ask it to produce:

places and transitions
guards
token schemas
invariants (“no domain has more than N in-flight requests”)

Then you can run deterministic simulation and property tests against the net. In other words, you shift the human effort from line-by-line review toward validating a constrained model.

That is a good fit for how modern teams already work: we trust compilers and test harnesses far more than we trust “reasoning about threads.”

Key Takeaways

Model your concurrency as explicit state flow, not implicit shared state. Places and tokens become your canonical “what exists” and “what is allowed.”
Guards are your audit trail. Put policy and safety rules (rate limits, resource availability, retry caps) in guards where they can be reviewed and tested.
Use joins and forks to encode coordination. If a job requires a proxy and a target and a domain lease, represent that as a multi-token join.
Prefer simulation to hero debugging. A net lends itself to time travel: you can replay token flows and test changes before production.
Beware the hard part: persistence and partitioning. The article openly highlights the open problem: how do you scale token state beyond one machine without losing the semantics that make the model safe?
Actionable next step: pick a “coordination-heavy” subsystem (scheduler, retry engine, resource leasing) and re-implement only the decision layer using CPN semantics. Benchmark correctness (fewer invalid states) and complexity (less bespoke coordination code).

Looking Ahead

CPNs are not a silver bullet. Two risks are worth stating plainly.

Risk 1: You build a new database by accident.

A correct, performant CPN engine needs atomic multi-token moves, conflict detection, deadlock avoidance, and observability. If you implement that poorly, you may end up with a fragile “workflow database” that is harder to operate than the system it replaced.

Risk 2: Formalism can become theater.

It is possible to draw beautiful nets and still encode bad policy (or forget key constraints). The value is not the diagram; it is the ability to run the system under a ruleset that is testable and enforceable.

That said, the direction feels right: treat coordination as a first-class program with tight semantics, then use LLMs to help you iterate within those guardrails.

If I were betting on a near-term proof point, I would follow the article’s suggestion: implement a scraper scheduler (or any lease-heavy orchestration component) in a CPN-inspired runtime. If it becomes easier to extend without regressions—and if the simulation tooling catches failures earlier—you have something powerful: an architecture where “LLM acceleration” is earned by design, not hoped for.

Sources

CPNs, LLMs, and Distributed Applications (Sao) https://blog.sao.dev/cpns-llms-distributed-apps/

Based on analysis of CPNs, LLMs, and Distributed Applications (Sao) https://blog.sao.dev/cpns-llms-distributed-apps/

Colored Petri Nets: A Practical Safety Harness for LLM-Built Distributed Apps

The Core Insight

Why This Matters

A concrete example: “polite” web scraping

The “LLM angle” that is easy to miss

Key Takeaways

Looking Ahead

Sources

Related Articles

The Great Unchaining: Why 2026 is the Year of Sovereign AI and Local LLM Deployment

When AI Lies: Understanding and Fighting LLM Hallucinations

The What/How Loop: Why LLMs Can’t Replace Understanding in Software Development