The Dark Factory: Where Humans Don’t Even Look at the Code

4 min read

HERO

“If you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.” That’s not hyperbole from a random Twitter thread. It’s a guiding principle from a three-person team that’s building production security software without reviewing a single line of code.

The Core Insight

The Core Insight

StrongDM AI team has published what might be the most radical description of AI-assisted software development yet: Software Factories and the Agentic Moment. Their rules are simple and uncomfortable:

  • Code must not be written by humans
  • Code must not be reviewed by humans

Wait—no code review? On security software? How could that possibly work when LLMs are notoriously prone to making inhuman mistakes?

Their answer is elegant: treat test scenarios like holdout sets in machine learning. The coding agents can see the specs but not the validation scenarios. It’s like aggressive testing by an external QA team—except the testers are also AI agents, running thousands of scenarios per hour.

The key insight: move from boolean definitions of success (“the test suite is green”) to probabilistic ones. They call it “satisfaction”—of all observed trajectories through all scenarios, what fraction likely satisfy the user?

Why This Matters

Why This Matters

The November 2025 inflection point changed everything. Claude Opus 4.5 and GPT 5.2 turned a corner on how reliably coding agents could follow instructions. But StrongDM’s team saw it earlier—back in October 2024 with Claude 3.5’s revision, when “long-horizon agentic coding workflows began to compound correctness rather than error.”

Their most jaw-dropping innovation: the Digital Twin Universe. To test their permission management software without hitting API rate limits or triggering abuse detection, they built behavioral clones of Okta, Jira, Slack, Google Docs, Google Drive, and Google Sheets.

How do you clone the important parts of Okta? With coding agents, naturally. Dump the full public API documentation into the agent harness and have it build a self-contained imitation. The trick that ensured high fidelity: use popular SDK client libraries as compatibility targets, always aiming for 100% compatibility.

With their own independent clones—free from rate limits—their army of simulated testers could go wild. Test failure modes that would be dangerous against live services. Run scenarios at volumes far exceeding production limits. Validate at rates impossible against the real thing.

Key Takeaways

  • Creating a high-fidelity SaaS clone was always possible but never economically feasible. Generations of engineers wanted full in-memory replicas of their CRM to test against but self-censored the proposal to build it. Coding agents change that math.

  • The “Gene Transfusion” pattern: Have agents extract patterns from existing systems and reuse them elsewhere. Like genetic code, good architectural patterns can be transplanted between projects.

  • “Semports” for direct language porting: Code goes directly from one language to another, with agents handling the translation.

  • Pyramid Summaries for context management: Multiple levels of summary let agents enumerate short descriptions quickly, then zoom into detail as needed.

  • The Attractor release is meta-delicious: Their coding agent at the heart of the software factory is released as… three markdown spec files. No code. Just feed the specs into your agent of choice.

Looking Ahead

This feels like a glimpse of one potential future: software engineers moving from building code to building and semi-monitoring the systems that build code. The Dark Factory.

But there’s a serious caveat in the $1,000/day-per-engineer budget. If these patterns really add $20,000/month per engineer to your costs, they become more business model exercise than engineering practice. Can you create a profitable enough product to afford that overhead?

And building sustainable software businesses looks very different when any competitor can potentially clone your newest features with a few hours of agent work.

Still, there’s a lot to learn here even for teams not burning thousands on tokens. The fundamental question is now unavoidable: What does it take to have agents prove their code works without reviewing every line?

The answer seems to involve holdout scenarios, probabilistic satisfaction metrics, and digital twins of the systems you depend on. The Dark Factory isn’t for everyone. But the questions it raises are questions every developer will eventually face.


Based on analysis of “Software Factories and the Agentic Moment” via Simon Willison

Share this article

Related Articles