The Architecture of AI Agents: What Lilian Weng’s Landmark Post Still Teaches Us

Back in June 2023, OpenAI’s Lilian Weng published what has become the definitive technical overview of LLM-powered autonomous agents. Nearly three years later, as we witness the explosion of Claude Code, Cursor agents, and coding assistants, her framework remains the clearest mental model for understanding how these systems actually work.
The Core Insight

The anatomy of an AI agent comes down to three components orbiting a central brain:
1. Planning — The ability to decompose complex tasks into manageable steps. This includes Chain-of-Thought reasoning (think step by step), Tree of Thoughts (explore multiple paths), and self-reflection mechanisms like ReAct that integrate reasoning with action.
2. Memory — Both short-term (the context window) and long-term (external vector stores). Your agent can only be as smart as what it can remember—and remembering well means fast retrieval through algorithms like FAISS, HNSW, or ScaNN.
3. Tool Use — The ability to call external APIs, execute code, search the web. This is what transforms an LLM from a text generator into something that can actually do things.
The elegant insight: we can map these directly to human cognition. Sensory memory becomes embedding representations. Short-term memory is in-context learning. Long-term memory is the vector store. The model learns like we learn—it just does it differently.
Why This Matters

ReAct changed everything. Before ReAct, LLMs either reasoned OR acted. ReAct unified them: Thought → Action → Observation, repeated until the task is done. This pattern is now so fundamental that you see it in every serious agent implementation.
The Reflexion framework introduced something critical: learning from failure. After each action, the agent evaluates outcomes and reflects. Failed trajectories become training data for future attempts. This is how agents improve over time without fine-tuning—pure in-context learning from mistakes.
Tool use is the great equalizer. Models can be trained to call calculators, weather APIs, code executors. HuggingGPT routes tasks to specialist models on HuggingFace. The LLM becomes a conductor, not the entire orchestra.
The paper also introduced ChemCrow (13 expert tools for organic synthesis and drug discovery) and Generative Agents (25 AI characters living in a simulated town, complete with relationships, memories, and emergent social behaviors like throwing parties and spreading gossip).
Key Takeaways
Task decomposition is prompt engineering’s mature form. Instead of hoping the model handles complexity, you structurally break it down: “Steps for XYZ: 1. 2. 3.”
Memory retrieval is a solved problem with multiple solutions. HNSW for hierarchical navigation, FAISS for clustering, ScaNN for anisotropic quantization. Pick based on your scale and accuracy needs.
The Thought-Action-Observation loop is canonical. If you’re building agents, you’re implementing some variant of this pattern.
Self-play works for games but not for poker. Chess and Go agents can train against themselves because information is perfect and rules are symmetric. Real-world agency involves hidden information and asymmetric incentives.
LLM evaluation by LLMs is unreliable for expert domains. ChemCrow showed GPT-4 and the specialized system appeared equivalent in LLM evaluation, but human chemists found ChemCrow dramatically better. The model doesn’t know what it doesn’t know.
Looking Ahead
The most prescient warning in Weng’s post: multi-agent systems face fundamental challenges around context length, efficiency, and stability. Three years later, these remain the bottlenecks.
Context windows have grown (Claude now offers 200K tokens), but the attention mechanism doesn’t scale linearly. Memory compaction, hierarchical summarization, and retrieval augmentation are all active areas of research.
The Generative Agents simulation points toward something bigger: what happens when many AI agents interact continuously, forming relationships and social dynamics? We’re starting to see this in production—AI customer service agents handing off to AI specialists, swarms of coding agents collaborating on complex systems.
Weng’s framework gave us the vocabulary: Planning, Memory, Tool Use. The implementations have evolved dramatically. But the fundamental architecture remains surprisingly stable.
If you want to understand where AI agents came from and where they’re going, this post remains required reading. The components haven’t changed. Only the capabilities have.
Based on analysis of “LLM Powered Autonomous Agents” by Lilian Weng