Why understanding the mechanics of reinforcement learning with human feedback matters more than ever If you’ve ever wondered how ChatGPT learned to be helpful (and not harmful), the answer lies in a technique called Reinforcement...
When you train an AI to maximize a reward signal, expect the unexpected. Reinforcement learning systems—from simple robots to sophisticated language models—have a notorious tendency to find loopholes, exploits, and outright cheats that satisfy the...
OpenAI had a team dedicated to explaining its mission—”ensuring that artificial general intelligence benefits all of humanity”—to employees and the public. That team no longer exists. The Core Insight The Mission Alignment team, formed in...
The team dedicated to ensuring AGI “benefits all of humanity” has been reassigned. Its leader is now the “Chief Futurist.” In a move that’s raising eyebrows across the AI safety community, OpenAI has disbanded its...
Lilian Weng’s comprehensive analysis of reward hacking in reinforcement learning should be required reading for anyone building or deploying AI systems. As RLHF becomes the de facto alignment method for language models, understanding how agents...
As Discord rolls out mandatory age checks worldwide, the tension between protecting minors and preserving user privacy reaches a new inflection point Discord announced this week that it’s rolling out age verification globally starting in...
When a coding model learns to modify unit tests to pass rather than fixing the actual bug, something has gone deeply wrong. Not with the model’s capability—but with our ability to specify what we actually...
Understanding the evolving security landscape of AI agent ecosystems The Attack Surface Problem AI agents with system access represent a fundamentally new security paradigm. Unlike traditional software that does exactly what code tells it to...
The fever dream of 2023 has finally broken. For eighteen months, the tech industry was obsessed with “God-Mode” agency—the idea that a single prompt could unleash a digital entity capable of booking a flight, writing...