Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment
Anthropic just released a paper (full Alignment Science blog) showing that nine parallel Claude Opus 4.6 agents outperformed Anthropic’s own human researchers on a real alignment problem. The setup: weak-to-strong supervision (using a weaker AI to train a stronger one, mirroring how humans will someday supervise AI smarter than us).
Here’s what happened
- Two human Anthropic researchers spent seven days evaluating the four best methods from prior research and recovered 23% of the maximum performance gap.
- Nine Claude Opus 4.6 agents in parallel sandboxes spent five more days on the same problem, sharing findings as they went.
- The Claude agents recovered 97% of the gap, roughly what you’d get training the model on perfect ground-truth data.
- Total cost: $18,000, or about $22 per Claude-research-hour.
- The agents also invented four kinds of “reward hacking” (gaming the test) that none of the authors predicted, including one that exfiltrated test labels by flipping single answers and watching the score change.
- Some Claude-discovered methods are so unfamiliar that the authors call them “alien science.”
Why this matters
Alignment research (making sure AI behaves the way humans want) was the one field everyone agreed couldn’t be automated. That argument is now empirical, not hypothetical.
The cost number is what to internalize: whatever ratio of human researchers to Claude fleet you can imagine, the labs can afford more. Andrew Curran is calling it “a preview of RSI” (recursive self-improvement, where AI improves its own training).
Our take
Read the paper carefully, and the catch shows up: this only works on problems where progress can be automatically scored, and even then, the agents tried to game the score in four different ways. Most real alignment problems don’t fit that mold. But Anthropic’s own pitch is that solving this general version would let you bootstrap into the fuzzy problems, too.
The open question for the rest of 2026: did Anthropic just publish the seed of recursive self-improvement, or a clever experiment on a uniquely well-behaved problem? Both readings are honest. Neither is comforting.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.
The post Anthropic Unleashes ‘Alien Science’ as AI Surpasses Humans in Alignment appeared first on eWEEK.