Need for Speed: Mercury 2 Is 13x Faster Than Claude Haiku
Every AI model you’ve ever used writes the same way: one word at a time, left to right, like a typewriter. If it drifts off course early, tough luck. It keeps typing.
Well, Inception Labs just launched Mercury 2, which is an AI that works completely differently:
- Instead of predicting one word after another, it starts with a rough sketch of the entire answer.
- Then, it refines everything at once, like an editor revising a full draft in parallel.
- The technical term is a “diffusion LLM” (dLLM), which uses the same core approach behind AI image generators like Midjourney, but applied to text and reasoning.
Let us tell you, the speed is real. Independent testing from Artificial Analysis clocked Mercury 2 at 1,196 tokens per second, over 3x faster than the next fastest model in its price class. That’s a very big deal if you need speed. For context, Claude 4.5 Haiku hits ~89 tokens/sec and GPT-5 Mini ~73. RIP.
Here’s what else matters
- $0.25 per million input tokens, $0.75 per million output (cheaper output than GPT-5 Mini).
- #18 out of 134 models on Artificial Analysis’s intelligence index, with strengths in agentic coding and instruction-following.
- Supports tool use, 128K context, structured outputs, and drops into any OpenAI-compatible stack with zero rewrites.
To be clear: Mercury 2 isn’t trying to dethrone frontier giants like GPT-5.2 or Claude Opus. It’s built for production speed, not leaderboard bragging rights.
So why does 10x speed even matter? Because AI isn’t just about chatbots anymore. It’s agent loops, where one task chains dozens of AI calls together.
- Andrej Karpathy (former OpenAI researcher, Tesla AI lead, and notably an Inception investor) drove this home over the weekend when he described the new “Claw” layer of AI.
- Local agent platforms like OpenClaw and NanoClaw that orchestrate scheduling, tool calls, and persistent workflows on your own machine.
- He called them “a personal digital house elf.” We prefer “a digital non-human entity (don’t get it twisted) that runs 24/7 for you”, but ya same energy!
In agent loops, latency compounds at every step. A model that’s 10x faster doesn’t just save time; it changes what you can build. Voice assistants that feel natural. Code agents that keep pace with your thinking. Background automations that actually finish before you forget you started them.
The big question
If diffusion can make small models this fast without sacrificing reasoning, will big labs build their own? We know Google already has one… Expect more experiments soon. Switching to a diffusion LLM dramatically increases the number of tokens you can serve per GPU. There’s every incentive to do this. Why wouldn’t ya?
Check out Corey’s cost breakdown for using Mercury 2 in your OpenClaw setup, geek out with our think-piece on combining diffusion with an energy-based model, or try Mercury 2 yourself.
Editor’s note: This content originally ran in the newsletter of our sister publication, The Neuron. To read more from The Neuron, sign up for its newsletter here.
The post Need for Speed: Mercury 2 Is 13x Faster Than Claude Haiku appeared first on eWEEK.