News Every Day | 12 January 2025, 15:00

It’s getting harder to measure just how good AI is getting

Vox

In this photo illustration, a person navigates the ChatGPT application interface on a smartphone, selecting AI model options.

Toward the end of 2024, I offered a take on all the talk about whether AI’s “scaling laws” were hitting a real-life technical wall. I argued that the question matters less than many think: There are existing AI systems powerful enough to profoundly change our world, and the next few years are going to be defined by progress in AI, whether the scaling laws hold or not.

It’s always a risky business prognosticating about AI, because you can be proven wrong so fast. It’s embarrassing enough as a writer when your predictions for the upcoming year don’t pan out. When your predictions for the upcoming week are proven false? That’s pretty bad.

But less than a week after I wrote that piece, OpenAI’s end-of-year series of releases included their latest large language model (LLM), o3. o3 does not exactly put the lie to claims that the scaling laws that used to define AI progress don’t work quite that well anymore going forward, but it definitively puts the lie to the claim that AI progress is hitting a wall.

o3 is really, really impressive. In fact, to appreciate how impressive it is we’re going to have to digress a little into the science of how we measure AI systems.

Standardized tests for robots

If you want to compare two language models, you want to measure the performance of each of them on a set of problems that they haven’t seen before. That’s harder than it sounds — since these models are fed enormous amounts of text as part of training, they’ve seen most tests before.

So what machine learning researchers do is build benchmarks, tests for AI systems that let us compare them directly to one another and to human performance across a range of tasks: math, programming, reading and interpreting texts, you name it. For a while, we tested AIs on the US Math Olympiad, a mathematics championship, and on physics, biology, and chemistry problems.

The problem is that AIs have been improving so fast that they keep making benchmarks worthless. Once an AI performs well enough on a benchmark we say the benchmark is “saturated,” meaning it’s no longer usefully distinguishing how capable the AIs are, because all of them get near-perfect scores.

2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so it’s not a good way to measure further progress.

On the Math Olympiad qualifier, too, the models now perform among top humans. A benchmark called the MMLU was meant to measure language understanding with questions across many different domains. The best models have saturated that one, too. A benchmark called ARC-AGI was meant to be really, really difficult and measure general humanlike intelligence — but o3 (when tuned for the task) achieves a bombshell 88 percent on it.

We can always create more benchmarks. (We are doing so — ARC-AGI-2 will be announced soon, and is supposed to be much harder.) But at the rate AIs are progressing, each new benchmark only lasts a few years, at best. And perhaps more importantly for those of us who aren’t machine learning researchers, benchmarks increasingly have to measure AI performance on tasks that humans couldn’t do themselves in order to describe what they are and aren’t capable of.

Yes, AIs still make stupid and annoying mistakes. But if it’s been six months since you were paying attention, or if you’ve mostly only playing around with the free versions of language models available online, which are well behind the frontier, you are overestimating how many stupid and annoying mistakes they make, and underestimating how capable they are on hard, intellectually demanding tasks.

The invisible wall

This week in Time, Garrison Lovely argued that AI progress didn’t “hit a wall” so much as become invisible, primarily improving by leaps and bounds in ways that people don’t pay attention to. (I have never tried to get an AI to solve elite programming or biology or mathematics or physics problems, and wouldn’t be able to tell if it was right anyway.)

Anyone can tell the difference between a 5-year-old learning arithmetic and a high schooler learning calculus, so the progress between those points looks and feels tangible. Most of us can’t really tell the difference between a first-year math undergraduate and the world’s most genius mathematicians, so AI’s progress between those points hasn’t felt like much.

But that progress is in fact a big deal. The way AI is going to truly change our world is by automating an enormous amount of intellectual work that was once done by humans, and three things will drive its ability to do that.

One is getting cheaper. o3 gets astonishing results, but it can cost more than $,1000 to think about a hard question and come up with an answer. However, the end-of-year release of China’s DeepSeek indicated that it might be possible to get high-quality performance very cheaply.

The second is improvements in how we interface with it. Everyone I talk to about AI products is confident there are tons of innovation to be achieved in how we interact with AIs, how they check their work, and how we set which AI to use for which task. You could imagine a system where normally a mid-tier chatbot does the work but can internally call in a more expensive model when your question needs it. This is all product work versus sheer technical work, and it’s what I warned in December would transform our world even if all AI progress halted.

And the third is AI systems getting smarter — and for all the declarations about hitting walls, it looks like they are still doing that. The newest systems are better at reasoning, better at problem solving, and just generally closer to being experts in a wide range of fields. To some extent we don’t even know how smart they are because we’re still scrambling to figure out how to measure it once we are no longer really able to use tests against human expertise.

I think that these are the three defining forces of the next few years — that’s how important AI is. Like it or not (and I don’t really like it, myself; I don’t think that this world-changing transition is being handled responsibly at all) none of the three are hitting a wall, and any one of the three would be sufficient to lastingly change the world we live in.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

It’s getting harder to measure just how good AI is getting

Standardized tests for robots

The invisible wall

Read also

New York law focused on officer mental health care

Shilpa Shinde says her return to 'Bhabiji Ghar Par Hain 2.0' is for fans; Aasif Sheikh teases more fun

Toy giveaway at Winter Wonderland Holiday Event

Sports today

All sports news today

Sports in Russia today

Friends of Today24