Add news
March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010 November 2010 December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024 December 2024 January 2025 February 2025 March 2025 April 2025 May 2025 June 2025 July 2025 August 2025 September 2025 October 2025 November 2025 December 2025
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
News Every Day |

It’s getting harder to measure just how good AI is getting

10
Vox
In this photo illustration, a person navigates the ChatGPT application interface on a smartphone, selecting AI model options.

Toward the end of 2024, I offered a take on all the talk about whether AI’s “scaling laws” were hitting a real-life technical wall. I argued that the question matters less than many think: There are existing AI systems powerful enough to profoundly change our world, and the next few years are going to be defined by progress in AI, whether the scaling laws hold or not. 

It’s always a risky business prognosticating about AI, because you can be proven wrong so fast. It’s embarrassing enough as a writer when your predictions for the upcoming year don’t pan out. When your predictions for the upcoming week are proven false? That’s pretty bad. 

But less than a week after I wrote that piece, OpenAI’s end-of-year series of releases included their latest large language model (LLM), o3. o3 does not exactly put the lie to claims that the scaling laws that used to define AI progress don’t work quite that well anymore going forward, but it definitively puts the lie to the claim that AI progress is hitting a wall

o3 is really, really impressive. In fact, to appreciate how impressive it is we’re going to have to digress a little into the science of how we measure AI systems.

Standardized tests for robots

If you want to compare two language models, you want to measure the performance of each of them on a set of problems that they haven’t seen before. That’s harder than it sounds — since these models are fed enormous amounts of text as part of training, they’ve seen most tests before. 

So what machine learning researchers do is build benchmarks, tests for AI systems that let us compare them directly to one another and to human performance across a range of tasks: math, programming, reading and interpreting texts, you name it. For a while, we tested AIs on the US Math Olympiad, a mathematics championship, and on physics, biology, and chemistry problems. 

The problem is that AIs have been improving so fast that they keep making benchmarks worthless. Once an AI performs well enough on a benchmark we say the benchmark is “saturated,” meaning it’s no longer usefully distinguishing how capable the AIs are, because all of them get near-perfect scores. 

2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so it’s not a good way to measure further progress. 

On the Math Olympiad qualifier, too, the models now perform among top humans. A benchmark called the MMLU was meant to measure language understanding with questions across many different domains. The best models have saturated that one, too. A benchmark called ARC-AGI was meant to be really, really difficult and measure general humanlike intelligence — but o3 (when tuned for the task) achieves a bombshell 88 percent on it. 

We can always create more benchmarks. (We are doing so — ARC-AGI-2 will be announced soon, and is supposed to be much harder.) But at the rate AIs are progressing, each new benchmark only lasts a few years, at best. And perhaps more importantly for those of us who aren’t machine learning researchers, benchmarks increasingly have to measure AI performance on tasks that humans couldn’t do themselves in order to describe what they are and aren’t capable of. 

Yes, AIs still make stupid and annoying mistakes. But if it’s been six months since you were paying attention, or if you’ve mostly only playing around with the free versions of language models available online, which are well behind the frontier, you are overestimating how many stupid and annoying mistakes they make, and underestimating how capable they are on hard, intellectually demanding tasks. 

The invisible wall

This week in Time, Garrison Lovely argued that AI progress didn’t “hit a wall” so much as become invisible, primarily improving by leaps and bounds in ways that people don’t pay attention to. (I have never tried to get an AI to solve elite programming or biology or mathematics or physics problems, and wouldn’t be able to tell if it was right anyway.)

Anyone can tell the difference between a 5-year-old learning arithmetic and a high schooler learning calculus, so the progress between those points looks and feels tangible. Most of us can’t really tell the difference between a first-year math undergraduate and the world’s most genius mathematicians, so AI’s progress between those points hasn’t felt like much.

But that progress is in fact a big deal. The way AI is going to truly change our world is by automating an enormous amount of intellectual work that was once done by humans, and three things will drive its ability to do that.

One is getting cheaper. o3 gets astonishing results, but it can cost more than $,1000 to think about a hard question and come up with an answer. However, the end-of-year release of China’s DeepSeek indicated that it might be possible to get high-quality performance very cheaply.

The second is improvements in how we interface with it. Everyone I talk to about AI products is confident there are tons of innovation to be achieved in how we interact with AIs, how they check their work, and how we set which AI to use for which task. You could imagine a system where normally a mid-tier chatbot does the work but can internally call in a more expensive model when your question needs it. This is all product work versus sheer technical work, and it’s what I warned in December would transform our world even if all AI progress halted.

And the third is AI systems getting smarter — and for all the declarations about hitting walls, it looks like they are still doing that. The newest systems are better at reasoning, better at problem solving, and just generally closer to being experts in a wide range of fields. To some extent we don’t even know how smart they are because we’re still scrambling to figure out how to measure it once we are no longer really able to use tests against human expertise.

I think that these are the three defining forces of the next few years — that’s how important AI is. Like it or not (and I don’t really like it, myself; I don’t think that this world-changing transition is being handled responsibly at all) none of the three are hitting a wall, and any one of the three would be sufficient to lastingly change the world we live in.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

Ria.city






Read also

College of Marin divided over textbook loan program

U.S. Troops, Civilian Killed in Ambush in Syria

Santa Land 2025 provides priceless memories

News, articles, comments, with a minute-by-minute update, now on Today24.pro

Today24.pro — latest news 24/7. You can add your news instantly now — here




Sports today


Новости тенниса


Спорт в России и мире


All sports news today





Sports in Russia today


Новости России


Russian.city



Губернаторы России









Путин в России и мире







Персональные новости
Russian.city





Friends of Today24

Музыкальные новости

Персональные новости