March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010 November 2010 December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024 December 2024 January 2025
1 2 3 4 5 6 7 8 9 10 11 12 13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
News Every Day |

It’s getting harder to measure just how good AI is getting

2
Vox
In this photo illustration, a person navigates the ChatGPT application interface on a smartphone, selecting AI model options.

Toward the end of 2024, I offered a take on all the talk about whether AI’s “scaling laws” were hitting a real-life technical wall. I argued that the question matters less than many think: There are existing AI systems powerful enough to profoundly change our world, and the next few years are going to be defined by progress in AI, whether the scaling laws hold or not. 

It’s always a risky business prognosticating about AI, because you can be proven wrong so fast. It’s embarrassing enough as a writer when your predictions for the upcoming year don’t pan out. When your predictions for the upcoming week are proven false? That’s pretty bad. 

But less than a week after I wrote that piece, OpenAI’s end-of-year series of releases included their latest large language model (LLM), o3. o3 does not exactly put the lie to claims that the scaling laws that used to define AI progress don’t work quite that well anymore going forward, but it definitively puts the lie to the claim that AI progress is hitting a wall

o3 is really, really impressive. In fact, to appreciate how impressive it is we’re going to have to digress a little into the science of how we measure AI systems.

Standardized tests for robots

If you want to compare two language models, you want to measure the performance of each of them on a set of problems that they haven’t seen before. That’s harder than it sounds — since these models are fed enormous amounts of text as part of training, they’ve seen most tests before. 

So what machine learning researchers do is build benchmarks, tests for AI systems that let us compare them directly to one another and to human performance across a range of tasks: math, programming, reading and interpreting texts, you name it. For a while, we tested AIs on the US Math Olympiad, a mathematics championship, and on physics, biology, and chemistry problems. 

The problem is that AIs have been improving so fast that they keep making benchmarks worthless. Once an AI performs well enough on a benchmark we say the benchmark is “saturated,” meaning it’s no longer usefully distinguishing how capable the AIs are, because all of them get near-perfect scores. 

2024 was the year in which benchmark after benchmark for AI capabilities became as saturated as the Pacific Ocean. We used to test AIs against a physics, biology, and chemistry benchmark called GPQA that was so difficult that even PhD students in the corresponding fields would generally score less than 70 percent. But the AIs now perform better than humans with relevant PhDs, so it’s not a good way to measure further progress. 

On the Math Olympiad qualifier, too, the models now perform among top humans. A benchmark called the MMLU was meant to measure language understanding with questions across many different domains. The best models have saturated that one, too. A benchmark called ARC-AGI was meant to be really, really difficult and measure general humanlike intelligence — but o3 (when tuned for the task) achieves a bombshell 88 percent on it. 

We can always create more benchmarks. (We are doing so — ARC-AGI-2 will be announced soon, and is supposed to be much harder.) But at the rate AIs are progressing, each new benchmark only lasts a few years, at best. And perhaps more importantly for those of us who aren’t machine learning researchers, benchmarks increasingly have to measure AI performance on tasks that humans couldn’t do themselves in order to describe what they are and aren’t capable of. 

Yes, AIs still make stupid and annoying mistakes. But if it’s been six months since you were paying attention, or if you’ve mostly only playing around with the free versions of language models available online, which are well behind the frontier, you are overestimating how many stupid and annoying mistakes they make, and underestimating how capable they are on hard, intellectually demanding tasks. 

The invisible wall

This week in Time, Garrison Lovely argued that AI progress didn’t “hit a wall” so much as become invisible, primarily improving by leaps and bounds in ways that people don’t pay attention to. (I have never tried to get an AI to solve elite programming or biology or mathematics or physics problems, and wouldn’t be able to tell if it was right anyway.)

Anyone can tell the difference between a 5-year-old learning arithmetic and a high schooler learning calculus, so the progress between those points looks and feels tangible. Most of us can’t really tell the difference between a first-year math undergraduate and the world’s most genius mathematicians, so AI’s progress between those points hasn’t felt like much.

But that progress is in fact a big deal. The way AI is going to truly change our world is by automating an enormous amount of intellectual work that was once done by humans, and three things will drive its ability to do that.

One is getting cheaper. o3 gets astonishing results, but it can cost more than $,1000 to think about a hard question and come up with an answer. However, the end-of-year release of China’s DeepSeek indicated that it might be possible to get high-quality performance very cheaply.

The second is improvements in how we interface with it. Everyone I talk to about AI products is confident there are tons of innovation to be achieved in how we interact with AIs, how they check their work, and how we set which AI to use for which task. You could imagine a system where normally a mid-tier chatbot does the work but can internally call in a more expensive model when your question needs it. This is all product work versus sheer technical work, and it’s what I warned in December would transform our world even if all AI progress halted.

And the third is AI systems getting smarter — and for all the declarations about hitting walls, it looks like they are still doing that. The newest systems are better at reasoning, better at problem solving, and just generally closer to being experts in a wide range of fields. To some extent we don’t even know how smart they are because we’re still scrambling to figure out how to measure it once we are no longer really able to use tests against human expertise.

I think that these are the three defining forces of the next few years — that’s how important AI is. Like it or not (and I don’t really like it, myself; I don’t think that this world-changing transition is being handled responsibly at all) none of the three are hitting a wall, and any one of the three would be sufficient to lastingly change the world we live in.

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

Мода

Красовалась в купальнике с вырезом: Самойлова показала фигуру после праздников

This photo from Chargers practice is straight out of a dystopian movie

Fulham And Man City ‘Could Potentially Go For’ Midfielder If Made Available

Horse racing tips: ‘Loads more to come under perfect conditions’ – Templegate’s 5-2 NAP has everything in his favour

Varun Chakaravarthy claims a fifer ahead of India vs England series

Ria.city






Read also

Kemi Badenoch must commit to pulling UK out of ECHR if Tories return to power, says ex Cabinet minister

Rolling the Digital Dice: Exploring the Thrills of Online Gaming and Casino Betting

No. 15 Oregon pulls away in the final minute edging Penn State 82-81

News, articles, comments, with a minute-by-minute update, now on Today24.pro

News Every Day

Amazon is ‘winding down’ some of its DEI programs

Today24.pro — latest news 24/7. You can add your news instantly now — here


News Every Day

Amazon is ‘winding down’ some of its DEI programs



Sports today


Новости тенниса
Australian Open

Теннисистка Блинкова выиграла первый круг Открытого чемпионата Австралии



Спорт в России и мире
Москва

КХЛ. «Динамо» Москва — СКА. Видеообзор победного матча московского клуба



All sports news today





Sports in Russia today

Москва

Московский ХК «Динамо» нанес поражение СКА в матче КХЛ


Новости России

Game News

Marvel Rivals' latest update quietly killed the game's burgeoning mod scene


Russian.city


Москва

В Россию вернулся привычный семейный кроссовер Nissan


Губернаторы России
Эдуард Латыпов

«Должен быть готов к любым погодным условиям»: Латыпов — о сложном профиле трассы в Рязани и возможной поездке на Игры


"Z" и "Аз" В НЛП СВО МОЖНО НАПРАВИТЬ ДЛЯ УСКОРЕНИЯ УСПЕШНОЙ СПЕЦ ОПЕРАЦИИ РОССИИ НА УКРAИНЕ. Новости мирового уровня. Россия, США, Европа могут улучшить отношения и здоровье общества?!

ANNA-NEWS.INFO: Неблагодарный Алиев или О чем молчит президент Азербайджана…

Определены победители профессионального этапа конкурса «Волшебный лед Сибири»

На Кубани проложили более полукилометра дороги к корме танкера «Волгонефть-239»


Продать стихи. Как продать стихи. Продать стихи собственного сочинения.

Мацуев даст концерт с Челябинским симфоническим оркестром в Москве

Тимати проводит время с Валентиной Ивановой и детьми в Диснейленде: фото

Продать стихи. Как продать стихи. Продать стихи собственного сочинения.


300 игроков включены в пенсионную программу ATP в 2024-м

«Я не делала этого последние пять или шесть лет». Елена Рыбакина удивила откровенным признанием

Аделаида (ATP). Финал. Оже-Альяссим встретится с Кордой

Australian Open. 12 января. Турнир начнут Котов, Зверев, Рууд, Мирра Андреева, Павлюченкова, Потапова, Блинкова, Соболенко



Жители Москвы и Петербурга чаще всего путешествовали в новогодние праздники

ОБЕСПЕЧЕНИЕ ОХРАНЫ ПОРЯДКА И БЕЗОПАСНОСТИ В ПРАЗДНИКИ

ОБЕСПЕЧЕНИЕ ОХРАНЫ ПОРЯДКА И БЕЗОПАСНОСТИ В ПРАЗДНИКИ

Самолет Москва — Ереван совершил вынужденную посадку в Махачкале


Продать стихи. Как продать стихи. Продать стихи собственного сочинения.

Аэропортом Пулково в новогодние праздники воспользовались 632 тысячи пассажиров

Собянин: Почти полмиллиона питомцев приняли госветклиники Москвы в 2024 году

Концерт Бориса Пинхасовича и Алексея Гориболя в Концертном зале им. П.И. Чайковского


На Кубани проложили более полукилометра дороги к корме танкера «Волгонефть-239»

Министр обороны Польши Косиняк-Камыш призвал не летать в Москву

Контакты к выборам считают // Эксперты оценили частоту встреч губернаторов с федеральным начальством

"Z" и "Аз" В НЛП СВО МОЖНО НАПРАВИТЬ ДЛЯ УСКОРЕНИЯ УСПЕШНОЙ СПЕЦ ОПЕРАЦИИ РОССИИ НА УКРAИНЕ. Новости мирового уровня. Россия, США, Европа могут улучшить отношения и здоровье общества?!



Путин в России и мире






Персональные новости Russian.city
Баста

Супруга Басты рассказала о желании развестись в начале семейной жизни



News Every Day

Varun Chakaravarthy claims a fifer ahead of India vs England series




Friends of Today24

Музыкальные новости

Персональные новости