March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010
November 2010
December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
28
29
30
News Every Day |

The way we measure progress in AI is terrible

Every time a new AI model is released, it’s typically touted as acing its performance against a series of benchmarks. OpenAI’s GPT-4o, for example, was launched in May with a compilation of results that showed its performance topping every other AI company’s latest model in several tests.

The problem is that these benchmarks are poorly designed, the results hard to replicate, and the metrics they use are frequently arbitrary, according to new research. That matters because AI models’ scores against these benchmarks will determine the level of scrutiny and regulation they receive.

“It seems to be like the Wild West because we don’t really have good evaluation standards,” says Anka Reuel, an author of the paper, who is a PhD student in computer science at Stanford University and a member of its Center for AI Safety.

A benchmark is essentially a test that an AI takes. It can be in a multiple-choice format like the most popular one, the Massive Multitask Language Understanding benchmark, known as the MMLU, or it could be an evaluation of AI’s ability to do a specific task or the quality of its text responses to a set series of questions. 

AI companies frequently cite benchmarks as testament to a new model’s success. “The developers of these models tend to optimize for the specific benchmarks,” says Anna Ivanova, professor of psychology at the Georgia Institute of Technology and head of its Language, Intelligence, and Thought (LIT) lab, who was not involved in the Stanford research. 

These benchmarks already form part of some governments’ plans for regulating AI. For example, the EU AI Act, which goes into force in August 2025, references benchmarks as a tool to determine whether or not a model demonstrates “systemic risk”; if it does, it will be subject to higher levels of scrutiny and regulation. The UK AI Safety Institute references benchmarks in Inspect, which is its framework for evaluating the safety of large language models. 

But right now, they might not be good enough to use that way. “There’s this potential false sense of safety we’re creating with benchmarks if they aren’t well designed, especially for high-stakes use cases,” says Reuel. “It may look as if the model is safe, but it is not.” 

Given the increasing importance of benchmarks, Reuel and her colleagues wanted to look at the most popular examples to figure out what makes a good one—and whether the ones we use are robust enough. The researchers first set out to verify the benchmark results that developers put out, but often they couldn’t reproduce them. To test a benchmark, you typically need some instructions or code to run it on a model. Many benchmark creators didn’t make the code to run their benchmark publicly available. In other cases, the code was outdated.

Benchmark creators often don’t make the questions and answers in their data set publicly available either. If they did, companies could just train their model on the benchmark; it would be like letting a student see the questions and answers on a test before taking it. But that makes them hard to evaluate.

Another issue is that benchmarks are frequently “saturated,” which means all the problems have been pretty much been solved. For example, let’s say there’s a test with simple math problems on it. The first generation of an AI model gets a 20% on the test, failing. The second generation of the model gets 90% and the third generation gets 93%. An outsider may look at these results and determine that AI progress has slowed down, but another interpretation could just be that the benchmark got solved and is no longer that great a measure of progress. It fails to capture the difference in ability between the second and third generations of a model.

One of the goals of the research was to define a list of criteria that make a good benchmark. “It’s definitely an important problem to discuss the quality of the benchmarks, what we want from them, what we need from them,” says Ivanova. “The issue is that there isn’t one good standard to define benchmarks. This paper is an attempt to provide a set of evaluation criteria. That’s very useful.”

The paper was accompanied by the launch of a website, Better Bench, that ranks the most popular AI benchmarks. Rating factors include whether or not experts were consulted on the design, whether the tested capability is well defined, and other basics—for example, is there a feedback channel for the benchmark, or has it been peer-reviewed?

The MMLU benchmark had the lowest ratings. “I disagree with these rankings. In fact, I’m an author of some of the papers ranked highly, and would say that the lower ranked benchmarks are better than them,” says Dan Hendrycks, director of CAIS, the Center for AI Safety, and one of the creators of the MMLU benchmark.  That said, Hendrycks still believes that the best way to move the field forward is to build better benchmarks.

Some think the criteria may be missing the bigger picture. “The paper adds something valuable. Implementation criteria and documentation criteria—all of this is important. It makes the benchmarks better,” says Marius Hobbhahn, CEO of Apollo Research, a research organization specializing in AI evaluations. “But for me, the most important question is, do you measure the right thing? You could check all of these boxes, but you could still have a terrible benchmark because it just doesn’t measure the right thing.”

Essentially, even if a benchmark is perfectly designed, one that tests the model’s ability to provide compelling analysis of Shakespeare sonnets may be useless if someone is really concerned about AI’s hacking capabilities. 

“You’ll see a benchmark that’s supposed to measure moral reasoning. But what that means isn’t necessarily defined very well. Are people who are experts in that domain being incorporated in the process? Often that isn’t the case,” says Amelia Hardy, another author of the paper and an AI researcher at Stanford University.

There are organizations actively trying to improve the situation. For example, a new benchmark from Epoch AI, a research organization, was designed with input from 60 mathematicians and verified as challenging by two winners of the Fields Medal, which is the most prestigious award in mathematics. The participation of these experts fulfills one of the criteria in the Better Bench assessment. The current most advanced models are able to answer less than 2% of the questions on the benchmark, which means there’s a significant way to go before it is saturated. 

“We really tried to represent the full breadth and depth of modern math research,” says Tamay Besiroglu, associate director at Epoch AI. Despite the difficulty of the test, Besiroglu speculates it will take only around four or five years for AI models to score well against it.

And Hendrycks’ organization, CAIS, is collaborating with Scale AI to create a new benchmark that he claims will test AI models against the frontier of human knowledge, dubbed Humanity’s Last Exam, HLE. “HLE was developed by a global team of academics and subject-matter experts,” says Hendrycks. “HLE contains unambiguous, non-searchable, questions that require a PhD-level understanding to solve.” If you want to contribute a question, you can here.

Although there is a lot of disagreement over what exactly should be measured, many researchers agree that more robust benchmarks are needed, especially since they set a direction for companies and are a critical tool for governments. 

“Benchmarks need to be really good,” Hardy says. “We need to have an understanding of what ‘really good’ means, which we don’t right now.”

Москва

С нуля до Кремля. Многодетная мама из Черёмушек выступила на престижном конкурсе

Michail Antonio reveals he was barred from entering the UK after passport blunder in nightmare international break

F1 Las Vegas Grand Prix – Start time, starting grid, how to watch, & more

Exclusive: Sumit Kaul on joining the new season of Tenali Rama as Girgit; says ‘It will be a challenge for me to live up to the expectations of audience’

Las Vegas GP F1 qualifying: George Russell takes pole, Lewis Hamilton only 10th

Ria.city






Read also

Man repeatedly kicks seat during flight before he’s tied up by passengers

'Where is the apology?' Trump rages against NY Times' Maggie Haberman in midnight attack

Blinken to appear at Afghanistan withdrawal hearing, top Republican says

News, articles, comments, with a minute-by-minute update, now on Today24.pro

News Every Day

Michail Antonio reveals he was barred from entering the UK after passport blunder in nightmare international break

Today24.pro — latest news 24/7. You can add your news instantly now — here


News Every Day

Sky Sports commentator stunned by ‘one of the strangest reactions to a goal I’ve ever seen’ by Watford fans



Sports today


Новости тенниса
Кубок Дэвиса

Миранчук с «Атлантой» выбыл из плей-офф МЛС, Синнер выиграл Кубок Дэвиса. Главное к утру



Спорт в России и мире
Москва

В Москве судья сделала замечание перекрестившемуся мальчику-спортсмену: русские требуют поставить её на место



All sports news today





Sports in Russia today

Москва

19 медалей завоевали пловцы спортшколы «Нижегородец» на Всероссийских соревнованиях в Москве


Новости России

Game News

The community behind the PC port of Ocarina of Time have been secretly working on a native version of Star Fox 64


Russian.city


US Open

Динара Сафина назвала турниры Большого шлема, которые Даниил Медведев может выиграть в следующем сезоне


Губернаторы России
Футбол

«Футбол-шоу» признано лучшей детской программой по версии Премии имени Эдуарда Сагалаева


В Москве 3-летний мальчик съел шоколадку и умер

Члены «Военно-технического общества» передали радиостанции Hytera в войсковую часть №2567

«Грузовичкоф» на передовой новых коллабораций с блогерами: выступление Наталии Поникаровской на конференции The Trends

Подбор и подготовка персонала в Группе НЛМК


Концерт, посвященный памяти Фредди Меркьюри, состоится в Нижнем Новгороде

Лидер «Ленинграда» Шнуров исключил свое наставничество в шоу «Голос»

«Едем побеждать». Певица Юта выступила перед контрактниками в Москве

Питчинг Релиза. Отправить релиз на Питчинг.


Во французском городе Грас открыли корт имени Даниила Медведева

Оже-Альяссим о том, что Маррей будет тренировать Джоковича: «Значит, тур ATP – действительно фильм»

Елена Веснина: «Путь длиною в 30 лет пройден. Теннис – лучшая игра на свете, но пришло время двигаться дальше»

Миранчук с «Атлантой» выбыл из плей-офф МЛС, Синнер выиграл Кубок Дэвиса. Главное к утру



Гастрольный форс-мажор: Театр Дениса Матросова едва не остался без сценических костюмов

В Подмосковье офицер Росгвардии оказал помощь в эвакуации  пострадавших в результате ДТП

Филиал № 4 ОСФР по Москве и Московской области информирует: Отделение СФР по Москве и Московской области оплатило свыше 243 тысяч дополнительных выходных дней по уходу за детьми с инвалидностью

Bloody - участник и технический партнер Red Expo-2024


«Вероятно, Нагучев до сих пор живет временами ЧМ-2018, его репортаж вызвал недоумение». «Пари НН» ответил «Матч ТВ»

Нижегородское "Торпедо" разгромило на выезде московский "Спартак" в КХЛ

«Торпедо» разгромило «Спартак» в Москве, Бочаров отразил два буллита и оформил «сухарь»

«Смоленскэнерго» предупреждает об опасности поражения электрическим током после непогоды


"Финуслуги": средний объем депозита пенсионеров РФ превысил 500 тыс. рублей

Финал Всероссийского конкурса творческих проектов педагогов «Живая традиция»

Режиссер Кустурица собирается оформить российское гражданство

Ушел из жизни бывший защитник и тренер «Северстали» Аньшин



Путин в России и мире






Персональные новости Russian.city
Булат Окуджава

Творческий вечер в Париже посвятили Булату Окуджаве



News Every Day

Exclusive: Sumit Kaul on joining the new season of Tenali Rama as Girgit; says ‘It will be a challenge for me to live up to the expectations of audience’




Friends of Today24

Музыкальные новости

Персональные новости