March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010
November 2010
December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22
23
24
25
26
27
28
29
30
News Every Day |

What it means that new AIs can “reason”

0
Vox
In this photo illustration, the sign of OpenAl o1, the first in a planned series of “reasoning” models that have been trained to answer more complex questions, is displayed on a smartphone screen on September 13, 2024, in Suqian, Jiangsu Province of China.

An underappreciated fact about large language models (LLMs) is that they produce “live” answers to prompts. You prompt them and they start talking in response, and they talk until they’re done. The result is like asking a person a question and getting a monologue back in which they improv their answer sentence by sentence.

This explains several of the ways in which large language models can be so frustrating. The model will sometimes contradict itself even within a paragraph, saying something and then immediately following up with the exact opposite because it’s just “reasoning aloud” and sometimes adjusts its impression on the fly. As a result, AIs need a lot of hand-holding to do any complex reasoning.

This story was first featured in the Future Perfect newsletter.

Sign up here to explore the big, complicated problems the world faces and the most efficient ways to solve them. Sent twice a week.

One well-known way to solve this is called chain-of-thought prompting, where you ask the large language model to effectively “show its work” by “‘thinking” out loud about the problem and giving an answer only after it has laid out all of its reasoning, step by step. 

Chain-of-thought prompting makes language models behave much more intelligently, which isn’t surprising. Compare how you’d answer a question if someone shoves a microphone in your face and demands that you answer immediately to how you’d answer if you had time to compose a draft, review it, and then hit “publish.”

The power of think, then answer

OpenAI’s latest model, o1 (nicknamed Strawberry), is the first major LLM release with this “think, then answer” approach built in. 

Unsurprisingly, the company reports that the method makes the model a lot smarter. In a blog post, OpenAI said o1 “performs similarly to PhD students on challenging benchmark tasks in physics, chemistry, and biology. We also found that it excels in math and coding. In a qualifying exam for the International Mathematics Olympiad (IMO), GPT-4o correctly solved only 13 percent of problems, while the reasoning model scored 83 percent.”

This major improvement in the model’s ability to think also intensifies some of the dangerous capabilities that leading AI researchers have long been on the lookout for. Before release, OpenAI tests its models for their capabilities with chemical, biological, radiological, and nuclear weapons, the abilities that would be most sought-after by terrorist groups that don’t have the expertise to build them with current technology. 

As my colleague Sigal Samuel wrote recently, OpenAI o1 is the first model to score “medium” risk in this category. That means that while it’s not capable enough to walk, say, a complete beginner through developing a deadly pathogen, the evaluators found that it “can help experts with the operational planning of reproducing a known biological threat.” 

These capabilities are one of the most clear-cut examples of AI as a dual-use technology: a more intelligent model becomes more capable in a wide array of uses, both benign and malign.

If future AI does get good enough to tutor any college biology major through steps involved in recreating, say, smallpox in the lab, this would potentially have catastrophic casualties. At the same time, AIs that can tutor people through complex biology projects will do an enormous amount of good by accelerating lifesaving research. It is intelligence itself, artificial or otherwise, that is the double-edged sword.

The point of doing AI safety work to evaluate these risks is to figure out how to mitigate them with policy so we can get the good without the bad.

How to (and how not to) evaluate an AI

Every time OpenAI or one of its competitors (Meta, Google, Anthropic) releases a new model, we retread the same conversations. Some people find a question on which the AI performs very impressively, and awed screenshots circulate. Others find a question on which the AI bombs — say, “how many ‘r’s are there in ‘strawberry’” or “how do you cross a river with a goat” — and share those as proof that AI is still more hype than product. 

Part of this pattern is driven by the lack of good scientific measures of how capable an AI system is. We used to have benchmarks that were meant to describe AI language and reasoning capabilities, but the rapid pace of AI improvement has gotten ahead of them, with benchmarks often “saturated.” This means AI performs as well as a human on these benchmark tests, and as a result they’re no longer useful for measuring further improvements in skill.

I strongly recommend trying AIs out yourself to get a feel for how well they work. (OpenAI o1 is only available to paid subscribers for now, and even then is very rate-limited, but there are new top model releases all the time.) It’s still too easy to fall into the trap of trying to prove a new release “impressive” or “unimpressive” by selectively mining for tasks where they excel or where they embarrass themselves, instead of looking at the big picture. 

The big picture is that, across nearly all tasks we’ve invented for them, AI systems are continuing to improve rapidly, but the incredible performance on almost every test we can devise hasn’t yet translated into many economic applications. Companies are still struggling to identify how to make money off LLMs. A big obstacle is the inherent unreliability of the models, and in principle an approach like OpenAI o1’s — in which the model gets more of a chance to think before it answers — might be a way to drastically improve reliability without the expense of training a much bigger model. 

Sometimes, big things can come from small improvements 

In all likelihood, there isn’t going to be a silver bullet that suddenly fixes the longstanding limitations of large language models. Instead, I suspect they’ll be gradually eroded over a series of releases, with the unthinkable becoming achievable and then mundane over the course of a few years — which is precisely how AI has proceeded so far. 

But as ChatGPT — which itself was only a moderate improvement over OpenAI’s previous chatbots but which reached hundreds of millions of people overnight — demonstrates, technical progress being incremental doesn’t mean societal impact is incremental. Sometimes the grind of improvements to various parts of how an LLM operates — or improvements to its UI so that more people will try it, like the chatbot itself — push us across the threshold from “party trick” to “essential tool.” 

And while OpenAI has come under fire recently for ignoring the safety implications of their work and silencing whistleblowers, its o1 release seems to take the policy implications seriously, including collaborating with external organizations to check what their model can do. I’m grateful that they’re making that work possible, and I have a feeling that as models keep improving, we will need such conscientious work more than ever. 

A version of this story originally appeared in the Future Perfect newsletter. Sign up here!

Game News

Default app 3 (Кейс симулятор Standoff 2)

Frustrated Hamilton had to "yank" steering wheel in Azerbaijan GP

Russia to finance encyclopedia of Islam

Premier League clubs showing frustration over secretive Manchester City trial

Rangers Star Insists ‘Some Moments’ Have Showed Gers’ Quality

Ria.city






Read also

Initial public offerings scheduled to debut next week

Kamala Harris ties Mark Robinson around Trump's neck in brutal new North Carolina ad

This wireless charger is also a phone stand, and it’s only $12 today

News, articles, comments, with a minute-by-minute update, now on Today24.pro

News Every Day

New $100M DOJ lawsuit details the 'unseaworthy' condition of the ship behind Baltimore bridge collapse

Today24.pro — latest news 24/7. You can add your news instantly now — here


News Every Day

New $100M DOJ lawsuit details the 'unseaworthy' condition of the ship behind Baltimore bridge collapse



Sports today


Новости тенниса
ATP

Россиянин Алибек Качмазов вышел в основную сетку турнира ATP-250 в Чэнду



Спорт в России и мире
Москва

Офицер Росгвардии спас жизнь мужчине в Москве



All sports news today





Sports in Russia today

Москва

Естественный отбор технологий: в Москве пройдет отборочный этап чемпионата «Битва роботов»


Новости России

Game News

Эти игры настолько сложны, что доведут вас до безумия


Russian.city


WTA

Теннисистка Камилла Рахимова поднялась на 73-е место в рейтинге WTA


Губернаторы России
Елена Волкова

В Республике Таджикистан стартует проект «Русский язык: читаем, слушаем, смотрим в странах СНГ»


В Подмосковье сотрудники Росгвардии провели встречу со студентами финансового университета

Росгвардия обеспечила безопасность футбольного матча в Дагестане

За попытку прорваться в НИИ после стрельбы у WB арестовали шесть человек

В Подмосковье сотрудники Росгвардии провели встречу со студентами финансового университета


Адвокат Алешкин: с похитителей Авраама Руссо потребуют 5 млн рублей

Кот с собачьим именем и споры о жизни. Как омские панки оказались надёжной опорой для отца Егора Летова

На лидера группы "Порнофильмы" составили протокол за дискредитацию армии

Шопен зазвучит при свечах


«Совесть не позволила»: Надежда Гуськова в юности попадала в клубы оригинальным способом

Теннисистка Потапова: считаю Квинси Промеса одной из легенд «Спартака»

Россиянин Алибек Качмазов вышел в основную сетку турнира ATP-250 в Чэнду

Марии Шараповой завидуют все русские женщины. И вот почему



Фестиваль «Большая сцена» приглашает к участию талантливых людей со всей России

Родители 317,2 тыс. детей в Московской области получают единое пособие

Росгвардия обеспечила правопорядок на футбольном матче «ЦСКА» - «Краснодар» в Москве

Соцфонд проиндексирует пенсии работающим пенсионерам в феврале


Президент ТПП РФ поздравил сотрудников Роспотребнадзора с профессиональным праздником

Росгвардейцы Чувашии стали бронзовыми призерами Чемпионата войск национальной гвардии по мини-футболу

В Подмосковье сотрудники Росгвардии провели встречу со студентами финансового университета

«Евро-Футбол.Ру»: «Спартак» продлит контракт с Максименко


Подросток в Москве съел головку чеснока и попал в реанимацию

Полиция раскрыла обстоятельства исчезновения двух пенсионеров в пригороде Геленджика

СВЕТЛАНА БОЖОР — О ЖИЗНИ ПЛАСТИЧЕСКОГО ХИРУРГА

В Туле состоялась торжественная церемония открытия финального этапа Всероссийской спартакиады Следственного комитета Российской Федерации



Путин в России и мире






Персональные новости Russian.city
Юрий Шевчук

Петербуржцы Юрий Шевчук, Дмитрий Шагин, Мария Любичева и Борис Вишневский предоставили свои лоты на благотворительный аукцион «Яблока» в поддержку политических заключённых



News Every Day

Premier League clubs showing frustration over secretive Manchester City trial




Friends of Today24

Музыкальные новости

Персональные новости