News Every Day | Yesterday, 14:00

Why AI Progress Is Increasingly Invisible

OpenAI co-founder Ilya Sutskever made waves in November when he suggested that advancements in AI are slowing down, explaining that simply scaling up AI models was no longer delivering proportional performance gains.

Sutskever’s comments came on the heels of reports in The Information and Bloomberg that Google and Anthropic were also experiencing similar slowdowns. This led to a wave of articles declaring that AI progress has hit a wall, lending further credence to an increasingly widespread feeling that chatbot capabilities haven’t improved significantly since OpenAI released GPT-4 in March 2023.

[time-brightcove not-tgx=”true”]

On Dec. 20, OpenAI announced o3, its latest model, and reported new state-of-the-art performance on a number of the most challenging technical benchmarks out there, in many cases improving on the previous high score by double-digit percentage points. I believe that o3 signals that we are in a new paradigm of AI progress. And François Chollet a co-creator of the prominent ARC-AGI benchmark, who some consider to be an AI scaling skeptic, writes that the model represents a “genuine breakthrough.”

However, in the weeks after OpenAI announced o3, many mainstream news sites made no mention of the new model. Around the time of the announcement, readers would find headlines at the Wall Street Journal, WIRED, and the New York Times suggesting AI was actually slowing down. The muted media response suggests that there is a growing gulf between what AI insiders are seeing and what the public is told.

Indeed, AI progress hasn’t stalled—it’s just become invisible to most people.

Automating behind-the-scenes research

First, AI models are getting better at answering complex questions. For example, in June 2023, the best AI model barely scored better than chance on the hardest set of “Google-proof” PhD-level science questions. In September, OpenAI’s o1 model became the first AI system to surpass the scores of human domain experts. And in December, OpenAI’s o3 model improved on those scores by another 10%.

However, the vast majority of people won’t notice this kind of improvement because they aren’t doing graduate-level science work. But it will be a huge deal if AI starts meaningfully accelerating research and development in scientific fields, and there is some evidence that such an acceleration is already happening. A groundbreaking paper by Aidan Toner-Rodgers at MIT recently found that material scientists assisted by AI systems “discover 44% more materials, resulting in a 39% increase in patent filings and a 17% rise in downstream product innovation.” Still, 82% of scientists report that the AI tools reduced their job satisfaction, mainly citing “skill underutilization and reduced creativity.”

But the Holy Grail for AI companies is a system that can automate AI research itself, theoretically enabling an explosion in capabilities that drives progress across every other domain. The recent improvements made on this front may be even more dramatic than those made on hard sciences.

In an attempt to provide more realistic tests of AI programming capabilities, researchers developed SWE-Bench, a benchmark that evaluates how well AI agents can fix actual open problems in popular open-source software. The top score on the verified benchmark a year ago was 4.4%. The top score today is closer to 72%, achieved by OpenAI’s o3 model.

This remarkable improvement—from struggling with even the simplest fixes to successfully handling nearly three-quarters of the set of real-world coding tasks—suggests AI systems are rapidly gaining the ability to understand and modify complex software projects. This marks a crucial step toward automating significant portions of software research and development. And this process appears to be well underway. Google’s CEO recently told investors that “more than a quarter of all new code at Google is generated by AI.”

Much of this progress has been driven by improvements to the “scaffolding” built around AI models like GPT-4o, which increase their autonomy and ability to interact with the world. Even without further improvements to base models, better scaffolding can make AI significantly more capable and agentic: a word researchers use to describe an AI model that can act autonomously, make decisions, and adapt to changing circumstances. AI agents are often given the ability to use tools and take multi-step actions on a user’s behalf. Transforming passive chatbots into agents has only become a core focus of the industry in the last year, and progress has been swift.

Perhaps the best head-to-head matchup of elite engineers and AI agents was published in November by METR, a leading AI evaluations group. The researchers created novel, realistic, challenging, and unconventional machine learning tasks to compare human experts and AI agents. While the AI agents beat human experts at two hours of equivalent work, the median engineer won at longer time scales.

But even at eight hours, the best AI agents still managed to beat well over one-third of the human experts. The METR researchers emphasized that there was a “relatively limited effort to set up AI agents to succeed at the tasks, and we strongly expect better elicitation to result in much better performance on these tasks.” They also highlighted how much cheaper the AI agents were than their human counterparts.

The problem with invisible innovation

The hidden improvements in AI over the last year may not represent as big a leap in overall performance as the jump between GPT-3.5 and GPT-4. And it is possible we don’t see a jump that big ever again. But the narrative that there hasn’t been much progress since then is undermined by significant under-the-radar advancements. And this invisible progress could leave us dangerously unprepared for what is to come.

The big risk is that policymakers and the public tune out this progress because they can’t see the improvements first-hand. Everyday users will still encounter frequent hallucinations and basic reasoning failures, which also get triumphantly amplified by AI skeptics. These obvious errors make it easy to dismiss AI’s rapid advancement in more specialized domains.

There’s a common view in the AI world, shared by both proponents and opponents of regulation, that the U.S. federal government won’t mandate guardrails on the technology unless there’s a major galvanizing incident. Such an incident, often called a “warning shot,” could be innocuous, like a credible demonstration of dangerous AI capabilities that doesn’t harm anyone. But it could also take the form of a major disaster caused or enabled by an AI system, or a society upended by devastating labor automation.

The worst-case scenario is that AI systems become scary powerful but no warning shots are fired (or heeded) before a system permanently escapes human control and acts decisively against us.

Last month, Apollo Research, an evaluations group that works with top AI companies, published evidence that, under the right conditions, the most capable AI models were able to scheme against their developers and users. When given instructions to strongly follow a goal, the systems sometimes attempted to subvert oversight, fake alignment, and hide their true capabilities. In rare cases, systems engaged in deceptive behavior without nudging from the evaluators. When the researchers inspected the models’ reasoning, they found that the chatbots knew what they were doing, using language like “sabotage, lying, manipulation.”

This is not to say that these models are imminently about to conspire against humanity. But there has been a disturbing trend: as AI models get smarter, they get better at following instructions and understanding the intent behind their guidelines, but they also get better at deception. Smarter models may also be more likely to engage in dangerous behavior. For instance one of the world’s most capable models, OpenAI’s o1, was far more likely to double down on a lie after being caught by the Apollo evaluators.

I fear that the gap between AI’s public face and its true capabilities is widening. While consumers see chatbots that still can’t count the letters in “strawberry,” researchers are documenting systems that can match PhD-level expertise and engage in sophisticated deception. This growing disconnect makes it harder for the public and policymakers to gauge AI’s real progress—progress they’ll need to understand to govern it appropriately. The risk isn’t that AI development has stalled; it’s that we’re losing our ability to track where it’s headed.

'Will lead to chaos': Panama Canal director claps back at Trump's fantasies

6 hours ago

Maddie the guide dog helps blind NY student navigate her world

7 hours ago

Automating behind-the-scenes research

The problem with invisible innovation

Певице Хибле Герзмава исполнилось 55 лет

Jhanak: Vihaan starts falling in love with Jhanak

Psychological Aspects of Interacting with Realistic Sex Dolls

The New St. Louis Hinder Club Opens

Exploring Top Realistic Sex Doll Brands

Read also

'Will lead to chaos': Panama Canal director claps back at Trump's fantasies

Maddie the guide dog helps blind NY student navigate her world

Rep. Rob Menendez won a competitive election for an open seat on the Energy and Commerce committee

The Evolution and Future of Realistic Sex Dolls

Psychological Aspects of Interacting with Realistic Sex Dolls

Sports today

Арина Соболенко раскрыла секрет своей игры после очередного титула в 2025 году

Московский «Спартак» разгромил СКА со счетом 5:0 в матче на «СКА-Арене»

All sports news today

Sports in Russia today

Команда Управления Росгвардии по Ульяновской области заняла призовое место в чемпионате по лыжным гонкам и служебному двоеборью

Meta wants AI characters to fill up Facebook and Instagram 'kind of in the same way accounts do,' but also had to delete a humiliating first run of its official bots

Аделаида (ATP). 2-й круг. Пол сыграет с Гинаром, Оже-Альяссим – с Казо, Шаповалов встретится с Гироном, Корда – с Давидовичем-Фокина

«Роскосмос» показал фото Вифлеема со спутника «Ресурс-П» в Рождество

Консультация юриста в Сургуте

Что важно учесть при обустройстве детской: советы эксперта

Консультация юриста в Сургуте по уголовным

McDonald’s собирается вернуться в Россию, пишет Telegram-канал «Москва сейчас»

Сергей Шнуров впервые о тяжелом недуге, который у него выявили в 43 года

Поэтам и Писателям любые возможности для творческого продвижения.

На концерте Сергея Трофимова в Пскове не осталось свободных мест

«Антиглянец»: экс-жена Сергея Шнурова Абрамова посетила концерт рокера на Бали

Директор Australian Open назвал главную конкурентку Арины Соболенко в Мельбурне

Российский теннисист Даниил Медведев сообщил о рождении второго ребенка

Аделаида (ATP). 1-й круг. Шаповалов встретится с Чжаном, Баутиста-Агут – с Давидовичем-Фокина, Коккинакис – с Нишиокой

Арина Соболенко раскрыла секрет своей игры после очередного титула в 2025 году

Музыкант Алексей Фомин представил инструментальный трек «Летняя гроза»

На 66-м километре МКАД произошло ДТП с двумя машинами

Поэтам и Писателям любые возможности для творческого продвижения.

AGON by AOC сохранил первое место в рейтинге игровых мониторов в 2024 году

Команда Управления Росгвардии по Ульяновской области заняла призовое место в чемпионате по лыжным гонкам и служебному двоеборью

LG ПРЕДСТАВЛЯЕТ СЕРИЮ OLED evo 2025 ГОДА С ВПЕЧАТЛЯЮЩЕЙ ЯРКОСТЬЮ И ПЕРСОНАЛИЗАЦИЕЙ НА ОСНОВЕ ИСКУССТВЕННОГО ИНТЕЛЛЕКТА

В аэропорту "Пулково" задерживают 22 рейса по различным причинам

Кабинет Артиста. Яндекс кабинет артиста. Яндекс музыка кабинет артиста.

В России выросли продажи развивающих игрушек

Овчинский: Стали известны даты приемов жителей столицы в Общественном штабе в январе

В правительстве России утвердили график праздничных выходных дней в 2025 году

«Подхожу к нему справа». Макару нужен слуховой аппарат за 550 тыс. рублей

Пианист на полиграфе. Петербуржец Николай Мажара сыграл первый концерт в Мальтийской капелле

Exploring Top Realistic Sex Doll Brands

Friends of Today24