News Every Day | 09 December 2025, 23:13

The problem with ‘human in the loop’ AI? Often, it’s the humans

Welcome to Eye on AI. In this edition…AI is outperforming some professionals…Google plans to bring ads to Gemini…leading AI labs team up on AI agent standards…a new effort to give AI models a longer memory…and the mood turns on LLMs and AGI.

Greetings from San Francisco, where we are just wrapping up Fortune Brainstorm AI. On Thursday, we’ll bring you a roundup of insights from the conference. But today, I want to talk about some notable studies from the past few weeks with potentially big implications for the business impact AI may have.

First, there was a study from the AI evaluations company Vals AI that pitted several legal AI applications as well as ChatGPT against human lawyers on legal research tasks. All of the AI applications beat the average human lawyers (who were allowed to use digital legal search tools) in drafting legal research reports across three criteria: accuracy, authoritativeness, and appropriateness. The lawyers’ aggregate median score was 69%, while ChatGPT scored 74%, Midpage 76%, Alexi 77%, and Counsel Stack, which had the highest overall score, 78%.

One of the more intriguing findings is that for many question types, it was the generalist ChatGPT that was the most accurate, beating out the more specialized applications. And while ChatGPT lost points for authoritativeness and appropriateness, it still topped the human lawyers across those dimensions.

The study has been faulted for not testing some of the better-known and most widely adopted legal AI research tools, such as Harvey, Legora, CoCounsel from Thompson Reuters, or LexisNexis Protégé, and for only testing ChatGPT among the frontier general-purpose models. Still, the findings are notable and comport with what I’ve heard anecdotally from lawyers.

A little while ago I had a conversation with Chris Kercher, a litigator at Quinn Emanuel who founded that firm’s data and analytics group. Quinn Emanuel has been using Anthropic’s general purpose AI model Claude for a lot of tasks. (This was before Anthropic’s latest model, Claude Opus 4.5, debuted.) “Claude Opus 3 writes better than most of my associates,” Kercher told me. “It just does. It is clear and organized. It’s a great model.” He said he is “constantly amazed” by what LLMs can do, finding new issues, strategies, and tactics that he can use to argue cases.

Kercher said that AI models have allowed Quinn Emanuel to “invert” its prior work processes. In the past, junior lawyers—who are known as associates—used to spend days researching and writing up legal memos, finding citations for every sentence, before presenting those memos to more senior lawyers who would incorporate some of that material into briefs or arguments that would actually be presented in court. Today, he says, AI is used to generate drafts that Kercher said are by and large better, in a fraction of the time, and then these drafts are given to associates to vet. The associates are still responsible for the accuracy of the memos and citations—just as they always were—but now they are fact-checking the AI and editing what it produces, not performing the initial research and drafting, he said.

He said that the most experienced, senior lawyers often get the most value out of working with AI, because they have the expertise to know how to craft the perfect prompt, along with the professional judgment and discernment to quickly assess the quality of the AI’s response. Is the argument the model has come up with sound? Is it likely to work in front of a particular judge or be convincing to a jury? These sorts of questions still require judgment that comes from experience, Kercher said.

Ok, so that’s law, but it likely points to ways in which AI is beginning to upend work within other “knowledge industries” too. Here at Brainstorm AI yesterday, I interviewed Michael Truell, the cofounder and CEO of hot AI coding tool Cursor. He noted that in a University of Chicago study looking at the effects of developers using Cursor, it was often the most experienced software engineers who saw the most benefit from using Cursor, perhaps for some of the same reasons Kercher says experienced lawyers get the most out of Claude—they have the professional experience to craft the best prompts and the judgment to better assess the tools’ outputs.

Then there was a study out on the use of generative AI to create visuals for advertisements. Business professors at New York University and Emory University tested whether advertisements for beauty products created by human experts alone, created by human experts and then edited by AI models, or created entirely by AI models were most appealing to prospective consumers. They found the ads that were entirely AI generated were chosen as the most effective—increasing clickthrough rates in a trial they conducted online by 19%. Meanwhile, those created by humans and edited by AI were actually less effective than those simply created by human experts with no AI intervention. But, critically, if people were told the ads were AI-generated, their likelihood of buying the product declined by almost a third.

Those findings present a big ethical challenge to brands. Most AI ethicists think people should generally be told when they are consuming content generated by AI. And advertisers do need to negotiate various Federal Trade Commission rulings around “truth in advertising.” But many ads already use actors posing in various roles without needing to necessarily tell people that they are actors—or the ads do so only in very fine print. How different is AI-generated advertising? The study seems to point to a world where more and more advertising will be AI-generated and where disclosures will be minimal.

The study also seems to challenge the conventional wisdom that “centaur” solutions (which combine the strengths of humans and those of AI in complementary ways) will always perform better than either humans or AI alone. (Sometimes this is condensed to the aphorism “AI won’t take your job. A human using AI will take your job.”) A growing body of research seems to suggest that in many areas, this simply isn’t true. Often, the AI on its own actually produces the best results.

But it is also the case that whether centaur solutions work well depends tremendously on the exact design of the human-AI interaction. A study on human doctors using ChatGPT to aid diagnosis, for example, found that humans working with AI could indeed produce better diagnoses than either doctors or ChatGPT alone—but only if ChatGPT was used to render an initial diagnosis and human doctors, with access to the ChatGPT diagnosis, then gave a second opinion. If that process was reversed, and ChatGPT was asked to render the second opinion on the doctor’s diagnosis, the results were worse—and in fact, the second-best results were just having ChatGPT provide the diagnosis. In the advertising study, it would have been good if the researchers had looked at what happens if AI generates the ads and then human experts edit them.

But in any case, momentum towards automation—often without a human in the loop—is building across many fields.

On that happy note, here’s more AI news.

Jeremy Kahn
jeremy.kahn@fortune.com
@jeremyakahn

This story was originally featured on Fortune.com

The problem with ‘human in the loop’ AI? Often, it’s the humans

Read also

Person trapped in car after accident released by firefighters

World champion Norris says McLaren must 'improve in all areas'

UECL: Fiorentina vs. Rakow Czestochowa – Probable line-ups and where to watch on TV

Sports today

All sports news today

Sports in Russia today

Friends of Today24