News Every Day | 12:00

Publishers are finally getting serious about AI scraping

I think the strongest indicator of how normal using AI has become is the language we use as shorthand for it. It’s now extremely common for someone to say they asked “chat” for some piece of information. We all know what they mean.

But if you needed data on how popular AI portals are now, OpenAI provided it recently when the company revealed that ChatGPT has 900 million users, up from 800 million in the fall. Even if Gemini, Copilot, and Claude weren’t also rising (they are), that would be enough for the media—not to mention brands and marketing/PR agencies—to really understand how fast AI is growing as a discovery channel. Whether or not it’s a source of traffic doesn’t matter; it’s a meaningful layer between publishers and audiences.

That’s obviously the reason there’s been so much interest in the infant field of GEO (generative engine optimization) lately, and why I’ve written about it more than once in the past few months. But the focus on how to get AI search engines to notice and reference content doesn’t mean there shouldn’t be some kind of reckoning with how the content got there in the first place, and what—if any—value exchange that should trigger.

Surveys, such as this one done by OnMessage last fall, consistently show the public believes content providers should be compensated when their content is scraped by AI engines. The AI industry tends to have a different view, often suggesting that “publicly available” data (i.e., stuff on the internet) is fair game. It’s more nuanced than that, of course, but the central issue is one of leverage: The AI companies have it, and publishers by and large don’t.

The push for a better bargain

A new industry coalition is looking to rebalance those scales. In late February, a group of U.K. media companies—including the BBC, the Financial Times, and The Guardian—announced they were forming SPUR, which stands for Standards for Publisher Usage Rights. In an open letter, the leaders of those companies articulated the group’s purpose: “to establish shared technical standards and responsible licensing frameworks that ensure AI developers can access high quality, reliable journalism in legitimate, responsible and convenient ways.”

In other words, SPUR is meant to help lead the publishing industry toward a better bargain between AI companies and the media. Currently, publishers have a hodgepodge of solutions: You could pursue a licensing deal with one of the big AI companies, an option available only to publishers above a certain size. You could sue the AI companies, an expensive proposition. Or you could try to defend your content through a combination of paywalls, bot-blocking protocols, and nascent technologies aimed at getting AI crawlers to pay for access.

The spirit of SPUR is that there’s power in numbers. Although it’s beginning with a handful of U.K. publishers, the group is actively working to recruit media worldwide into the coalition. By taking collective action, which the news media is traditionally allergic to, the coalition stands a better chance of establishing some kind of framework for how AI services will pay for access to content.

It stands an even better chance with allies. Last year, Cloudflare stepped into this fight, advocating on the side of publishers. And it brought to the battlefield technical clout: A significant portion of internet traffic goes through Cloudflare’s network, so it has an outsize say in what the rules are, and which ones get enforced. As part of its push against unauthorized AI scraping, it introduced Pay Per Crawl, a new way to charge bots for access to content.

Couldflare’s solution is actually one of several on the market, and although SPUR doesn’t intend to play favorites, Pay Per Crawl is exactly the kind of technical barrier the group was created to encourage. The fact is, unauthorized AI crawling is rampant. TollBit, which publishes quarterly reports about bot activity, recently highlighted the problem of third parties leveraging virtual, “headless” browsers (essentially bots accessing sites as if they were humans and then scraping them) on an industrial scale to crawl vast amounts of data—the equivalent of a fishing trawler.

For the longest time, the only technical weapon digital publishers had was the robots exclusion protocol (robots.txt), but it’s an honor system that can easily be ignored or bypassed. The main focus of SPUR, sources tell me, is to help publishers build more defenses. By making it more difficult and cost-prohibitive for AI crawlers to access content, it will encourage the people who operate them to make deals.

Then come the agents

The biggest wild card here is agents. AI services access content largely for three purposes: for training data, for search crawling, and in response to user requests. It’s the last category that is proving very contentious and the impetus behind a war of words between Perplexity and Cloudflare last summer. User agents have traditionally been given a pass from blocking since they effectively act as human proxies, not mass-scraping tools. Importantly, though, they don’t behave as humans (for example, they don’t look at ads), so many sites (and especially publishers) believe they should be entitled to block them.

Some believe this aspect of AI crawling should be regulated, and certainly it’s part of the ongoing lawsuits between the media and the AI industry. But those approaches drag on; SPUR is acting now. You can picture this quickly leading to an arms race, and when the players were individual publishers versus the AI industry, that’s very asymmetric warfare. But a large, worldwide industry coalition, backed by technical allies like Cloudflare, might actually have a chance to push back.

So now the hard work begins of herding the cats of the media industry. And the clock is ticking: User behavior is shifting rapidly, and asking “chat” what’s happening in the world means more agents are replacing human traffic to news websites. SPUR may give publishers a chance to shape that system, but it is taking form with or without them. Once those rules harden, changing them will be much harder.

Publishers are finally getting serious about AI scraping

The push for a better bargain

Then come the agents

Read also

'Political bombshell' engulfs Trump-endorsed Republican as rape allegation emerges

The Pentagon–Anthropic clash is a warning for every enterprise AI buyer

Presley Smith Age Insights For Fans Of Sunny Baudelaire

Sports today

All sports news today

Sports in Russia today

Friends of Today24