{*}
Add news
March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010 November 2010 December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024 December 2024 January 2025 February 2025 March 2025 April 2025 May 2025 June 2025 July 2025 August 2025 September 2025 October 2025 November 2025 December 2025 January 2026 February 2026
1 2 3 4 5 6 7 8 9 10 11 12 13 14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
News Every Day |

News Publishers Are Now Blocking The Internet Archive, And We May All Regret It

Last fall, I wrote about how the fear of AI was leading us to wall off the open internet in ways that would hurt everyone. At the time, I was worried about how companies were conflating legitimate concerns about bulk AI training with basic web accessibility. Not surprisingly, the situation has gotten worse. Now major news publishers are actively blocking the Internet Archive—one of the most important cultural preservation projects on the internet—because they’re worried AI companies might use it as a sneaky “backdoor” to access their content.

This is a mistake we’re going to regret for generations.

Nieman Lab reports that The Guardian, The New York Times, and others are now limiting what the Internet Archive can crawl and preserve:

When The Guardian took a look at who was trying to extract its content, access logs revealed that the Internet Archive was a frequent crawler, said Robert Hahn, head of business affairs and licensing. The publisher decided to limit the Internet Archive’s access to published articles, minimizing the chance that AI companies might scrape its content via the nonprofit’s repository of over one trillion webpage snapshots.

Specifically, Hahn said The Guardian has taken steps to exclude itself from the Internet Archive’s APIs and filter out its article pages from the Wayback Machine’s URLs interface. The Guardian’s regional homepages, topic pages, and other landing pages will continue to appear in the Wayback Machine.

The Times has gone even further:

The New York Times confirmed to Nieman Lab that it’s actively “hard blocking” the Internet Archive’s crawlers. At the end of 2025, the Times also added one of those crawlers — archive.org_bot — to its robots.txt file, disallowing access to its content.

“We believe in the value of The New York Times’s human-led journalism and always want to ensure that our IP is being accessed and used lawfully,” said a Times spokesperson. “We are blocking the Internet Archive’s bot from accessing the Times because the Wayback Machine provides unfettered access to Times content — including by AI companies — without authorization.”

I understand the concern here. I really do. News publishers are struggling, and watching AI companies hoover up their content to train models that might then, in some ways, compete with them for readers is genuinely frustrating. I run a publication myself, remember.

But blocking the Internet Archive isn’t going to stop AI training. What it will do is ensure that significant chunks of our journalistic record and historical cultural context simply… disappear.

And that’s bad.

The Internet Archive is the most famous nonprofit digital library, and has been operating for nearly three decades. It isn’t some fly-by-night operation looking to profit off publisher content. It’s trying to preserve the historical record of the internet—which is way more fragile than most people comprehend. When websites disappear—and they disappear constantly—the Wayback Machine is often the only place that content still exists. Researchers, historians, journalists, and ordinary citizens rely on it to understand what actually happened, what was actually said, what the world actually looked like at a given moment.

In a digital era when few things end up printed on paper, the Internet Archive’s efforts to permanently preserve our digital culture are essential infrastructure for anyone who cares about historical memory.

And now we’re telling them they can’t preserve the work of our most trusted publications.

Think about what this could mean in practice. Future historians trying to understand 2025 will have access to archived versions of random blogs, sketchy content farms, and conspiracy sites—but not The New York Times. Not The Guardian. Not the publications that we consider the most reliable record of what’s happening in the world. We’re creating a historical record that’s systematically biased against quality journalism.

Yes, I’m sure some will argue that the NY Times and The Guardian will never go away. Tell that to the readers of the Rocky Mountain News, which published for 150 years before shutting down in 2009, or to the 2,100+ newspapers that have closed since 2004. Institutions—even big, prominent, established ones—don’t necessarily last.

As one computer scientist quoted in the Nieman piece put it:

“Common Crawl and Internet Archive are widely considered to be the ‘good guys’ and are used by ‘the bad guys’ like OpenAI,” said Michael Nelson, a computer scientist and professor at Old Dominion University. “In everyone’s aversion to not be controlled by LLMs, I think the good guys are collateral damage.”

That’s exactly right. In our rush to punish AI companies, we’re destroying public goods that serve everyone.

The most frustrating bit of all of this: The Guardian admits they haven’t actually documented AI companies scraping their content through the Wayback Machine. This is purely precautionary and theoretical. They’re breaking historical preservation based on a hypothetical threat:

The Guardian hasn’t documented specific instances of its webpages being scraped by AI companies via the Wayback Machine. Instead, it’s taking these measures proactively and is working directly with the Internet Archive to implement the changes.

And, of course, as one of the “good guys” of the internet, the Internet Archive is willing to do exactly what these publishers want. They’ve always been good about removing content or not scraping content that people don’t want in the archive. Sometimes to a fault. But you can never (legitimately) accuse them of malicious archiving (even if music labels and book publishers have).

Either way, we’re sacrificing the historical record not because of proven harm, but because publishers are worried about what might happen. That’s a hell of a tradeoff.

This isn’t even new, of course. Last year, Reddit announced it would block the Internet Archive from archiving its forums—decades of human conversation and cultural history—because Reddit wanted to monetize that content through AI licensing deals. The reasoning was the same: can’t let the Wayback Machine become a backdoor for AI companies to access content Reddit is now selling. But once you start going down that path, it leads to bad places.

The Nieman piece notes that, in the case of USA Today/Gannett, it appears that there was a company-wide decision to tell the Internet Archive to get lost:

In total, 241 news sites from nine countries explicitly disallow at least one out of the four Internet Archive crawling bots.

Most of those sites (87%) are owned by USA Today Co., the largest newspaper conglomerate in the United States formerly known as Gannett. (Gannett sites only make up 18% of Welsh’s original publishers list.) Each Gannett-owned outlet in our dataset disallows the same two bots: “archive.org_bot” and “ia_archiver-web.archive.org”. These bots were added to the robots.txt files of Gannett-owned publications in 2025.

Some Gannett sites have also taken stronger measures to guard their contents from Internet Archive crawlers. URL searches for the Des Moines Register in the Wayback Machine return a message that says, “Sorry. This URL has been excluded from the Wayback Machine.”

A Gannett spokesperson told NiemanLab that it was about “safeguarding our intellectual property” but that’s nonsense. The whole point of libraries and archives is to preserve such content, and they’ve always preserved materials that were protected by copyright law. The claim that they have to be blocked to safeguard such content is both technologically and historically illiterate.

And here’s the extra irony: blocking these crawlers may not even serve publishers’ long-term interests. As I noted in my earlier piece, as more search becomes AI-mediated (whether you like it or not), being absent from training datasets increasingly means being absent from results. It’s a bit crazy to think about how much effort publishers put into “search engine optimization” over the years, only to now block the crawlers that feed the systems a growing number of people are using for search. Publishers blocking archival crawlers aren’t just sacrificing the historical record—they may be making themselves invisible in the systems that increasingly determine how people discover content in the first place.

The Internet Archive’s founder, Brewster Kahle, has been trying to sound the alarm:

“If publishers limit libraries, like the Internet Archive, then the public will have less access to the historical record.”

But that warning doesn’t seem to be getting through. The panic about AI has become so intense that people are willing to sacrifice core internet infrastructure to address it.

What makes this particularly frustrating is that the internet’s openness was never supposed to have asterisks. The fundamental promise wasn’t “publish something and it’s accessible to all, except for technologies we decide we don’t like.” It was just… open. You put something on the public web, people can access it. That simplicity is what made the web transformative.

Now we’re carving out exceptions based on who might access content and what they might do with it. And once you start making those exceptions, where do they end? If the Internet Archive can be blocked because AI companies might use it, what about research databases? What about accessibility tools that help visually impaired users? What about the next technology we haven’t invented yet?

This is a real concern. People say “oh well, blocking machines is different from blocking humans,” but that’s exactly why I mention assistive tech for the visually impaired. Machines accessing content are frequently tools that help humans—including me. I use an AI tool to help fact check my articles, and part of that process involves feeding it the source links. But increasingly, the tool tells me it can’t access those articles to verify whether my coverage accurately reflects them.

I don’t have a clean answer here. Publishers genuinely need to find sustainable business models, and watching their work get ingested by AI systems without compensation is a legitimate grievance—especially when you see how much traffic some of these (usually less scrupulous) crawlers dump on sites. But the solution can’t be to break the historical record of the internet. It can’t be to ensure that our most trusted sources of information are the ones that disappear from archives while the least trustworthy ones remain.

We need to find ways to address AI training concerns that don’t require us to abandon the principle of an open, preservable web. Because right now, we’re building a future where historians, researchers, and citizens can’t access the journalism that documented our era. And that’s not a tradeoff any of us should be comfortable with.

Ria.city






Read also

Is CPS superintendent search ‘on track’ or ‘stalled’? Depends who you ask on school board

Coachless Rennes stuns Ligue 1 leader PSG and ends winless run

Sixth for Ireland in opening leg of 2026 Longines League of Nations, Abu Dhabi

News, articles, comments, with a minute-by-minute update, now on Today24.pro

Today24.pro — latest news 24/7. You can add your news instantly now — here




Sports today


Новости тенниса


Спорт в России и мире


All sports news today





Sports in Russia today


Новости России


Russian.city



Губернаторы России









Путин в России и мире







Персональные новости
Russian.city





Friends of Today24

Музыкальные новости

Персональные новости