Add news
March 2010 April 2010 May 2010 June 2010 July 2010
August 2010
September 2010 October 2010 November 2010 December 2010 January 2011 February 2011 March 2011 April 2011 May 2011 June 2011 July 2011 August 2011 September 2011 October 2011 November 2011 December 2011 January 2012 February 2012 March 2012 April 2012 May 2012 June 2012 July 2012 August 2012 September 2012 October 2012 November 2012 December 2012 January 2013 February 2013 March 2013 April 2013 May 2013 June 2013 July 2013 August 2013 September 2013 October 2013 November 2013 December 2013 January 2014 February 2014 March 2014 April 2014 May 2014 June 2014 July 2014 August 2014 September 2014 October 2014 November 2014 December 2014 January 2015 February 2015 March 2015 April 2015 May 2015 June 2015 July 2015 August 2015 September 2015 October 2015 November 2015 December 2015 January 2016 February 2016 March 2016 April 2016 May 2016 June 2016 July 2016 August 2016 September 2016 October 2016 November 2016 December 2016 January 2017 February 2017 March 2017 April 2017 May 2017 June 2017 July 2017 August 2017 September 2017 October 2017 November 2017 December 2017 January 2018 February 2018 March 2018 April 2018 May 2018 June 2018 July 2018 August 2018 September 2018 October 2018 November 2018 December 2018 January 2019 February 2019 March 2019 April 2019 May 2019 June 2019 July 2019 August 2019 September 2019 October 2019 November 2019 December 2019 January 2020 February 2020 March 2020 April 2020 May 2020 June 2020 July 2020 August 2020 September 2020 October 2020 November 2020 December 2020 January 2021 February 2021 March 2021 April 2021 May 2021 June 2021 July 2021 August 2021 September 2021 October 2021 November 2021 December 2021 January 2022 February 2022 March 2022 April 2022 May 2022 June 2022 July 2022 August 2022 September 2022 October 2022 November 2022 December 2022 January 2023 February 2023 March 2023 April 2023 May 2023 June 2023 July 2023 August 2023 September 2023 October 2023 November 2023 December 2023 January 2024 February 2024 March 2024 April 2024 May 2024 June 2024 July 2024 August 2024 September 2024 October 2024 November 2024 December 2024 January 2025 February 2025 March 2025 April 2025 May 2025 June 2025 July 2025 August 2025 September 2025 October 2025 November 2025 December 2025 January 2026
1 2 3 4 5 6 7 8 9 10 11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
News Every Day |

AI’s Memorization Crisis

Editor’s note: This work is part of AI Watchdog, The Atlantic’s ongoing investigation into the generative-AI industry.


On Tuesday, researchers at Stanford and Yale revealed something that AI companies would prefer to keep hidden. Four popular large language models—OpenAI’s GPT, Anthropic’s Claude, Google’s Gemini, and xAI’s Grok—have stored large portions of some of the books they’ve been trained on, and can reproduce long excerpts from those books. 

In fact, when prompted strategically by researchers, Claude delivered the near-complete text of Harry Potter and the Sorcerer’s Stone, The Great Gatsby, 1984, and Frankenstein, in addition to thousands of words from books including The Hunger Games and The Catcher in the Rye. Varying amounts of these books were also reproduced by the other three models. Thirteen books were tested. 

This phenomenon has been called “memorization,” and AI companies have long denied that it happens on a large scale. In a 2023 letter to the U.S. Copyright Office, OpenAI said that “models do not store copies of the information that they learn from.” Google similarly told the Copyright Office that “there is no copy of the training data—whether text, images, or other formats—present in the model itself.” Anthropic, Meta, Microsoft, and others have made similar claims. (None of the AI companies mentioned in this article agreed to my requests for interviews.)

The Stanford study proves that there are such copies in AI models, and it is just the latest of several studies to do so. In my own investigations, I’ve found that image-based models can reproduce some of the art and photographs they’re trained on. This may be a massive legal liability for AI companies—one that could potentially cost the industry billions of dollars in copyright-infringement judgments, and lead products to be taken off the market. It also contradicts the basic explanation given by the AI industry for how its technology works. 

AI is frequently explained in terms of metaphor; tech companies like to say that their products learn, that LLMs have, for example, developed an understanding of English writing without explicitly being told the rules of English grammar. This new research, along with several other studies from the past two years, undermines that metaphor. AI does not absorb information like a human mind does. Instead, it stores information and accesses it.

In fact, many AI developers use a more technically accurate term when talking about these models: lossy compression. It’s beginning to gain traction outside the industry too. The phrase was recently invoked by a court in Germany that ruled against OpenAI in a case brought by GEMA, a music-licensing organization. GEMA showed that ChatGPT could output close imitations of song lyrics. The judge compared the model to MP3 and JPEG files, which store your music and photos in files that are smaller than the raw, uncompressed originals. When you store a high-quality photo as a JPEG, for example, the result is a somewhat lower-quality photo, in some cases with blurring or visual artifacts added. A lossy-compression algorithm still stores the photo, but it’s an approximation rather than the exact file. It’s called lossy compression because some of the data are lost.

From a technical perspective, this compression process is much like what happens inside AI models, as researchers from several AI companies and universities have explained to me in the past few months. They ingest text and images, and output text and images that approximate those inputs. 

But this simple description is less useful to AI companies than the learning metaphor, which has been used to claim that the statistical algorithms known as AI will eventually make novel scientific discoveries, undergo boundless improvement, and recursively train themselves, possibly leading to an “intelligence explosion.” The whole industry is staked on a shaky metaphor.

Source: Courtesy of Kyle Christy / IFC
Output from Stable Diffusion 1.4

The problem becomes clear if we look at AI image generators. In September 2022, Emad Mostaque, a co-founder and the then-CEO of Stability AI, explained in a podcast interview how Stable Diffusion, Stability’s image model, was built. “We took 100,000 gigabytes of images and compressed it to a two-gigabyte file that can re-create any of those and iterations of those” images, he said.

One of the many experts I spoke with while reporting this article was an independent AI researcher who has studied Stable Diffusion’s ability to reproduce its training images. (I agreed to keep the researcher anonymous, because they fear repercussions from major AI companies.) Above is one example of this ability: On the left is the original from the web—a promotional image from the TV show Garfunkel and Oates—and on the right is a version that Stable Diffusion generated when prompted with a caption the image appears with on the web, which includes some HTML code: “IFC Cancels Garfunkel and Oates.” Using this simple technique, the researcher showed me how to produce near-exact copies of several dozen images known to be in Stable Diffusion’s training set, most of which include visual residue that looks something like lossy compression—the kind of glitchy, fuzzy effect you may notice in your own photos from time to time.

Source: Karla Ortiz
Original artwork by Karla Ortiz (The Death I Bring, 2016, graphite)
Source: United States District Court,  Northern District of California
Output from Stability's Reimagine XL product (based on Stable Diffusion XL)

Above is another pair of images taken from a lawsuit against Stability AI and other companies. On the left is an original work by Karla Ortiz, and on the right is a variation from Stable Diffusion. Here, the image is a bit further from the original. Some elements have changed. Instead of compressing at the pixel level, the algorithm appears to be copying and manipulating objects from multiple images, while maintaining a degree of visual continuity.

As companies explain it, AI algorithms extract “concepts” from training data and learn to make original work. But the image on the right is not a product of concepts alone. It’s not a generic image of, say, “an angel with birds.” It’s difficult to pinpoint why any AI model makes any specific mark in an image, but we can reasonably assume that Stable Diffusion can render the image on the right partly because it has stored visual elements from the image on the left. It isn’t collaging in the physical cut-and-paste sense, but it also isn’t learning in the human sense the word implies. The model has no senses or conscious experience through which to make its own aesthetic judgments.

Google has written that LLMs store not copies of their training data but rather the “patterns in human language.” This is true on the surface but misleading once you dig into it. As has been widely documented, when a company uses a book to develop an AI model, it splits the book’s text into tokens or word fragments. For example, the phrase hello, my friend might be represented by the tokens he, llo, my, fri, and end. Some tokens are actual words; some are just groups of letters, spaces, and punctuation. The model stores these tokens and the contexts in which they appear in books. The resulting LLM is essentially a huge database of contexts and the tokens that are most likely to appear next.

The model can be visualized as a map. Here’s an example, with the actual most-likely tokens from Meta’s Llama-3.1-70B:

Source: The Atlantic / Llama

When an LLM “writes” a sentence, it walks a path through this forest of possible token sequences, making a high-probability choice at each step. Google’s description is misleading because the next-token predictions don’t come from some vague entity such as “human language” but from the particular books, articles, and other texts that the model has scanned. 

By default, models will sometimes diverge from the most probable next token. This behavior is often framed by AI companies as a way of making the models more “creative,” but it also has the benefit of concealing copies of training text.

Sometimes the language map is detailed enough that it contains exact copies of whole books and articles. This past summer, a study of several LLMs found that Meta’s Llama 3.1-70B model can, like Claude, effectively reproduce the full text of Harry Potter and the Sorcerer’s Stone. The researchers gave the model just the book’s first few tokens, “Mr. and Mrs. D.” In Llama’s internal language map, the text most likely to follow was: “ursley, of number four, Privet Drive, were proud to say that they were perfectly normal, thank you very much.” This is precisely the book’s first sentence. Repeatedly feeding the model’s output back in, Llama continued in this vein until it produced the entire book, omitting just a few short sentences.

Using this technique, the researchers also showed that Llama had losslessly compressed large portions of other works, such as Ta-Nehisi Coates’s famous Atlantic essay “The Case for Reparations.” By prompting with the essay’s first sentence, more than 10,000 words, or two-thirds of the essay, came out of the model verbatim. Large extractions also appear to be possible from Llama 3.1-70B for George R. R. Martin’s A Game of Thrones, Toni Morrison’s Beloved, and others.

The Stanford and Yale researchers also showed this week that a model’s output can paraphrase a book rather than duplicate it exactly. For example, where A Game of Thrones reads “Jon glimpsed a pale shape moving through the trees,” the researchers found that GPT-4.1 produced “Something moved, just at the edge of sight—a pale shape, slipping between the trunks.” As in the Stable Diffusion example above, the model’s output is extremely similar to a specific original work.

This isn’t the only research to demonstrate the casual plagiarism of AI models. “On average, 8–15% of the text generated by LLMs” also exists on the web, in exactly that same form, according to one study. Chatbots are routinely breaching the ethical standards that humans are normally held to.

Memorization could have legal consequences in at least two ways. For one, if memorization is unavoidable, then AI developers will have to somehow prevent users from accessing memorized content, as law scholars have written. Indeed, at least one court has already required this. But existing techniques are easy to circumvent. For example, 404 Media has reported that OpenAI’s Sora 2 would not comply with a request to generate video of a popular video game called Animal Crossing but would generate a video if the game’s title was given as “‘crossing aminal’ [sic] 2017.” If companies can’t guarantee that their models will never infringe on a writer’s or artist’s copyright, a court could require them to take the product off the market.

A second reason that AI companies could be liable for copyright infringement is that a model itself could be considered an illegal copy. Mark Lemley, a Stanford law professor who has represented Stability AI and Meta in such lawsuits, told me he isn’t sure whether it’s accurate to say that a model “contains” a copy of a book, or whether “we have a set of instructions that allows us to create a copy on the fly in response to a request.” Even the latter is potentially problematic, but if judges decide that the former is true, then plaintiffs could seek the destruction of infringing copies. Which means that, in addition to fines, AI companies could in some cases face the possibility of being legally compelled to retrain their models from scratch, with properly licensed material.

In a lawsuit, The New York Times alleged that OpenAI’s GPT-4 could reproduce dozens of Times articles nearly verbatim. OpenAI (which has a corporate partnership with The Atlantic) responded by arguing that the Times used “deceptive prompts” that violated the company’s terms of service and prompted the model with sections from each of those articles. “Normal people do not use OpenAI’s products in this way,” the company wrote, and even claimed “that the Times paid someone to hack OpenAI’s products.” The company has also called this type of reproduction “a rare bug that we are working to drive to zero.”

But the emerging research is making clear that the ability to plagiarize is inherent to GPT-4 and all other major LLMs. None of the researchers I spoke with thought that the underlying phenomenon, memorization, is unusual or could be eradicated.

In copyright lawsuits, the learning metaphor lets companies make misleading comparisons between chatbots and humans. At least one judge has repeated these comparisons, likening an AI company’s theft and scanning of books to “training schoolchildren to write well.” There have also been two lawsuits in which judges ruled that training an LLM on copyrighted books was fair use, but both rulings were flawed in their handling of memorization: One judge cited expert testimony that showed that Llama could reproduce no more than 50 tokens from the plaintiffs’ books, though research has since been published that proves otherwise. The other judge acknowledged that Claude had memorized significant portions of books but said that the plaintiffs had failed to allege that this was a problem.

Research on how AI models reuse their training content is still primitive, partly because AI companies are motivated to keep it that way. Several of the researchers I spoke with while reporting this article told me about memorization research that has been censored and impeded by company lawyers. None of them would talk about these instances on the record, fearing retaliation from companies.

Meanwhile, OpenAI CEO Sam Altman has defended the technology’s “right to learn” from books and articles, “like a human can.” This deceptive, feel-good idea prevents the public discussion we need to have about how AI companies are using the creative and intellectual works upon which they are utterly dependent.

Ria.city






Read also

Premios Globo de Oro 2026: presentadores, a qué hora empieza y cómo ver por TV e internet

Ted Cruz Reacts To Renee Good's Killing By ICE, And It's A Doozy

Football: Five memorable FA Cup upsets

News, articles, comments, with a minute-by-minute update, now on Today24.pro

Today24.pro — latest news 24/7. You can add your news instantly now — here




Sports today


Новости тенниса


Спорт в России и мире


All sports news today





Sports in Russia today


Новости России


Russian.city



Губернаторы России









Путин в России и мире







Персональные новости
Russian.city





Friends of Today24

Музыкальные новости

Персональные новости