Accio Lawyers! Microsoft manager trained AI on pirated Potter books
Oh, my. With “AI” systems causing a lot of problems pretty much everywhere, it’s a bad look for one of the world’s most important tech companies to actively promote piracy. But that appears to be just what happened, with a post hosted on Microsoft’s developer blog, actively using an apparently pirated set of Harry Potter novels to train an Azure-based “AI” system.
“The Harry Potter series, written by J.K. Rowling, is a globally beloved collection of seven books that follow the journey of a young wizard, Harry Potter, and his friends as they battle the dark forces led by the evil Voldemort,” wrote Pooja Kamath, a Microsoft Senior Product Manager. The blog post then pointed to a Kaggle dataset link that contained seven TXT files, apparently encompassing the entire published novel series.
The blog post was a guide on adding generative “AI” to applications via Azure. The manager said that it could be used to create a Q&A system, or auto-generate Harry Potter fan fiction. “This feature is sure to delight Potterheads, allowing them to explore new adventures and create their own magical stories.” It closes with an LLM-generated image of two children on a train, obviously caricatures of Harry Potter and Ron Weasley, with a Microsoft logo between them.
This is, in technical legalistic terms, a big frickin’ no-no. All the Harry Potter novels are, of course, held under copyright by various entities around the world, including the author. A quick browse on Amazon shows that a complete collection costs $70 USD in ebook format at the time of writing. Hosting or downloading the files for free without paying any kind of royalty is a crime basically everywhere. Yes, that includes downloading it even if all you intend to do is plug it into a large language model.
The original Microsoft how-to post was published in late 2024, and has been removed from the site (though it’s still accessible via the Internet Archive). Ditto for the Kaggle dataset, which was mistakenly marked as “public domain” and only downloaded about 10,000 times, according to a report from Ars Technica. Both the blog post and the pirated data set seem to have flown under the radar for a year and a half, until a Hacker News thread yesterday brought new attention to them.
It’s shocking that a Microsoft manager would be so casual about ebook piracy in a public post on a Microsoft blog (though Kamath may not understand how the public domain system works and assumed the files were marked correctly.). But the most popular large language models have been trained on millions of ebooks, many (possibly even a majority) of which have been downloaded via illegal piracy.
Authors have filed lawsuits against Meta/Facebook, OpenAI, Nvidia, Alphabet/Google, Anthropic, Microsoft, and others, aiming to stop training on copyrighted works and/or seek remuneration for books already incorporated into LLM training without permission. Initial results in the courts have been mixed, sometimes finding the results of training models “transformative” and thus substantively different from the core data, i.e., fair use, and some finding that initial acts of piracy must still be prosecuted.