OpenAI's lawyer says there are too many files from Ilya Sutskever and other employees to share in copyright lawsuit
- OpenAI is trying to negotiate down the number of files it must produce in a copyright case.
- Files belonging to ex-chief scientist Ilya Sutskever are among those under dispute.
- The Authors Guild's case centers on claims that OpenAI trained AI models on books without permission.
A lawyer for OpenAI is seeking to negotiate down the number of documents the company must review and disclose in a high-profile copyright lawsuit, arguing that the latest requests involving its cofounder Ilya Sutskever and seven other current and former employees are too big and numerous.
In a letter to the judge filed on Wednesday in New York federal court, OpenAI lawyer Carolyn M. Homer said files demanded by the Authors Guild from eight additional people would total hundreds of gigabytes of data "comprising over 886,000 documents."
These eight "custodians" — people thought to have relevant evidence to produce in the pretrial discovery process — include former chief scientist and cofounder Sutskever and researcher Jan Leike, who left the company in May for rival firm Anthropic.
The lawsuit centers on claims that OpenAI's models were trained on books without the authors' permission.
Homer also named other disputed custodians, including OpenAI technical staff members Chelsea Voss, Shantanu Jani, and Jong Wook Kim, pretraining data lead Qiming Yuan, and former employees Andrew Mayne and Cullen O'Keefe.
OpenAI has already agreed to produce documents from 24 custodians but is pushing back against proposed search parameters and requests to produce files relevant to eight new custodians over concerns that their files would significantly increase the resources needed to go through them.
According to OpenAI's lawyer, the company's review of the existing 24 custodians, based on its own proposed search terms, would require it to examine "more than 460,000 documents" totaling 359 gigabytes. Homer said that using the Authors Guild's proposed terms, OpenAI would need to review over 1 million documents.
When factoring in OpenAI's proposed search terms for the eight disputed custodians, Homer said the size of the files would be over 375 gigabytes, exceeding the size of the files from the 24 custodians already agreed on by both parties.
The lawyer also said OpenAI estimated a 71% duplication rate based on proposed search terms between the eight disputed custodians and the 24 existing ones.
OpenAI's lawyer said that the "substantial volume of hits," as well as concerns over high duplication rates, meant it would continue to attempt to reach an agreement with the plaintiffs over Sutskever's files and the other disputed custodians.
The dispute marks the latest development in the ongoing class-action lawsuit brought by the Authors Guild — which provides support for writers — against OpenAI. Unsealed documents reviewed by BI this year showed that the ChatGPT maker deleted two datasets, "books1" and "books2," used to train an older AI model named GPT-3.
OpenAI is also facing several other cases over copyright infringement, including one brought against the company by The New York Times.
Authors Guild lawyers said in filings that the datasets may have included "more than 100,000 published books."
OpenAI and the Authors Guild did not immediately respond to Business Insider's request for comment.