Risk: OpenAI destroyed databases containing more than 100,000 books used to train ChatGPT

Photo: Getty Images

The conflict between the authors’ union and OpenAIowner of ChatGPT, has just begun a new chapter, with documents proving that the startup used thousands of books to train its algorithms.

The consortium is suing the startup, claiming that OpenAI infringed the copyright of published works for AI training.

New evidence suggests that the startup deleted two databases, known as books1 and books2, which contained more than 100,000 published works. According to Business Insider, OpenAI has been reluctant to acknowledge the existence of these files. More recent documents, dated 2020 and now released, reveal that the books1 and books2 databases account for 16% of the total training used to create GPT-3, totaling 50 billion words. OpenAI’s lawyers claim that the textbook training was retired at the end of 2021 and the databases were deleted the following year, and that none of the current ChatGPT models were created using these files. Furthermore, those responsible for creating the files are no longer in the company. Using published books is crucial to training high-quality AI models, but the lack of financial compensation for copyright holders has led to legal disputes, including lawsuits brought by the Authors’ Union. The startup seeks to keep the contents of databases and the identity of employees confidential.