Long stories short
- China drew up tighter rules to govern generative artificial intelligence in the country.
- Amazon was projected to make $12.9 billion from this year’s Prime Day, up from $8 billion last year.
- Massachusetts lawmakers considered a near total ban on selling location data from mobile phones.
Shadow theft
Sarah Silverman and two bestselling novelists are suing Meta and OpenAI for $1 billion after accusing them of using “shadow libraries” to train AI chatbots.
So what? Increasing numbers of creators are suing AI companies over copyright infringement. Their lawsuits threaten to expose the opaque world of AI training models and their use of illegal databases of millions of e-books.
A comedian and a chatbot walk into a bar. Silverman is a famous comedian and actor who in 2010 wrote a book called The Bedwetter. In her lawsuit, she claims that OpenAI and Meta copied the book without her approval. For proof, Silverman has submitted an exhibit showing what happened when ChatGPT was asked to summarise her work. The summary was so accurate, she argues, that The Bedwetter must have formed part of ChatGPT’s training dataset.
Shadow libraries. The Silverman lawsuit alleges that AI companies like OpenAI illegally used “shadow libraries” to train their technology. These online databases provide free access to millions of books and articles that would otherwise be behind paywalls. Users utilise peer-to-peer file-sharing technology to download books, a bit like Napster 20 years ago.
Shadow libraries are of great value to the AI-training community. Large language models, the programmes which form the basis of chatbots like ChatGPT, require huge amounts of text to train themselves to mimic human responses.
While some online libraries – like Project Gutenberg, a collection of e-books with expired copyrights – are legal, others are more controversial:
- Library Genesis (aka LibGen): a file-sharing shadow library for academic journals, general interest books, comics and audio books. Has over 80 million science magazine articles. Russian origin.
- Z-Library: began as a mirror for LibGen. Now describes itself as the “world’s largest e-book library”. Located on the Dark Web.
- Sci-hub: limited to research papers only, with more than 88 million files. In 2019, the site’s operator said it was accessed more than 400,000 times a day.
- Bibliotik: An invite only file sharing site with 196,640 ebooks, arranged via category like a library. It has more than 1,000 active members and attracts 150,000 visits each month.
Not unique. Silverman and her co-plaintiffs, the novelists Richard Kadrey and Christopher Golden, are not the only people to take AI firms to court. A group of visual artists have sued Stability AI, Midjourney and DeviantArt for copyright infringement. Programmers have sued GitHub for introducing GitHub Copilot, an AI product, which they say relies on “unprecedented open-source software piracy”. Getty Images has also filed an AI lawsuit, alleging that Stability AI, who created the image-generation tool Stable Diffusion, trained its model on “millions of images protected by copyright.”
What next? Meta and OpenAI have yet to respond to Silverman’s lawsuit but are understood to deny any wrongdoing. More broadly, these lawsuits could define the boundaries of how AI learns and what role copyright laws will play in how training datasets are assembled.
Napster, the music file-sharing website, collapsed in 2001 under a wave of lawsuits. Two decades later, AI companies are beginning to discover that they can’t ignore copyright either.
Thanks for reading. Please tell your friends to sign up, send us ideas and let us know what you think.
Email sensemaker@tortoisemedia.com.
Alexi Mostrous
@AlexiMostrous