Join us Read
Listen
Watch
Book
Technology AI, Science and New Things

The World (wide web) is not enough: AI systems running out of training data

The current struggle for AI companies is to secure enough chips to train ever-larger AI models. But experts are warning of a new problem: there may not be enough text data on the internet to train the next generation of models. Researchers from Epoch, an AI institute, estimate the current pool of quality online language data could be exhausted within this decade. The era of free access to online training data may also be ending, with content providers suing OpenAI over its use of copyrighted material or signing exclusive data licensing deals. AI companies are already considering alternatives. OpenAI and Anthropic, which is backed by Amazon, are both experimenting with “synthetic data”, using AI to generate data that in turn feeds new AI models. But this AI “inbreeding” can ultimately cause models to collapse and produce nonsense. It’s a problem AI companies might have to confront soon anyway – as the internet fills with AI generated text, AI companies may find they’ve inadvertently polluted their own training data.


Enjoyed this article?

Sign up to the Daily Sensemaker Newsletter

A free newsletter from Tortoise. Take once a day for greater clarity.



Tortoise logo

A free newsletter from Tortoise. Take once a day for greater clarity.



Tortoise logo

Download the Tortoise App

Download the free Tortoise app to read the Daily Sensemaker and listen to all our audio stories and investigations in high-fidelity.

App Store Google Play Store

Follow:


Copyright © 2025 Tortoise Media

All Rights Reserved