The current struggle for AI companies is to secure enough chips to train ever-larger AI models. But experts are warning of a new problem: there may not be enough text data on the internet to train the next generation of models. Researchers from Epoch, an AI institute, estimate the current pool of quality online language data could be exhausted within this decade. The era of free access to online training data may also be ending, with content providers suing OpenAI over its use of copyrighted material or signing exclusive data licensing deals. AI companies are already considering alternatives. OpenAI and Anthropic, which is backed by Amazon, are both experimenting with “synthetic data”, using AI to generate data that in turn feeds new AI models. But this AI “inbreeding” can ultimately cause models to collapse and produce nonsense. It’s a problem AI companies might have to confront soon anyway – as the internet fills with AI generated text, AI companies may find they’ve inadvertently polluted their own training data.