Close Mobile Menu

Data Is Fueling the AI Revolution. What Happens When It Runs Out?

Experts warn that the internet may not be big enough to sustain AI innovation

March 20, 2025
by Aaron Mok
Digital streams Midjourney

AI seems to have infinite capabilities. Just ask OpenAI’s ChatGPT to write a cover letter or plan a travel itinerary, and it will spit out a comprehensive response in mere seconds. Heck, it can even pass the bar exam. 

Underlying these impressive capabilities are advanced machine learning models trained on massive troves of data. Generative AI chatbots, in particular, are powered by so-called large language models, advanced algorithms that can process, analyze, and (seemingly) understand human language. Fueling the advance of these LLMs are hundreds of millions of publicly available texts scraped from across the web—everything from Wikipedia pages to books scanned into Project Gutenberg, and code held in open-source repositories like GitHub. 

Problem is, AI is gobbling up that data faster than it can be created, leading experts like Berkeley computer science professor Stuart Russell to warn that “the root of the large language model, which is to make them bigger and train them on more data, is coming to an end.” He told attendees at the AI for Good Global Summit in 2023. “We’re literally running out of text in the universe to train these systems on.”

Indeed, a recent study from a San Francisco-based AI research group, Epoch, claims that, at the current rate, the total stock of human-generated internet data could be depleted by as early as 2026.

Why it matters

Internet data is “hugely important” for AI’s capabilities, Dan Klein, a professor at UC Berkeley’s Department of Electrical Engineering and Computer Sciences (EECS) said. Web data is pumped into AI models during its pre-training stage where AI learns patterns between a diverse set of words before it’s fine-tuned to accomplish specific tasks. That process serves as the foundation for AI models to understand prompts and generate useful, accurate outputs. “The bulk of that raw knowledge and understanding, the representation of how language works, the concepts that exist in our use of language which reflects the concepts that exist in the real world—all of that is ultimately coming from the web training,” Klein said.

Tech giants develop their LLMs according to scaling laws, the idea that increasing the size of models—including their pre-training data sets—will improve their capabilities. But some experts say that AI is facing diminishing returns as additional inputs aren’t improving model capabilities as significantly as they previously were.

The specter of an AI plateau is a potential threat to the business model. As major tech companies continue to spend billions on infrastructure to grow and power their LLMs, investors may begin to wonder when they’ll see returns.

Alane Suhr, an assistant professor at EECS, added that the pressure companies face to advance their AI models as quickly as possible could incentivize the adoption of new data extraction methods that are “ethically dubious.” She pointed to new evidence that Meta downloaded tens of millions of pirated books amid the ongoing Kadrey v. Meta class action lawsuit on allegations of copyright infringement. 

A 2024 New York Times investigation on AI data collection practices found that Big Tech companies are cutting corners to get their hands on data, tapping into private user data for model training purposes and scraping content from YouTubers, publishers and other creators without their consent, leading to a slew of lawsuits.  

Not all data are created equal 

The vast digital archive known as the internet didn’t appear overnight. Large collections of documents have accumulated over a long period of time. “There’s not another hidden 30 years of the Web,” Klein said.

But hitting the internet data wall doesn’t necessarily mean there’s no more data to tap into. For instance, Ambi Robotics, a startup founded by Professor Ken Goldberg and other Berkeley researchers, uses physical data collected from warehouses they work with to train its robots to place, stack, and sort packages on their own. Data is collected by internal systems in real-time, ensuring a steady supply of up-to-date data used to train its AI. That way, robots don’t drop packages and place them sloppily. “If the data quality isn’t good, the system won’t perform reliably,” said Goldberg.

The emphasis is on good

 “I personally think we’re at the stage where the quality of data is more important than quantity,” said Sewon Min, an incoming computer science professor at UC Berkeley, who stresses that accurate, timely, and complete data is crucial for the next phase of AI development. AI models already exhibit advanced levels of intelligence, she says, just on the “medium quality” data they’ve scraped from the internet. High-quality data, then, must be obtained to push AI’s progress even further.

LLMs can inherit biases from the data they are trained on. One BAIR study, for example, found that ChatGPT’s responses tend to default towards American English as opposed to, say, British or Ghanaian English. Standard American English is “likely the best-represented variety in its training data,” the researchers wrote. High-quality data helps ensure its outputs fairly reflect the diversity and complexity of the real world. 

A potential solution

Synthetic data, or AI-generated data that mimic real-world conditions, could address the data wall, according to Berkeley’s Min and Klein. Generative AI algorithms can be used to create hyper-specific data sets tailored to a particular task. For example, a car dealership aiming to target women over 40 living in the Midwest could generate synthetic customer profile data—a digital persona of the ideal buyer’s characteristics—to train its AI model to personalize outreach messages. 

Synthetic data can also be a workaround to proprietary and private information. For instance, publicly available CT scans of liver tumors are scarce, but researchers have found that synthetic images of cancer tumors can be used effectively to enhance AI’s tumor-detection abilities, which in turn can be a helpful tool for doctors. 

One promising strategy for creating synthetic data, according to a study coauthored by Berkeley graduate student Charlie Snell and researchers from Google DeepMind, is using reasoning models like OpenAI’s o-1. These models use a technique called test-time compute in which the AI responds to a prompt by breaking it down into multiple tasks. Instead of spitting out an answer instantaneously, the model spends extra time “thinking” before generating an outcome. These higher-quality responses could be turned into a new training data set for the AI, which in theory, could improve its responses. 

Problem solved?

Not exactly. It’s “very hard to get it right,” according to Professor Klein. Synthetic data may also not fully capture the nuances of real-world data, which can skew the accuracy and reliability of the AI model. Artificially-generated data can also be biased towards a particular outcome if the data isn’t diverse. Cancer detection-AI trained primarily on images of lighter-skinned patients, for instance, could struggle to correctly identify melanomas on darker-skinned patients. 

At its worst, shoddy synthetic data could degrade AI performance. A study found that training AI on AI-generated data can produce lower-quality outputs, a phenomenon known as “model collapse.” As more LLM-generated content spills onto the web, future models may be increasingly trained on their own outputs, creating a recursive loop that could make model collapse unavoidable.

Creating new sources of data also comes at a cost. Training AI is already expensive and requires massive amounts of computing power, not to mention energy, water, and land. Synthetic data generation and test-time compute would inevitably add to that burden. 

What can we expect for the future?

Looking ahead, the skyrocketing costs of scaling LLMs could signal a shift towards developing smaller language models. Instead of training large models to be better at its general capabilities, smaller models can be trained to perform exceptionally well at specific tasks such as legal contract negotiation. 

“My personal bet is we’re going to see a mixture of general models and specialist models that are much more focused,” Klein said. 

Berkeley alum Dr. A.K. Pradeep ’92 also predicts that the way forward will be using “swarms of small language models” with different types of knowledge and levels of expertise. Using multiple small models in tandem as opposed to one large model to complete tasks, Pradeep says, will allow for more diverse perspectives and creative problem-solving, mimicking how the human brain works with its “billions of connected neurons.”

If companies prioritize training smaller, more expert models, hitting the data wall may not be an issue. But for those aiming to achieve superintelligent AI, creativity will be necessary. Whether Big Tech can adapt—or if AI’s rapid ascent is about to hit a ceiling—remains to be seen.

Share this article