The AI data gold rush is here and Corporate America is ready

The era of free AI training data is over. Reddit $RDDT charges millions for API access. The New York Times filed a complaint. Publishers block scrapers. Even if AI companies could still suck up the public Internet, they face a bigger problem: They need different types of data entirely for the next leap in capabilities.
Large language models have been built by scraping text and images from the web. But as AI systems move beyond chatbots, they need training data that was never publicly available in the first place. Data that is locked, scattered, or doesn’t even exist yet.
New markets are emerging to unlock these sources. Here are three.
Your digital exhaust, monetized
Most people think of personal data as social security numbers and medical records. But almost everything you do online generates data that platforms collect and use. Spotify $SPOT listening history, your email habits, the documents you write in Google $GOOGL Docs, your conversations with ChatGPT.
When you download your Instagram data, for example, the company doesn’t just provide you with your photos. You get everything Instagram has inferred about you based on your browsing behavior: hundreds of data points ranging from innocuous labels like “interested in nature” to psychological assessments like whether you have depression.
None of this is publicly available. This is all legally yours.
“If you park your car in a parking lot, the parking lot doesn’t own your car,” says Anna Kazlauskas, CEO of Vana, a company that builds infrastructure for individuals to contribute data from their platform to AI training. The same principle applies to data: you own it, even if it lives on someone else’s server.
The scale is enormous. A version of Common Crawl, the dataset that resulted in Meta $META’s Llama 3 contains approximately 15 trillion words taken from the public Internet. If 100 million people each contributed data exports from just five platforms, it would produce 450 trillion tokens, 30 times more than any existing dataset.
This type of data could unlock personalized AI that understands your music tastes, or health models trained on real sleep and fitness data, all of which couldn’t be done with scraped web content. Kazlauskas says paying people for data only they can provide could also reshape the broader AI debate.
“A lot of the fear around AI comes from the lack of proper attribution and economics,” says Kazlauskas. “If you’re teaching AI how to do your job, you should actually own that AI model.”
Mapping the physical world
Text models could train on scraped web data. But the next generation of AI needs accurate and consistent information about the physical world. Robots navigating cities, autonomous vehicles, and augmented reality all need high-fidelity digital maps to make decisions.
The problem is that existing aerial data is fragmented. It comes from various contractors with different sensors with different accuracies, making it almost impossible to train reliable geospatial models. Satellite imagery, although covering most of the planet, lacks resolution. The data layer that AI companies need simply doesn’t exist yet.
Spexi is trying to build it using gig workers and drones. The company has more than 10,000 pilots carrying out standardized missions at 80 meters altitude. Over the past 18 months, they have covered more than 6 million acres in 300 North American cities with higher resolution than satellites or traditional aerial imagery, says Bill Lakeland, CEO of Spexi.
Spexi works with companies like Niantic to train large geospatial models for augmented reality and robotics. Unlike language models, these require constant updating as buildings rise and roads change. This is a version of the same problem that plagues ChatGPT and other LLMs: how to keep models up to date without retraining from scratch. Lakeland’s team is working on algorithms to predict when and where updates are needed, but this remains an unsolved research challenge.
Big Data’s second chance
One of the world’s largest PC makers has been collecting telemetry data for seven years. Nobody looked at him. When Sachin Dharmapurikar’s team at The Modern Data Company finally analyzed it, they discovered that two of the 70 fields had been collected incorrectly this entire time.
Dharmapurikar’s company helps businesses transform their existing data into structured, contextualized datasets designed for specific business questions rather than general storage. Ten years ago, companies started tracking and storing everything in the cloud, assuming that data collection would eventually provide insights. Instead, it has created costly, siled, and unmanaged data landscapes.
When ChatGPT exploded in popularity, many executives thought they had finally found a simple solution. Simply feed all that stored data into an LLM and watch the magic happen. Dharmapurikar calls this the “ChatGPT curse.”
The reality is more complex. Businesses need four things: data quality at scale, the ability to trace lineage and explain how conclusions were reached, governance to prevent AI hallucinations, and semantic metadata that contextualizes data in business terms. The lifetime value of a retail customer is different from that of an enterprise customer, for example. Without this context, models will infer incorrectly.
Even when data exists, it is often trapped. Sales, manufacturing, and web teams collect data in silos, and transferring it between departments requires bureaucracy and paperwork. AI needs information from across the organization, but the reality is fragmented systems that don’t communicate with each other.
Dharmapurikar says the industry is finally becoming realistic. “People are now more calculated, more rational and pragmatic about these kinds of things,” he says. “The reality is stark: there is no easy solution.”




