This is part of a series of special reports I’ll do over the next month or two on trends driving the AI landscape toward the second half of 2025 and 2026. Why am I starting with data?
Because data is everything! Data constrains our ability to train larger and larger models from scratch. We can build more data centers. We can add more power. But can we add more data?
This is one of the most interesting questions on the planet in 2025, and we are learning that the answer is yes. This report dives into two competing trends that are shaping 2025. They’re both durable enough that we can be confident at this point that they will profoundly shape 2026 and 2027.
Put simply, natural data supply is tightening while synthetic data and synthetic training are exploding. This has profound implications for the way model intelligence is going to grow in the future.
Natural Data Tightening: Everywhere you look companies are looking to constrain and lock off data access to ChatGPT and other major model makers. AI model makers themselves are going tit-for-tat to keep data away from each other (hello Windsurf). Net net, this means available natural data supply is shrinking.
Synthetic Data Exploding: At the same time, model makers are going all in on using synthetically generated tokens and synthetic training methods to enable them to continue to scale intelligence without natural data sources.
Synthetic Data refers to tokens generated by AI, and synthetic training goes a step farther, giving these synthetic tokens synthetic (AI-derived) feedback.
There is a widespread misconception that synthetic data = bad data. As you’ll see below, this isn’t true. It’s in fact increasingly clear that using synthetic data and synthetic training methods improves the quality of models, and frontline models we use today were almost all trained to some degree on synthetic data or used synthetic feedback somewhere in the training process.
So synthetic data is here already, and the data says it’s going to get more prevalent very rapidly. What happens in a world where natural data is disappearing just as synthetic data is exploding? Do models stay aligned? Are there quality implications we aren’t paying attention to? If we assume that we can manage synthetic data safely at scale, where does the bottleneck shift to? That’s what this report explores…
Listen to this episode with a 7-day free trial
Subscribe to Nate’s Substack to listen to this post and get 7 days of free access to the full post archives.