DeepSeek Claims to Threaten NVIDIA's Moat—Here's Why Jensen is Going to Keep Laughing to the Bank
NVIDIA opened down something like 11-12% today. Microsoft took a bath in early trading—to the point where Satya was tweeting about Jevons last night lol—the sellers don't understand how AI works
I was going to do a longer post on how DeepSeek actually works, and I probably still will. But hey, you can’t pick market timing and we clearly need to have a conversation about some of the more absurd assumptions that are driving the rumor mill right now.
Over the weekend of January 25-26, DeepSeek hit #1 in the iOS App Store. Even though Apple can’t release an AI product for the life of them, that scared the markets.
On Monday January 27, DeepSeek triggered a sell-off of AI-related stocks. NVIDIA dipped 12%, Microsoft slid 4%, and Meta saw ripples of concern. At a glance, many interpreted DeepSeek’s R1 model as a sign that the future of AI might require fewer computational resources than once thought. But if you look more carefully, it becomes clear that R1’s emergence doesn’t negate the indispensable role of massive infrastructures and prior investments—nor does it remove the need for continued growth in chip capacity. Instead, it points us toward a more complex reality where optimization gains push the field forward but are unlikely to truly reduce our reliance on high-end compute. After all, when was the last time someone predicting the world needed less computers was right?
The Reality Behind DeepSeek’s Efficiency
DeepSeek’s R1 has made headlines for matching OpenAI’s o1 in reasoning tasks while seemingly using less hardware. Yet the story is more nuanced. R1 was produced via a sophisticated multi-stage training process that combined supervised fine-tuning (SFT) with Group Relative Policy Optimization (GRPO), a cutting-edge reinforcement learning technique.
These are real advances, and they represent really clever engineering, and DeepSeek did everyone a huge favor by open-sourcing their approach. I guarantee you Zuck is somewhere in a room with his AI team backwards engineering this thing right now.
So what’s the secret sauce?
I’ll write a longer post on this, but three things stand out:
No Value Function: GRPO does away with an explicit value function, simplifying memory usage.
Group-Based Advantage: Rewards hinge on the average performance of a group of outputs, dovetailing more neatly with reward model training and again saving on compute.
KL Divergence in the Loss Function: By integrating KL divergence directly into optimization, GRPO offers a more robust handle on complex tasks. Again, read that as an efficiency play.
It all adds up to real gains: Using rule-based rewards for correctness and format, DeepSeek drastically improved reasoning benchmarks—on AIME 2024, it leapt from a pass@1 score of 15.6% to an impressive 71.0%, aligning with O1-0912’s performance. Yes, these are real gains.
What’s the rest of the R1 story?
R1 Built on Existing Foundations. DeepSeek benefited from high-quality chain-of-thought data and methodologies already developed by leading AI labs (e.g. OpenAI). Thus, R1’s achievements were not born in isolation; they rest on the hundreds of billions of dollars poured into large-scale AI research over the past couple of years.
Training R1 Was Still Resource-Intensive. Despite the efficiency claims, the multi-stage process called for substantial high-end infrastructure—just orchestrated in a more optimized way. Efficiency does not equate to trivial hardware requirements.
All that efficiency still didn’t reach true SOTA. The State of the Art remains o1 Pro, and we’re likely to get o3 and o3 Pro in the next few weeks. I’ve played with both o1 Pro and DeepThink and the difference is stark when you ask the right kinds of questions.
R1 is deliberately misaligned. It is the first world-class model (as far as I know) that is deliberately and very publicly censoring outputs. It won’t say “Xi Jinping,” it won’t admit to the existence of Tiananmen, and it won’t discuss the Cultural Revolution. We spill a lot of ink around getting alignment right as models get smarter, and I think deliberately mis-aligning a model deserves more attention.
From DeepSeek to Statecraft: Why China’s $138B AI Plan?
Some observers have latched on to DeepSeek’s efficiency and declared that the era of scaling large language models may be ending. Yet China’s announcement of a $138 billion AI investment package over the weekend strongly suggests otherwise. This underscores a simple point: no matter how refined your training techniques, large-scale compute remains indispensable for cutting-edge AI. We won’t need less computers.
In fact, even if R1’s training was 30× more efficient than previous methods, ongoing leadership in AI still demands pushing the frontiers—where the cost and complexity of each incremental improvement inevitably rise. Once you’ve achieved “parity” in certain tasks, the next steps to surpass that level (and to serve these state-of-the-art models at scale) still require enormous infrastructure.
What’s Next? Key Questions:
DeepSeek’s and the industry’s next moves depend on the answers to these questions:
Are current efficiency gains enough to fundamentally shift demand away from advanced chips?
Or are they merely a sign that we can do more with the same level of investment—thus stimulating even more adoption, which in turn fuels demand for greater computational capacity?
The Incremental Value of Intelligence
Investors spooked by potential declines in chip demand may be missing the broader economic principle at play: Jevons Paradox. Improving efficiency often triggers higher overall consumption. If intelligence becomes cheaper to train, more organizations—across every sector—will incorporate AI into their daily functions, unleashing new layers of demand for both advanced hardware and specialized services.
This goes beyond looking at our current use of intelligence. For example, ChatGPT may be at parity or better with doctors today on diagnosis. But we should ask ourselves: could we use more intelligence to improve medical care further? The answer is obviously yes, and there are numerous other fields in need of similar investments in intelligence. Which requires chips.
Beyond the Moat: Continuous Innovation and Infrastructure
Some commentators worry that DeepSeek’s R1 undermines the “moat” of proprietary AI models. Yet the real bulwark in AI is continuous innovation and robust infrastructure, not a single uncopyable system. DeepSeek’s breakthroughs build on a global foundation of research, shared knowledge, and hardware built by firms like NVIDIA and Microsoft.
In short, it’s all about chips:
Model Training: Even with better algorithms, training frontier models still devours immense computational resources.
Serving AI at Scale: As AI products become pervasive, the inference side—i.e., running these models for millions of users—also demands colossal compute.
Staying on the Cutting Edge: Pushing past current performance thresholds grows exponentially more resource-intensive, ensuring an ongoing hunger for advanced semiconductors.
The Scaling Laws are Laws for a Reason
DeepSeek’s efficiency gains are a milestone, offering a glimpse of how smarter techniques can yield powerful AI without entirely breaking the bank. Yet these advances rest upon prior breakthroughs and massive investments in both algorithms and hardware. What’s more, they do not alter the fundamental trajectory: as intelligence becomes more affordable, demand surges, necessitating ever more scalable infrastructure.
China’s $138 billion AI plan is a poignant reminder that scaling laws remain central—leading-edge models still consume massive compute, and more intelligence invariably breeds more applications. While DeepSeek’s R1 is undeniably impressive, the future of AI hinges on an ongoing interplay between cutting-edge optimization and robust, ever-expanding compute environments. Efficiency alone does not reduce our reliance on chips; it simply fuels the next wave of adoption and innovation, which inevitably demands even more chips.
On Twitter, I wrote yesterday (riffing off of one of your TT videos), “I personally would say, “intelligence” is always in demand and a cheaper price for that intelligence will spur on greater demand, thus requiring more chips. It’s just that the supply curve for those chips will be less steep than it was before.” After reading this essay, I would amend that ending to say that the supply curve will only change temporarily, if that much. To be honest, it will probably be more cost-efficient for Nvidia and the other chip makers to just plow through and maintain production, as if nothing has really changed.
1. Even if they aren't hiding a bag of chips, agree that these efficiency gains don't mean diminished chip demand. If anything, it seems the opposite - we can now do more with less, meaning more competition and accelerated pace of innovation
2. I do question how this impacts the economics of leading AI labs investing to push frontier models forward. If scorched earth by open source is the strategy, what's the incentive to continue spending big money to advance new models when it quickly becomes replicated and given away for free afterwards?
3. Would love your expanded thoughts on how the price of energy plays into this. China has a seemingly huge advantage unless there's a lot of deregulation in US around nuclear or rapid improvement in renewables. Beyond chip demand, energy cost to train and serve the models seems like an important input that becomes more pronounced over a long timeline