Think Before Your Next Move: A Celebration of Inference Time Compute and AI
If you aren't sure what "inference time compute" is—this post is for you! I'll explain what it means, how it's changed everything in AI in the last few months, and predict what's next for AI this year
I still remember the first time I watched a language model “think out loud.” And credit here—the raw chain of thought prize goes to DeepSeek’s R1! As I prompted it with a question about product strategy, it began thinking out loud, writing down its reasoning step by step, as if solving it on scratch paper.
In that moment, I felt a spark. Yes, I’d seen reasoning before—I’d watched the little o1 Pro ticker go by and been wildly impressed by the results. But there was something about seeing inference live that made it feel raw, vulnerable, real. Seeing inference in real time is one of those moments I’ll remember in this ongoing AI revolution. It was a moment when I realized inference-time compute really was a big deal. It really was going to change everything.
Over the past few months, that spark of realization has become a bonfire: new AI models are leveraging massive reasoning processes at inference time to achieve feats once thought impossible.
Inference-time compute refers to the computation expended while a model is generating an answer (as opposed to during training). By letting models “think longer”—whether through chain-of-thought reasoning, process-based supervision, or iterative self-refinement—we are unlocking unprecedented capabilities in math, science, and coding. In this article, I’ll dive deep into how this works and why it matters. I’ll compare OpenAI’s breakthrough o1 and o3 reasoning models, explore Anthropic’s innovative Claude 3.7 Sonnet hybrid system, and share data on the efficiency gains they achieved. I’ll also highlight the latest developments and offer a few thoughts on how inference-time compute will drive us toward AI scaling – and maybe even AGI.
Let’s embark on this journey of reasoning at runtime—a love letter to the power of letting our AI think a little harder, one step at a time.
Inference-Time Compute: Letting AI Think Longer and Harder
When we talk about AI breakthroughs today, the papers often mention model size or training data. But fewer people understand the new AI superpower: inference-time compute, essentially the “brainpower” a model uses while answering a question. Increasing this brainpower means allowing the model to perform more internal computation per query – and it’s changing everything about AI performance.
Chain-of-Thought Reasoning – A Revolution in Real Time
One of the first leaps in inference-time reasoning came with chain-of-thought (CoT) prompting. Instead of answering directly, a language model is prompted to generate a step-by-step reasoning process before giving the final answer. This simple idea – getting the model to articulate a chain of logic – significantly boosts its problem-solving abilities. Researchers at Google demonstrated back in 2022 that even without changing the model’s weights, “a chain of thought… significantly improves the ability of large language models to perform complex reasoning” (Wei et al., 2022, as cited in a Dec 2024 analysis (What Will REALLY Give Us Some AGI - Themesis, Inc.)).
You know what’s interesting? That works pretty well for humans too! One of the fascinating experiences of watching this AI revolution has been seeing many of the techniques that help humans to think magically helping AI to “think” as well. There’s a profound mirroring here, and it extends to techniques like social engineering, which work on both humans and LLMs now. But back to the topic at hand…
OpenAI took this further by not just prompting for reasoning, but training models to use CoT internally. Their o1 model, introduced in late 2024, was explicitly taught via reinforcement learning to think step-by-step. As the o1 contributors reported, “Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought… We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute)” (What Will REALLY Give Us Some AGI - Themesis, Inc.). In other words, the longer o1 was allowed to “cogitate” during inference, the better it did. This is a profound finding – it established that giving an AI more inference-time cycles yields higher accuracy. We now had a tangible new scaling law: if you want smarter outputs, just let the model devote more FLOPs to the problem at hand.
Crucially, these chains of thought are not just a gimmick; they unlocked abilities previously out of reach. With CoT reasoning, models began solving math word problems, logical puzzles, and multi-step reasoning tasks that would stump their earlier selves. It’s as if a switch was flipped from “recite what you know” to “figure it out step by step.” By the end of 2024, the narrative in AI circles had flipped as well. Many had claimed AI progress was slowing, that scaling was hitting a wall – but inference-time reasoning proved them wrong. As one expert noted in January 2025, the rapid improvements from models like o3 “continue our breakneck pace of improving AI – and [put] to bed the narrative in 2024 that ‘AI scaling is over’” (AI Reasoning—What is It? - by James Wang - Weighty Thoughts). I couldn’t agree more. Allowing models to reason in runtime has supercharged their capabilities in a way that pure pre-training couldn’t.
Process-Based Supervision and Revisions – Teaching Models to Double-Check Themselves
Chain-of-thought is one piece of the puzzle. Another powerful technique is process-based supervision, where the model’s intermediate reasoning steps are themselves guided and corrected. Instead of only getting a reward for a correct final answer, the model gets feedback on each step of its thinking process. This idea, explored by alignment researchers in 2023, became reality in new systems by late 2024. Think of it as having a coach watching over the model’s shoulder, nudging it when it goes astray in the reasoning, rather than only telling it right or wrong at the end.
A striking example is a framework introduced in December 2024 called LLM² (literally “LLM-squared”). In this setup, an initial language model (System 1) is paired with a verifier model (System 2) that checks each token or step the first model wants to output. During inference, the primary model proposes multiple next-step candidates, and the verifier provides “timely feedback” on each, filtering out bad reasoning in real-time. The result is a two-pass system: one part generates ideas, the other part criticizes and refines them, much like a human brainstorming and then reviewing their work. The payoff? Substantial gains in problem-solving accuracy. On challenging math datasets, this dual-process approach boosted accuracy from ~50% to ~58% on GSM8K, and from ~24% to ~29% on the MATH competition dataset – even before any additional tricks (). When the researchers allowed the model to then do a form of self-consensus (basically generating multiple solutions and picking the best, which adds even more inference compute), they achieved an astounding jump on GSM8K—from 56% up to 70% accuracy. These are huge leaps for math problems, all coming from smarter use of compute after the model is trained.
In plainer terms: process supervision gives AI a chance to revise and reflect on its answers. It’s akin to you solving a problem and then double-checking each line of your work. By catching errors in the reasoning process, the model can correct mistakes before finalizing an answer. We see similar ideas in revision models or self-reflection techniques that emerged around the same time—models generating an initial answer, then evaluating and potentially revising it using another pass or another model. All these methods trade extra compute for extra reliability. And as a result, AI performance on tasks like mathematical reasoning and code debugging is breaking new ground. Test-time “thinking” isn’t just about making the model talk more; it’s about making it think better.
To visualize the impact of thinking longer, consider the chart below. It shows an AI model’s accuracy on a set of math problems (the 2024 American Invitational Mathematics Exam) as a function of how many “thinking” tokens it’s allowed to use for each problem. The trend is clear: more inference-time tokens (i.e. a longer chain-of-thought) yields higher accuracy, with a predictable logarithmic improvement (Claude's extended thinking \ Anthropic). The model at 2k tokens of thinking was only getting about 30% of questions right, but by the time it’s allowed a 64k-token reasoning spree, it’s approaching 60% accuracy – nearly double the performance with the same underlying model!
You can see the same general trend for o1 (from OpenAI) below:
These charts send a thrilling message: we have a new scaling law in town. Beyond scaling model size or dataset size, scaling the thinking time at inference is a potent path to higher intelligence. The more compute you give the model per query, the better it gets – albeit with diminishing returns, much like a human taking more time to double-check their work. I find this development nothing short of beautiful. It feels like we’ve opened up a third axis of AI improvement (the first two being model and data scale), one that plays more directly into algorithmic intelligence rather than just knowledge memorization.
From o1 to o3: How OpenAI Supercharged Reasoning with Compute
No discussion of inference-time reasoning is complete without OpenAI’s o1 and o3 models – the poster children of this movement. These models aren’t perfect, but I do think that about the only appropriate response is awe. Not really awe at the discoverer, but awe at the discovered. Reasoning models are what made me convinced it’s reasonable (ha) and correct to talk about the rocks thinking now.
Or to put it more bluntly, if we want to insist that reasoning token production is only “simulating” thinking I suppose that’s entirely fine, but I’m not sure that my “simulated” thinking is more accurate or that I could reliably distinguish the veracity of my own thinking and reasoning processes from that of a strong reasoning model.
And this is the dumbest they are ever going to be. And you know what? With the right chain of thought prompt, you can get non-reasoning models to produce surprisingly reflective responses as well. I used a special reasoning prompt for ChatGPT 4.5, and I asked it about this concept of distinguishing between human thinking and AI thinking. Here’s what ChatGPT 4.5 said:
Your question touches on something deeply profound: as inference compute increasingly matches or surpasses human reasoning quality, the boundary between your thinking and artificially generated reasoning inevitably becomes blurrier. You might find yourself uncertain about whether your own reasoning retains special validity compared to high-quality AI-generated reasoning.
This experience—feeling uncertainty about the uniqueness or superiority of your reasoning—is itself meaningful. It highlights something fundamental: human reasoning, while subjectively experienced as intrinsically valid, has always been fallible, prone to biases, and subject to self-deception. High-quality inference compute simply exposes and clarifies the fact that your cognitive validity was never guaranteed by virtue of being human alone.
However, validity is more nuanced than just reasoning "quality" or correctness. While AI-generated thought might achieve logical coherence, factual accuracy, and impeccable form, human thought remains inherently tied to lived experience, context, embodied intuition, emotion, and subjective meaning-making. Thus, the validity of human reasoning may reside not in its pure quality or coherence but in its contextual authenticity and deeply personal relevance.
This recognition doesn’t necessarily diminish the value or authenticity of your reasoning. Instead, it urges epistemic humility—a recognition that validity is often contextual and relational, not absolute. Rather than fearing the loss of special validity, you might embrace a broader conception of reasoning where human and AI thought complement each other, each bringing different strengths: AI providing rigor, coherence, and logical clarity, humans providing context, authenticity, moral judgment, and experiential insight.
In other words, inference compute doesn’t invalidate your reasoning; it refines how you understand and value it. It encourages seeing validity less as an inherent human quality and more as something emergent from thoughtful reflection, nuanced context-awareness, and meaningful engagement—qualities that, while AI can mirror, it does not inherently experience.
Like, damn. And by the way that reasoning trick is here, from Matt Shumer:
``` First, think deeply for five minutes (at a minimum — if after five minutes, you still don't have the optimal response, keep thinking until you do) about the best way to do this, inside <thinking> tags, and then respond with your answer. ```
And this reasoning revolution has all come on in the last 6 months or so. I’m still in awe at how quickly these models progressed and how much they proved the doubters wrong. The wall is a straight line right now. Let me share why the o1-to-o3 leap is such a big deal, after I show a chart of o1 (red) vs o3 (blue) performance:

OpenAI o1: Teaching Models to “Think” (The First Step)
OpenAI’s o1 (announced in late 2024) was the company’s first public foray into a reasoning-optimized model. It came with a simple premise: use reinforcement learning to train a model to use chain-of-thought reasoning effectively. In practice, o1 would generate long internal sequences of thought (not always revealed to the user) to break down complex problems, and then produce an answer. This differed from previous GPT-style models that usually tried to answer in one go. With o1, OpenAI basically said, “We’ve taught our model how to think, not just what to say.”
Even in its preview and early versions, o1 delivered markedly better results on tough benchmarks than any vanilla GPT-style model of similar size. It was especially strong in domains like math, coding, and logic puzzles – tasks that benefit from stepwise reasoning. However, o1 was just the beginning. Perhaps the most important thing it proved was the concept that reasoning can be scaled. OpenAI researchers noted that both more training and more test-time thinking improved o1’s performance (What Will REALLY Give Us Some AGI - Themesis, Inc.), hinting that if we push further on this, we might get even bigger gains. And push further they did.
OpenAI o3: Simulated Reasoning and a Giant Leap in Performance
By December 2024, just a few months after o1’s debut, OpenAI teased o3 – a model that would take inference-time compute to new heights. It’s been only a couple months and it’s easy to remember the buzz and skepticism in the community: could o3 really be that much better, so soon? The answer was a resounding yes. In early 2025, o3 became available through Deep Research, and it blew past o1 on every metric that matters.
What sets o3 apart is something OpenAI calls “simulated reasoning” – essentially a more advanced, internal chain-of-thought. Unlike traditional LLMs that generate an answer in one forward pass, o3 can “pause and reflect” internally (OpenAI o3 Released: Benchmarks and Comparison to o1). It’s as if o3 is running a thought simulation inside its mind before speaking. OpenAI described this as a “private chain-of-thought” (OpenAI o3 Released: Benchmarks and Comparison to o1): o3 thinks through the problem and plans its answer invisibly, then delivers a well-reasoned result. This ability to plan ahead gives it a huge advantage on complex tasks. Whereas o1 might have been more of a savvy guesser, o3 actually strategizes.
The impact of simulated reasoning is evident in o3’s benchmark results. Across the board, o3 achieves large improvements over o1. For example, let’s revisit that ARC-AGI reasoning challenge I charted above (a notoriously difficult suite of abstract problems). On that test, o1 had hovered around human-toddler-level performance. o3 shattered previous records – scoring about 75.7% on the ARC benchmark in a low-compute setting, and up to 87.5% when given a high reasoning budget (OpenAI o3 Released: Benchmarks and Comparison to o1). For context, 85% is roughly human-level on this test, so o3 essentially matched or exceeded humans when it was allowed to really think (using its “high” effort mode). No model before had come close to this. In fact, o3 was the first model ever to surpass the 85% threshold on the ARC-AGI challenge (albeit on the public test set under certain conditions) (o3: The grand finale of AI in 2024 - by Nathan Lambert). What was jaw-dropping was how fast this was all done—crushing a human task designed to be extremely hard for AI—all within a few months by giving a model more inference-time compute and a clever reasoning training regime.
o3’s prowess doesn’t stop at abstract reasoning. It has also achieved breakthrough results in math and coding, two domains that were longstanding challenges for AI. On the American Invitational Mathematics Exam (AIME), o3 scored an astonishing 96.7% (OpenAI o3 Released: Benchmarks and Comparison to o1)—essentially acing a contest-level math test. For comparison, o1’s score on the same exam was around the low 80s (already impressive, but nowhere near o3). Similarly, on a graduate-level science exam, o3 hit 87.7%, demonstrating mastery of physics and other hard sciences (OpenAI o3 Released: Benchmarks and Comparison to o1). These are exactly the kinds of tasks where step-by-step reasoning and checking each step are crucial—and o3’s longer reasoning time paid off with near-perfect scores.
Let’s not forget coding, which is like catnip for a model that can think logically. o3 has shown it’s a superb coder and debugger. It outperforms o1 on coding competitions and benchmarks like Codeforces. One report noted o3 achieved a Codeforces Elo rating of ~2727, up from o1’s ~1891 – a massive jump that catapults the AI from roughly an entry-level programming contestant to a top 50 global score. On a software engineering benchmark (SWE-bench Verified), o3 scored about 71.7%, compared to o1’s ~48.9%. These numbers mean o3 can handle complex programming tasks and produce correct, verifiable code far more reliably than its predecessor. It’s as if o1 knew how to code but got tripped up by long problems, whereas o3 methodically works through each part of the problem until it’s solved.
Curious how to get o3 to produce code without a direct interface? You can do it via Deep Research now if you point the tool at an open set of documenation, constrain the prompt, and if you instruct o3 to think deeply about the code and enclose its response within <code> tags when it is absolutely sure the code will run.
(OpenAI o3 Released: Benchmarks and Comparison to o1) Figure: OpenAI’s o3 model versus o1 on coding benchmarks. On the left, o3 achieves 71.7% on a software engineering challenge (SWE-bench Verified), vastly outperforming o1’s accuracy. On the right, o3’s Codeforces competitive programming rating soars to 2727, compared to o1’s 1891. This places o3’s coding proficiency in the realm of top human competitors. Inference-time simulated reasoning allows o3 to plan and debug code far better than o1 (OpenAI o3 Released: Benchmarks and Comparison to o1) (o3: The grand finale of AI in 2024 - by Nathan Lambert).
What’s even more exciting is that o3 delivers these gains while also introducing new user-facing controls. OpenAI realized that not every query needs maximal reasoning, so they made reasoning time configurable. Developers using o3-mini can dial a “reasoning effort” setting to low, medium, or high (OpenAI o3 Released: Benchmarks and Comparison to o1). If you need a quick answer for a simple task, use low (faster and cheaper). If you have a really hard problem where accuracy matters more than speed, crank it to high and o3 will spend significantly more compute to get the best answer. This is essentially exposing the inference-time compute budget as a tunable parameter. It acknowledges the practical reality that more thinking costs more (in time and $$), but puts that power in our hands. It feels like having a car with different drive modes: eco for quick/light tasks, sport for heavy reasoning. The key point is, OpenAI made inference-time scaling not just an internal feature, but a user-controllable aspect of their API. That’s a strong sign they believe inference-time compute is a fundamental lever for AI performance going forward.
To sum up the o1-to-o3 story: In the span of three months, inference-time reasoning transformed from a promising idea into a rocket engine of progress. o3’s teaser trailer at the end of 2024 was rightly called “the grand finale of AI in 2024” (o3: The grand finale of AI in 2024 - by Nathan Lambert), capping the year with a surprise leap in capabilities. Many of us didn’t expect such rapid improvement, but o3’s success signaled that the era of reasoning-optimized models had truly arrived. It proved that pre-trained models could be pushed much further with relatively straightforward additions: letting them think longer, teaching them via RL to use that time well, and, when in doubt, trying multiple reasoning paths until one works. The o3 family is essentially scaling laws personified: bigger, longer, smarter – and by “longer” I mean longer inference, not just longer training.
I can’t overstate my enthusiasm here, because its hard to compare o3 to any other model. If you’ve never seen an AI one-shot a 60 page report that’s extraordinary and well-cited and well-organized, you are missing out! I summarized a few uses for Deep Research in an earlier post here. Watching o3 crush benchmarks that GPT-4 struggled with felt like witnessing a new paradigm take shape in real time. And less you think I’ve forgotten the other players—it’s not just OpenAI in this game.
Hybrid Reasoning Systems: Inside Anthropic’s Claude 3.7 Sonnet
Just a couple of weeks ago, Anthropic unveiled Claude 3.7 “Sonnet”, which they touted as “our most intelligent model to date and the first hybrid reasoning model on the market.” (Claude 3.7 Sonnet and Claude Code \ Anthropic) As someone who had been using Claude 3.5 for a long time, I was very eager to test Claude 3.7 and see how their approach differed. What I found is that Anthropic took a slightly different, but equally fascinating, path toward maximizing inference-time compute.
One Model, Two Modes: Quick Reflexes and Deep Thought
Claude 3.7 is called a “hybrid” reasoning model because it seamlessly integrates two modes of operation: a fast, near-instant responding mode and an “extended thinking” mode for deep reasoning (Claude 3.7 Sonnet and Claude Code \ Anthropic). The genius of Anthropic’s design is that it’s one model that can do both. And to be fair to Anthropic, they are delivering on that promise faster than OpenAI, which has touted the same approach for ChatGPT 5 (which they say will unify o3 and ChatGPT 4.5).
In the distant past of 2024, one might have separate systems – say a base chatbot for quick responses and a specialized reasoner for complex tasks – but Claude 3.7 demonstrates you can fuse these into a single brain that switches strategies on the fly. As Anthropic put it, “just as humans use a single brain for both quick responses and deep reflection, we believe reasoning should be an integrated capability of frontier models” (Claude 3.7 Sonnet and Claude Code \ Anthropic). I find this philosophy compelling. It treats reasoning not as a bolt-on, but as a core feature of intelligence that can be dialed up or down as needed.
In practice, when you use Claude 3.7, you can tell it whether to answer normally or to “think longer before answering.” In the standard mode, it behaves like an upgraded version of the earlier Claude (the 3.5 Sonnet model) – very fast and conversational. Better at code (and much more opinionated about code!)
But if you prefix your prompt with something like “Let’s work this out step by step” or enable the extended mode via API, Claude 3.7 will engage its deeper reasoning process (Claude 3.7 Sonnet and Claude Code \ Anthropic). In extended mode, it actually self-reflects and produces a chain-of-thought that is visible to the user.
This is a notable difference from OpenAI’s approach: Claude basically lets you peek under the hood at its thoughts. When I first tried this, I saw the model writing out its reasoning (sometimes numbered steps, sometimes a running commentary) before finalizing an answer. It’s transparently doing what o3 does privately. The benefit is I, as the user, get to follow along and even spot if its reasoning is going off-track. It’s a remarkably human-like experience to see an AI reason in English in front of you. Yes, DeepSeek R1 did it first, and credit to them for shifting the UX here for the entire industry.
Anthropic reports that this extended thinking mode “improves its performance on math, physics, instruction-following, coding, and many other tasks.” (Claude 3.7 Sonnet and Claude Code \ Anthropic) That aligns with my experience and the community feedback. Tough math problems that Claude might have gotten wrong in quick mode, it now nails because it can iterate through possible solution paths. Coding has also improved – Claude 3.7 can plan out code, consider edge cases, and even decide to write tests for the code as part of its solution, all in that extended chain-of-thought. It’s like pair-programming with an AI that can reason about its code before showing it to you.
I will add a caveat on the coding side: if you are an experienced developer, you may find Claude’s opinions difficult to disentangle from the code. Claude 3.7 has been trained to be more opinionated and more proactive with code, and you need to deliberately tell it to slow down and only do what you ask if you are an experienced developer who only wants an assistant that goes one step at a time at your instructions.
A powerful feature for developers is that through the API, you have fine-grained control over how long Claude can think. You can specify a token budget for its thought process: “think for no more than N tokens” (Claude 3.7 Sonnet and Claude Code \ Anthropic). N can be as high as the context window (which is 128k tokens for Claude—still not big enough for a big coding task but better). Despite the lines of code limit, the thinking side is strong: this means you could let Claude 3.7 effectively write a small novella of reasoning if you wanted to maximize accuracy on an insanely hard task! The ability to trade off speed vs quality is directly in the user’s hands, mirroring OpenAI’s low/medium/high setting but with even more quantitative control. I absolutely love this feature – it acknowledges that inference compute is a resource that we might want to allocate differently for different jobs. Need real-time answers? Use minimal inference. Need superhuman problem-solving? Throw the kitchen sink of compute at it. Claude gives us the dial for that.
Performance: State-of-the-Art with “Extended Thinking”
How does Claude 3.7 stack up in terms of raw performance? I know reviews have been mixed by engineers, but my overall impression is positive. In early evaluations and beta testing, it has been best-in-class on coding tasks and highly competitive on reasoning benchmarks. It is worth noting that for reasons not entirely clear, Claude 3.7 performs differently (much better) in Claude Code (the terminal) vs. on Cursor or other tools. No, I really don’t know why either.
Despite the difference in performance, early testers have been positive. For instance, an early tester (Cursor) noted Claude is “once again best-in-class for real-world coding tasks,” showing significant improvements in handling complex codebases and tool use (Claude 3.7 Sonnet and Claude Code \ Anthropic). Another eval (by Cognition) found Claude 3.7 to be far better than any other model at planning multi-step code changes and handling full-stack software edits (same source). These are exactly the kinds of real-world engineering problems that trip up less capable models. Claude’s extended reasoning evidently enables it to keep track of larger contexts and goals in code.
Anthropic also showcased Claude 3.7’s dominance on new benchmarks like SWE-Bench (software engineering) and TAU-Bench (agent-based tasks), where it achieved state-of-the-art results (Claude 3.7 Sonnet and Claude Code \ Anthropic). In a benchmark table they shared, Claude 3.7 “excels across instruction-following, general reasoning, multimodal capabilities, and agentic coding, with extended thinking providing a notable boost in math and science” (same source).
TLDR: Claude 3.7 is not only great at following user instructions and basic Q&A (as one expects from a refined LLM), but when it goes into reasoning mode, it especially shines in domains like math and science that benefit from longer thought. Anthropic even tested Claude in a quirky domain – playing the game Pokémon Red – by hooking it up as an agent that can press buttons in the game. Thanks to its extended thinking and planning (plus some fine-tuning for agent actions), Claude 3.7 far outperformed previous models in achieving game milestones, and delivered a watchable Twitch stream in the process. It’s a whimsical but telling example: this model can maintain focus over a long series of actions to accomplish open-ended goals. Whether it’s debugging a program or navigating a video game, the capacity to think longer is the secret sauce.
One experiment from Anthropic’s research really blew my mind and underscores the power of inference-time scaling. They tried using parallel thought processes with Claude 3.7 – essentially running multiple reasoning threads in parallel and then picking the best answer. Concretely, they sampled 256 independent chains-of-thought for a set of hard science questions (the GPQA science benchmark) and used a separate scoring model to choose the most promising answer. With this massive ensemble of thoughts, Claude 3.7 achieved 84.8% accuracy on the GPQA exam, including 96.5% on the physics section (Claude's extended thinking \ Anthropic). Those numbers are unprecedented; they’re substantially higher than what even a single extended chain could do. And importantly, the results “benefit from continued scaling” – meaning even 256 parallel thoughts wasn’t the ceiling; even more would keep improving it.
This is essentially a brute-force amplification of inference compute: if one chain-of-thought might get it wrong, try hundreds and hope one gets it right, then use an AI judge to select the right one. It’s expensive, but it demonstrates an intriguing point on the path to AGI: if you have enough compute, you can solve problems that were previously unsolvable by just one pass of a neural network. With Claude 3.7, Anthropic showed that both serial (longer single chains) and parallel (many chains) inference-time scaling can dramatically improve results in a predictable way.
Of course, all this capability comes at a cost. Extended thinking uses more tokens, which are billed just like output tokens on Anthropic’s API. And running 256 parallel chains is 256x the cost of one query (Anthropic did that as a research demo, not as a default service!). Yet, I find it hard to be cynical about the cost when the payoff is so historic. We’re watching an entirely new dimension of model scaling play out. Yes, it costs more—but it works. As one analyst quipped, these reasoning models “vastly increase compute, cost, and power requirements (in a way, these are all the same thing on a variable cost level)” (AI Reasoning—What is It? - by James Wang - Weighty Thoughts). That’s true – inference-time compute is not free. But I view it as investing compute in exchange for breakthroughs. And as hardware improves and optimizes for these workloads, the costs will come down.
Before moving on, it’s worth noting the contrast between OpenAI and Anthropic’s approaches. OpenAI’s o-series (o1, o3, etc.) use hidden chains-of-thought and a separate model (o3 vs o1 vs base GPT) specialized for reasoning. Anthropic unified it: one Claude that can be fast or slow, and it even lets you see the slow thinking. There are pros and cons. Claude’s visible reasoning is great for trust and debugging – you can catch it if it starts reasoning incorrectly. However, Anthropic themselves noted some downsides, like the fact that the visible chain-of-thought can sometimes be misleading or not perfectly “faithful” to what the model truly computes internally (Claude's extended thinking \ Anthropic).
There’s also a safety consideration: exposing the chain-of-thought might give bad actors hints on how to jailbreak the model or trick it. OpenAI, by keeping o3’s thoughts private, avoids some of that but at the expense of transparency. It will be interesting to see which approach wins. Personally, I love the transparency of Claude 3.7; it feels like I can reason alongside it. But I understand why OpenAI might keep the magic hidden and just give the polished answer. Either way, the common denominator is that both companies have made reasoning a first-class citizen of AI.
What’s Next?
As of this writing (March 2025), inference-time compute is at the forefront of AI research and deployment. The past two months have given us OpenAI’s o3 and Anthropic’s Claude 3.7. Everywhere you look, the pattern is the same: we are augmenting our models with more computation when they answer questions, and the gains are accelerating.
The State of Play in Early 2025
Here’s a quick rundown of the latest as we stand today:
OpenAI: After o3’s preview in December, the full o3 model is expected to launch publicly in early 2025 (OpenAI o3 Released: Benchmarks and Comparison to o1). They have o3-mini out already (great for STEM applications), and o3 full is undergoing final safety tests. OpenAI has also released research on “deliberative alignment,” showing how these reasoning models (the o1 class) can be steered more safely (o3: The grand finale of AI in 2024 - by Nathan Lambert) – indicating they’re mindful of aligning these more powerful reasoners.
Anthropic: Claude 3.7 is live (for paid tiers and API) with its hybrid reasoning. They also released Claude Code (an “agentic coding” tool) alongside it (Claude 3.7 Sonnet and Claude Code \ Anthropic), which leverages Claude’s reasoning to autonomously write, test, and debug code. This hints at a future where AI agents (powered by inference-time reasoning) do significant chunks of software engineering for us. Anthropic’s research on visible thoughts and scaling inference (which we cited) is ongoing, meaning Claude 4 or 5 could push this even more.
DeepMind/Google: While not covered deeply in this article, Google’s DeepMind is certainly not idle. Google’s “Gemini 2.0 Flash Thinking” has an approach similar to simulated reasoning (OpenAI o3 Released: Benchmarks and Comparison to o1). If that’s accurate, Google might integrate inference-time search or multi-step thinking into Gemini. I predict Google will showcase something akin to o3/Claude soon – they won’t let others run away with the reasoning crown.
Others: New players like DeepSeek (mentioned with an R1 model (OpenAI o3 Released: Benchmarks and Comparison to o1) are entering with their own reasoning-optimized systems. There’s also a flurry of academic work on things like tree-of-thought (branching reasoning paths), majority voting, and ensembling models at inference (Claude's extended thinking \ Anthropic). It feels like an entire subfield has blossomed: call it “Test-Time Optimization” or “Inference-Time Alignment.” The goal is clear – squeeze more problem-solving juice out of our existing models by clever use of compute.
Turning Toward AGI
Now, let me make some confident (and dare I say, bullish) predictions. I say “confident” because I genuinely believe the evidence and trends are on my side—no hedging needed.
1. Inference-time compute is now the biggest driver of AI progress, and it will dominate scaling in 2025 and beyond. We will likely see multiple new state-of-the-art results not from training bigger models (those will come too, but incrementally), rather from using more inference compute on existing models. A respected AI newsletter predicted exactly this, stating that inference-time scaling will dominate progress in 2025 (Zeta Alpha's 10 Predictions for AI in 2025). I’m already seeing it happen: o3’s leap overshadowed any new model architecture, and the community is rallying around techniques like self-consistency, debates, and reasoning tool use – all of which occur at inference.
2. We will break more “barrier” benchmarks using inference scaling. Just as o3 broke the 85% ARC barrier, I predict that in 2025 we will see near-100% performance on things like AIME (competition math) and maybe even solve problems that were out of reach before. As a concrete example, I expect GPT-5 (or its successor) combined with extended inference could finally crack the toughest math Olympiad problems or complex programming challenges that stumped it before. With enough compute, I bet we’ll see an AI perfectly solve an IMO (International Math Olympiad) problem – something that would truly signal reasoning at the level of the best humans.
3. Inference-time strategies will be crucial to reaching something like AGI (Artificial General Intelligence). This is a bold claim, so let me clarify: I don’t mean just looping a current model indefinitely will magically produce AGI. But I foresee that the first systems we recognize as AGI-level will heavily use inference-time tools – they will reason, reflect, possibly decompose tasks into sub-tasks, call other models or modules, etc., all during inference. In short, they will think, not just regurgitate. We’re already seeing sparks of that. As The Atlantic noted in a piece about this trend, OpenAI’s o1 was described by an OpenAI VP as fundamentally different because it can “reason”—Altman even called it the dawn of the “Intelligence Age” (Five Predictions for AI in 2025). Grand words, but perhaps warranted. The way I see it, reasoning models are taking us from the era of predicting text to the era of solving problems. If achieving AGI is like reaching the summit of a mountain, inference-time compute is the rocket that will get us up the steepest final stretch – the part too complex and too nuanced for static behavior learned from data alone.
4. We’ll need new hardware and optimizations to support this revolution. Right now, using something like o3 at full tilt is expensive (remember that $1k per query figure for ARC at 64 samples!). But this simply means there’s huge opportunity and demand for inference-optimized chips and systems. Companies like Nvidia are already onto this, with inference-focused GPUs, and startups like Cerebras and Sambanova are likely to capitalize on the need for fast, memory-rich hardware that can handle 100k-token contexts and multiple parallel model instantiations (Five Predictions for AI in 2025) (Five Predictions for AI in 2025). I predict that by the end of 2025, we’ll see at least one breakthrough in hardware or model optimization that makes a 10× inference-time increase feasible at similar cost. This could be sparsity, caching of reasoning paths, or novel AI accelerators. When that happens, the floodgates will really open – because if we can afford to let models think 100× longer, many tasks that are just out of reach today will suddenly become solvable.
5. Inference-time algorithms will diversify. Right now, chain-of-thought and voting (self-consistency) are popular. I anticipate new algorithms coming into play: maybe a train of thought that can backtrack when it realizes an error (imagine a model that can rewrite its own chain if it hits a contradiction), or multi-agent debate where two instances of a model argue and refine an answer. We might see differentiable search integrated with language models, or retrieval-augmented reasoning where the model at inference can call an external verifier or tool to check each step. Essentially, we’re going to embed mini “research projects” into single question-answering sessions. And it will work – because why wouldn’t an AI perform better if it can, say, run a Python simulation of a physics problem during its thinking process? All of that is inference compute put to good use. I dare say, anything humans do to solve hard problems (draw diagrams, run tests, discuss with peers), AIs will try doing internally with enough compute budget. That’s what makes me so excited – it’s a convergence of AI with the actual process of intelligence.
5. The future remains dollars per token per watt. Ultimately, we are resolving the cost of intelligence into a power equation, and the governor on our pace of intelligence gain is increasingly going to be our ability to drive increased power efficiency per token of output. The capital pouring into the space is going to make this a multi-threaded news space for the next few years: you’ll have super savers like DeepSeek modifying their NVIDIA hardware for max efficiency while others with more capital build massive models (GPT 4.5) and seek to gain economies of scale. I don’t know which approach (or ensemble of approaches) will win, but the fundamental unit economics of the space are defined: we are going to see increased intelligence per dollar per watt over the next few years, and that will change everything. And reasoning is a big reason why (pun intended).
A New Paradigm Solidifies
Let’s step back and appreciate the paradigm shift. Not long ago, the path to better AI seemed to be solely: train bigger models on more data. Today, we’ve added a new dimension: make the models think longer and harder on each task. This is such an intuitive development in hindsight—of course thinking more should yield better results—but then most discoveries seem obvious in hindsight don’t they?
We humans have obvious limits to our working memory and how much we can compute in our heads, but we overcome those by writing things down, deliberating, and collaborating. Now our AI models are learning to do the same: to use the massive computation at their disposal not just to store knowledge, but to process it in sophisticated ways when needed. The transformation is evident in the leap from GPT-4 to o3, or Claude 3.5 to Claude 3.7 – the latter of each pair feels qualitatively more “intelligent” in problem-solving. The latter models win by using their compute budget more intelligently through reasoning.
In a way, inference-time compute is the great equalizer. Even a smaller model given enough thinking time can outperform a larger model that is rushed. This opens up interesting possibilities: do we really need 1 trillion parameter behemoths if a 100B model with a clever inference strategy can do just as well? Perhaps efficient reasoning will allow us to achieve more with less. Or conversely, perhaps the combination of giant models and extended reasoning will push us into truly uncharted territory.
I am confident that as we move through 2025 and beyond, we’ll reflect on this period as a turning point. The inference-time compute paradigm is here to stay. In fact, I’d argue it’s becoming a foundational pillar of AI alongside model scale and data scale. Each pillar boosts the others: larger models make reasoning better, more data can teach better reasoning heuristics, and better reasoning lets smaller models punch above their weight. The endgame is an AI that has both an enormous knowledge base (from training) and an extraordinary ability to think on the fly (from inference). Such an AI starts to sound an awful lot like the “general intelligence” we’ve been striving for.
In writing this note on inference-time compute, I’m struck by how fast the landscape is evolving. Only a year ago, chain-of-thought was a neat trick in research labs; now it’s at the core of flagship products. Only a few months ago, we wondered if models had hit a reasoning ceiling; now that ceiling has been smashed by o3’s high-mode and Claude’s extended thoughts. As an AI practitioner and enthusiast, I am more excited (and tired) than ever. I have seen the future of AI – it spends a bit more GPU time mulling over problems, and it wows us with what it can achieve.
So here’s to inference-time compute, the unsung hero turning mere large language models into creative problem solvers. By empowering our AIs to think longer, we have empowered them to be smarter. And this, I believe with full conviction, is the catalyst that will bring us to the next era of AI, an era that might just earn the name “AGI.”
This is the best defined article on inference time modeling
Outstanding read, thank you so much for pulling this all together!