Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full breakdown + Nate's Community Slack)

Playback speed

Share post at current time

Share from 0:00

0:00

Playback speed

0:00

Preview

Opus 4.8 scored 81 in my benchmark. I still wouldn't default to it. (The full breakdown + Nate's Community Slack)

Claude Opus 4.8 is excellent. The harder question is where it should replace your current workflow, where it should be a specialist, and where turning the reasoning dial up can make the work worse.

Nate

Jun 03, 2026

∙ Paid

Claude Opus 4.8 is excellent. The harder question is where it should replace your current workflow, where it should be a specialist, and where turning the reasoning dial up can make the work worse.

After I read the runs, I wanted the recommendation to be simple: use Opus 4.8.

The score almost lets you say that. In my current benchmark suite, Opus 4.8 is the leader. It scored 81 on the strict average. GPT-5.5 scored 71. The rest of the field was well behind: Gemini 3.5 Flash High Fast at 56, Opus 4.7 at 54, Sonnet 4.6 at 52, GPT-5.4 at 51, and Gemini 3.1 Pro at 38.

If all you want is a leaderboard, the article can end there.

But that would be a bad article, and it would make you worse at choosing models.

The result gets more useful when you stop at the individual runs. Opus 4.8 won because it was much better than Opus 4.7 at the parts of work that usually break professional AI output: source discipline, operational judgment, canary handling, provenance, self-correction, and knowing when a messy data problem should be reviewed instead of quietly “fixed.”

I care about that more than I care about a slightly prettier answer.

It also did not win every task. GPT-5.5 beat it on the Artemis visualization. Opus 4.8 still had visual and front-end weaknesses in multiple runs. And outside our suite, Andon Labs found a long-horizon business benchmark where Opus 4.8 on max effort did worse than Opus 4.8 on high effort, and both did worse than Opus 4.7.

That last point is the one I keep coming back to, because it breaks the lazy way people talk about model launches.

We are used to asking, “Which model is smartest?”

I still want to know the answer. But if you are actually building, managing a team, buying enterprise licenses, or choosing your own daily tool, the question has more parts:

What work are you doing?
How long does the task run?
What source material does it need?
What tools can the model use?
Can it inspect the artifact it just made?
Does it preserve state when the work gets long?
How much does the human have to babysit it?
What happens when it is uncertain?
What does a failed run cost you?

Those questions decide whether the model saves you time or creates another review queue.

So I am not treating Opus 4.8 as a “switch everything” release. It is one of the best models available right now. It is the best model in my current strict suite. I would use it aggressively for some work. I would not blindly make it my default for every long-running workflow.

Here’s what I’m covering:

Every test, scored and picked apart. Where Opus 4.8 won, where GPT-5.5 beat it, and where the score hides real caveats.
The effort-level trap. The Vending-Bench data on why max can make long-running work worse, and how I configure each mode for real work.
How I actually choose my daily tools. Why I still reach for Codex/5.5 despite the score, plus a routing guide for when to use Opus 4.8, Codex/5.5, and GPT-5.5.
What builders, leaders, and executives should each do differently. Role-specific guidance and four prompts you can paste and use today.

The reasoning is below, along with the tools to make the same decision for your own stack.

Listen to this episode with a 7-day free trial

Subscribe to Nate’s Substack to listen to this post and get 7 days of free access to the full post archives.