I asked ChatGPT 5.4 a question today. A simple one: “I need to wash my car. The carwash is 100 meters away. Should I walk or drive?”
ChatGPT 5.4, the model OpenAI just positioned as its most capable system for professional work, thought for a couple of seconds. Then it wrote a full essay. Walk, it said. At 100 meters, driving is more hassle than it’s worth. It listed exceptions: icy conditions, mobility issues, carrying heavy items. It closed with a nuanced distinction between “driving there” as a transportation decision versus “repositioning the car.” Thorough. Well-structured. Completely wrong.
I asked Claude the same question. Claude thought for a moment and wrote one sentence: “Drive. You need the car at the carwash.”
Gemini got it right too. Gemini 3.1 Pro called it a trick question and noted that unless you have an incredibly long extension cord and a hose, you have to drive. Every frontier model got this right except GPT-5.4 Thinking. The model that OpenAI says is ready to run your professional workflows wrote a careful, confident, wrong answer to a question a child would get right.
And that, in one shot, is the story of GPT-5.4.
Except it isn’t. It’s not that simple. Because GPT-5.4 is better than Opus 4.6 at some things (genuinely, measurably better) and I’m not going to take a silly example and milk it for outrage. I am going to point out that if you position your model as the best in the world, it has to survive ordinary real-world test cases. It cannot be behind frontier models on questions that aren’t even trick questions. I’m going to show you the full picture, because the full picture is more interesting than any single test.
I ran blind evaluations. Not vibes. Not “I tried it for an afternoon.” Six structured evals, independent judging, outputs labeled by number so the judge never knew which model produced what. I also had an AI-fluent person aside from me verify the results. We tested GPT-5.4 against Claude Opus 4.6 and Gemini 3.1 on real tasks, the kind of work you’d hand a model on a Tuesday afternoon and expect to use Thursday morning.
The short version: GPT-5.4 is not the best model. It is not the worst model. It is the most interesting model I’ve tested — and interesting for reasons that have almost nothing to do with the benchmarks OpenAI is promoting.
Here’s what’s inside:
The toggle that splits one model into two products. The single most important finding in the eval suite, and the thing most users will never think about.
Where 5.4 genuinely wins. Quantitative modeling, file processing, and competitive self-knowledge, with the eval data to back it up.
Where it falls apart, and what that reveals about intelligence. Writing quality, product judgment, and a subtle failure mode I’m calling the pipeline problem.
The Steinberger signal. Why the most important AI hire of the year explains this entire release.
What OpenAI is actually building. It isn’t a chatbot. It’s agentic infrastructure, and the benchmarks they chose to promote tell you exactly that.
The models are converging on capability and diverging on philosophy. Pay less attention to who won the benchmark and more attention to what the benchmark was measuring.
Listen to this episode with a 7-day free trial
Subscribe to Nate’s Substack to listen to this post and get 7 days of free access to the full post archives.













