Using GPT-5.5 for the first time was the most blown-away I have felt about a model release in a while, and the reason is not benchmark scores. The reason is that I handed it the kind of work that breaks models, the kind with messy files and legal risk and 23 deliverables that have to open in the right format, and it came back with something close to a real executive handoff. That has not happened before.
For the last several months, Anthropic felt to me like it held the lead on practical knowledge work. Opus 4.7 was where I went first when the work was real, Sonnet was the workhorse for everything else, and ChatGPT was something I checked in on rather than something I built around. GPT-5.5 changes that. The model is stronger than anything I have used on complex, multi-step work. The harness around it, Codex plus computer use plus Images 2, turns that strength into a system that can actually finish things. I still use Anthropic models every day, and I will get to where Opus remains the right call later in this piece. But the gap on serious execution work is wide enough that I would have to invent reasons not to start here.
This review is also about where GPT-5.5 is not perfect, because the edges matter just as much when the work is going into production.
Here’s what’s inside:
Three hard tests, not benchmarks. An executive knowledge-work package, a messy 465-file data migration, and an interactive 3D research build, each designed to fail in a different way.
Where GPT-5.5 won by a wide margin. The Dingo test produced the closest thing to a real executive handoff I have seen from any model, with real files, real legal posture, and real artifacts.
Where it still needs help. The Splash Bros data migration cleared the canary check for the first time, but backend hygiene is still not production-safe, and the Artemis visualization shows that blank-canvas visual taste remains Claude’s territory.
How I am actually routing work now. The defaults that changed, the two-model workflows that did not, and the top 10 takeaways and prompt tips I would hand to anyone using these tools for real work.
Five prompts that raise the bar. Not better questions for the model. Better work for the model. A stress-test finder that interviews you until it finds the most ambitious task you can realistically delegate, plus templates for multi-artifact business packages, validated data migrations, structure-first long-form writing, and routing your real weekly work to the right model and surface.
The clearest way to see what changed is through the tests themselves. Start with why the floor moved.
Listen to this episode with a 7-day free trial
Subscribe to Nate’s Substack to listen to this post and get 7 days of free access to the full post archives.













