The Complete Guide to ChatGPT o3: Includes Tests vs. Gemini 2.5 Pro, Performance at ~12 Job Skills, OpenAI Notes Review, Links to Other Reviews
I have 23 pages of testing here, but I give you the TLDR on o3 right at the top, along with the actual practical job skills where I expect o3 to excel, and why I think the future accelerated today.
So o3 dropped yesterday, April 16th. Take a breath. First, here’s the gigantic o3 model notes from OpenAI.
My first impression is that this is a very very good model, it is a step forward, it is arguably State of the Art (SOTA), but AGI claims are definitely overrated. As an example, I asked it to write this substack for me, and it absolutely failed lol
Kidding aside, I do think we should look at this as an new high watermark on one of the only benchmarks that really matters: the world’s most generally useful everyday model.
It has become fashionable to talk about model capability scaling so fast that people are going to pick models based on vibes or preferences, because for most tasks people can’t tell the difference.
I think this is partly true:
It is true in that models are now scaling past the median knowledge worker task and people can reliably use a very wide range of models for low-utility tasks
It is false in that models continue to show very strong differentials on very high leverage tasks, so that choosing the correct model and giving the model the right inputs matters more than it ever has
Basically, for the 10% of your work that matters the most, picking the absolute best model matters a whole lot. And if it’s the best model, why not use it for most things?
And that’s why I wrote this. I wanted to figure out what the best State of the Art model for everyday use is as of today (April 16, 2025), and I wanted to do it in a rigorous way, which means with a series or very hard challenges, rigorously graded. These challenges indirectly measure work skills, and I’ll catalog how those map too.
Read on for the three challenge prompts, very detailed google docs that show the full results from both Gemini 2.5 Pro and ChatGPT o3, and an overall score, plus notes on where each model is strong and weak.
Enjoy, and happy model release day!
Keep reading with a 7-day free trial
Subscribe to Nate’s Substack to keep reading this post and get 7 days of free access to the full post archives.