Nate’s Substack

Nate’s Substack

Share this post

Nate’s Substack
Nate’s Substack
The Complete Guide to ChatGPT o3: Includes Tests vs. Gemini 2.5 Pro, Performance at ~12 Job Skills, OpenAI Notes Review, Links to Other Reviews

The Complete Guide to ChatGPT o3: Includes Tests vs. Gemini 2.5 Pro, Performance at ~12 Job Skills, OpenAI Notes Review, Links to Other Reviews

I have 23 pages of testing here, but I give you the TLDR on o3 right at the top, along with the actual practical job skills where I expect o3 to excel, and why I think the future accelerated today.

Nate's avatar
Nate
Apr 17, 2025
∙ Paid
49

Share this post

Nate’s Substack
Nate’s Substack
The Complete Guide to ChatGPT o3: Includes Tests vs. Gemini 2.5 Pro, Performance at ~12 Job Skills, OpenAI Notes Review, Links to Other Reviews
9
9
Share

So o3 dropped yesterday, April 16th. Take a breath. First, here’s the gigantic o3 model notes from OpenAI.

My first impression is that this is a very very good model, it is a step forward, it is arguably State of the Art (SOTA), but AGI claims are definitely overrated. As an example, I asked it to write this substack for me, and it absolutely failed lol

Kidding aside, I do think we should look at this as an new high watermark on one of the only benchmarks that really matters: the world’s most generally useful everyday model.

It has become fashionable to talk about model capability scaling so fast that people are going to pick models based on vibes or preferences, because for most tasks people can’t tell the difference.

I think this is partly true:

  1. It is true in that models are now scaling past the median knowledge worker task and people can reliably use a very wide range of models for low-utility tasks

  2. It is false in that models continue to show very strong differentials on very high leverage tasks, so that choosing the correct model and giving the model the right inputs matters more than it ever has

Basically, for the 10% of your work that matters the most, picking the absolute best model matters a whole lot. And if it’s the best model, why not use it for most things?

And that’s why I wrote this. I wanted to figure out what the best State of the Art model for everyday use is as of today (April 16, 2025), and I wanted to do it in a rigorous way, which means with a series or very hard challenges, rigorously graded. These challenges indirectly measure work skills, and I’ll catalog how those map too.

Read on for the three challenge prompts, very detailed google docs that show the full results from both Gemini 2.5 Pro and ChatGPT o3, and an overall score, plus notes on where each model is strong and weak.

Enjoy, and happy model release day!

Happy model release day! Paid subscribers get all these detailed thought pieces on AI

Keep reading with a 7-day free trial

Subscribe to Nate’s Substack to keep reading this post and get 7 days of free access to the full post archives.

Already a paid subscriber? Sign in
© 2025 Nate
Privacy ∙ Terms ∙ Collection notice
Start writingGet the app
Substack is the home for great culture

Share