22 Comments
User's avatar
Philip Shane's avatar

This is fantastic. You're becoming my AI Sommelier!

Expand full comment
Nate's avatar

This is sparkling AI from the Sonoma region of California…

Expand full comment
Philip Shane's avatar

🤣

Expand full comment
Tango Libertad's avatar

We got gadgets and gizmos a-plenty

We got whozits and whatzits galore...

Expand full comment
Nate's avatar

lol

Expand full comment
Cam's avatar

This isn’t just about choosing the “best AI model.”

It’s the start of the AI Credit Rating System - the Moody’s of the model economy.

✅ Accuracy

⚡️ Efficiency

🛡 Reliability

Built as middleware, it becomes the trust layer.

Whoever builds it? Owns the stack.

Even in-house, the stakes are huge.

Shopify has 8,000+ employees using AI.

What if you knew which models actually delivered - per use case, per team?

A Live Ratings Layer =

🧭 Strategic clarity

📉 Less hype waste

💡 A new internal OS for AI decisions

Expand full comment
Nate's avatar

Moody’s is an interesting analogy here actually. Wonder if we’ll start to see ratings like that for models.

Expand full comment
Cam's avatar

On reflection, in a future where the value of a business relies heavily on the quality of proprietary models, perhaps it is the internal model rating methodology, rather than an external rating that is of value.

Expand full comment
Nate's avatar

that may well be how it shakes out! Work is so varied the standard benchmarks are not particularly useful

Expand full comment
Lisa Harper's avatar

It would also be useful to add a category for pricing (free, plus, pro) for those of us that can’t afford to pay for more than one at a time. That said, your little app helped me decide on Claude this month. 😀

Expand full comment
Nate's avatar

Pricing adds like a ton of complexity because it changes all the other variables … but I agree it would be helpful

Expand full comment
Lisa Harper's avatar

Ah! I was considering a much simpler concept. Simply labeling the models by price level.

Expand full comment
Byron's avatar

Learning ChatGPT 04 underperforms ChatGPT 03 is a huge insight for me. The little descriptions in the box where you choose the model on ChatGPT and Gemini don't make it easy to determine which model to use. Your dashboard does. Thank you.

Expand full comment
Guto's avatar

I don't think we can stablish that o4 is worse than o3, since the full o4 model is not available yet. We only have o4-mini to choose from.

Also, I didn't see o4-mini in Nate's app, so even comparing o4-mini to o3 in his tool isn't possible yet. Or am I missing something?

Expand full comment
Nate's avatar

so the app shows 4o not o4. and i stayed away from the mini version because of the complexity of showing so many models.

And I feel good about saying 4o is worse than o3, and I’m sure o4 will be better than o3. What a timeline lol

Expand full comment
Nate's avatar

Yes I know it’s crazy but 4 < 3 in this world. At least for now. Glad it helps!

Expand full comment
Tim McAllister's avatar

Nate, One major gap I’ve seen—especially with models like Claude—is reliability and consistency: frequent output truncation, unexpected timeouts, or failing to finish a task are real issues. I’d recommend adding “Reliability/Consistency” as its own category, or giving users the option to define custom categories for dealbreaker failure modes they experience. If a model can’t deliver a complete, uninterrupted result, it needs to be scored down—no matter how strong it looks on benchmarks or other attributes defined and weighted.

Expand full comment
Nate's avatar

Yeah I felt like adding even more categories would be overwhelming, but you're right—that's a separate concern and it's a real issue with Claude, particularly on lower plans.

Expand full comment
Tim McAllister's avatar

Great approach, Nate! Thanks for the life preserver! 🛟🛟🛟

Really like how you provide the customization to allow weighting to my needs, use cases and priorities. Critical to me and why I switched (primarily) from #Claude to #ChatGPT due to its maddening 😤shorter “output length limits” and its shorter “time outs and latency controls”

Expand full comment
Nate's avatar

I think personalizing rankings is really important given how differently we all use these models, but I haven’t really seen a leaderboard that does this.

Expand full comment
M-at's avatar

This is awesome!

Expand full comment
Sean Eisenstein's avatar

This is super cool. Thanks Nate!

Expand full comment