I built a dashboard you can customize that turns 7 major AI model releases into one clear decision—and yes you can tune it to your preferences right now! It's open-source and easy to remix too
On reflection, in a future where the value of a business relies heavily on the quality of proprietary models, perhaps it is the internal model rating methodology, rather than an external rating that is of value.
It would also be useful to add a category for pricing (free, plus, pro) for those of us that can’t afford to pay for more than one at a time. That said, your little app helped me decide on Claude this month. 😀
Learning ChatGPT 04 underperforms ChatGPT 03 is a huge insight for me. The little descriptions in the box where you choose the model on ChatGPT and Gemini don't make it easy to determine which model to use. Your dashboard does. Thank you.
Nate, One major gap I’ve seen—especially with models like Claude—is reliability and consistency: frequent output truncation, unexpected timeouts, or failing to finish a task are real issues. I’d recommend adding “Reliability/Consistency” as its own category, or giving users the option to define custom categories for dealbreaker failure modes they experience. If a model can’t deliver a complete, uninterrupted result, it needs to be scored down—no matter how strong it looks on benchmarks or other attributes defined and weighted.
Yeah I felt like adding even more categories would be overwhelming, but you're right—that's a separate concern and it's a real issue with Claude, particularly on lower plans.
Great approach, Nate! Thanks for the life preserver! 🛟🛟🛟
Really like how you provide the customization to allow weighting to my needs, use cases and priorities. Critical to me and why I switched (primarily) from #Claude to #ChatGPT due to its maddening 😤shorter “output length limits” and its shorter “time outs and latency controls”
I think personalizing rankings is really important given how differently we all use these models, but I haven’t really seen a leaderboard that does this.
This is fantastic. You're becoming my AI Sommelier!
This is sparkling AI from the Sonoma region of California…
🤣
We got gadgets and gizmos a-plenty
We got whozits and whatzits galore...
lol
This isn’t just about choosing the “best AI model.”
It’s the start of the AI Credit Rating System - the Moody’s of the model economy.
✅ Accuracy
⚡️ Efficiency
🛡 Reliability
Built as middleware, it becomes the trust layer.
Whoever builds it? Owns the stack.
Even in-house, the stakes are huge.
Shopify has 8,000+ employees using AI.
What if you knew which models actually delivered - per use case, per team?
A Live Ratings Layer =
🧭 Strategic clarity
📉 Less hype waste
💡 A new internal OS for AI decisions
Moody’s is an interesting analogy here actually. Wonder if we’ll start to see ratings like that for models.
On reflection, in a future where the value of a business relies heavily on the quality of proprietary models, perhaps it is the internal model rating methodology, rather than an external rating that is of value.
that may well be how it shakes out! Work is so varied the standard benchmarks are not particularly useful
It would also be useful to add a category for pricing (free, plus, pro) for those of us that can’t afford to pay for more than one at a time. That said, your little app helped me decide on Claude this month. 😀
Pricing adds like a ton of complexity because it changes all the other variables … but I agree it would be helpful
Ah! I was considering a much simpler concept. Simply labeling the models by price level.
Learning ChatGPT 04 underperforms ChatGPT 03 is a huge insight for me. The little descriptions in the box where you choose the model on ChatGPT and Gemini don't make it easy to determine which model to use. Your dashboard does. Thank you.
I don't think we can stablish that o4 is worse than o3, since the full o4 model is not available yet. We only have o4-mini to choose from.
Also, I didn't see o4-mini in Nate's app, so even comparing o4-mini to o3 in his tool isn't possible yet. Or am I missing something?
so the app shows 4o not o4. and i stayed away from the mini version because of the complexity of showing so many models.
And I feel good about saying 4o is worse than o3, and I’m sure o4 will be better than o3. What a timeline lol
Yes I know it’s crazy but 4 < 3 in this world. At least for now. Glad it helps!
Nate, One major gap I’ve seen—especially with models like Claude—is reliability and consistency: frequent output truncation, unexpected timeouts, or failing to finish a task are real issues. I’d recommend adding “Reliability/Consistency” as its own category, or giving users the option to define custom categories for dealbreaker failure modes they experience. If a model can’t deliver a complete, uninterrupted result, it needs to be scored down—no matter how strong it looks on benchmarks or other attributes defined and weighted.
Yeah I felt like adding even more categories would be overwhelming, but you're right—that's a separate concern and it's a real issue with Claude, particularly on lower plans.
Great approach, Nate! Thanks for the life preserver! 🛟🛟🛟
Really like how you provide the customization to allow weighting to my needs, use cases and priorities. Critical to me and why I switched (primarily) from #Claude to #ChatGPT due to its maddening 😤shorter “output length limits” and its shorter “time outs and latency controls”
I think personalizing rankings is really important given how differently we all use these models, but I haven’t really seen a leaderboard that does this.
This is awesome!
This is super cool. Thanks Nate!