You "Otter" Do AI Evals—Here's What They Are, How to Get Started, and How Not to Fail
Evals are a fancy way to say evaluations—a structured test to make sure your AI is actually delivering the performance it claims. This post lays out the path to avoid a "D" on your own AI builds!
Yes yes the pun is awful lol
I feel like only developers talk about evals. And the rest of us are missing out.
Because we hear “evaluation” and we think “boring developer test" when we should be thinking “hey I’d like to not have catastrophic business risk from AI.”
Yes, it’s that serious. Don’t believe me? Check out the billions evaporated by eval-related issues in this table.
Very large companies that had very public eval-related issues…
Now you might be tempted to throw up your hands and say: “If publicly traded companies screw this up, how can we hope to get it right? Why bother?”
I will tell you: the answer to “evals are hard and risk is high” is not “we shouldn’t do evals.” Risk with smarter systems is high enough that working on evals is important even if you know they are not perfect. You’d rather some less risk than be full risk-on and just YOLO your prompts into the void. Would you let your software team just yeet stuff into production at 4:30 PM on a Friday with zero testing?
Same deal here. Evals matter. Read on for a practical way to go from panic to productive with your eval setup, including a framework to assess risk and eval how-to’s matched to various risk profiles. You know, the kind of nerdy stuff that makes my nerdy heart sing lol
Keep reading with a 7-day free trial
Subscribe to Nate’s Substack to keep reading this post and get 7 days of free access to the full post archives.