In our platform, we run a multi-step AI agent that generates content for customers in the sports world. The agent is never fixed: we change the underlying models, reword prompts, restructure workflow steps, adjust the agent logic. Any of those can quietly make the output worse, and for a while the only safeguard was a human, reading generated content and forming an opinion. That doesn’t scale at all, highly subjective and misses slow drift completely.

The obvious instinct is to write tests. But a unit test assumes the same input gives the same output, and an LLM breaks that on the first run. The same prompt produces different valid text every time, so equality checks are useless.
So I built an evaluation suite, on top of an open-source tool called Promptfoo, that stops trying to pin the output and grades it against fixed criteria for what “good” looks like. It works in three layers.
Structural checks, deterministic. Exactly X options, exactly one correct, real sources, output that matches the required schema. Pass or fail, no opinion needed.
Quality, graded by an LLM. This one is more interesting. A set of rubrics, one job each: are the wrong answers plausible but definitely wrong, does the question give away its own answer, does it follow the customer’s rules, is it any fun to answer, is it controversial?
Consistency and drift. The correct answer should not always land on option A, and the style should not wander away from examples we have already signed off.
Non-determinism is handled by running each case a few times and gating on a pass rate, not one lucky result. The judge always stays on the same model, generally the strongest one available. Every run is scored and stored, so I can put two of them side by side: change a prompt, see which rubric moved and by how much. It runs nightly for a baseline, and on demand against anything that touches the agent.
Prompts, model choice and agent parameters aren’t little settings you tweak in isolation. They behave like code. Fix one thing with a reworded prompt and you break another, or the pipeline quietly slows down, and you only notice much later. So they need the same safety net we built for code long ago: a way to test them and catch the regressions automatically.
This article was originally published on LinkedIn on 2026-06-06.