Put your LLM output under test, or every prompt tweak is a silent regression

Sat, 06 Jun 2026 08:00:00 +0200

In our platform, we run a multi-step AI agent that generates content for customers in the sports world. The agent is never fixed: we change the underlying models, reword prompts, restructure workflow steps, adjust the agent logic. Any of those can quietly make the output worse, and for a while the only safeguard was a human, reading generated content and forming an opinion. That doesn’t scale at all, highly subjective and misses slow drift completely.

Promptfoo on Egor Lynko

Put your LLM output under test, or every prompt tweak is a silent regression