Agent workflows fail in subtle ways: brittle tool choice, silent retries, degraded explanations, and confident answers built from the wrong retrieval context. A demo rarely exercises these edges for long enough to expose them.
An eval harness makes these failures observable before users find them for you. It is not one benchmark score. It is a repeatable set of cases that represent the workflow you actually care about.
We usually start with historical tickets, support cases, exception records, or analyst notes. The best eval cases are boringly real: missing fields, contradictory instructions, partial data, stale policies, duplicate customers, and requests that should be escalated rather than answered.
Each case needs an expected behaviour, not always an expected sentence. For an operational agent, success might mean choosing the right tool, refusing to act without approval, routing to a human, or attaching the evidence that explains a recommendation.
The harness should run during development and again before release. A change to the prompt, model, retriever, tool schema, or ranking logic should be treated like any other code change that can regress production behaviour.
Observability matters after launch. Trace logs, tool calls, refusal reasons, latency, and human override rates tell you whether the agent is becoming more useful or merely more confident. Without this, operations teams are left debugging anecdotes.
The goal is not to make LLM systems perfectly deterministic. It is to make their important failures visible, repeatable, and small enough to fix. Production agents deserve the same release discipline as any other operational system.