Field Notes15 June 2026·9 min read

The Eval Harness Is the Deliverable, Not the Agent

Most AI projects treat evaluation as the thing you bolt on after the agent works. That's backwards. With agentic systems — multi-step, tool-calling, non-deterministic — you can't eyeball correctness, and 'it looked good in the demo' is not a release gate. The eval harness is the deliverable. The agent is just the thing it scores.

By Daniel Usvyat · Founder & Principal, USQRD

Why the Demo Lies and the Harness Tells the Truth

A demo is a single happy-path trace under controlled conditions. An agent in production faces inputs the demo never had: malformed data, ambiguous instructions, tools that time out, retrieval that returns the wrong chunk. The demo tells you the system *can* work once. It tells you nothing about how often it works, or what happens at the edges.

This is the same gap we wrote about in why RAG demos work and production doesn't — retrieval drift, stale indexes, and chunk boundaries don't show up in a curated demo. With agents the surface area is larger because every additional step compounds the failure probability. A four-step agent at 90% per-step reliability is a coin flip end to end.

The harness exists to make that compounding visible and measurable. It's the difference between 'we think it's better' and 'pass rate went from 71% to 88% on the regression set, p95 latency held at 4.2s, and cost per task dropped 12%.' One of those sentences you can take to a stakeholder. The other gets your project killed in month four.

The agent is just the thing the harness scores.

The Four Layers of a Real Harness

A real eval harness isn't a notebook with twenty test cases. Across our engagements, the ones that hold up under iteration have four distinct layers, each catching a different class of failure.

Skip any one of them and you create a blind spot you'll discover in production — usually after a stakeholder does.

→Golden sets — a curated, human-labelled set of inputs with known-good outputs or acceptance criteria. This is your ground truth. It's slow to build and the most valuable thing you own. 50 well-chosen cases beat 500 scraped ones.
→Regression suites — every bug you fix and every edge case you hit becomes a permanent test. The suite grows monotonically. This is what lets you change a prompt or swap a model without silently breaking last month's fix.
→LLM-as-judge — automated scoring for outputs too open-ended for exact match. Powerful for scale, but it is itself a model with its own error rate and biases. Treat its scores as estimates, not facts.
→Cost and latency gates — hard thresholds in the pipeline. A change that improves accuracy 3% but doubles p95 latency or triples token spend should fail the gate, not ship quietly.

LLM-as-Judge Is Useful and Also a Trap

LLM-as-judge is what makes evals scale for open-ended outputs — summaries, multi-turn conversations, tool-use reasoning where there's no single correct string. Without it you're stuck hand-grading every run, which means you grade nothing once the novelty wears off.

But the judge is a model, and it inherits every weakness of a model. It's biased toward longer answers. It's swayed by surface fluency over correctness. It drifts when the provider updates the underlying model under you. We've seen judge scores move several points with no change to the agent at all — purely because the judge model changed beneath the eval.

The discipline that makes it trustworthy: calibrate the judge against a human-labelled subset and track agreement as a metric in its own right. If judge-human agreement drops below your threshold, the judge is broken and its scores are noise. Pin model versions where you can, use structured rubrics over vibes-based scoring, and never let a judge grade a dimension you haven't checked it can actually grade.

→Calibrate against human labels and treat agreement rate as a first-class metric.
→Use explicit rubrics, not 'rate this 1-10' — narrow questions get reliable answers.
→Pin judge model versions and re-baseline when you can't.
→Never trust a judge on a dimension you haven't verified it can score.

Why Teams Skip It — and What It Costs Later

Teams skip the harness for understandable reasons. It's invisible in a demo, so it doesn't impress the people approving budget. It's slow and unglamorous to build — labelling golden sets is real work. And in the early phase the agent is changing so fast that tests feel like they'll just be thrown away.

The cost arrives later and compounds. Without a regression suite, every prompt tweak is a gamble — you fix one case and silently break three. Without golden sets, 'better' is a matter of opinion, so iteration becomes argument. Without cost gates, spend creeps until finance asks why the agent costs more than the team running it. This is a big part of why pilots die in month four: not a technical wall, but the loss of confidence that comes from not being able to prove the thing is improving.

The cruel irony is that the teams moving fastest in the demo phase are often the ones who stall hardest in production — because they have no instrumentation to debug *why* it broke, and no safety net to change anything without fear.

How We Make the Harness the Backbone of Delivery

In our work we build the harness before the agent is anything more than a stub. The first golden set gets written from the actual acceptance criteria the stakeholder cares about, in week one. It's small and ugly and it grows every week. By the time the agent works, the thing that proves it works already exists.

From there, every change runs through the harness in CI. A prompt edit, a model swap, a new tool — none of it ships without passing the regression suite and clearing the cost and latency gates. This is what lets a team iterate aggressively without fear, which is the whole point. The harness isn't a brake; it's what lets you go fast safely. It's a core reason enterprise AI projects with a defined shape actually ship.

The harness also outlives the engagement. When we hand a system over, the team inherits the agent *and* the instrument that tells them whether their next change made it better or worse. That's the difference between owning a system and owning a black box you're afraid to touch.

What's Still Hard — and the Honest Caveats

None of this is solved cleanly. Building golden sets for genuinely subjective tasks — tone, judgment, taste — is still mostly human labour, and it doesn't scale the way anyone wants. Evaluating multi-turn agentic trajectories, where the right next action depends on state three steps back, is an open problem; most teams (us included) approximate it with checkpoint scoring and accept the gaps.

Judge drift from provider model updates is a real operational tax. And there's a quieter failure mode: a harness can give false confidence if your golden set doesn't represent production traffic. An eval suite that's 95% green on cases that never occur in the wild is worse than no suite, because it manufactures trust you haven't earned. The fix is unglamorous — sample real production traffic, label it, feed it back into the golden set continuously.

The thesis holds regardless. For agentic systems the harness is the deliverable because it's the only thing that converts 'seems to work' into a defensible number, and the only thing that lets you keep changing the system without breaking it. Build it first. Everything else is downstream of it.

Frequently asked questions

What is an eval harness for an AI agent?

It's the system that scores your agent's behaviour automatically and repeatably — golden sets of known-good cases, a growing regression suite, LLM-as-judge for open-ended outputs, and cost/latency gates. It's what lets you prove a change made the agent better rather than just different.

Is LLM-as-judge reliable enough to use in production evals?

It's reliable enough to scale evaluation, but only if you calibrate it against human-labelled cases and track judge-human agreement as its own metric. Treat its scores as estimates, pin model versions, and use narrow rubrics rather than vague 1-10 ratings.

How many test cases do I need in a golden set?

Fewer than you think, if they're well chosen. 50 cases that mirror real production traffic and edge cases beat 500 scraped ones — representativeness matters far more than volume, and a green suite of unrepresentative cases is actively dangerous.

Should we build the eval harness before or after the agent?

Before. Write the first golden set from your actual acceptance criteria in week one, then grow it as you build. By the time the agent works, the instrument that proves it works already exists — and you can iterate without fear of silent regressions.

Free resource

Take the Operational Bottleneck Audit

Our Bottleneck Audit pinpoints where your AI delivery loses confidence — and whether an eval-first approach would unstick it.

Ready to stop experimenting?