Golden Eval Datasets Rot: How to Keep Yours Honest

Everyone wants to talk about the agent. Almost nobody wants to talk about the 400 carefully labelled examples that tell you whether the agent actually works. Building golden eval datasets is the least glamorous, most consequential work in shipping production AI — and it's where we see the most teams quietly cut corners, then act surprised when their green dashboard stops correlating with reality.
By Daniel Usvyat · Founder & Principal, USQRD
Synthetic Happy-Paths Are a Comfort Blanket, Not an Eval Set
The fastest way to build an eval set is to ask a model to generate one. Prompt GPT for 200 example queries and expected answers, run them, watch the score hit 94%, ship. We've inherited several of these and they share a tell: they only contain the cases the author already imagined the agent handling well. Synthetic data inherits the blind spots of whoever generated it.
Real users do not behave like synthetic users. They paste half a spreadsheet into a chat box. They ask three questions in one sentence. They reference a previous turn that didn't happen. They use your product's internal jargon, misspell the SKU, and abandon mid-flow. None of that shows up when a model invents 'realistic' examples, because the model's idea of realistic is the median case — and the median case was never your problem.
In our work the rule is simple: synthetic data is acceptable for bootstrapping coverage of a known category, never for defining what the agent must handle. The defining examples come from production. If you haven't shipped yet, they come from a closed beta with real humans and instrumented logging — not from a prompt that says 'generate edge cases'.
The model's idea of a realistic example is the median case — and the median case was never your problem.
- →Synthetic data is fine to pad coverage of a category you've already identified.
- →It is never fine to use synthetic data to discover which categories exist.
- →If your eval set scores higher than your real users' satisfaction, your eval set is fiction.
Mining Edge Cases from Prod Traffic Is the Actual Job
The valuable examples are the ones that broke something. We pull eval candidates from three streams of production traffic: explicit negative signals (thumbs-down, escalations to a human, abandoned sessions), implicit ones (retries, rephrasings, conversations that loop), and the long tail of inputs that the agent handled but that no test ever covered. That third bucket is the dangerous one — it's working today, untested, one prompt change away from breaking.
This is why the eval harness is the real deliverable, not the agent. The harness is what turns a stream of messy production traces into labelled, replayable cases. Without instrumentation that captures the full input — context window, tool calls, retrieved chunks, the lot — you can't reconstruct a failure, and a failure you can't reconstruct can't become a regression test.
A practical heuristic: every production incident becomes a golden case before it's considered closed. Someone reports the agent did something dumb, you reproduce it, you label the correct behaviour, you add it to the set, and only then do you fix it. This is the same discipline as a regression test in normal engineering — it just feels heavier because labelling agent behaviour is genuinely harder than asserting a return value.
- →Negative signals: thumbs-down, human escalation, abandonment.
- →Implicit friction: retries, rephrasings, looping conversations.
- →The untested long tail: inputs that work today but no eval covers.
- →Every reproduced incident becomes a labelled case before the fix ships.
Why Golden Sets Rot Within Months
Here's the uncomfortable truth: a golden set is decaying from the day you freeze it, for three independent reasons. First, the world changes — a set built around last quarter's product, pricing, or policy starts asserting answers that are now wrong. We've watched a support agent get marked 'incorrect' on cases where the agent was right and the golden label was stale.
Second, the prompt evolves. You rewrite the system prompt to handle a new tool, and suddenly the agent phrases answers differently. If your evals do exact-match or rigid scoring, the new phrasing fails cases that are behaviourally correct. The eval set was written against a prompt that no longer exists.
Third, labels drift because labelling is subjective. Two annotators disagree on what 'good' looks like, the rubric shifts informally over six months, and nobody re-checks the old labels against the current standard. Label quality decays silently — there's no error message for 'this label was correct in March and wrong now.'
There's no error message for a label that was correct in March and wrong now.
Version Eval Sets Alongside Prompts — Always
The single most important pattern we apply: eval sets are versioned artefacts that live next to the prompts and code they evaluate, in the same repo, under the same review. When you change a prompt, the PR shows which eval revision it was scored against. A score without a dataset version attached is meaningless — you literally cannot tell whether the number moved because the agent got better or because someone edited the test.
Concretely, every eval run records three things: the prompt/agent version, the eval-set version, and the scoring rubric version. A green dashboard is only trustworthy when all three are pinned and visible. This is also how you make refreshes safe — you can re-run an old agent against a new eval set, or a new agent against the old set, and reason precisely about what changed.
This is the same insider-threat mindset we apply elsewhere: treat the agent's behaviour as something to be continuously verified, not trusted. The eval set is your verification surface. If it isn't versioned, your verification has no audit trail.
- →Eval sets live in the same repo as prompts, reviewed in the same PR.
- →Every run pins three versions: agent, dataset, rubric.
- →Never compare two scores without confirming the dataset version matched.
- →A score with no dataset version attached is noise, not signal.
The Refresh Cadence That Doesn't Invalidate History
Refreshing naively destroys your ability to track progress over time — if you swap out half the cases every month, last month's 88% and this month's 88% measure different things. The fix is to partition the set. We keep a stable regression core — cases that should always pass, changed only deliberately when behaviour genuinely should differ — and a rolling fresh slice of recent production edge cases that grows the set's coverage.
The cadence we use across engagements: review new candidate cases weekly, formally version a refresh monthly, and audit existing labels against the current rubric quarterly. The quarterly label audit is the step everyone skips and the one that prevents silent rot — you re-label a random sample of old cases and measure how far the labels have drifted from your current standard. If drift is high, the whole set needs attention.
When you do change the core, treat it like a schema migration: bump the dataset version, document what changed and why, and re-baseline. Your historical chart should show a clearly marked discontinuity at the version boundary, not a misleading smooth line. Honest discontinuity beats fake continuity every time.
- →Weekly: triage new candidate cases from prod traffic.
- →Monthly: version a refresh, adding to the rolling slice.
- →Quarterly: audit a sample of old labels against the current rubric.
- →Treat core changes as migrations — bump the version and re-baseline.
The Failure: When Teams Stop Trusting Their Own Green Dashboard
The worst case we've seen wasn't a team with no evals. It was a team that built a solid eval set, hit green, and then froze it — no refresh cadence, no label audits, no versioning against prompt changes. For a while the dashboard stayed green while real users got steadily worse experiences, because the set no longer represented production. The gap between the green number and reality grew quietly for months.
Then something subtle happens that's hard to recover from: engineers notice the dashboard is green during an incident. They learn the dashboard lies. So they stop looking at it. Now you have all the cost of maintaining evals and none of the benefit — a green light nobody believes, which is functionally identical to having no evals at all, except more expensive and more falsely reassuring.
Rebuilding trust is slower than building it the first time. You have to re-source cases from current traffic, re-audit every label, re-version everything, and then deliberately catch a real regression with the refreshed set so the team sees the dashboard go red when it should. Trust in an eval set is earned by it being right when it matters — and you only get to demonstrate that during an actual failure.
A green light nobody believes is functionally identical to having no evals — except more expensive and more falsely reassuring.
What's Still Hard, and Where to Start
None of this is fully solved. Labelling subjective agent behaviour at scale is still genuinely hard — LLM-as-judge helps with throughput but introduces its own drift, because the judge model changes underneath you too, and now you have a second artefact to version. Capturing complete production traces for multi-step agents with tool calls is an engineering investment most teams underestimate. And there's no clean answer to how large a regression core should be; too small and it's brittle, too large and it's expensive to keep honest.
What we are confident about is the direction: real cases over synthetic, versioned sets over frozen ones, partitioned refresh over wholesale swaps, and label audits as a standing ritual rather than a one-off. The teams that ship reliable agents aren't the ones with the cleverest prompts — they're the ones whose green dashboard they'd bet the quarter on. The same pattern shows up in why RAG demos work and production doesn't: the demo isn't tested against the cases that break it.
If you're staring at a green dashboard and a quiet unease that it doesn't match what your users feel, that gap is the thing to investigate first. Start by pulling 50 recent thumbs-downs and checking how many your eval set would have caught.
Frequently asked questions
How often should you update an eval dataset for an AI agent?
Triage new production cases weekly, version a refresh monthly, and audit existing labels against your current rubric quarterly. The quarterly label audit is the step most teams skip and the one that catches silent rot.
Why do golden eval sets become unreliable over time?
Three reasons: the world changes so labels go stale, the prompt evolves past the examples it was scored against, and human labels drift as the implicit rubric shifts. None of these throw an error, so the decay is invisible until your dashboard stops matching reality.
Should eval datasets be versioned with prompts?
Yes — always. Eval sets should live in the same repo as your prompts and every run should pin the agent version, dataset version, and rubric version. A score without an attached dataset version is meaningless because you can't tell if the agent improved or someone just edited the test.
Can you use synthetic data to build an AI eval set?
Use it only to pad coverage of categories you've already identified from real traffic, never to discover what categories exist. Synthetic data inherits the blind spots of whoever generated it, so it produces a reassuringly green dashboard that misses the messy inputs real users actually send.
Take the Operational Bottleneck Audit
Our Bottleneck Audit includes a hard look at whether your eval set actually represents production — or just makes the dashboard green.
Find Out If Your Green Dashboard Is Lying
We'll audit your eval coverage against real production traffic and show you where the gap is. Senior engineers only, no junior hand-offs.
Book a Discovery Call

