8 June 2026·9 min read

Why Your RAG Demo Works and Production Doesn't

Almost every RAG system we've been brought in to rescue had a great demo. Someone typed five questions, got five clean answers, and the project got funded. Then it met real users and the answers got confidently wrong. The demo isn't lying — it's just measuring the wrong thing. A demo proves the happy path exists. Production is the long tail, the stale document, the question phrased in a way the builder never imagined.

By Daniel Usvyat · Founder & Principal, USQRD

Key takeaways

→Demos pass because they only ever hit the queries the builder chose — production fails on the long tail nobody tested.
→The four killers are retrieval drift, bad chunk boundaries, stale indexes, and long-tail queries — all invisible in a happy-path demo.
→Retrieval quality, not generation quality, is where most RAG systems actually break — and most teams have no way to measure it.
→An eval set built from real query logs catches these failures before your customer does; vibes-based testing never will.
→Production-ready retrieval means versioned indexes, monitored freshness, and a retrieval eval that runs on every change — not just a vector DB and a prompt.

The Demo Is a Curated Sample of One

Here's the structural problem: whoever builds the demo also chooses the questions. They unconsciously ask things the corpus answers well, phrased the way the documents are phrased. That's not cheating — it's human nature. But it means the demo measures the system on its single best 1% of inputs.

Production inverts that. Real users ask the question they have, not the question your documents answer. They use internal jargon, abbreviations, and assumptions the corpus never states explicitly. They ask multi-hop questions that need three documents stitched together. The demo never touched any of this.

We've seen systems demo at what felt like 95% and land at 60% accuracy on real traffic — not because anything broke, but because the real distribution of questions was never tested. The gap between those two numbers is the entire problem, and you can't see it until you measure against real queries.

Four Failure Modes the Demo Will Never Show You

Across our engagements, RAG systems fail in production in remarkably consistent ways. None of them are exotic. All of them are invisible in a five-question demo.

Retrieval drift is the quiet one: as the corpus grows, the vectors that used to surface for a query get crowded out by near-duplicates and newer content. A query that returned the right doc at 1,000 documents returns noise at 50,000. Nothing changed in your code — the index changed underneath you.

Chunk boundaries are where most accuracy quietly leaks. If you split a document mid-table or mid-procedure, retrieval pulls half the answer and the model confidently completes the rest. The demo questions happened to land inside clean chunks. Real questions don't.

→Retrieval drift: relevant docs get crowded out as the corpus grows and near-duplicates accumulate.
→Chunk boundaries: naive fixed-size splitting cuts tables, lists, and procedures in half, so retrieval returns fragments.
→Stale indexes: the source updated, the index didn't, and the model answers from a version of reality that no longer exists.
→Long-tail queries: the 40% of questions phrased in ways the builder never anticipated — where recall collapses silently.

Why Retrieval — Not Generation — Is Where It Breaks

Teams obsess over the LLM and the prompt because that's the visible, fun part. But in nearly every broken RAG system we've audited, the generation step was fine. The model was reasoning correctly over the context it was given. The context was just wrong.

If retrieval surfaces the wrong chunks, no amount of prompt engineering saves you — the model is reasoning faithfully over bad inputs. Garbage retrieval, confident garbage out. This is why swapping to a bigger model rarely fixes a RAG problem; you're upgrading the wrong component.

The uncomfortable implication: you have to measure retrieval separately from generation. Most teams only look at the final answer, which conflates two failure modes and tells you nothing about which one to fix. You need to know whether the right document was in the top-k at all — before you ever look at what the model did with it.

If retrieval surfaces the wrong chunks, no amount of prompt engineering saves you. Garbage retrieval, confident garbage out.

How We Catch These Before the Customer Does

The answer is unglamorous: a real eval set. Not vibes, not a demo script — a versioned set of question-answer pairs drawn from actual user queries (or, before launch, from domain experts who ask the way real users will). We typically start with 100–300 cases that deliberately over-sample the long tail and the multi-hop questions.

We score retrieval and generation separately. Retrieval gets recall@k and a relevance judgment: was the answer-bearing chunk actually retrieved? Generation gets faithfulness (did it stay grounded in the context?) and correctness against a reference. Splitting these tells you immediately whether to fix chunking or fix the prompt.

Then we make the eval run on every change — new chunking strategy, new embedding model, re-indexed corpus. This is the same discipline that separates the projects that ship from the ones that stall, a pattern we've written about in what actually works in enterprise AI. Without it you're flying blind, and the customer becomes your test suite.

→Build the eval set from real query logs, not from the demo script — over-sample the long tail.
→Score retrieval (recall@k, relevance) and generation (faithfulness, correctness) as separate metrics.
→Run the full eval on every index rebuild, embedding swap, or chunking change — treat it like CI.
→Track the eval over time so retrieval drift shows up as a falling number, not a customer complaint.

What a Production-Ready Retrieval Setup Actually Requires

A vector database and a prompt is a prototype, not a production system. The difference is everything around the retrieval call that keeps it honest as the corpus and the traffic change.

Chunking has to respect document structure — split on headings, keep tables and procedures intact, and overlap enough that boundary questions still work. Indexes need freshness monitoring: when did each source last sync, and does anything answer from stale data? We've seen systems confidently cite a policy that was superseded months earlier because nobody monitored the index lag.

For ambiguous and long-tail queries, plain semantic search isn't enough. Hybrid retrieval (combining keyword and vector), reranking the top candidates, and sometimes query rewriting close most of the recall gap. None of this is exotic — it's just the work that doesn't fit in a demo timeline, which is exactly why so many demos skip it.

→Structure-aware chunking that respects headings, tables, and procedures rather than fixed token windows.
→Hybrid retrieval plus a reranker to rescue the long-tail and keyword-heavy queries semantic search misses.
→Versioned, freshness-monitored indexes so stale data surfaces as an alert, not a wrong answer.
→An eval harness wired into deployment so retrieval quality is a number you watch, not a hope.

What's Still Genuinely Hard

None of this makes RAG a solved problem, and we'd be lying to say otherwise. Multi-hop reasoning across documents is still fragile — retrieval that needs to synthesise three sources to answer one question fails far more often than single-hop. Building eval sets is real, ongoing work; query distributions shift as users learn what the system can do, so a static eval decays.

And measuring relevance still requires judgment. LLM-as-judge helps at scale but introduces its own biases, so the highest-stakes evals still need human review. Anyone selling you a fully automated, set-and-forget RAG eval is selling theatre.

The honest position is this: production RAG is a measurement discipline, not a model choice. The teams that succeed aren't the ones with the best embedding model — they're the ones who built the eval that tells them when retrieval breaks, and who treat the index as a living system that needs monitoring like any other. If you're weighing whether to build that discipline in-house or buy it, our build vs buy framework for AI agents walks through the honest costs of each path.

Frequently asked questions

Why does my RAG system work in testing but fail in production?

Because testing usually uses questions the builder chose, which unconsciously match how the documents are written. Production traffic is the long tail — differently phrased, multi-hop, and jargon-heavy queries — where retrieval recall quietly collapses.

How do you evaluate a RAG system properly?

Score retrieval and generation separately: use recall@k and relevance to check whether the right chunk was retrieved, and faithfulness plus correctness to check what the model did with it. Build the eval set from real user queries and run it on every index or model change.

What causes retrieval drift in RAG?

As your corpus grows, near-duplicates and newer content crowd out documents that previously surfaced for a query. Nothing in your code changes — the index distribution shifts underneath you, so recall degrades silently over time unless you track it with an eval.

Is a vector database enough for production RAG?

No. A vector DB and a prompt is a prototype. Production needs structure-aware chunking, hybrid retrieval with reranking, freshness-monitored indexes, and an eval harness wired into deployment so retrieval quality is a number you watch.

Take the Operational Bottleneck Audit

Our Bottleneck Audit pinpoints exactly where your RAG system loses accuracy — retrieval, chunking, or freshness — before it costs you a customer.

Ready to stop experimenting?