Field Notes3 July 2026·10 min read

Why LLM Confidence Scores Lie — and Break Your Human Gate

Most production agents we inspect have a human-in-the-loop gate wired to the model's own confidence — a self-reported score, or a logprob average dressed up as one. The assumption is that the model knows when it's unsure. It doesn't. LLM confidence is systematically miscalibrated, and when you gate human review on a number that lies, you build a review process that escalates the easy cases and waves the dangerous ones through. The gate feels like a safety control. It's mostly theatre.

By Daniel Usvyat · Founder & Principal, USQRD

Key takeaways

01LLMs are overconfident: a model saying "95% sure" is often right 70% of the time, and logprob-derived scores are no better once you fine-tune or RAG-ground the model.
02Naive confidence gating fails silently — it routes the wrong things to humans and lets confidently-wrong answers through autonomously.
03The cheap calibration moves that worked for us: abstention thresholds tuned on a reliability diagram, ensemble disagreement, and retrieval-grounding checks. Temperature scaling alone didn't.
04Stop measuring agents by accuracy. Measure abstention quality: coverage at a fixed error rate, and the cost-weighted trade-off between false autonomy and needless escalation.
05Let an agent act autonomously only where the calibrated error rate is below your tolerated cost of being wrong — everything else routes to a human.

The Number You're Gating On Is Lying to You

There are two ways teams extract confidence from an LLM, and both are flawed. The first is to ask the model — "rate your confidence 0–100." The second is to derive a score from token logprobs, usually the mean log-probability of the generated answer. Neither tracks the probability that the answer is actually correct.

Self-reported confidence is anchored to surface fluency, not correctness. A model will say "I'm 95% confident" about a hallucinated citation in exactly the same tone it uses for a verified fact. In one document-extraction engagement, the agent's self-rated confidence sat above 90 on roughly 80% of outputs — and was wrong on about a quarter of those. That's not a calibration wobble; that's a number with almost no signal in the range you care about.

Logprobs are subtler but break in their own way. Once you RAG-ground or fine-tune a model, the output distribution sharpens and logprobs collapse toward confident-looking values regardless of whether the retrieved context actually supports the answer. The model is fluent about the wrong thing. As we've written in why RAG demos work and production doesn't, retrieval drift makes this worse over time — the logprob stays high while the grounding rots underneath it.

A model says "95% confident" about a hallucinated citation in exactly the tone it uses for a verified fact.

What a Reliability Diagram Actually Showed Us

A reliability diagram plots predicted confidence against observed accuracy in bins. Perfect calibration is the diagonal. For the extraction agent above, the raw curve was a flat line hovering near 75% accuracy across every confidence bucket from 60 to 99 — the model's confidence carried almost no information about whether it was right. The Expected Calibration Error was north of 0.2.

That flatness is the killer for gating. If accuracy is roughly constant regardless of stated confidence, then any threshold you pick routes a near-random sample of errors to humans. You're not catching the dangerous cases; you're catching a coin-flip's worth of them while burning reviewer time on confidently-correct outputs.

After the calibration work below, the curve bent toward the diagonal in the region that mattered — high-confidence outputs were genuinely more accurate — and, critically, we got a usable abstention band. The point isn't a prettier chart. It's that you cannot reason about a human gate until you've drawn this diagram on your own data. Build it before you build the gate.

The Calibration Moves That Worked — and the Ones That Didn't

We tried the textbook fixes first. Temperature scaling — fitting a single scalar to soften the logits — barely moved ECE for our generative tasks, because it assumes a classification head that LLM generation doesn't really have. Verbalised confidence prompting ("think about how sure you are") produced more cautious-sounding language and no better calibration. Both are popular; both underdelivered.

What moved the needle was cheaper and more structural. Ensemble disagreement: sample the answer 3–5 times at moderate temperature and measure agreement. Consensus is a far better honesty signal than any single self-rating. Retrieval-grounding checks: for RAG outputs, verify that the claim is actually entailed by a retrieved span before trusting it — an unsupported answer gets abstained regardless of how confident the model sounds. And abstention thresholds tuned empirically on the reliability diagram, not guessed.

The pattern is consistent across our engagements: external, verifiable signals beat introspective ones. The model is a bad judge of itself but a decent committee, and grounding is checkable against ground truth. This is the same logic behind treating agents like untrusted insiders rather than trusted narrators — you verify behaviour, you don't take the agent's word for it.

→Worked: ensemble disagreement across 3–5 samples as a confidence proxy.
→Worked: entailment / grounding checks that abstain when no retrieved span supports the claim.
→Worked: abstention thresholds fit to your own reliability diagram, per task.
→Didn't move ECE meaningfully: temperature scaling alone, verbalised "rate your confidence" prompting.

The Real Decision: When Does an Agent Get to Act Alone?

Here is the framing we give CTOs. Autonomy is not a model property; it's an economic one. An agent should act autonomously on a task only where its calibrated error rate is below the cost of being wrong on that task. High-cost, irreversible actions — moving money, sending external comms, mutating production data — demand a far lower tolerated error rate than reversible, low-stakes ones like drafting an internal summary.

So you don't have one gate. You have a per-action-class policy. For each class, define the tolerated error rate, then read off the reliability diagram what confidence band (after calibration) clears it. Outputs in that band act autonomously; everything else routes to a human or abstains. The agent earns autonomy task by task as the evidence comes in.

This is also where the eval harness becomes non-negotiable — without it you have no per-class error estimates to set thresholds against. As we've argued, the eval harness is the real deliverable; the calibration policy is one of the things it pays for.

→Irreversible + high-cost (payments, external sends, deletes): require very low calibrated error; default to human review.
→Reversible + low-cost (internal drafts, suggestions): tolerate higher error; default to autonomous with sampling.
→Anything where grounding check fails: abstain, regardless of confidence.
→Anything outside the calibrated confidence band: route to human.

Measure Abstention Quality, Not Just Accuracy

Accuracy is the wrong scoreboard for a gated agent, because an agent that abstains on everything has zero errors and zero value. The metric that matters is the trade-off between coverage and risk. Track coverage at a fixed error rate: at a tolerated 2% error, what fraction of cases can the agent handle autonomously? That single number tells you the actual leverage of the deployment.

Then track the two failure modes separately, because they have different costs. False autonomy — acting confidently on a wrong answer — is the expensive one and the reason the gate exists. Needless escalation — routing an easy, correct case to a human — is the cost you pay for safety. A good gate minimises false autonomy first, then claws back coverage by tightening the cheaper escalations. Plotting a risk–coverage curve makes this concrete and lets you negotiate the operating point with the business.

Be honest about what this buys you and what it doesn't. Calibration reduces confidently-wrong actions; it does not eliminate them, and distribution shift will quietly erode your thresholds over weeks — exactly the kind of slow decay that contributes to pilots dying in month four. The gate is a living control that needs re-fitting, not a one-time number.

The Honest Limits

None of this makes the model self-aware. Ensemble agreement still misses correlated errors — if all five samples share the same wrong assumption from the same bad retrieval, they'll confidently agree on garbage. Grounding checks only help where claims are checkable against retrieved text; open-ended reasoning tasks have no span to verify against, and calibration there remains genuinely hard. We don't have a clean answer for those, and anyone who claims one is selling something.

What we do have is a discipline: draw the reliability diagram on your data, prefer external signals over introspective ones, set per-action-class thresholds tied to the cost of being wrong, and measure coverage-at-fixed-error instead of raw accuracy. That's enough to turn a decorative confidence gate into a real one — and to know, with numbers, which parts of the workflow your agent has actually earned the right to run alone.

An agent that abstains on everything has zero errors and zero value.

Frequently asked questions

Are LLM logprobs a reliable confidence measure?

No. Mean token logprobs reflect output fluency, not correctness, and they collapse toward confident-looking values once you fine-tune or RAG-ground a model. Use them as one weak signal at most, never as your sole gating threshold.

How do I decide when an AI agent can act without human review?

Treat autonomy per action class: estimate the agent's calibrated error rate for that task and allow autonomy only where it falls below the cost of being wrong on it. Irreversible, high-cost actions need much lower tolerated error than reversible ones.

What's a better metric than accuracy for a human-in-the-loop agent?

Coverage at a fixed error rate — the fraction of cases the agent handles autonomously while staying under your tolerated error — plus separate tracking of false autonomy versus needless escalation. A risk–coverage curve makes the operating point explicit.

Does temperature scaling fix LLM calibration?

Rarely enough on its own for generative agent tasks. In our work it barely moved Expected Calibration Error; ensemble disagreement and retrieval-grounding checks were far more effective signals for deciding when to abstain or escalate.

Lead magnet

Take the Operational Bottleneck Audit

Our Bottleneck Audit maps where in your agent workflow autonomy is safe today and where the confidence gate is quietly failing.

Ready to stop experimenting?