The Agent That 10x'd Traffic and Blew Its Budget

Almost every agent looks cheap in pilot. Fifty requests a day, a generous model, nobody watching the token meter — the demo lands and everyone moves on. Then traffic ramps, and the bill doesn't scale linearly with usage. It scales worse. This is an anonymised account of one engagement where a support agent ran fine at pilot volume and became unaffordable the moment real traffic hit, and exactly where the money went.
By Daniel Usvyat · Founder & Principal, USQRD
The Setup: A Support Agent That Passed Its Pilot
The client was a Series B SaaS company. The agent resolved customer support tickets end-to-end — read the ticket, pull account context, search docs, draft a reply, and in some cases take an action through internal tools. In pilot it handled roughly 50 conversations a day against a friendly internal beta. Resolution quality was good, the demo was clean, and the per-conversation cost looked like a rounding error.
Then it went live to the full inbound queue. Traffic went from ~50 to ~500 conversations a day inside three weeks. The bill didn't grow 10x. It grew closer to 14x, because the inefficiencies that were invisible at low volume compounded once the agent was running constantly against messier, longer real-world conversations.
The pilot numbers that everyone had signed off on were never representative. They measured a quiet system on easy tickets. Production was noisy, multi-turn, and full of the edge cases that make agents loop. This is the same gap we see between a RAG demo and production — the demo measures the happy path, and the happy path is not where the cost lives.
Where the Money Actually Went
We instrumented the agent at the span level — every LLM call, every tool call, every retrieval, tagged to a conversation and a turn. The breakdown was not what the team expected. The useful work — reading the ticket and drafting the final answer — was a minority of the spend. Four failure modes ate the rest.
None of these were exotic. They're the default behaviour of a naively-wired agent loop, and they're invisible until you tag spend per turn and per resolved task rather than looking at one aggregate bill.
- →Redundant tool calls: the agent re-fetched the same account record three or four times in a single conversation because nothing cached it across turns. ~22% of tool-call spend was duplicate fetches.
- →Oversized context windows: the full conversation history plus every retrieved doc was stuffed into context on every call. By turn six, a single LLM call carried ~14k tokens, most of it stale.
- →Re-retrieval on every turn: the agent ran a fresh vector search on each user message even when the topic hadn't changed, paying for embeddings and retrieval round-trips it didn't need.
- →A chatty planner loop: the planning model deliberated — sometimes 5–7 reasoning steps — before taking a single action, often re-deciding things it had already decided. The planner alone was ~40% of token spend.
The Before/After Numbers
We set three target metrics up front, because 'the bill is too high' is not something you can engineer against. The unit that matters for a support agent is cost per resolved task — not cost per call, not cost per token. A cheap call that fails to resolve and triggers three more calls is expensive. A more expensive call that resolves in one shot is cheap.
After four weeks of focused work, here is what moved. Resolution rate held steady at ~82% throughout — the point was to cut waste, not quality.
- →Cost per resolved task: down ~71%. The biggest single contributor was killing planner waste and pruning context.
- →Median latency: down from ~11s to ~4s. Most of the waste was also slow — fewer round-trips and fewer tokens meant faster answers as a side effect.
- →Token spend per conversation: down ~64%, driven almost entirely by context pruning and the smaller routing model.
- →p95 latency: down from ~26s to ~9s, because the step budget capped the worst-case planner loops that had been the tail.
- →Resolution rate: unchanged at ~82%.
The Four Changes That Did the Work
The fixes mapped one-to-one onto the failure modes. None required a smarter model. Most were about not paying for work the agent had already done or didn't need to do.
Caching came first because it was the cheapest win. We cached tool results within a conversation — account lookups, entitlement checks, anything idempotent within the session — keyed by conversation and arguments. Redundant fetches dropped to near zero.
We split the model tier. The original build used one frontier model for everything, including routing decisions like 'is this a billing question or a technical one'. We moved classification and routing to a small, fast model and reserved the frontier model for drafting and judgement calls. Most turns never touched the expensive model at all.
Context pruning replaced 'stuff everything in' with a maintained working state: a short running summary plus the last two turns verbatim plus only the retrieved chunks relevant to the current step. And we cached retrieval — re-running search only when the query embedding moved meaningfully from the last one, not on every message.
The useful work was a minority of the spend. The rest was the agent paying, repeatedly, for things it had already done.
- →Cache tool results within a conversation — idempotent calls should never run twice.
- →Route with a small model; reserve the frontier model for drafting and judgement.
- →Maintain a compact working state instead of replaying full history every call.
- →Re-retrieve only when the query has actually changed.
- →Give the planner a hard step budget — cap reasoning steps and force an action.
Step Budgets: The Fix Nobody Wants to Add
The planner loop was the hardest to fix because the instinct is to let the agent 'think as long as it needs to'. In practice, unbounded planning is where both cost and tail latency hide. We capped the planner at three steps before it was forced to either act or escalate to a human, and we logged every time it hit the cap so we could see whether the budget was too tight.
It almost never was. The conversations that hit the cap were the genuinely hard ones that should have gone to a human anyway. Forcing a decision didn't hurt resolution rate — it just stopped the agent burning tokens re-litigating decisions. This is the same discipline you need on the evaluation side: you can't tune a budget you can't measure, which is why the eval harness is the actual deliverable, not the agent itself.
Step budgets also gave us a clean escalation signal. An agent that hits its budget is telling you something — either the task is out of scope or the tooling is missing. Both are useful, and both were invisible when the planner just looped silently.
The Transferable Principle: Model Unit Economics Before You Ship
The real lesson isn't any single fix. It's that this team shipped without a model of what a resolved task should cost, so there was no number to defend against and no alarm when the curve bent. The blowup was discovered by finance, not by engineering — which is the worst way to find it.
Before you ship an agent, write down the unit economics: target cost per resolved task, expected calls per task, token budget per call, and the resolution rate that makes the whole thing worth running. Then instrument against those numbers from day one. When the gap between pilot and production opens, you'll see it in your own dashboard instead of in a forwarded invoice. This is the same instinct behind deciding build vs buy honestly — you can't make the call without a cost model.
The honest open problem: unit economics drift. As models change, prompts grow, and traffic mix shifts, your cost-per-task moves, and a budget you set at launch rots like any other artefact. The fix isn't a one-time analysis — it's a standing metric you watch the way you watch error rate. Treat cost per resolved task as a first-class SLO, alert on it, and review it when anything in the stack changes.
Frequently asked questions
Why does an AI agent cost more per request at scale than in pilot?
Because the inefficiencies — redundant tool calls, oversized context, re-retrieval, planner loops — compound per turn and per conversation, and production traffic is longer and messier than pilot traffic. The cost curve bends upward as usage grows, so per-task cost rises rather than staying flat.
What is the right unit to measure AI agent cost?
Cost per resolved task, not cost per call or per token. A cheap call that fails and triggers three more is expensive; a pricier call that resolves in one shot is cheap. Measure the outcome, not the input.
How do you reduce LLM token spend without hurting quality?
Prune context to a compact working state instead of replaying full history, route cheap decisions to a smaller model, cache idempotent tool and retrieval results, and cap planner steps. In our work these cut token spend ~64% with no drop in resolution rate.
When should I model an AI agent's unit economics?
Before you ship — write down target cost per resolved task, calls per task, and token budget per call, then instrument against them from day one. Treat cost per resolved task as a standing SLO you alert on, because it drifts as models, prompts, and traffic mix change.
Take the Operational Bottleneck Audit
Our Bottleneck Audit maps where your agent's spend and latency actually go — before finance asks.
Find the cost leak before it 14x's
We instrument your agent at the span level and model cost per resolved task, then ship the fixes. Most engagements pay for themselves in token spend within weeks.
Book a Discovery Call

