Why does the same AI agent task cost $8 in one run and $240 in another?

A recent arXiv paper from Bai et al. (April 2026) measured token consumption across identical agentic coding tasks and found runs on the same task can differ by up to 30x in total tokens. The variance is structural, not a bug: the model decides how many tool calls to make, the retrieval layer decides how much context to pull, the verifier decides whether to loop again. Each is a stochastic decision conditioned on intermediate outputs the user never sees.

Does spending more tokens improve agentic AI accuracy?

No. The same paper found that accuracy peaks at intermediate token cost and saturates at higher costs. A $240 run is usually not a better answer than the $8 run; it is just a longer one. ReAct-style "try, fail, retry" loops compound noise more than they compound insight. Token variance is not only a cost problem; it is a quality problem.

How does compiled execution reduce token consumption versus ReAct loops?

Compiled execution builds a structured plan at design time and runs the steps in order. The runtime does not burn tokens guessing what to do next. ReAct loops do the opposite: every step is a fresh stochastic decision about what to try next, with a chance to loop back, retry, or wander. Compiled plans bound the step count and the token spend. ReAct architectures are the structural reason the same task can cost 30x more on different runs.

What questions should enterprises ask AI agent vendors about token efficiency?

Stop asking about per-token prices and context window sizes. Ask about the coefficient of variation on token spend across runs of the same workflow, what architecture prevents runaway reasoning loops on edge cases, whether policy steps can execute as deterministic code, whether you can observe and cap token spend per workflow, whether the system can route different steps to different models based on cost and fitness, and what the accuracy curve looks like as a function of token spend. The vendors that can answer those questions are the ones building for the world the arXiv paper just described.

The Unit of AI Cost Is Not the Token. It Is the Distribution.

Q: What is the KV cache and why does it matter for AI cost?

The KV cache is the running memory state of a long conversation or agent loop. It does not scale with model parameters; it scales with context length and the number of agent steps. A long agentic session can hold tens of gigabytes of state per user, per session. Architectures that hose the context window with every retrieval pay that bill on every step. Architectures that retrieve precisely, and that compile workflows so step count is bounded, do not.

Agentic AI token spend can vary 30x on identical tasks. The same workflow can cost $8 or $240 with no change to inputs, because every run makes stochastic decisions about tool calls and retry loops the user never sees. The expensive runs are usually no more accurate than the cheap ones. They are just longer.

On May 19, 2026, Sundar Pichai opened Google I/O not with a model drop, but with a token counter. Roughly 9.7 trillion tokens per month in April 2024. 480 trillion at I/O 2025. 3.2 quadrillion at I/O 2026. That is a 330x jump in two years. Google also disclosed that top Google Cloud customers are each processing around one trillion tokens per day, with more than 375 enterprises crossing the cumulative one-trillion-token mark in the prior 12 months.

The same afternoon, OpenAI announced “Guaranteed Capacity”: one-, two-, or three-year compute commitments for enterprise customers, with volume-based discounts redeemable across OpenAI’s model families. Sam Altman framed it directly: “customers are increasingly asking us for certainty on capacity. as models get better, we expect that the world will be capacity-constrained for some time.”

Two of the most powerful AI companies on earth told you, on the same day, that demand is outrunning supply. This is Jevons Paradox in real time. Cheaper, faster models do not reduce token consumption. They explode it.

Inference got 100x cheaper. The bill went up anyway.

This sounds like a paradox until you do the multiplication. Twelve months ago, one million tokens of frontier-class reasoning output cost on the order of $60. Today an equivalent quality of output costs roughly $0.50: a 128x drop in twelve months. GPT-4-class output has dropped roughly 100x since the original GPT-4 shipped.

By any normal reading of a technology cost curve, that should be deflationary. It should be saving enterprises money. The opposite has happened. Anthropic has signed multi-year capacity deals with both xAI and Amazon. Microsoft’s 2026 Azure capex guide starts with an eight. NVIDIA paid roughly twenty billion dollars to acquire Groq, an inference specialist that did not exist as a serious commercial entity three years ago. The cost curve and the demand curve crossed, and then demand lapped supply.

The amplification stack underneath that demand is brutal. Compare what each workload type consumes against a single-shot completion as the baseline:

Workload	Token consumption vs single-shot	Why
Single-shot completion	1x (baseline)	One prompt, one response.
Reasoning model	~10x output tokens	Spends most tokens thinking out loud before answering.
Agentic workflow (ReAct)	~20x requests, with up to 30x run-to-run variance	Plans, calls tools, retries, synthesizes. Step count is stochastic.
Deep-research query	>10x of an original GPT-4 query	Long context, multi-source synthesis, repeated retrieval.
Compiled execution (MightyBot)	Bounded by the plan	Step count is fixed at design time. No retry loops.

We made every individual token a hundred times cheaper, and then we built a generation of products that consume ten thousand times more tokens. 100x cheaper times 10,000x more equals a 100x larger total bill.

That is the supply story, and it is the one most procurement teams are starting to wake up to. There is a quieter problem underneath the supply problem, and it is going to reshape the way regulated enterprises buy AI.

The 30x variance problem

Two weeks before the I/O keynote, an arXiv paper landed that enterprise buyers have not absorbed yet.

The paper is How Do AI Agents Spend Your Money? Analyzing and Predicting Token Consumption in Agentic Coding Tasks by Longju Bai, Zhemin Huang, Xingyao Wang, Jiao Sun, Rada Mihalcea, Erik Brynjolfsson, Alex Pentland, and Jiaxin Pei. arXiv:2604.22750, v2, April 29, 2026.

The authors ran identical agentic tasks across eight frontier LLMs on SWE-bench Verified. Same task. Same prompt. Same context window. Same tool stack. They measured end-to-end token consumption across many runs.

The coefficient of variation was extreme. From the paper, verbatim: “runs on the same task can differ by up to 30x in total tokens.” The same task could cost eight dollars or two hundred and forty dollars. Same inputs. Same agent. Different luck.

This is not a bug. It is a feature of agentic execution:

The model decides how many tool calls to make.
The retrieval layer decides how much context to pull.
The verifier decides whether to loop again.
Each of these is a stochastic decision conditioned on intermediate outputs the user never sees.

The paper also finds that frontier models cannot accurately predict their own token usage, with weak-to-moderate correlations only up to 0.39, and they systematically underestimate real costs. The agent does not know what it is about to spend. The vendor does not know what it is about to charge you. Nobody is at the wheel of the meter.

The quality problem hiding inside the cost problem

There is a quieter finding buried in the paper, and it is the one that matters most for buyers. The paper does not just find that token cost has wide variance. It finds that accuracy peaks at intermediate cost and saturates at higher costs.

The $240 run is usually not a better answer. It is just a longer one. Agents that burn 30x more tokens on the same task are typically not 30x more correct. Often they are not any more correct. The “try, fail, retry” loops at the heart of most ReAct-style architectures compound noise more than they compound insight. Token variance is not only a cost problem. It is a quality problem: enterprises are paying more for worse answers.

The paper also clocks the model-to-model spread. On the same tasks, Kimi-K2 and Claude-Sonnet-4.5 consumed over 1.5 million more tokens, on average, than GPT-5. Choosing the model is choosing the cost surface. Most enterprises are doing this by feel.

Procurement was not designed for this

Cloud compute pricing is published per instance-hour and stable once reserved. You can size a budget. You can quote a fixed price. You can sign an MSA on a per-million-token rate card.

You can still do all of those things with agents. But the per-million rate card no longer maps to the actual cost of producing a decision. The agent decides how many millions to spend. And it decides differently each run.

The piece that nobody is pricing yet is what happens when the procurement function at a top-tier bank, consulting firm, or pharma company realizes the math. They will not just ask for cheaper tokens. They will ask for tighter controls on variance: better routing, better caching, better observability, better policy limits, and in some cases dedicated capacity.

That is a very different infrastructure requirement from stateless inference. It does not mean cloud disappears. It means the winning cloud looks much more like an execution and control layer for stochastic workloads.

The unit of AI cost is not the token. It is the distribution.

What token-efficient agents actually look like

MightyBot was built before any of this was a public conversation, on a thesis that AI for the work that can’t be wrong requires a different architecture from “AI for the work that’s fun to try.” That thesis is now also a procurement thesis. Here is how it maps to token efficiency in production.

1. Compiled execution, not stochastic loops

MightyBot runs on a Constrained Agent Runtime and a Deterministic Action Layer, supporting three execution patterns: compiled plan execution, stepwise execution, and planned sequences. The default for production work is compiled plan execution: build a structured plan, then run the steps. The runtime does not burn tokens guessing what to do next. It executes a plan that was already shaped at design time.

No ReAct “think, try, fail, think, try, fail” loops. Those are the architecture that produces the 30x variance the arXiv paper measured. They are also the architecture that pays more for worse answers.

2. Design-time intelligence versus runtime execution

MightyBot separates MightyBot Agent Compiler (design time) from Policy Agent Workflow (runtime). Creative reasoning happens once, during workflow design and policy authoring. At runtime, the system does not pay tokens to re-discover the answer every time the workflow runs. The plan is compiled. The execution is cheap.

3. Policy evaluated through deterministic code paths

Most enterprise policies are precise enough to encode as code: three-way matches, threshold checks, variance calculations, comparison against a canonical field dictionary. MightyBot evaluates them through deterministic code paths, not through the model. Every step that does not require model reasoning is a step that does not consume tokens. Most enterprise workflows are ninety percent deterministic. Treating them as if they were ninety percent generative is the root cause of most runaway token bills.

4. Search instead of context-stuffing, and KV-cache discipline

MightyBot’s Megastore layer uses hybrid search over a unified repository of documents, system records, and API responses. Tokens go to the page, paragraph, and character offsets that matter for the current policy decision, not to the entire request dumped into the context window. The paper’s finding that input tokens drive the overall cost is exactly the problem this is built to solve.

This matters more than ever because the KV cache, the running memory state of a long conversation or agent loop, is the silent monster of the inference era. It does not scale with model parameters; it scales with context length and number of agent steps. A long agentic session can hold tens of gigabytes of state per user, per session.

Architectures that hose the context window with every retrieval pay that bill on every step. Architectures that retrieve precisely, and that compile workflows so step count is bounded, do not.

5. Model-agnostic routing

The platform is model-agnostic by design. Sovereign or latency-sensitive steps can route to smaller specialized models. Frontier reasoning routes to whichever model fits the step best. Deterministic checks route to code. You pay for capability where you actually need it, and only there. The 1.5 million token spread between Kimi-K2 and GPT-5 the paper measured stops being an accident and starts being a decision.

6. Continuous evaluation and a why-trail on every run

Every decision carries a why-trail: policy version applied, data inputs consulted, evidence pointers with page and character offsets, timestamped execution log. The Continuous Evaluation Engine tracks accuracy by input type and data source, time-to-decision, and human override frequency. Token spend per workflow becomes a measured, trended quantity. Outlier runs surface instead of hiding inside an unbounded monthly invoice.

What this architecture is actually optimized for

Token efficiency

Tokens go to the steps that actually need a model. Deterministic policy checks run as code. Retrieval is precise, not context-stuffed. Workflows are compiled, not improvised. The variance the arXiv paper measured is the variance MightyBot is built to eliminate.

Fast execution

Compiled plans do not wait around for the model to decide what to do next. Complex evidence-heavy reviews that take 90 minutes in a manual process complete in roughly five. Loan administrators handle an order of magnitude more cases per day. Cycle times collapse because the system is not paying itself to think out loud.

High accuracy

99%+ across millions of live production tasks. The accuracy is the architecture, not a marketing number. When the system stops paying the model to guess at structure and policy, the model stops guessing wrong.

These outcomes are observed across more than two hundred financial institutions in production, processing more than one hundred billion dollars in lending activity. They are not separable from the token-efficiency story. The cycle-time reduction is a token reduction. The throughput uplift is a cost-per-decision reduction. The accuracy floor is what you get when an agentic architecture stops paying the system to guess.

The questions to ask your AI vendor in 2026

If you are buying agentic AI for regulated work, the questions you ask vendors need to change. The token-economy data is now public. The 30x variance is now in the literature. The vendors that cannot answer the new questions are the ones selling you variance and calling it intelligence.

Stop asking:

What’s your per-token price?
What’s your context window?
Which model do you use?
Can you summarize this PDF?

Ask instead:

What is the coefficient of variation on token spend across runs of the same workflow?
What is the architecture that prevents runaway reasoning loops on edge cases?
Can policy steps execute as deterministic code, or do they all hit the LLM?
Can I observe, alert on, and cap token spend per workflow, per tenant, per policy version?
Can the system route different steps to different models based on cost and fitness?
What is the accuracy curve as a function of token spend? Where does it plateau?
What is in the why-trail for every decision, and can I export it on audit?

Buy the distribution, not the token.

The token economy is exploding. That is the supply story, and it makes for great keynote slides. The demand story is more interesting and harder to fit on a slide. Enterprises will pay a premium not for cheaper tokens, but for tighter variance. The next era of AI infrastructure is execution control, not just inference.

Token efficiency stops being a virtue and starts being a procurement requirement. Compiled execution beats stochastic loops. Deterministic code beats LLM calls wherever policy is precise enough to encode. Why-trails on every decision beat black-box outputs that nobody can defend at audit.

If your AI agent can cost you eight dollars or two hundred and forty dollars for the same task, with no change to any input, you do not have an AI agent. You have a slot machine. The work that can’t be wrong is not work you want to leave on a slot machine.

The Unit of AI Cost Is Not the Token. It Is the Distribution.

Inference got 100x cheaper. The bill went up anyway.

The 30x variance problem

The quality problem hiding inside the cost problem

Procurement was not designed for this

What token-efficient agents actually look like

1. Compiled execution, not stochastic loops

2. Design-time intelligence versus runtime execution

3. Policy evaluated through deterministic code paths

4. Search instead of context-stuffing, and KV-cache discipline

5. Model-agnostic routing

6. Continuous evaluation and a why-trail on every run

What this architecture is actually optimized for

Token efficiency

Fast execution

High accuracy

The questions to ask your AI vendor in 2026

Buy the distribution, not the token.

Where this applies in production

Frequently Asked Questions

The Unit of AI Cost Is Not the Token. It Is the Distribution.

Inference got 100x cheaper. The bill went up anyway.

The 30x variance problem

The quality problem hiding inside the cost problem

Procurement was not designed for this

What token-efficient agents actually look like

1. Compiled execution, not stochastic loops

2. Design-time intelligence versus runtime execution

3. Policy evaluated through deterministic code paths

4. Search instead of context-stuffing, and KV-cache discipline

5. Model-agnostic routing

6. Continuous evaluation and a why-trail on every run

What this architecture is actually optimized for

Token efficiency

Fast execution

High accuracy

The questions to ask your AI vendor in 2026

Buy the distribution, not the token.

Where this applies in production

Frequently Asked Questions

Related Articles

The Token Economics of AI Agents in 2026: What a Decision Actually Costs

What Is an Agent Compiler? From Plain English to Working Agent

AI Agent Cost Controls: The Production Architecture Buyers Should Demand