AI agent budgeting is usually framed as a token procurement problem. Finance asks which model is cheapest. Engineering asks which context window is largest. Procurement negotiates a rate card. Everyone feels responsible, but nobody has actually budgeted the workflow.
That is the mistake.
The unit of cost in production AI is not the token. It is the completed decision: one claim triaged, one draw package reviewed, one invoice approved, one covenant package checked, one exception routed to a human. Tokens are an input. Decisions are the output the business pays for.
If you budget tokens but do not budget decisions, you have not built a cost model. You have built a meter and hoped the workflow behaves.
Why token rates do not answer the business question
A per-token rate tells you what one model call costs. It does not tell you how many model calls the workflow will make.
That distinction did not matter much in the era of single-shot prompts. A user asked a question, a model answered, and the rough cost was prompt tokens plus response tokens. The pricing unit and the product unit were close enough.
Agents break that relationship. An agentic workflow may classify a document, retrieve evidence, inspect a policy, call a tool, validate a field, retry extraction, ask a second model to judge ambiguity, summarize findings, and write an audit trail. The token rate only prices each step. It does not bound the number of steps.
That is how two vendors can both say they use inexpensive models while producing completely different economics in production. The first system may execute a compiled six-step plan. The second may improvise through a loop until it finds an answer. Same input, same token rate, different cost per decision.
For regulated workflows, that difference is not a rounding error. It determines whether the agent can run across thousands of cases per day or stays trapped in a pilot.
What cost per decision includes
Cost per decision starts with token spend, but it does not end there.
It includes the number of model calls required to complete the workflow. It includes the amount of context retrieved on each step. It includes the cost of vision, OCR, embeddings, structured extraction, tool calls, storage, and human review. It includes rework when the agent produces an incomplete answer. It includes debugging time when a run is expensive and nobody can explain why.
A useful budget model has at least five layers.
Input cost: documents, pages, images, records, messages, and system data required to start the workflow.
Execution cost: model calls, retrieved context, tools, code paths, and deterministic checks used to process the input.
Exception cost: human review, escalations, low-confidence handling, re-runs, and missing data requests.
Audit cost: evidence capture, policy versioning, logs, report generation, and reviewer notes.
Variance cost: the spread between normal runs and expensive outlier runs.
Most teams only model the second layer. The actual business case lives across all five.
Variance is the budget killer
Average cost per run is useful, but variance is what destroys budgets.
If a loan review usually costs $0.40 but occasionally costs $12 because the agent falls into a retrieval loop, finance cannot forecast the month. If an invoice check usually takes 20 seconds but sometimes runs for 11 minutes, operations cannot set service levels. If a claims packet sometimes causes the system to re-read every uploaded document on every step, engineering cannot reason about capacity.
This is why production buyers should ask for a cost distribution, not a single average.
The practical questions are simple:
- What does the median run cost?
- What does the 95th percentile run cost?
- What causes outliers?
- Which step consumes the most context?
- Which policy version changed the cost curve?
- Can a workflow stop before it exceeds a defined budget?
If a vendor cannot answer these questions, the buyer is not buying a production cost model. They are buying an experiment.
The architecture of a budgetable agent
AI agent cost controls are not a dashboard feature. They are an architecture feature.
A budgetable agent has a defined plan before runtime. It knows the expected steps, the allowed tools, the required evidence, the validation rules, and the escalation path. The runtime still uses models where judgment is needed, but it does not spend tokens deciding the workflow structure from scratch on every case.
That is the difference between budgeting a factory line and budgeting a conversation.
MightyBot is built around compiled execution for exactly this reason. Plain-English policies become structured workflows. Deterministic checks run as code. Retrieval targets the evidence needed for the current policy decision. Model calls are reserved for classification, extraction, ambiguity resolution, and narrative synthesis where language judgment actually matters.
The result is not just lower cost. It is a narrower cost distribution. A narrower distribution is what makes the workflow budgetable.
What to measure before deployment
Before an agent goes live, run a cost calibration pass.
Start with a representative test set: clean cases, messy cases, edge cases, large document packages, missing documents, conflicting values, and exceptions that should go to a human. Run each case through the workflow and record the cost profile by step.
For every workflow, capture:
- total model calls
- input tokens by step
- output tokens by step
- retrieved chunks and pages
- deterministic checks run
- tool calls made
- elapsed time
- confidence score
- exception reason
- human review result
Then compute cost per completed decision, not cost per isolated model response. A run that produces an answer nobody can use has not completed the decision.
This calibration pass gives finance a forecast, operations a service-level expectation, and engineering a set of outliers to fix before launch.
The controls buyers should demand
Production AI buyers should require workflow-level controls.
There should be a maximum number of model calls per workflow. There should be a maximum amount of retrieved context per step. There should be a retry policy that is explicit rather than emergent. There should be model routing by task: small models for classification, deterministic code for calculations, stronger models for high-judgment synthesis.
There should also be budget alerts that are tied to the business unit, policy version, and workflow type. A spend spike should not appear as a mysterious monthly invoice. It should point to a specific policy, document type, step, and exception pattern.
Most importantly, the workflow should be able to stop. If required evidence is missing, the agent should request the missing evidence instead of spending more tokens guessing. If a policy exception is ambiguous, the agent should escalate instead of retrying. If a run exceeds its expected bounds, the system should preserve the state and hand it to a reviewer.
Cost control and compliance control are the same muscle. Both require bounded execution.
The CFO question
The CFO does not need to know the price of one million tokens. The CFO needs to know what it costs to process one claim, one invoice, one credit file, one covenant package, or one compliance review at production volume.
That answer should include a range, not only an average. It should identify what drives expensive cases. It should show how policy changes affect spend. It should separate deterministic work from model work. It should show the cost of exceptions and the cost avoided by automation.
If the vendor cannot explain cost per decision, the vendor cannot explain ROI.
AI agent budgets will not mature around cheaper tokens. They will mature around controlled execution. The teams that learn this first will have a material advantage: they will deploy agents into real workflows while everyone else is still negotiating a token rate card.