AI agents are running in production, but most teams can’t answer a basic question: why did the agent do that? Traditional observability tools were built for request/response systems. Agents need something different.
Software teams have spent a decade building muscle around observability. Metrics, logs, traces: the three pillars work well for microservices, APIs, and web applications. You instrument your code, ship telemetry to Datadog or Grafana, and when something breaks, you trace the request from ingress to database and back.
AI agents break every assumption that model relies on. An agent doesn’t process a request and return a response. It receives an input, reasons about it, decides which tools to call, executes a multi-step workflow, and produces an output that may or may not be correct. The workflow might branch. It might loop. It might call an LLM four times or forty times depending on the complexity of the input. Traditional APM tools can tell you the agent ran. They cannot tell you why it made the decisions it made.
This is the observability gap. You can see latency, error rates, and throughput. You cannot see which tools the agent considered but rejected, how many tokens it consumed at each step, whether its reasoning was sound, or whether the output was actually correct. Closing this gap is the difference between running agents in production and running agents in production safely.
Why Traditional Observability Falls Short
Request/response tracing assumes a linear flow: request comes in, hits service A, calls service B, queries a database, returns a response. You can model this as a directed acyclic graph. Every span has a clear parent, and the trace has a clear start and end.
Agent workflows don’t follow this pattern. An agent might receive an invoice, extract fields using an LLM, realize it needs additional context, query a knowledge base, re-evaluate its extraction based on the new context, validate the results against a schema, fail validation on two fields, retry those specific fields with a different prompt strategy, and finally produce output. That’s not a DAG. That’s a decision tree with loops and conditional branches.
A Datadog trace of this workflow shows you API latency for each external call. It does not show you that the agent chose to call the wrong API because it misinterpreted the input. It does not show you that the agent retried a field extraction three times, burning tokens each time, because the initial prompt was ambiguous. The trace looks green (no errors, acceptable latency) while the agent produced incorrect output. This is the core problem: traditional observability measures system health, not agent correctness.
The Three Pillars of Agent Observability
Agent observability requires three capabilities that traditional tools don’t provide.
Execution traces with decision context. Not just “what happened” but “what was considered.” At each decision point, log the agent’s reasoning: what alternatives it evaluated, what evidence it used, and why it chose the path it did. When an agent selects tool A over tool B, you need to know what information drove that selection. Without this, debugging agent failures means guessing.
Token accounting. Track token usage per step, per LLM call, per workflow. This isn’t just cost management. Token consumption is a proxy for agent efficiency. If a workflow that normally uses 2,000 tokens suddenly uses 20,000, something changed: the input is unusual, the agent is retrying, or a prompt is degrading. Trending token costs per workflow is one of the most reliable early warning signals for agent problems.
Output validation telemetry. Automated checks on agent outputs, tracked as first-class observability data. Did the extracted data match the expected schema? Did the agent call the right tools in the right order? Did the output pass business rules? Validation results should flow into the same observability pipeline as logs and traces, not sit in a separate system.
Structured Logging for Agent Workflows
Standard log lines are nearly useless for debugging agents. A log that says Processing document invoice_4821.pdf tells you nothing about why the agent made the choices it made. Agent logs need to be structured around decisions, not events.
Every decision point in an agent workflow should emit a structured log entry with these fields:
-
Input state: what data the agent had when it made the decision
-
Policy evaluated: which rule, prompt, or instruction guided the decision
-
Decision made: what the agent chose to do
-
Evidence used: what specific input data influenced the choice
-
Confidence score: how certain the agent was (if the model provides calibrated confidence)
-
Output state: what changed as a result of the decision
{ “step”: “field_extraction”, “field”: “vendor_name”, “input_state”: {“document_type”: “invoice”, “page_count”: 2}, “policy”: “extract_vendor_v3”, “decision”: “extracted_from_header”, “evidence”: “Found ‘Acme Corp’ in document header region”, “confidence”: 0.94, “output”: {“vendor_name”: “Acme Corp”}, “tokens_used”: 340, “model”: “gpt-4o”, “latency_ms”: 820 }
These structured logs are queryable. You can find every instance where confidence dropped below 0.8, every case where the agent chose an unusual extraction strategy, every workflow where a specific policy version was active. This turns debugging from “read through thousands of log lines” into “query for the anomaly.”
Distributed Tracing for Multi-Step Agents
Agent workflows span multiple services, APIs, and LLM calls. A single workflow might hit a document parser, two different LLMs, a vector database, a validation service, and an output API. You need distributed tracing to follow the full execution path.
Use OpenTelemetry-compatible tracing, but extend spans with agent-specific attributes. Every span in an agent trace should include:
- Step name and step type (LLM call, tool call, validation, decision)
- Policy version (which version of the agent’s instructions were active)
- Token count (input tokens, output tokens, total)
- Model identifier (which model was called, including version)
- Confidence score (for LLM steps that produce structured output)
This lets you trace from a business-level problem (“invoice 4821 was processed incorrectly”) all the way back to the technical root cause (“model returned 0.3 confidence on the line_items field at step 3, but the threshold was set to 0.25, so it passed validation when it shouldn’t have”). Without agent-aware tracing, that investigation takes hours of manual log correlation. With it, you query for the trace ID and see the full decision chain in seconds.
Propagate trace context across async boundaries. If an agent queues work for later processing, the trace context needs to follow. Otherwise you end up with disconnected trace fragments that are impossible to correlate.
Token Budgets and Cost Alerts
Every agent workflow should have a token budget. This is not primarily about cost control (though that matters). It’s about anomaly detection.
A well-tuned agent workflow has a predictable token consumption pattern. Invoice processing might use 1,500 to 3,000 tokens depending on document complexity. If a run consumes 30,000 tokens, something is wrong. The agent might be stuck in a retry loop. It might be processing unexpected input that causes it to branch into expensive reasoning paths. It might be calling the LLM repeatedly because its initial extraction failed validation.
Set token budgets per workflow type and alert when a run exceeds the budget. This catches problems that traditional error monitoring misses. A workflow that succeeds but burns 10x the normal tokens is a problem: it’s either doing unnecessary work, or it’s producing correct output through brute force rather than efficient execution. Both are worth investigating.
Track three metrics: tokens per successful run (is efficiency stable?), tokens per failed run (are failures expensive?), and total token spend per workflow type per day (is overall cost trending in the right direction?). A sudden spike in any of these indicates a regression.
Debugging Non-Deterministic Behavior
The hardest problem in agent observability: the same input can produce different outputs. Run the same document through the same agent twice, and you might get slightly different extractions. This is inherent to LLM-based systems, and it makes traditional debugging techniques (reproduce the bug, find the root cause) much harder.
Three techniques help.
Input fingerprinting and output hashing. Compute a hash of every input and every output. When you see divergent outputs for the same input fingerprint, flag it for review. Over time, this builds a dataset of inputs that produce unstable outputs, which tells you where your agent needs better prompts, stricter validation, or deterministic fallbacks.
Workflow replay. Record the full input state (document, context, configuration) for every workflow run. When a failure occurs, replay the workflow with identical input to test reproducibility. If the failure reproduces, it’s a systematic issue. If it doesn’t, it’s a non-deterministic LLM behavior that needs guardrails.
Hybrid execution isolation. Separate deterministic steps from non-deterministic steps in your tracing. MightyBot’s compiled execution approach makes this particularly tractable: deterministic, code-based steps always produce the same output for the same input, so when output diverges, you know the variance originated in an LLM step. This narrows the debugging surface dramatically. Instead of investigating the entire workflow, you investigate only the non-deterministic steps that varied.
Dashboards That Matter
Most agent dashboards show vanity metrics: number of agents running, total workflows completed, average latency. These tell you the system is alive. They don’t tell you the system is working correctly.
Build dashboards around these metrics instead:
- Success rate by workflow type. Not just “did it complete without errors” but “did it produce validated, correct output.” A workflow that completes but fails output validation is not a success.
- Token cost per workflow, trending over time. Rising token costs mean degrading efficiency. This is often the first sign that prompts need tuning or that input patterns have shifted.
- Exception rate by policy version. When you update an agent’s policies, track whether the new version produces more or fewer exceptions. This is your A/B test for agent behavior changes.
- Mean time to detect agent errors. How long between an agent producing bad output and your team discovering it? This measures the effectiveness of your entire observability stack.
- Validation pass rate by field and document type. Granular accuracy metrics that tell you exactly where the agent struggles.
The goal is a dashboard where a single glance tells you: are our agents healthy, accurate, and cost-efficient? If any of those dimensions degrades, the dashboard should make it obvious within minutes, not days.
Related Reading
- Building AI Agents That Don’t Hallucinate: Ground truth validation and retrieval strategies that complement observability
- Fault-Tolerant AI Agent Pipelines: Designing agent architectures that recover gracefully from failures
- What Are AI Agent Guardrails?: Policy enforcement and safety checks for autonomous agent systems
- What Is Deterministic AI?: How compiled execution reduces the non-deterministic surface area of agent workflows
Frequently Asked Questions
Can I use existing APM tools like Datadog or New Relic for agent observability?
They’re a starting point, not a solution. Traditional APM tools capture latency, error rates, and throughput, which you still need. But they lack agent-specific primitives: decision traces, token accounting, confidence scores, and output validation. Use your APM for infrastructure-level monitoring, then layer agent-specific observability on top using structured logs and custom OpenTelemetry spans.
How much logging overhead is acceptable for production agent workflows?
Decision-level structured logging adds minimal latency (sub-millisecond per log entry) but can generate significant data volume. A typical agent workflow might produce 20 to 50 structured log entries. The cost is storage, not performance. Compress and sample judiciously: log every decision for failed workflows, sample at 10 to 20 percent for successful ones. Never skip logging for workflows that exceed their token budget.
What’s the most important metric to track for AI agents in production?
Validated success rate: the percentage of workflows that complete and produce output that passes all validation checks. Raw completion rate is misleading because agents can “succeed” while producing incorrect output. If you can only track one metric, track how often your agent produces correct, validated results.
How do you handle debugging when LLM providers don’t expose reasoning traces?
You instrument around the black box. Log the exact prompt sent, the full response received, and the decision your agent made based on that response. Track input/output pairs over time to build a behavioral profile. When outputs diverge unexpectedly, you may not know why the model responded differently, but you can identify which inputs trigger instability and add deterministic handling for those cases.