Agent Observability: Tracing and Debugging
Traditional software observability — logs, metrics, and distributed tracing — falls short for LLM agents. An agent might make 8 LLM calls, invoke 5 tools, backtrack twice, and spend $0.47 in tokens before returning a single answer. When that answer is wrong, you need to reconstruct the reasoning trajectory, not just the HTTP call chain. This post covers the tools and techniques that make multi-step agent execution inspectable, debuggable, and cost-accountable.
Why Agent Observability Matters
Agents are non-deterministic, multi-step systems with emergent execution paths. Unlike a REST API where the call graph is fixed at compile time, an agent's execution graph is generated at runtime by the LLM itself. This creates unique observability challenges:
- Variable execution depth — the same prompt can trigger 2 steps or 20 steps depending on the model's reasoning
- Hidden costs — each LLM call consumes tokens; without accounting you can't budget or optimize
- Silent failures — the agent may return a plausible but wrong answer without raising any exception
- Tool interaction bugs — errors often emerge at the boundary between the LLM's output and the tool's expected input format
- Latency attribution — a 30-second response could be 2s of LLM time and 28s of tool execution, or the reverse
The observability hierarchy for agents, from most to least critical:
- Trace every LLM call — input prompt, output completion, token counts, latency, model used
- Trace tool invocations — tool name, input arguments, output, success/failure, latency
- Link spans into runs — group all calls from a single user request into one trace
- Compute cost — multiply token counts by per-model pricing, aggregate per user/run/day
- Evaluate quality — score the final output against ground truth or LLM-as-judge criteria
LangSmith: Tracing & Evaluation
LangSmith is LangChain's hosted observability platform. It captures every LLM call, chain step, and tool invocation as nested spans in a trace tree. If you're already using LangChain, integration is essentially zero-config — set two environment variables and every chain execution is automatically traced.
# LangSmith setup — just environment variables import os os.environ["LANGCHAIN_TRACING_V2"] = "true" os.environ["LANGCHAIN_API_KEY"] = "ls-..." os.environ["LANGCHAIN_PROJECT"] = "customer-support-agent" # That's it. Every LangChain call is now traced. from langchain_openai import ChatOpenAI from langchain.agents import AgentExecutor llm = ChatOpenAI(model="gpt-4o") agent = AgentExecutor(agent=agent_runnable, tools=tools) # This run appears in LangSmith with full trace tree result = agent.invoke({"input": "What's the refund policy for order #1234?"})
LangSmith organizes data into three core concepts:
- Runs — a single execution of a chain/agent, containing nested child runs for each step
- Traces — the root run plus all descendant runs, visualized as a tree or timeline
- Datasets & Evaluators — test sets paired with scoring functions for regression testing agent quality
For non-LangChain code, use the @traceable decorator to manually instrument functions:
from langsmith import traceable @traceable(run_type="chain", name="customer_support_agent") def handle_support_query(query: str) -> str: # Step 1: Classify intent intent = classify_intent(query) # Step 2: Retrieve relevant docs docs = retrieve_context(query, intent) # Step 3: Generate response response = generate_response(query, docs, intent) return response @traceable(run_type="llm") def classify_intent(query: str) -> str: # Traced as a child span under the parent chain response = openai_client.chat.completions.create( model="gpt-4o-mini", messages=[{"role": "user", "content": f"Classify intent: {query}"}] ) return response.choices[0].message.content @traceable(run_type="retriever") def retrieve_context(query: str, intent: str) -> list: # Traced as a retriever span with input/output docs return vector_store.similarity_search(query, k=5)
LangFuse: Open-Source Alternative
LangFuse is an open-source LLM observability platform that provides tracing, prompt management, and evaluation without vendor lock-in. You can self-host it (Docker Compose or Kubernetes) or use their managed cloud. The key advantage: your trace data stays in your infrastructure.
LangSmith
- Hosting: Managed SaaS only
- LangChain integration: Zero-config, automatic
- Non-LangChain: @traceable decorator + SDK
- Evaluation: Built-in datasets, evaluators, comparison views
- Prompt management: Hub for versioned prompts
- Pricing: Free tier (5K traces/mo), paid plans scale
- Best for: LangChain-native teams wanting turnkey solution
LangFuse
- Hosting: Self-hosted or managed cloud
- LangChain integration: Callback handler (one line)
- Non-LangChain: @observe decorator + low-level SDK
- Evaluation: Scoring API, annotation queues, model-based evals
- Prompt management: Built-in versioned prompt registry
- Pricing: Open source (self-host free), cloud has free tier
- Best for: Teams needing data sovereignty or custom infra
LangFuse uses a slightly different tracing model. Instead of auto-instrumentation, you explicitly create traces, spans, and generations:
from langfuse.decorators import observe, langfuse_context @observe() def agent_pipeline(user_query: str) -> str: # Root trace created automatically by @observe # Step 1: Plan plan = create_plan(user_query) # Step 2: Execute tools for step in plan.steps: result = execute_tool(step) # Step 3: Synthesize answer = synthesize_answer(user_query, results) # Attach metadata and scores langfuse_context.update_current_trace( user_id="user_abc", metadata={"plan_steps": len(plan.steps)}, tags=["production", "v2.1"] ) return answer @observe(as_type="generation") def create_plan(query: str) -> Plan: # Tracked as an LLM generation with token counts response = openai_client.chat.completions.create( model="gpt-4o", messages=[{"role": "system", "content": PLANNER_PROMPT}, {"role": "user", "content": query}] ) # LangFuse auto-extracts token usage from OpenAI responses langfuse_context.update_current_observation( model="gpt-4o", usage={"input": response.usage.prompt_tokens, "output": response.usage.completion_tokens} ) return parse_plan(response.choices[0].message.content)
Custom Tracing Implementation
When you can't use LangSmith or LangFuse — perhaps due to air-gapped environments, compliance requirements, or you're building on a non-LangChain framework — you need custom tracing. The core idea: wrap every LLM call and tool invocation in a span that records inputs, outputs, timing, and token usage, then ship those spans to your existing observability backend.
import time, uuid, json from dataclasses import dataclass, field from typing import Any, Optional from contextlib import contextmanager import threading @dataclass class Span: span_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8]) trace_id: str = "" parent_id: Optional[str] = None name: str = "" span_type: str = "generic" # "llm" | "tool" | "retriever" | "chain" start_time: float = 0.0 end_time: float = 0.0 input_data: Any = None output_data: Any = None tokens_in: int = 0 tokens_out: int = 0 model: str = "" error: Optional[str] = None metadata: dict = field(default_factory=dict) @property def duration_ms(self) -> float: return (self.end_time - self.start_time) * 1000 # Thread-local storage for trace context propagation _context = threading.local() class Tracer: def __init__(self, exporter=None): self.spans: list[Span] = [] self.exporter = exporter or ConsoleExporter() @contextmanager def span(self, name: str, span_type: str = "generic", **kwargs): span = Span( name=name, span_type=span_type, trace_id=getattr(_context, "trace_id", str(uuid.uuid4())[:8]), parent_id=getattr(_context, "current_span_id", None), start_time=time.time(), **kwargs ) prev_span_id = getattr(_context, "current_span_id", None) _context.current_span_id = span.span_id try: yield span except Exception as e: span.error = str(e) raise finally: span.end_time = time.time() self.spans.append(span) self.exporter.export(span) _context.current_span_id = prev_span_id
Usage looks like this — every LLM call and tool invocation gets wrapped:
tracer = Tracer(exporter=OTLPExporter("http://jaeger:4317")) def run_agent(query: str) -> str: with tracer.span("agent_run", span_type="chain") as root: root.input_data = query _context.trace_id = root.trace_id # LLM call — traced with token counts with tracer.span("plan", span_type="llm") as s: resp = openai_client.chat.completions.create( model="gpt-4o", messages=[...] ) s.tokens_in = resp.usage.prompt_tokens s.tokens_out = resp.usage.completion_tokens s.model = "gpt-4o" s.output_data = resp.choices[0].message.content # Tool call — traced with input/output with tracer.span("tool:search_db", span_type="tool") as s: s.input_data = {"query": "order #1234"} result = db.search("order #1234") s.output_data = result root.output_data = final_answer return final_answer
For a decorator-based approach that's less verbose:
import functools def trace(name: str = None, span_type: str = "generic"): def decorator(func): @functools.wraps(func) def wrapper(*args, **kwargs): span_name = name or func.__name__ with _global_tracer.span(span_name, span_type) as s: s.input_data = {"args": str(args), "kwargs": str(kwargs)} result = func(*args, **kwargs) s.output_data = str(result)[:500] # Truncate large outputs return result return wrapper return decorator # Clean usage: @trace(span_type="llm") def classify_intent(query: str) -> str: ... @trace(span_type="tool") def search_knowledge_base(query: str) -> list: ... @trace(span_type="chain") def run_agent(user_input: str) -> str: ...
Token Accounting & Cost Tracking
Token accounting is the financial observability layer for LLM agents. Without it, you're flying blind on costs. A single agent run can cost anywhere from $0.002 (simple GPT-4o-mini classification) to $2.50+ (complex multi-step GPT-4o reasoning with tool use). At scale, the difference between "we optimized our prompts" and "we didn't" can be $50K/month.
Here's a practical token accounting implementation:
from dataclasses import dataclass, field from collections import defaultdict # Pricing per 1M tokens (as of mid-2024) MODEL_PRICING = { "gpt-4o": {"input": 2.50, "output": 10.00}, "gpt-4o-mini": {"input": 0.15, "output": 0.60}, "claude-3.5": {"input": 3.00, "output": 15.00}, "claude-haiku": {"input": 0.25, "output": 1.25}, } @dataclass class TokenLedger: """Per-run token accounting.""" entries: list = field(default_factory=list) def record(self, model: str, input_tokens: int, output_tokens: int, step_name: str = ""): pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0}) cost = (input_tokens * pricing["input"] + output_tokens * pricing["output"]) / 1_000_000 self.entries.append({ "model": model, "step": step_name, "input_tokens": input_tokens, "output_tokens": output_tokens, "cost_usd": cost, }) @property def total_cost(self) -> float: return sum(e["cost_usd"] for e in self.entries) @property def cost_by_model(self) -> dict: breakdown = defaultdict(float) for e in self.entries: breakdown[e["model"]] += e["cost_usd"] return dict(breakdown) def summary(self) -> str: total_in = sum(e["input_tokens"] for e in self.entries) total_out = sum(e["output_tokens"] for e in self.entries) return (f"Tokens: {total_in} in + {total_out} out | " f"Cost: ${self.total_cost:.4f} | " f"Steps: {len(self.entries)}")
usage.prompt_tokens) include the entire context window — system prompt, conversation history, and tool definitions. In a multi-turn agent, the context grows with every step. A 5-step agent might use 15K total input tokens even if each individual message is only 200 tokens, because the full history is sent on every call.
Integrate the ledger into your agent loop:
ledger = TokenLedger() def agent_step(messages: list, tools: list, step_name: str): response = openai_client.chat.completions.create( model="gpt-4o", messages=messages, tools=tools ) # Record token usage for this step ledger.record( model="gpt-4o", input_tokens=response.usage.prompt_tokens, output_tokens=response.usage.completion_tokens, step_name=step_name ) return response # After agent completes: print(ledger.summary()) # → Tokens: 8420 in + 1230 out | Cost: $0.0334 | Steps: 4 # Ship to your metrics system metrics.gauge("agent.cost_usd", ledger.total_cost, tags=["agent:support"]) metrics.histogram("agent.tokens.input", total_in, tags=["agent:support"])
Debugging Agent Failures
Agent failures are fundamentally different from traditional software bugs. The code doesn't crash — the LLM returns a plausible but incorrect answer, calls the wrong tool, or loops indefinitely. Debugging requires reasoning trajectory analysis, not stack traces.
Common agent failure modes and how to diagnose them:
A practical debugging helper that captures enough context for post-mortem analysis:
import json, datetime class AgentDebugger: """Records full execution trace for post-mortem analysis.""" def __init__(self): self.steps = [] self.start_time = datetime.datetime.utcnow() def log_llm_call(self, step_name: str, messages: list, response, model: str): self.steps.append({ "type": "llm", "step": step_name, "model": model, "messages": messages, "output": response.choices[0].message.model_dump(), "tokens": { "input": response.usage.prompt_tokens, "output": response.usage.completion_tokens }, "finish_reason": response.choices[0].finish_reason, "timestamp": datetime.datetime.utcnow().isoformat() }) def log_tool_call(self, tool_name: str, args: dict, result, error: str = None): self.steps.append({ "type": "tool", "tool": tool_name, "args": args, "result": str(result)[:2000], "error": error, "timestamp": datetime.datetime.utcnow().isoformat() }) def dump(self, path: str = None): """Export trace for offline analysis.""" trace = { "start": self.start_time.isoformat(), "steps": self.steps, "total_steps": len(self.steps), "llm_calls": len([s for s in self.steps if s["type"] == "llm"]), "tool_calls": len([s for s in self.steps if s["type"] == "tool"]), "errors": [s for s in self.steps if s.get("error")] } if path: with open(path, "w") as f: json.dump(trace, f, indent=2) return trace
max_steps=15 guard with a fallback response ("I couldn't complete this request") prevents runaway costs and latency.
Strategies for systematic agent debugging:
Trace Replay
Record full execution traces (prompts, tool results, LLM outputs) and replay them deterministically. Mock tool calls with recorded outputs to isolate whether the bug is in the LLM's reasoning or in tool behavior. LangSmith's "playground" and LangFuse's trace view both support this.
Differential Testing
Run the same inputs through two model versions (or prompt versions) and diff the traces. Look for divergence points — where does the new version first deviate? This catches regressions from prompt changes that appear fine on simple cases but fail on edge cases.
Checkpoint Inspection
Insert "checkpoint" assertions between agent steps that validate intermediate state. For example, after a retrieval step, assert that at least one document was returned. After a planning step, assert the plan has between 1 and 10 steps. These catch silent failures early.
LLM-as-Judge Monitoring
Use a cheap, fast model (GPT-4o-mini) to evaluate agent outputs in real time. Flag responses that score below a threshold for human review. This gives you a quality signal without manual review of every trace — focus human attention on the 5-10% of runs that look problematic.
Key takeaway: Agent observability is not optional — it's the foundation that makes agents production-viable. The combination of hierarchical tracing (LangSmith, LangFuse, or custom), token-level cost accounting, and systematic debugging workflows transforms agents from unpredictable black boxes into inspectable, optimizable systems. Start with tracing, add cost tracking, then build evaluation pipelines. The tooling investment pays for itself the first time you debug a $50 runaway agent loop.