Agent Observability: Tracing and Debugging

MLOps Series LLM Agents & Orchestration

Traditional software observability — logs, metrics, and distributed tracing — falls short for LLM agents. An agent might make 8 LLM calls, invoke 5 tools, backtrack twice, and spend $0.47 in tokens before returning a single answer. When that answer is wrong, you need to reconstruct the reasoning trajectory, not just the HTTP call chain. This post covers the tools and techniques that make multi-step agent execution inspectable, debuggable, and cost-accountable.

Why Agent Observability Matters

Agents are non-deterministic, multi-step systems with emergent execution paths. Unlike a REST API where the call graph is fixed at compile time, an agent's execution graph is generated at runtime by the LLM itself. This creates unique observability challenges:

Variable execution depth — the same prompt can trigger 2 steps or 20 steps depending on the model's reasoning
Hidden costs — each LLM call consumes tokens; without accounting you can't budget or optimize
Silent failures — the agent may return a plausible but wrong answer without raising any exception
Tool interaction bugs — errors often emerge at the boundary between the LLM's output and the tool's expected input format
Latency attribution — a 30-second response could be 2s of LLM time and 28s of tool execution, or the reverse

Key insight: Traditional APM tools (Datadog, New Relic) can monitor the infrastructure running your agent, but they cannot inspect the reasoning inside it. You need LLM-native observability that understands concepts like token usage, prompt templates, and chain-of-thought steps.

The observability hierarchy for agents, from most to least critical:

Trace every LLM call — input prompt, output completion, token counts, latency, model used
Trace tool invocations — tool name, input arguments, output, success/failure, latency
Link spans into runs — group all calls from a single user request into one trace
Compute cost — multiply token counts by per-model pricing, aggregate per user/run/day
Evaluate quality — score the final output against ground truth or LLM-as-judge criteria

LangSmith: Tracing & Evaluation

LangSmith is LangChain's hosted observability platform. It captures every LLM call, chain step, and tool invocation as nested spans in a trace tree. If you're already using LangChain, integration is essentially zero-config — set two environment variables and every chain execution is automatically traced.

# LangSmith setup — just environment variables
import os

os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = "ls-..."
os.environ["LANGCHAIN_PROJECT"] = "customer-support-agent"

# That's it. Every LangChain call is now traced.
from langchain_openai import ChatOpenAI
from langchain.agents import AgentExecutor

llm = ChatOpenAI(model="gpt-4o")
agent = AgentExecutor(agent=agent_runnable, tools=tools)

# This run appears in LangSmith with full trace tree
result = agent.invoke({"input": "What's the refund policy for order #1234?"})

LangSmith organizes data into three core concepts:

Runs — a single execution of a chain/agent, containing nested child runs for each step
Traces — the root run plus all descendant runs, visualized as a tree or timeline
Datasets & Evaluators — test sets paired with scoring functions for regression testing agent quality

For non-LangChain code, use the @traceable decorator to manually instrument functions:

from langsmith import traceable

@traceable(run_type="chain", name="customer_support_agent")
def handle_support_query(query: str) -> str:
    # Step 1: Classify intent
    intent = classify_intent(query)

    # Step 2: Retrieve relevant docs
    docs = retrieve_context(query, intent)

    # Step 3: Generate response
    response = generate_response(query, docs, intent)
    return response

@traceable(run_type="llm")
def classify_intent(query: str) -> str:
    # Traced as a child span under the parent chain
    response = openai_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": f"Classify intent: {query}"}]
    )
    return response.choices[0].message.content

@traceable(run_type="retriever")
def retrieve_context(query: str, intent: str) -> list:
    # Traced as a retriever span with input/output docs
    return vector_store.similarity_search(query, k=5)

Evaluation tip: LangSmith's evaluation framework lets you run your agent against a dataset and score every output with custom evaluators (exact match, LLM-as-judge, regex). This is critical for catching regressions when you change prompts or models — you can compare run-over-run accuracy before deploying.

LangFuse: Open-Source Alternative

LangFuse is an open-source LLM observability platform that provides tracing, prompt management, and evaluation without vendor lock-in. You can self-host it (Docker Compose or Kubernetes) or use their managed cloud. The key advantage: your trace data stays in your infrastructure.

LangSmith

Hosting: Managed SaaS only
LangChain integration: Zero-config, automatic
Non-LangChain: @traceable decorator + SDK
Evaluation: Built-in datasets, evaluators, comparison views
Prompt management: Hub for versioned prompts
Pricing: Free tier (5K traces/mo), paid plans scale
Best for: LangChain-native teams wanting turnkey solution

LangFuse

Hosting: Self-hosted or managed cloud
LangChain integration: Callback handler (one line)
Non-LangChain: @observe decorator + low-level SDK
Evaluation: Scoring API, annotation queues, model-based evals
Prompt management: Built-in versioned prompt registry
Pricing: Open source (self-host free), cloud has free tier
Best for: Teams needing data sovereignty or custom infra

LangFuse uses a slightly different tracing model. Instead of auto-instrumentation, you explicitly create traces, spans, and generations:

from langfuse.decorators import observe, langfuse_context

@observe()
def agent_pipeline(user_query: str) -> str:
    # Root trace created automatically by @observe

    # Step 1: Plan
    plan = create_plan(user_query)

    # Step 2: Execute tools
    for step in plan.steps:
        result = execute_tool(step)

    # Step 3: Synthesize
    answer = synthesize_answer(user_query, results)

    # Attach metadata and scores
    langfuse_context.update_current_trace(
        user_id="user_abc",
        metadata={"plan_steps": len(plan.steps)},
        tags=["production", "v2.1"]
    )
    return answer

@observe(as_type="generation")
def create_plan(query: str) -> Plan:
    # Tracked as an LLM generation with token counts
    response = openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[{"role": "system", "content": PLANNER_PROMPT},
                  {"role": "user", "content": query}]
    )

    # LangFuse auto-extracts token usage from OpenAI responses
    langfuse_context.update_current_observation(
        model="gpt-4o",
        usage={"input": response.usage.prompt_tokens,
               "output": response.usage.completion_tokens}
    )
    return parse_plan(response.choices[0].message.content)

Self-hosting note: LangFuse runs as a Next.js app backed by PostgreSQL. For production, deploy behind your existing auth proxy (e.g., OAuth2 Proxy) and use a managed Postgres instance. The Docker Compose setup works for evaluation but needs connection pooling (PgBouncer) at scale — each concurrent trace opens a DB connection.

Custom Tracing Implementation

When you can't use LangSmith or LangFuse — perhaps due to air-gapped environments, compliance requirements, or you're building on a non-LangChain framework — you need custom tracing. The core idea: wrap every LLM call and tool invocation in a span that records inputs, outputs, timing, and token usage, then ship those spans to your existing observability backend.

import time, uuid, json
from dataclasses import dataclass, field
from typing import Any, Optional
from contextlib import contextmanager
import threading

@dataclass
class Span:
    span_id: str = field(default_factory=lambda: str(uuid.uuid4())[:8])
    trace_id: str = ""
    parent_id: Optional[str] = None
    name: str = ""
    span_type: str = "generic"   # "llm" | "tool" | "retriever" | "chain"
    start_time: float = 0.0
    end_time: float = 0.0
    input_data: Any = None
    output_data: Any = None
    tokens_in: int = 0
    tokens_out: int = 0
    model: str = ""
    error: Optional[str] = None
    metadata: dict = field(default_factory=dict)

    @property
    def duration_ms(self) -> float:
        return (self.end_time - self.start_time) * 1000

# Thread-local storage for trace context propagation
_context = threading.local()

class Tracer:
    def __init__(self, exporter=None):
        self.spans: list[Span] = []
        self.exporter = exporter or ConsoleExporter()

    @contextmanager
    def span(self, name: str, span_type: str = "generic", **kwargs):
        span = Span(
            name=name,
            span_type=span_type,
            trace_id=getattr(_context, "trace_id", str(uuid.uuid4())[:8]),
            parent_id=getattr(_context, "current_span_id", None),
            start_time=time.time(),
            **kwargs
        )
        prev_span_id = getattr(_context, "current_span_id", None)
        _context.current_span_id = span.span_id
        try:
            yield span
        except Exception as e:
            span.error = str(e)
            raise
        finally:
            span.end_time = time.time()
            self.spans.append(span)
            self.exporter.export(span)
            _context.current_span_id = prev_span_id

Usage looks like this — every LLM call and tool invocation gets wrapped:

tracer = Tracer(exporter=OTLPExporter("http://jaeger:4317"))

def run_agent(query: str) -> str:
    with tracer.span("agent_run", span_type="chain") as root:
        root.input_data = query
        _context.trace_id = root.trace_id

        # LLM call — traced with token counts
        with tracer.span("plan", span_type="llm") as s:
            resp = openai_client.chat.completions.create(
                model="gpt-4o", messages=[...]
            )
            s.tokens_in = resp.usage.prompt_tokens
            s.tokens_out = resp.usage.completion_tokens
            s.model = "gpt-4o"
            s.output_data = resp.choices[0].message.content

        # Tool call — traced with input/output
        with tracer.span("tool:search_db", span_type="tool") as s:
            s.input_data = {"query": "order #1234"}
            result = db.search("order #1234")
            s.output_data = result

        root.output_data = final_answer
    return final_answer

OpenTelemetry integration: Export your custom spans to any OTLP-compatible backend (Jaeger, Grafana Tempo, Datadog). This lets you correlate agent traces with your existing service traces — you can see the agent's LLM calls alongside the API gateway latency and database queries in a single distributed trace.

For a decorator-based approach that's less verbose:

import functools

def trace(name: str = None, span_type: str = "generic"):
    def decorator(func):
        @functools.wraps(func)
        def wrapper(*args, **kwargs):
            span_name = name or func.__name__
            with _global_tracer.span(span_name, span_type) as s:
                s.input_data = {"args": str(args), "kwargs": str(kwargs)}
                result = func(*args, **kwargs)
                s.output_data = str(result)[:500]  # Truncate large outputs
                return result
        return wrapper
    return decorator

# Clean usage:
@trace(span_type="llm")
def classify_intent(query: str) -> str: ...

@trace(span_type="tool")
def search_knowledge_base(query: str) -> list: ...

@trace(span_type="chain")
def run_agent(user_input: str) -> str: ...

Token Accounting & Cost Tracking

Token accounting is the financial observability layer for LLM agents. Without it, you're flying blind on costs. A single agent run can cost anywhere from $0.002 (simple GPT-4o-mini classification) to $2.50+ (complex multi-step GPT-4o reasoning with tool use). At scale, the difference between "we optimized our prompts" and "we didn't" can be $50K/month.

Here's a practical token accounting implementation:

from dataclasses import dataclass, field
from collections import defaultdict

# Pricing per 1M tokens (as of mid-2024)
MODEL_PRICING = {
    "gpt-4o":       {"input": 2.50, "output": 10.00},
    "gpt-4o-mini":  {"input": 0.15, "output": 0.60},
    "claude-3.5":   {"input": 3.00, "output": 15.00},
    "claude-haiku": {"input": 0.25, "output": 1.25},
}

@dataclass
class TokenLedger:
    """Per-run token accounting."""
    entries: list = field(default_factory=list)

    def record(self, model: str, input_tokens: int, output_tokens: int,
                  step_name: str = ""):
        pricing = MODEL_PRICING.get(model, {"input": 0, "output": 0})
        cost = (input_tokens * pricing["input"] +
                output_tokens * pricing["output"]) / 1_000_000
        self.entries.append({
            "model": model, "step": step_name,
            "input_tokens": input_tokens,
            "output_tokens": output_tokens,
            "cost_usd": cost,
        })

    @property
    def total_cost(self) -> float:
        return sum(e["cost_usd"] for e in self.entries)

    @property
    def cost_by_model(self) -> dict:
        breakdown = defaultdict(float)
        for e in self.entries:
            breakdown[e["model"]] += e["cost_usd"]
        return dict(breakdown)

    def summary(self) -> str:
        total_in = sum(e["input_tokens"] for e in self.entries)
        total_out = sum(e["output_tokens"] for e in self.entries)
        return (f"Tokens: {total_in} in + {total_out} out | "
                f"Cost: ${self.total_cost:.4f} | "
                f"Steps: {len(self.entries)}")

Warning: Token counts from the API response (usage.prompt_tokens) include the entire context window — system prompt, conversation history, and tool definitions. In a multi-turn agent, the context grows with every step. A 5-step agent might use 15K total input tokens even if each individual message is only 200 tokens, because the full history is sent on every call.

Integrate the ledger into your agent loop:

ledger = TokenLedger()

def agent_step(messages: list, tools: list, step_name: str):
    response = openai_client.chat.completions.create(
        model="gpt-4o", messages=messages, tools=tools
    )

    # Record token usage for this step
    ledger.record(
        model="gpt-4o",
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        step_name=step_name
    )
    return response

# After agent completes:
print(ledger.summary())
# → Tokens: 8420 in + 1230 out | Cost: $0.0334 | Steps: 4

# Ship to your metrics system
metrics.gauge("agent.cost_usd", ledger.total_cost, tags=["agent:support"])
metrics.histogram("agent.tokens.input", total_in, tags=["agent:support"])

Debugging Agent Failures

Agent failures are fundamentally different from traditional software bugs. The code doesn't crash — the LLM returns a plausible but incorrect answer, calls the wrong tool, or loops indefinitely. Debugging requires reasoning trajectory analysis, not stack traces.

Common agent failure modes and how to diagnose them:

A practical debugging helper that captures enough context for post-mortem analysis:

import json, datetime

class AgentDebugger:
    """Records full execution trace for post-mortem analysis."""

    def __init__(self):
        self.steps = []
        self.start_time = datetime.datetime.utcnow()

    def log_llm_call(self, step_name: str, messages: list,
                       response, model: str):
        self.steps.append({
            "type": "llm",
            "step": step_name,
            "model": model,
            "messages": messages,
            "output": response.choices[0].message.model_dump(),
            "tokens": {
                "input": response.usage.prompt_tokens,
                "output": response.usage.completion_tokens
            },
            "finish_reason": response.choices[0].finish_reason,
            "timestamp": datetime.datetime.utcnow().isoformat()
        })

    def log_tool_call(self, tool_name: str, args: dict,
                        result, error: str = None):
        self.steps.append({
            "type": "tool",
            "tool": tool_name,
            "args": args,
            "result": str(result)[:2000],
            "error": error,
            "timestamp": datetime.datetime.utcnow().isoformat()
        })

    def dump(self, path: str = None):
        """Export trace for offline analysis."""
        trace = {
            "start": self.start_time.isoformat(),
            "steps": self.steps,
            "total_steps": len(self.steps),
            "llm_calls": len([s for s in self.steps if s["type"] == "llm"]),
            "tool_calls": len([s for s in self.steps if s["type"] == "tool"]),
            "errors": [s for s in self.steps if s.get("error")]
        }
        if path:
            with open(path, "w") as f:
                json.dump(trace, f, indent=2)
        return trace

Infinite loop protection: Always set a maximum step count on your agent loop. Without it, a confused model can loop indefinitely — calling the same tool with the same arguments or alternating between two tools. A simple max_steps=15 guard with a fallback response ("I couldn't complete this request") prevents runaway costs and latency.

Strategies for systematic agent debugging:

Trace Replay

Record full execution traces (prompts, tool results, LLM outputs) and replay them deterministically. Mock tool calls with recorded outputs to isolate whether the bug is in the LLM's reasoning or in tool behavior. LangSmith's "playground" and LangFuse's trace view both support this.

Differential Testing

Run the same inputs through two model versions (or prompt versions) and diff the traces. Look for divergence points — where does the new version first deviate? This catches regressions from prompt changes that appear fine on simple cases but fail on edge cases.

Checkpoint Inspection

Insert "checkpoint" assertions between agent steps that validate intermediate state. For example, after a retrieval step, assert that at least one document was returned. After a planning step, assert the plan has between 1 and 10 steps. These catch silent failures early.

LLM-as-Judge Monitoring

Use a cheap, fast model (GPT-4o-mini) to evaluate agent outputs in real time. Flag responses that score below a threshold for human review. This gives you a quality signal without manual review of every trace — focus human attention on the 5-10% of runs that look problematic.

Production checklist: Before deploying an agent, ensure you have: (1) tracing enabled for every LLM call and tool invocation, (2) token accounting with per-run cost calculation, (3) max-step guardrails on the agent loop, (4) alerting on cost spikes and error rate increases, and (5) a trace replay capability for debugging production issues. Without these, you're deploying a black box.

Key takeaway: Agent observability is not optional — it's the foundation that makes agents production-viable. The combination of hierarchical tracing (LangSmith, LangFuse, or custom), token-level cost accounting, and systematic debugging workflows transforms agents from unpredictable black boxes into inspectable, optimizable systems. Start with tracing, add cost tracking, then build evaluation pipelines. The tooling investment pays for itself the first time you debug a $50 runaway agent loop.