Agent Error Handling: Retries and Fallbacks

MLOps Series LLM Agents & Orchestration

Agent Error Taxonomy

LLM-powered agents operate in inherently unreliable environments. Every external call — to a model API, a tool endpoint, or a vector store — is a potential failure point. Before building retry or fallback logic, you need a precise taxonomy of what can go wrong. Agent errors fall into four broad categories, each demanding a different recovery strategy.

Transient Errors

These are temporary failures caused by infrastructure hiccups — rate limits (HTTP 429), connection timeouts, server 503 responses, and DNS resolution failures. They resolve on their own and are the primary target for retry logic. The key characteristic: the same request will succeed if you wait and try again.

Semantic Errors

The model returned a response, but it doesn't conform to expectations — malformed JSON, calling a tool that doesn't exist, or returning a schema-violating payload. Retrying the identical prompt rarely helps; instead you need to re-prompt with corrective context (e.g., append the parse error message to the next call).

Resource Errors

The request is valid, but you've hit a hard constraint — token budget exceeded, context window overflow, or a spending cap. Recovery means reducing the payload: summarize history, drop older tool results, or switch to a cheaper model with a smaller context window.

Fatal Errors

No retry or prompt rewrite will fix these. Authentication failures, revoked API keys, permanently removed endpoints, and policy violations fall here. The correct response is to abort immediately, log the error, and alert an operator.

Design principle: Classify before you retry. A retry loop that hammers a 401 endpoint wastes tokens and time. Always inspect the error type or HTTP status code first, then branch into the appropriate recovery path.

Retry Strategies

Not all retries are equal. A naive fixed-interval retry can amplify failures during an outage (the "thundering herd" problem). Production agent systems use one of three strategies, each with different trade-offs between latency and reliability.

Exponential Back-off with Jitter

The gold standard for transient errors. After each failed attempt, the wait time doubles and a random jitter component prevents synchronized retries across distributed agents. The formula is:

delay = min(base × 2^attempt + random(0, jitter), max_delay)

import random, time from functools import wraps def retry_with_backoff(max_retries=3, base_delay=1.0, max_delay=60.0, jitter=0.5): """Decorator: exponential back-off with jitter for transient errors.""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): for attempt in range(max_retries + 1): try: return func(*args, **kwargs) except TransientError as e: if attempt == max_retries: raise MaxRetriesExceeded(func.__name__, max_retries) from e delay = min(base_delay * (2 ** attempt) + random.uniform(0, jitter), max_delay) log(f"Retry {attempt+1}/{max_retries} for {func.__name__}, waiting {delay:.2f}s") time.sleep(delay) return wrapper return decorator

Linear Back-off

Simpler and more predictable than exponential. Each retry adds a fixed increment (e.g., 2 s, 4 s, 6 s). Works well when you expect short-lived blips but want a tighter latency ceiling than exponential back-off allows.

Adaptive Retry with Circuit Breaker

Combines retries with a circuit-breaker pattern borrowed from microservice resilience. After n consecutive failures, the breaker "opens" and all calls immediately fail-fast for a cooldown window. This prevents a cascading failure where a struggling upstream service gets hammered by retries from every agent in your fleet.

class CircuitBreaker: """Three-state breaker: CLOSED → OPEN → HALF_OPEN → CLOSED.""" def __init__(self, failure_threshold=5, cooldown=30.0): self.failure_count = 0 self.failure_threshold = failure_threshold self.cooldown = cooldown self.state = "CLOSED" self.last_failure_time = 0 def call(self, func, *args, **kwargs): if self.state == "OPEN": if time.time() - self.last_failure_time > self.cooldown: self.state = "HALF_OPEN" else: raise CircuitOpenError("Circuit breaker is OPEN") try: result = func(*args, **kwargs) self._on_success() return result except TransientError as e: self._on_failure() raise def _on_success(self): self.failure_count = 0 self.state = "CLOSED" def _on_failure(self): self.failure_count += 1 self.last_failure_time = time.time() if self.failure_count >= self.failure_threshold: self.state = "OPEN"

Warning: Never retry on fatal errors (401, 403, invalid API key). Your classifier must distinguish retryable from non-retryable exceptions before entering the retry loop, or you risk burning through your budget on doomed requests.

Fallback Chains

When retries are exhausted, the agent shouldn't simply crash. A fallback chain defines an ordered list of alternative strategies, each progressively simpler or cheaper. The agent walks down the chain until one succeeds or every option is exhausted.

The most common fallback pattern for agents is model cascading: start with the most capable (and expensive) model, and fall through to cheaper or cross-provider alternatives when it fails. But fallback chains aren't limited to model swaps — you can also fall back on strategy changes.

Model Cascading

GPT-4o → GPT-4o-mini → Claude Haiku → local model. Each step trades quality for availability and cost. The agent passes the same prompt to each model in sequence.

Strategy Downgrade

ReAct loop (multi-step reasoning) → single-shot tool call → direct LLM answer → cached response. Each step reduces capability but increases reliability and reduces latency.

class FallbackChain: """Walk an ordered list of callables until one succeeds.""" def __init__(self, handlers: list): self.handlers = handlers def execute(self, *args, **kwargs): errors = [] for i, handler in enumerate(self.handlers): try: log(f"Trying handler {i}: {handler.__name__}") result = handler(*args, **kwargs) if i > 0: log(f"Fallback to handler {i} succeeded") return result except FatalError: raise # Never swallow fatal errors except Exception as e: errors.append((handler.__name__, e)) log(f"Handler {handler.__name__} failed: {e}") raise AllHandlersFailed(errors) # ── Usage ── chain = FallbackChain([ call_gpt4o, # Primary: best quality call_gpt4o_mini, # Fallback 1: cheaper call_claude_haiku,# Fallback 2: cross-provider return_cached, # Fallback 3: static response ]) response = chain.execute(prompt=user_query)

Key insight: Cross-provider fallbacks protect you from correlated outages. If your primary and fallback are both on OpenAI, a single API outage takes down your entire chain. Mix providers (OpenAI + Anthropic + a local model) for true resilience.

Graceful Degradation

Graceful degradation is the principle that an agent should always return something useful, even when it can't deliver the ideal response. Rather than surfacing a raw error to the user, the agent progressively strips capabilities while maintaining a coherent user experience.

Degradation Levels

Think of degradation as a ladder with four rungs. The agent starts at the top and steps down only when a higher-quality response is impossible:

class DegradationManager: """Tracks the current degradation level and adjusts agent behavior.""" LEVELS = ["full", "reduced", "llm_only", "static"] def __init__(self): self.level = 0 def degrade(self): if self.level < len(self.LEVELS) - 1: self.level += 1 log(f"Degraded to level {self.level}: {self.LEVELS[self.level]}") def get_config(self) -> dict: configs = { "full": {"tools": True, "multi_step": True, "citations": True}, "reduced": {"tools": True, "multi_step": False, "citations": False}, "llm_only": {"tools": False, "multi_step": False, "citations": False}, "static": {"tools": False, "multi_step": False, "citations": False}, } return configs[self.LEVELS[self.level]] def reset(self): self.level = 0 log("Degradation reset to full capability")

The degradation manager integrates with your retry and fallback logic. When a retry fails, instead of trying the exact same call, the agent drops one degradation level and retries with reduced capability. This gives you a smooth failure curve instead of a cliff edge.

Important: Always inform the user when operating in a degraded mode. Append a disclaimer like "This response was generated without access to live data" so users know the answer quality may be lower than usual.

Recovery Patterns

Recovery goes beyond "retry until it works." In a multi-step agent loop, a failure at step 4 of 7 doesn't necessarily mean starting over. Well-designed agents implement checkpoint-based recovery — they persist intermediate state so they can resume from the last successful step.

Checkpoint & Resume

After each successful tool call or reasoning step, the agent serializes its state (conversation history, accumulated tool results, current plan) to a durable store. On failure, the orchestrator loads the latest checkpoint and retries only the failed step.

class CheckpointedAgent: """Persist state after each step for checkpoint-based recovery.""" def __init__(self, store, agent_id: str): self.store = store self.agent_id = agent_id def run(self, plan: list): state = self.store.load(self.agent_id) or {"step": 0, "results": []} for i in range(state["step"], len(plan)): try: result = self.execute_step(plan[i], state) state["results"].append(result) state["step"] = i + 1 self.store.save(self.agent_id, state) # Durable checkpoint except FatalError: state["status"] = "failed" self.store.save(self.agent_id, state) raise except Exception as e: log(f"Step {i} failed: {e}. State checkpointed.") raise StepFailed(step=i, error=e, agent_id=self.agent_id) state["status"] = "completed" self.store.save(self.agent_id, state) return state["results"]

Compensating Actions

Some agent steps have side effects — sending an email, updating a database, or posting to an API. If a later step fails, you may need to undo previous side effects. This is the compensating action (or "saga") pattern. Each step registers a compensator that runs on rollback.

class SagaOrchestrator: """Execute steps with compensating actions for rollback on failure.""" def __init__(self): self.compensations = [] def execute(self, steps: list): for step in steps: try: result = step["action"]() self.compensations.append(step["compensate"]) except Exception as e: log(f"Step failed: {e}. Rolling back...") self._rollback() raise def _rollback(self): for compensate in reversed(self.compensations): try: compensate() except Exception as e: log(f"Compensation failed: {e}") # Log but continue # ── Example: agent booking workflow ── saga = SagaOrchestrator() saga.execute([ {"action": reserve_flight, "compensate": cancel_flight}, {"action": reserve_hotel, "compensate": cancel_hotel}, {"action": charge_payment, "compensate": refund_payment}, ])

Production tip: Store checkpoints in Redis or a database, not in-memory. Agent processes can crash or be restarted by container orchestrators. External state stores ensure recovery survives process restarts.

Implementation in Python

Let's bring everything together into a production-ready agent error handler. This implementation composes the retry decorator, fallback chain, degradation manager, and circuit breaker into a unified ResilientAgent class that handles the full lifecycle of an agent call.

import time, random, logging from dataclasses import dataclass, field from typing import Callable, Any logger = logging.getLogger("agent.resilience") # ── Custom Exception Hierarchy ── class AgentError(Exception): pass class TransientError(AgentError): pass class SemanticError(AgentError): pass class ResourceError(AgentError): pass class FatalError(AgentError): pass # ── Error Classifier ── def classify_error(status_code: int, body: str) -> AgentError: if status_code in (429, 502, 503, 504): return TransientError(f"HTTP {status_code}") if status_code == 400 and "json" in body.lower(): return SemanticError("Malformed output") if status_code == 400 and "token" in body.lower(): return ResourceError("Token limit exceeded") if status_code in (401, 403): return FatalError(f"Auth failure: HTTP {status_code}") return AgentError(f"Unknown error: HTTP {status_code}") # ── Resilient Agent ── @dataclass class ResilientAgent: primary_model: Callable fallback_models: list[Callable] = field(default_factory=list) max_retries: int = 3 base_delay: float = 1.0 breaker: CircuitBreaker = field(default_factory=CircuitBreaker) degradation: DegradationManager = field(default_factory=DegradationManager) def invoke(self, prompt: str, **kwargs) -> dict: # Step 1: Try primary model with retries all_models = [self.primary_model] + self.fallback_models for model in all_models: try: return self._call_with_retry(model, prompt, **kwargs) except FatalError: raise except AgentError as e: logger.warning(f"{model.__name__} exhausted: {e}") self.degradation.degrade() continue # Step 2: All models failed — static fallback logger.error("All models failed. Returning static fallback.") return { "response": "I'm unable to process your request right now.", "degraded": True, "level": self.degradation.level } def _call_with_retry(self, model, prompt, **kwargs): for attempt in range(self.max_retries + 1): try: result = self.breaker.call(model, prompt, **kwargs) self.degradation.reset() return result except (FatalError, CircuitOpenError): raise except SemanticError as e: prompt = self._repair_prompt(prompt, e) continue # Retry with corrected prompt except TransientError: if attempt == self.max_retries: raise delay = min(self.base_delay * (2 ** attempt) + random.uniform(0, 0.5), 60) time.sleep(delay) def _repair_prompt(self, prompt, error): return prompt + f"\n\n[System: previous response had error: {error}. " \ "Please fix the output format and try again.]"

The ResilientAgent encapsulates every pattern discussed in this article: error classification routes failures to the right recovery path, exponential back-off with jitter handles transient failures, the circuit breaker prevents cascading overload, the fallback chain provides model-level redundancy, and the degradation manager ensures the user always gets a response — even if it's a reduced-quality one.

Observability Integration

Error handling without observability is flying blind. Every retry, fallback, and degradation event should emit structured metrics. Here's a minimal integration:

from prometheus_client import Counter, Histogram # ── Metrics ── RETRY_COUNT = Counter("agent_retries_total", "Retry attempts", ["model", "error_type"]) FALLBACK_COUNT = Counter("agent_fallbacks_total", "Fallback activations", ["from_model", "to_model"]) DEGRADATION_LEVEL = Histogram("agent_degradation_level", "Degradation level at response") LATENCY = Histogram("agent_call_seconds", "End-to-end call latency", ["model", "status"]) # ── Instrumented retry (integrate into _call_with_retry) ── def on_retry(model_name: str, error_type: str): RETRY_COUNT.labels(model=model_name, error_type=error_type).inc() def on_fallback(from_model: str, to_model: str): FALLBACK_COUNT.labels(from_model=from_model, to_model=to_model).inc() def on_response(level: int, latency: float, model: str): DEGRADATION_LEVEL.observe(level) LATENCY.labels(model=model, status="ok").observe(latency)

Key metrics to alert on: A spike in agent_fallbacks_total signals provider instability. A sustained increase in agent_degradation_level means users are getting lower-quality responses. Set thresholds and page your on-call when either drifts above baseline.

Building resilient agents is not about preventing every failure — it's about building systems that fail gracefully, recover quickly, and degrade predictably. Combine the patterns in this article — error classification, exponential back-off, fallback chains, graceful degradation, checkpoint recovery, and observability — and your agents will handle production chaos with confidence.

Next steps: Explore Agent Memory for persistent state management across sessions, and Agent Observability for deeper tracing and debugging of multi-step agent workflows.