LLM-powered agents operate in inherently unreliable environments. Every external call — to a model API,
a tool endpoint, or a vector store — is a potential failure point. Before building retry or fallback logic,
you need a precise taxonomy of what can go wrong. Agent errors fall into four broad categories, each
demanding a different recovery strategy.
Transient Errors
These are temporary failures caused by infrastructure hiccups — rate limits (HTTP 429), connection
timeouts, server 503 responses, and DNS resolution failures. They resolve on their own and are the
primary target for retry logic. The key characteristic: the same request will succeed if
you wait and try again.
Semantic Errors
The model returned a response, but it doesn't conform to expectations — malformed JSON, calling a
tool that doesn't exist, or returning a schema-violating payload. Retrying the identical prompt
rarely helps; instead you need to re-prompt with corrective context (e.g., append
the parse error message to the next call).
Resource Errors
The request is valid, but you've hit a hard constraint — token budget exceeded, context window
overflow, or a spending cap. Recovery means reducing the payload: summarize history,
drop older tool results, or switch to a cheaper model with a smaller context window.
Fatal Errors
No retry or prompt rewrite will fix these. Authentication failures, revoked API keys, permanently
removed endpoints, and policy violations fall here. The correct response is to abort
immediately, log the error, and alert an operator.
Design principle: Classify before you retry. A retry loop that hammers a 401 endpoint
wastes tokens and time. Always inspect the error type or HTTP status code first, then branch into the
appropriate recovery path.
Retry Strategies
Not all retries are equal. A naive fixed-interval retry can amplify failures during an outage (the
"thundering herd" problem). Production agent systems use one of three strategies, each with different
trade-offs between latency and reliability.
Exponential Back-off with Jitter
The gold standard for transient errors. After each failed attempt, the wait time doubles and a random
jitter component prevents synchronized retries across distributed agents. The formula is:
import random, time
from functools import wraps
defretry_with_backoff(max_retries=3, base_delay=1.0, max_delay=60.0, jitter=0.5):
"""Decorator: exponential back-off with jitter for transient errors."""defdecorator(func):
@wraps(func)
defwrapper(*args, **kwargs):
for attempt inrange(max_retries + 1):
try:
returnfunc(*args, **kwargs)
except TransientError as e:
if attempt == max_retries:
raise MaxRetriesExceeded(func.__name__, max_retries) from e
delay = min(base_delay * (2 ** attempt) + random.uniform(0, jitter), max_delay)
log(f"Retry {attempt+1}/{max_retries} for {func.__name__}, waiting {delay:.2f}s")
time.sleep(delay)
return wrapper
return decorator
Linear Back-off
Simpler and more predictable than exponential. Each retry adds a fixed increment (e.g., 2 s, 4 s, 6 s).
Works well when you expect short-lived blips but want a tighter latency ceiling than exponential
back-off allows.
Adaptive Retry with Circuit Breaker
Combines retries with a circuit-breaker pattern borrowed from microservice resilience. After n
consecutive failures, the breaker "opens" and all calls immediately fail-fast for a cooldown window. This
prevents a cascading failure where a struggling upstream service gets hammered by retries from every agent
in your fleet.
Warning: Never retry on fatal errors (401, 403, invalid API key). Your
classifier must distinguish retryable from non-retryable exceptions before entering the retry
loop, or you risk burning through your budget on doomed requests.
Fallback Chains
When retries are exhausted, the agent shouldn't simply crash. A fallback chain defines
an ordered list of alternative strategies, each progressively simpler or cheaper. The agent walks down
the chain until one succeeds or every option is exhausted.
The most common fallback pattern for agents is model cascading: start with the most
capable (and expensive) model, and fall through to cheaper or cross-provider alternatives when it fails.
But fallback chains aren't limited to model swaps — you can also fall back on strategy changes.
Model Cascading
GPT-4o → GPT-4o-mini → Claude Haiku → local model. Each step trades quality for availability and
cost. The agent passes the same prompt to each model in sequence.
Strategy Downgrade
ReAct loop (multi-step reasoning) → single-shot tool call → direct LLM answer → cached response.
Each step reduces capability but increases reliability and reduces latency.
classFallbackChain:
"""Walk an ordered list of callables until one succeeds."""def__init__(self, handlers: list):
self.handlers = handlers
defexecute(self, *args, **kwargs):
errors = []
for i, handler inenumerate(self.handlers):
try:
log(f"Trying handler {i}: {handler.__name__}")
result = handler(*args, **kwargs)
if i > 0:
log(f"Fallback to handler {i} succeeded")
return result
except FatalError:
raise# Never swallow fatal errorsexcept Exception as e:
errors.append((handler.__name__, e))
log(f"Handler {handler.__name__} failed: {e}")
raise AllHandlersFailed(errors)
# ── Usage ──
chain = FallbackChain([
call_gpt4o, # Primary: best qualitycall_gpt4o_mini, # Fallback 1: cheapercall_claude_haiku,# Fallback 2: cross-providerreturn_cached, # Fallback 3: static response
])
response = chain.execute(prompt=user_query)
Key insight: Cross-provider fallbacks protect you from correlated outages. If your
primary and fallback are both on OpenAI, a single API outage takes down your entire chain. Mix providers
(OpenAI + Anthropic + a local model) for true resilience.
Graceful Degradation
Graceful degradation is the principle that an agent should always return something useful, even
when it can't deliver the ideal response. Rather than surfacing a raw error to the user, the agent
progressively strips capabilities while maintaining a coherent user experience.
Degradation Levels
Think of degradation as a ladder with four rungs. The agent starts at the top and steps down only when
a higher-quality response is impossible:
The degradation manager integrates with your retry and fallback logic. When a retry fails, instead
of trying the exact same call, the agent drops one degradation level and retries with reduced
capability. This gives you a smooth failure curve instead of a cliff edge.
Important: Always inform the user when operating in a degraded mode. Append a
disclaimer like "This response was generated without access to live data" so users know
the answer quality may be lower than usual.
Recovery Patterns
Recovery goes beyond "retry until it works." In a multi-step agent loop, a failure at step 4 of 7
doesn't necessarily mean starting over. Well-designed agents implement checkpoint-based
recovery — they persist intermediate state so they can resume from the last successful step.
Checkpoint & Resume
After each successful tool call or reasoning step, the agent serializes its state (conversation history,
accumulated tool results, current plan) to a durable store. On failure, the orchestrator loads the
latest checkpoint and retries only the failed step.
classCheckpointedAgent:
"""Persist state after each step for checkpoint-based recovery."""def__init__(self, store, agent_id: str):
self.store = store
self.agent_id = agent_id
defrun(self, plan: list):
state = self.store.load(self.agent_id) or {"step": 0, "results": []}
for i inrange(state["step"], len(plan)):
try:
result = self.execute_step(plan[i], state)
state["results"].append(result)
state["step"] = i + 1
self.store.save(self.agent_id, state) # Durable checkpointexcept FatalError:
state["status"] = "failed"
self.store.save(self.agent_id, state)
raiseexcept Exception as e:
log(f"Step {i} failed: {e}. State checkpointed.")
raise StepFailed(step=i, error=e, agent_id=self.agent_id)
state["status"] = "completed"
self.store.save(self.agent_id, state)
return state["results"]
Compensating Actions
Some agent steps have side effects — sending an email, updating a database, or posting to an API. If a
later step fails, you may need to undo previous side effects. This is the compensating
action (or "saga") pattern. Each step registers a compensator that runs on rollback.
classSagaOrchestrator:
"""Execute steps with compensating actions for rollback on failure."""def__init__(self):
self.compensations = []
defexecute(self, steps: list):
for step in steps:
try:
result = step["action"]()
self.compensations.append(step["compensate"])
except Exception as e:
log(f"Step failed: {e}. Rolling back...")
self._rollback()
raisedef_rollback(self):
for compensate inreversed(self.compensations):
try:
compensate()
except Exception as e:
log(f"Compensation failed: {e}") # Log but continue# ── Example: agent booking workflow ──
saga = SagaOrchestrator()
saga.execute([
{"action": reserve_flight, "compensate": cancel_flight},
{"action": reserve_hotel, "compensate": cancel_hotel},
{"action": charge_payment, "compensate": refund_payment},
])
Production tip: Store checkpoints in Redis or a database, not in-memory. Agent processes
can crash or be restarted by container orchestrators. External state stores ensure recovery survives
process restarts.
Implementation in Python
Let's bring everything together into a production-ready agent error handler. This implementation
composes the retry decorator, fallback chain, degradation manager, and circuit breaker into a unified
ResilientAgent class that handles the full lifecycle of an agent call.
import time, random, logging
from dataclasses import dataclass, field
from typing import Callable, Any
logger = logging.getLogger("agent.resilience")
# ── Custom Exception Hierarchy ──classAgentError(Exception): passclassTransientError(AgentError): passclassSemanticError(AgentError): passclassResourceError(AgentError): passclassFatalError(AgentError): pass# ── Error Classifier ──defclassify_error(status_code: int, body: str) -> AgentError:
if status_code in (429, 502, 503, 504):
returnTransientError(f"HTTP {status_code}")
if status_code == 400 and"json"in body.lower():
returnSemanticError("Malformed output")
if status_code == 400 and"token"in body.lower():
returnResourceError("Token limit exceeded")
if status_code in (401, 403):
returnFatalError(f"Auth failure: HTTP {status_code}")
returnAgentError(f"Unknown error: HTTP {status_code}")
# ── Resilient Agent ──
@dataclass
classResilientAgent:
primary_model: Callable
fallback_models: list[Callable] = field(default_factory=list)
max_retries: int = 3
base_delay: float = 1.0
breaker: CircuitBreaker = field(default_factory=CircuitBreaker)
degradation: DegradationManager = field(default_factory=DegradationManager)
definvoke(self, prompt: str, **kwargs) -> dict:
# Step 1: Try primary model with retries
all_models = [self.primary_model] + self.fallback_models
for model in all_models:
try:
return self._call_with_retry(model, prompt, **kwargs)
except FatalError:
raiseexcept AgentError as e:
logger.warning(f"{model.__name__} exhausted: {e}")
self.degradation.degrade()
continue# Step 2: All models failed — static fallback
logger.error("All models failed. Returning static fallback.")
return {
"response": "I'm unable to process your request right now.",
"degraded": True,
"level": self.degradation.level
}
def_call_with_retry(self, model, prompt, **kwargs):
for attempt inrange(self.max_retries + 1):
try:
result = self.breaker.call(model, prompt, **kwargs)
self.degradation.reset()
return result
except (FatalError, CircuitOpenError):
raiseexcept SemanticError as e:
prompt = self._repair_prompt(prompt, e)
continue# Retry with corrected promptexcept TransientError:
if attempt == self.max_retries:
raise
delay = min(self.base_delay * (2 ** attempt) + random.uniform(0, 0.5), 60)
time.sleep(delay)
def_repair_prompt(self, prompt, error):
return prompt + f"\n\n[System: previous response had error: {error}. " \
"Please fix the output format and try again.]"
The ResilientAgent encapsulates every pattern discussed in this article: error
classification routes failures to the right recovery path, exponential back-off with jitter handles
transient failures, the circuit breaker prevents cascading overload, the fallback chain provides
model-level redundancy, and the degradation manager ensures the user always gets a response — even
if it's a reduced-quality one.
Observability Integration
Error handling without observability is flying blind. Every retry, fallback, and degradation event
should emit structured metrics. Here's a minimal integration:
Key metrics to alert on: A spike in agent_fallbacks_total signals provider
instability. A sustained increase in agent_degradation_level means users are getting
lower-quality responses. Set thresholds and page your on-call when either drifts above baseline.
Building resilient agents is not about preventing every failure — it's about building systems that
fail gracefully, recover quickly, and degrade predictably. Combine the patterns in this
article — error classification, exponential back-off, fallback chains, graceful degradation, checkpoint
recovery, and observability — and your agents will handle production chaos with confidence.
Next steps: Explore Agent Memory for persistent
state management across sessions, and Agent Observability for
deeper tracing and debugging of multi-step agent workflows.