Circuit Breaker Pattern
In distributed systems, failures are not a question of if but when. A single unresponsive downstream service can consume all your threads, exhaust connection pools, and bring down your entire platform in seconds. The circuit breaker pattern—borrowed from electrical engineering—provides a mechanism to fail fast, protect the caller, and give the failing service time to recover. This post walks you through the complete pattern: the state machine, implementation strategies, production-grade configuration, and real-world lessons from Netflix’s battle-tested infrastructure.
1 · The Cascading Failure Problem
Consider a typical microservices architecture where Service A calls Service B, which calls Service C, which calls Service D. When Service D becomes unresponsive (not down—unresponsive), something insidious happens:
- Service C threads block waiting for D’s response. The default HTTP timeout is often 30–60 seconds.
- Service C’s thread pool fills up—200 threads, all blocked waiting on D.
- Service B can no longer get responses from C, and its threads start blocking.
- Service A blocks on B, and now your user-facing API returns 503s to every customer.
One slow dependency has just taken down your entire platform. This is a cascading failure, and it is the single most common failure mode in distributed systems.
Why Timeouts Alone Are Not Enough
Setting aggressive timeouts helps, but doesn’t solve the problem:
| Approach | Problem |
|---|---|
| Long timeouts (30s) | Threads block for too long, pools exhausted quickly |
| Short timeouts (1s) | False positives during normal latency spikes, legitimate slow queries fail |
| No timeout | Threads block forever, guaranteed cascading failure |
| Timeout + retry | Amplifies load on a struggling service (retry storm), making failure worse |
What we need is a mechanism that detects when a downstream service is failing and stops sending requests to it entirely for a period. Enter the circuit breaker.
2 · The Circuit Breaker State Machine
A circuit breaker has exactly three states, analogous to an electrical circuit breaker:
CLOSED Normal Operation
The circuit is closed—requests flow through to the downstream service normally. Under the hood, the circuit breaker is silently tracking failure metrics:
- Failure count or failure rate over a sliding window
- Slow call rate—percentage of calls exceeding a duration threshold
- Each response is classified: success, failure (exception), or slow (exceeded threshold)
When the failure rate exceeds a configured threshold (e.g., 50% of calls in the last 10 calls), the circuit trips and transitions to OPEN.
OPEN Failing Fast
The circuit is open—no requests reach the downstream service. Instead, every call returns immediately with:
- A fallback response (cached data, default value, degraded experience)
- Or a fast failure exception (
CircuitBreakerOpenException)
This protects both the caller (no blocked threads) and the downstream service (no load while recovering). The circuit breaker starts a wait timer (e.g., 60 seconds).
HALF-OPEN Testing Recovery
After the wait timer expires, the circuit moves to half-open. A limited number of probe requests (typically 3–10) are allowed through to test whether the downstream service has recovered:
- If probes succeed at or above the threshold → circuit transitions back to CLOSED
- If probes fail → circuit transitions back to OPEN and the wait timer resets
▶ Circuit Breaker State Machine
Step through the complete lifecycle: normal operation → failures accumulate → circuit opens → timeout → half-open probe → recovery or re-open.
3 · Failure Thresholds & Sliding Windows
The circuit breaker’s behavior is governed by several critical configuration parameters. Getting these right is the difference between a useful safety net and an over-sensitive alarm that cries wolf.
Sliding Window Types
Resilience4j (the standard Java circuit breaker library) supports two window types:
Count-Based Window
Tracks the last N calls. Failure rate is calculated over these N calls. Simple, predictable, but doesn’t account for time—a burst of 10 failures in 1 second is treated the same as 10 failures spread over 10 minutes.
failureRateThreshold: 50%
→ trips after 5 failures in last 10 calls
Time-Based Window
Tracks calls within the last N seconds. More reflective of real-world conditions—a service that failed 5 minutes ago shouldn’t prevent current requests. Requires more memory (stores timestamps and results for each call).
failureRateThreshold: 50%
→ trips if >50% of calls failed in last 60s
Critical Configuration Parameters
| Parameter | Description | Typical Value | Too Low | Too High |
|---|---|---|---|---|
failureRateThreshold |
% failures to trip circuit | 50% | False trips on normal errors | Doesn’t trip when service is degraded |
slowCallRateThreshold |
% slow calls to trip circuit | 80% | Trips on occasional slow queries | Doesn’t detect latency issues |
slowCallDurationThreshold |
Duration to classify a call as “slow” | 2–5s | Normal calls classified as slow | Truly slow calls pass as normal |
slidingWindowSize |
Number of calls or seconds in window | 10–100 calls / 60s | Noisy, reacts to small bursts | Slow to detect real failures |
minimumNumberOfCalls |
Min calls before rate is calculated | 5–10 | 1 failure trips circuit (rate = 100%) | Slow to activate on low-traffic services |
waitDurationInOpenState |
Time before half-open transition | 30–60s | Probes overloaded service too soon | Unnecessarily long outage for callers |
permittedNumberOfCallsInHalfOpenState |
Probe requests in half-open | 3–10 | Single fluke decides state | Too much load on recovering service |
4 · Resilience4j: Production Configuration
Resilience4j is the de facto standard circuit breaker library for Java / Spring Boot applications, replacing the deprecated Netflix Hystrix. Here is a production-ready configuration with annotations explaining every parameter.
Spring Boot YAML Configuration
Java Annotation-Based Usage
Retry Configuration (with Circuit Breaker)
Retry → CircuitBreaker → RateLimiter → Bulkhead → TimeLimiter. This means retries happen outside the circuit breaker. If the circuit is open, the retry will get a CallNotPermittedException immediately—no wasted retries against a known-failed service.
5 · Fallback Strategies
When the circuit is open, you must decide what to return to the caller. The choice of fallback strategy depends on the operation’s semantics and consistency requirements.
| Strategy | When to Use | Example | Risk |
|---|---|---|---|
| Cached Response | Read operations with stale-tolerant data | Product catalog, user profile, config | Stale data served; cache may be cold |
| Default Value | Optional enrichment data | Recommendations = empty list; rating = “N/A” | Degraded experience, may confuse users |
| Degraded Service | Partial functionality acceptable | Show products without personalized pricing | Feature loss, potential revenue impact |
| Queue for Retry | Write operations that must eventually succeed | Payment processing, order submission | Delayed processing, eventual consistency |
| Fail Fast with Error | Operations where wrong data is worse than no data | Financial calculations, compliance checks | User sees error, but data integrity preserved |
| Alternative Service | Redundant providers available | Primary CDN down → secondary CDN | Increased cost, configuration complexity |
Fallback Implementation Pattern
6 · Bulkhead Pattern
Named after the watertight compartments in a ship’s hull, the bulkhead pattern isolates resources (threads, connections, memory) per downstream dependency. If one dependency fails, only its allocated resources are consumed—the rest of the system continues operating.
🚨 Without Bulkhead
All dependencies share a single thread pool (200 threads). When Service D is slow:
- D consumes all 200 threads
- Services B, C, E starved of threads
- Entire application down
✅ With Bulkhead
Each dependency gets an isolated pool (50 threads each). When Service D is slow:
- D consumes its 50 threads
- B, C, E still have their 50 threads each
- 75% of application still works
Bulkhead Types
| Type | Mechanism | Pros | Cons |
|---|---|---|---|
| Semaphore | Limits concurrent calls (semaphore counter) | Lightweight, no thread overhead | Doesn’t protect against slow calls blocking the caller’s thread |
| Thread Pool | Dedicated thread pool per dependency | Full isolation, timeout per dependency, caller never blocked | Thread overhead, context switching, harder to debug |
Resilience4j Bulkhead Configuration
7 · Retry with Exponential Backoff & Jitter
Retries are essential but dangerous. Naive retries (immediate, fixed interval) create retry storms that amplify load on a struggling service. The solution: exponential backoff with jitter.
Exponential Backoff
Each retry waits exponentially longer than the previous one:
Why Jitter Is Critical
Without jitter, all clients that failed at roughly the same time will retry at exactly the same time, creating synchronized retry waves. Adding randomized jitter spreads retries across the interval:
8 · Cascading Failure Prevention
Let’s visualize how the circuit breaker pattern prevents cascading failures across a service chain. Compare the behavior with and without circuit breakers when a downstream dependency fails.
▶ Cascading Failure Prevention
Service A → B → C → D chain. D fails. Step through to see cascading failure without circuit breaker, then how circuit breaker stops the domino effect.
9 · Library Comparison
| Library | Language | Status | Features | Notes |
|---|---|---|---|---|
| Netflix Hystrix | Java | 🔴 Deprecated (2018) | Circuit breaker, bulkhead (thread pool), dashboard, metrics | Pioneer of the pattern. Used RxJava internally. Replaced by Resilience4j. |
| Resilience4j | Java / Kotlin | 🟢 Active | Circuit breaker, retry, rate limiter, bulkhead, time limiter, cache | Lightweight, functional, Spring Boot integration. The recommended replacement for Hystrix. |
| Polly | .NET (C#) | 🟢 Active | Circuit breaker, retry, bulkhead, timeout, fallback, hedging | Fluent API, supports async. Part of .NET Foundation. Polly v8+ with ResiliencePipeline. |
| Sentinel | Java | 🟢 Active | Circuit breaker, flow control, concurrency limiting, system load protection | Alibaba project. Dashboard included. Strong in flow-control and system protection. |
| gobreaker | Go | 🟢 Active | Circuit breaker (Sony open-source) | Minimalist, idiomatic Go. No bulkhead or retry—combine with other Go libraries. |
| opossum | Node.js | 🟢 Active | Circuit breaker with events, fallback, health checks | Event-driven API. Prometheus metrics plugin available. |
Polly (.NET) Example
Go (gobreaker) Example
10 · How to Implement
Whether you use a library or build a custom circuit breaker, the core implementation follows this pattern:
Step-by-Step Implementation
- Define your failure criteria. What counts as a failure? HTTP 5xx? Timeouts? Specific exceptions? Business logic errors should not trip the circuit.
- Choose window type. Count-based for simplicity; time-based for accuracy. Start with count-based (window = 20).
- Set initial thresholds conservatively. Start with 50% failure rate, 20-call window, 60-second wait duration. Tune from production data.
- Implement fallbacks for every circuit. No circuit breaker should just throw an exception—always have a degraded response path.
- Add bulkheads alongside circuit breakers. Circuit breakers detect failure; bulkheads contain the blast radius before the circuit trips.
- Configure retry inside the circuit. Retry outside the circuit so open-circuit calls fail fast without retries.
- Emit metrics for every state transition. Without observability, circuit breakers are invisible black boxes.
Custom Implementation Skeleton
11 · Monitoring Circuit Breaker State
In production, you must have comprehensive monitoring of circuit breaker behavior. Without it, you’ll never know if your circuits are tuned correctly or if they’re silently degrading service quality.
Key Metrics to Track
State Transitions
Rejection Rate
Fallback Rate
Recovery Time
Prometheus + Micrometer Metrics
Grafana Dashboard Panels
A production circuit breaker dashboard should include these panels:
- Circuit State Timeline: State band chart showing CLOSED/OPEN/HALF-OPEN transitions over time for all services
- Failure Rate vs Threshold: Line chart of
failure_ratewith a horizontal threshold line. Alert when rate approaches threshold. - Rejection Rate: Percentage of requests rejected (not permitted) per service. Non-zero = circuit was open.
- Fallback Invocations: Count of fallback calls per service per minute. Trends show service reliability.
- Recovery Time Distribution: Histogram of time spent in OPEN state before successful recovery.
- Call Duration Percentiles: p50, p95, p99 latency. Expect sudden drops when circuit opens (fast-fail is fast).
Alerting Rules
waitDurationInOpenState or minimumNumberOfCalls.
12 · Real-World: Netflix Circuit Breakers
Netflix pioneered the circuit breaker pattern in microservices at scale, creating the Hystrix library (now deprecated but historically foundational). Their experience during AWS outages provides the definitive case study.
Netflix’s Architecture Context
- ~1,000+ microservices in production (as of their Hystrix era)
- ~10 billion Hystrix executions per day across the fleet
- >99.99% upstream availability target even when individual services fail
- Every API call between services was wrapped in a
HystrixCommand
The 2011 AWS US-East-1 Outage
In April 2011, AWS experienced a major EBS (Elastic Block Store) outage in the US-East-1 region. Multiple Netflix dependencies became unresponsive:
- Personalization service: Could not read user preferences from EBS-backed databases
- Bookmark service: Could not retrieve “continue watching” positions
- Rating service: Could not load user ratings
Without circuit breakers, the Netflix API would have blocked on all three services, consuming all threads, and returning 503 errors to every device worldwide.
With Hystrix circuit breakers, what actually happened:
- Personalization circuit opened → returned generic (non-personalized) recommendations
- Bookmark circuit opened → “continue watching” row hidden from UI
- Rating circuit opened → ratings shown as “Not Available”
- Streaming continued to work. Users could still browse, search, and watch content.
Netflix’s Configuration Philosophy
| Parameter | Netflix Default | Rationale |
|---|---|---|
| Thread pool size per service | 10 threads | Small pools force bulkheading; 100 dependencies × 10 threads = 1,000 total |
| Request timeout | 1,000ms | If a dependency can’t respond in 1s, something is wrong |
| Error threshold | 50% of 20 calls | Balanced between sensitivity and stability |
| Sleep window (wait duration) | 5,000ms | Aggressively short—try to recover quickly |
| Metrics rolling window | 10 seconds | Recent failures matter more than old ones |
Key Takeaways from Netflix
- Every external call gets a circuit breaker. No exceptions. If it crosses a network boundary, it gets a circuit breaker.
- Thread pool isolation (bulkhead) was equally important. The thread pool per dependency was what actually prevented cascading failure before the circuit tripped.
- Fallbacks must be designed upfront. You cannot retrofit meaningful fallbacks during an outage. They must be part of the original design.
- Dashboard visibility is critical. The Hystrix Dashboard provided real-time visualization of circuit state across hundreds of services. Without this, operators could not make informed decisions.
- Test in production with fault injection. Netflix’s Chaos Engineering (Chaos Monkey, Chaos Kong) regularly injects failures to validate that circuit breakers and fallbacks work correctly.
From Hystrix to Resilience4j
Netflix deprecated Hystrix in 2018, recommending Resilience4j as the replacement. Key reasons:
- Hystrix required RxJava, adding significant complexity
- Thread pool isolation added overhead for every call (context switch)
- Resilience4j uses a simpler functional approach with decorators
- Resilience4j supports both semaphore and thread pool bulkheads (choose per use case)
- Better integration with Spring Boot, Micrometer, and modern Java features
13 · Best Practices & Anti-Patterns
✅ Do
- Circuit break every cross-network call
- Design fallbacks at architecture time, not incident time
- Use bulkheads alongside circuit breakers
- Monitor state transitions and alert on opens
- Use exponential backoff with jitter for retries
- Test circuit breakers with fault injection regularly
- Set
minimumNumberOfCallsto avoid false trips - Log every state change with context (which dependency, what failure)
❌ Don’t
- Don’t circuit break on business logic errors (validation failures)
- Don’t set wait duration too short (retry storms)
- Don’t forget to tune thresholds from production data
- Don’t use a single shared thread pool for all dependencies
- Don’t retry without backoff and jitter
- Don’t ignore flapping circuits—they indicate misconfiguration
- Don’t treat circuit breakers as a replacement for proper error handling
- Don’t forget health checks—circuit breakers protect callers but don’t fix the root cause