← All Posts
HLD · System Design » Distributed Systems

Circuit Breaker Pattern

In distributed systems, failures are not a question of if but when. A single unresponsive downstream service can consume all your threads, exhaust connection pools, and bring down your entire platform in seconds. The circuit breaker pattern—borrowed from electrical engineering—provides a mechanism to fail fast, protect the caller, and give the failing service time to recover. This post walks you through the complete pattern: the state machine, implementation strategies, production-grade configuration, and real-world lessons from Netflix’s battle-tested infrastructure.

1 · The Cascading Failure Problem

Consider a typical microservices architecture where Service A calls Service B, which calls Service C, which calls Service D. When Service D becomes unresponsive (not down—unresponsive), something insidious happens:

  1. Service C threads block waiting for D’s response. The default HTTP timeout is often 30–60 seconds.
  2. Service C’s thread pool fills up—200 threads, all blocked waiting on D.
  3. Service B can no longer get responses from C, and its threads start blocking.
  4. Service A blocks on B, and now your user-facing API returns 503s to every customer.

One slow dependency has just taken down your entire platform. This is a cascading failure, and it is the single most common failure mode in distributed systems.

⚠ Critical Insight: Slow services are more dangerous than dead services. A dead service fails immediately (connection refused), consuming minimal resources. A slow service holds threads, memory, and connections hostage for the duration of the timeout—often 30–60x longer than a normal request.

Why Timeouts Alone Are Not Enough

Setting aggressive timeouts helps, but doesn’t solve the problem:

ApproachProblem
Long timeouts (30s)Threads block for too long, pools exhausted quickly
Short timeouts (1s)False positives during normal latency spikes, legitimate slow queries fail
No timeoutThreads block forever, guaranteed cascading failure
Timeout + retryAmplifies load on a struggling service (retry storm), making failure worse

What we need is a mechanism that detects when a downstream service is failing and stops sending requests to it entirely for a period. Enter the circuit breaker.

2 · The Circuit Breaker State Machine

A circuit breaker has exactly three states, analogous to an electrical circuit breaker:

CLOSED Normal Operation

The circuit is closed—requests flow through to the downstream service normally. Under the hood, the circuit breaker is silently tracking failure metrics:

When the failure rate exceeds a configured threshold (e.g., 50% of calls in the last 10 calls), the circuit trips and transitions to OPEN.

failure_rate = failures_in_window / total_calls_in_window if failure_rate >= threshold → transition CLOSED → OPEN

OPEN Failing Fast

The circuit is open—no requests reach the downstream service. Instead, every call returns immediately with:

This protects both the caller (no blocked threads) and the downstream service (no load while recovering). The circuit breaker starts a wait timer (e.g., 60 seconds).

While OPEN: • All requests → immediate fallback (no remote call) • Thread consumption: ~0 (no blocking) • After wait_duration expires → transition OPEN → HALF-OPEN

HALF-OPEN Testing Recovery

After the wait timer expires, the circuit moves to half-open. A limited number of probe requests (typically 3–10) are allowed through to test whether the downstream service has recovered:

HALF-OPEN permits N probe requests: if success_rate >= threshold → CLOSED (service recovered) if failure_rate >= threshold → OPEN (still failing, reset timer)
💡 Key Design: The half-open state is what makes circuit breakers self-healing. Without it, you’d need manual intervention to re-enable a dependency after an outage. The probe requests provide automatic recovery detection.

▶ Circuit Breaker State Machine

Step through the complete lifecycle: normal operation → failures accumulate → circuit opens → timeout → half-open probe → recovery or re-open.

3 · Failure Thresholds & Sliding Windows

The circuit breaker’s behavior is governed by several critical configuration parameters. Getting these right is the difference between a useful safety net and an over-sensitive alarm that cries wolf.

Sliding Window Types

Resilience4j (the standard Java circuit breaker library) supports two window types:

Count-Based Window

Tracks the last N calls. Failure rate is calculated over these N calls. Simple, predictable, but doesn’t account for time—a burst of 10 failures in 1 second is treated the same as 10 failures spread over 10 minutes.

slidingWindowSize: 10
failureRateThreshold: 50%

→ trips after 5 failures in last 10 calls

Time-Based Window

Tracks calls within the last N seconds. More reflective of real-world conditions—a service that failed 5 minutes ago shouldn’t prevent current requests. Requires more memory (stores timestamps and results for each call).

slidingWindowSize: 60 (seconds)
failureRateThreshold: 50%

→ trips if >50% of calls failed in last 60s

Critical Configuration Parameters

ParameterDescriptionTypical ValueToo LowToo High
failureRateThreshold % failures to trip circuit 50% False trips on normal errors Doesn’t trip when service is degraded
slowCallRateThreshold % slow calls to trip circuit 80% Trips on occasional slow queries Doesn’t detect latency issues
slowCallDurationThreshold Duration to classify a call as “slow” 2–5s Normal calls classified as slow Truly slow calls pass as normal
slidingWindowSize Number of calls or seconds in window 10–100 calls / 60s Noisy, reacts to small bursts Slow to detect real failures
minimumNumberOfCalls Min calls before rate is calculated 5–10 1 failure trips circuit (rate = 100%) Slow to activate on low-traffic services
waitDurationInOpenState Time before half-open transition 30–60s Probes overloaded service too soon Unnecessarily long outage for callers
permittedNumberOfCallsInHalfOpenState Probe requests in half-open 3–10 Single fluke decides state Too much load on recovering service

4 · Resilience4j: Production Configuration

Resilience4j is the de facto standard circuit breaker library for Java / Spring Boot applications, replacing the deprecated Netflix Hystrix. Here is a production-ready configuration with annotations explaining every parameter.

Spring Boot YAML Configuration

# application.yml — Resilience4j Circuit Breaker Config resilience4j: circuitbreaker: configs: default: # Sliding window: count-based, last 20 calls slidingWindowType: COUNT_BASED slidingWindowSize: 20 # Trip when ≥50% of calls fail failureRateThreshold: 50 # Also trip if ≥80% of calls are slow slowCallRateThreshold: 80 slowCallDurationThreshold: 3s # Need at least 5 calls before computing rate minimumNumberOfCalls: 5 # Wait 60s in OPEN before probing waitDurationInOpenState: 60s # Allow 5 probe requests in HALF-OPEN permittedNumberOfCallsInHalfOpenState: 5 # Automatically transition from OPEN to HALF-OPEN automaticTransitionFromOpenToHalfOpenEnabled: true # Count these exceptions as failures recordExceptions: - java.io.IOException - java.util.concurrent.TimeoutException - org.springframework.web.client.HttpServerErrorException # Ignore these (don't count as failure or success) ignoreExceptions: - com.example.BusinessValidationException instances: paymentService: baseConfig: default # Override: payments are critical, stricter thresholds failureRateThreshold: 30 waitDurationInOpenState: 30s inventoryService: baseConfig: default # More tolerant: occasional inventory misses are acceptable failureRateThreshold: 70 slidingWindowSize: 50

Java Annotation-Based Usage

@Service public class PaymentGateway { private final RestTemplate restTemplate; private final PaymentCacheService cache; @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback") @Retry(name = "paymentService") @Bulkhead(name = "paymentService") public PaymentResult processPayment(PaymentRequest request) { return restTemplate.postForObject( "https://payments.internal/api/v1/charge", request, PaymentResult.class ); } // Fallback: called when circuit is OPEN or call fails public PaymentResult paymentFallback(PaymentRequest req, Throwable t) { log.warn("Payment circuit open, queuing for retry: {}", t.getMessage()); // Queue payment for async retry messageQueue.send("payment-retry", req); return PaymentResult.pending(req.getOrderId()); } }

Retry Configuration (with Circuit Breaker)

# Retry wraps around circuit breaker # Order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → Function resilience4j: retry: instances: paymentService: maxAttempts: 3 waitDuration: 500ms # Exponential backoff: 500ms → 1s → 2s enableExponentialBackoff: true exponentialBackoffMultiplier: 2 exponentialMaxWaitDuration: 5s # Add randomized jitter (±200ms) to prevent thundering herd enableRandomizedWait: true randomizedWaitFactor: 0.4 retryExceptions: - java.io.IOException - java.util.concurrent.TimeoutException ignoreExceptions: - com.example.BusinessValidationException
⚠ Decorator Order Matters: In Resilience4j, the decoration order is Retry → CircuitBreaker → RateLimiter → Bulkhead → TimeLimiter. This means retries happen outside the circuit breaker. If the circuit is open, the retry will get a CallNotPermittedException immediately—no wasted retries against a known-failed service.

5 · Fallback Strategies

When the circuit is open, you must decide what to return to the caller. The choice of fallback strategy depends on the operation’s semantics and consistency requirements.

StrategyWhen to UseExampleRisk
Cached Response Read operations with stale-tolerant data Product catalog, user profile, config Stale data served; cache may be cold
Default Value Optional enrichment data Recommendations = empty list; rating = “N/A” Degraded experience, may confuse users
Degraded Service Partial functionality acceptable Show products without personalized pricing Feature loss, potential revenue impact
Queue for Retry Write operations that must eventually succeed Payment processing, order submission Delayed processing, eventual consistency
Fail Fast with Error Operations where wrong data is worse than no data Financial calculations, compliance checks User sees error, but data integrity preserved
Alternative Service Redundant providers available Primary CDN down → secondary CDN Increased cost, configuration complexity

Fallback Implementation Pattern

@CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback") public Product getProduct(String productId) { return productClient.getProduct(productId); } // Tiered fallback: cache → default → error public Product getProductFallback(String productId, Throwable t) { // Tier 1: Try local cache Product cached = localCache.get("product:" + productId); if (cached != null) { log.info("Serving cached product {}", productId); cached.setStale(true); // Mark as potentially stale return cached; } // Tier 2: Try distributed cache (Redis) Product redisCached = redisCache.get("product:" + productId); if (redisCached != null) { log.info("Serving Redis-cached product {}", productId); redisCached.setStale(true); return redisCached; } // Tier 3: Minimal default product log.warn("No cache for product {}, returning skeleton", productId); return Product.skeleton(productId, "Product temporarily unavailable"); }

6 · Bulkhead Pattern

Named after the watertight compartments in a ship’s hull, the bulkhead pattern isolates resources (threads, connections, memory) per downstream dependency. If one dependency fails, only its allocated resources are consumed—the rest of the system continues operating.

🚨 Without Bulkhead

All dependencies share a single thread pool (200 threads). When Service D is slow:

  • D consumes all 200 threads
  • Services B, C, E starved of threads
  • Entire application down

✅ With Bulkhead

Each dependency gets an isolated pool (50 threads each). When Service D is slow:

  • D consumes its 50 threads
  • B, C, E still have their 50 threads each
  • 75% of application still works

Bulkhead Types

TypeMechanismProsCons
Semaphore Limits concurrent calls (semaphore counter) Lightweight, no thread overhead Doesn’t protect against slow calls blocking the caller’s thread
Thread Pool Dedicated thread pool per dependency Full isolation, timeout per dependency, caller never blocked Thread overhead, context switching, harder to debug

Resilience4j Bulkhead Configuration

resilience4j: bulkhead: instances: paymentService: # Semaphore bulkhead: max 25 concurrent calls maxConcurrentCalls: 25 # Wait max 500ms for a permit before failing maxWaitDuration: 500ms thread-pool-bulkhead: instances: inventoryService: # Thread-pool bulkhead: dedicated thread pool maxThreadPoolSize: 20 coreThreadPoolSize: 10 queueCapacity: 50 keepAliveDuration: 30s

7 · Retry with Exponential Backoff & Jitter

Retries are essential but dangerous. Naive retries (immediate, fixed interval) create retry storms that amplify load on a struggling service. The solution: exponential backoff with jitter.

Exponential Backoff

Each retry waits exponentially longer than the previous one:

wait_time = base_delay × multiplier^(attempt - 1) Attempt 1: 500ms × 2^0 = 500ms Attempt 2: 500ms × 2^1 = 1,000ms Attempt 3: 500ms × 2^2 = 2,000ms Attempt 4: 500ms × 2^3 = 4,000ms (capped at max_delay)

Why Jitter Is Critical

Without jitter, all clients that failed at roughly the same time will retry at exactly the same time, creating synchronized retry waves. Adding randomized jitter spreads retries across the interval:

Full Jitter (recommended): wait = random(0, base_delay × 2^attempt) Equal Jitter: temp = base_delay × 2^attempt wait = temp/2 + random(0, temp/2) Decorrelated Jitter (AWS recommendation): wait = min(max_delay, random(base_delay, prev_wait × 3))
💡 AWS Study: Amazon’s analysis showed that “Full Jitter” dramatically outperforms “Equal Jitter” and “No Jitter” in total client work required to complete requests during contention. See AWS Architecture Blog: Exponential Backoff and Jitter.

8 · Cascading Failure Prevention

Let’s visualize how the circuit breaker pattern prevents cascading failures across a service chain. Compare the behavior with and without circuit breakers when a downstream dependency fails.

▶ Cascading Failure Prevention

Service A → B → C → D chain. D fails. Step through to see cascading failure without circuit breaker, then how circuit breaker stops the domino effect.

9 · Library Comparison

LibraryLanguageStatusFeaturesNotes
Netflix Hystrix Java 🔴 Deprecated (2018) Circuit breaker, bulkhead (thread pool), dashboard, metrics Pioneer of the pattern. Used RxJava internally. Replaced by Resilience4j.
Resilience4j Java / Kotlin 🟢 Active Circuit breaker, retry, rate limiter, bulkhead, time limiter, cache Lightweight, functional, Spring Boot integration. The recommended replacement for Hystrix.
Polly .NET (C#) 🟢 Active Circuit breaker, retry, bulkhead, timeout, fallback, hedging Fluent API, supports async. Part of .NET Foundation. Polly v8+ with ResiliencePipeline.
Sentinel Java 🟢 Active Circuit breaker, flow control, concurrency limiting, system load protection Alibaba project. Dashboard included. Strong in flow-control and system protection.
gobreaker Go 🟢 Active Circuit breaker (Sony open-source) Minimalist, idiomatic Go. No bulkhead or retry—combine with other Go libraries.
opossum Node.js 🟢 Active Circuit breaker with events, fallback, health checks Event-driven API. Prometheus metrics plugin available.

Polly (.NET) Example

// Polly v8+ ResiliencePipeline var pipeline = new ResiliencePipelineBuilder<HttpResponseMessage>() .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage> { FailureRatio = 0.5, SamplingDuration = TimeSpan.FromSeconds(30), MinimumThroughput = 10, BreakDuration = TimeSpan.FromSeconds(60), ShouldHandle = new PredicateBuilder<HttpResponseMessage>() .Handle<HttpRequestException>() .Handle<TimeoutRejectedException>() .HandleResult(r => (int)r.StatusCode >= 500), OnOpened = args => { logger.LogWarning("Circuit opened for {BreakDuration}", args.BreakDuration); return ValueTask.CompletedTask; }, OnClosed = args => { logger.LogInformation("Circuit closed, service recovered"); return ValueTask.CompletedTask; } }) .AddRetry(new RetryStrategyOptions<HttpResponseMessage> { MaxRetryAttempts = 3, Delay = TimeSpan.FromMilliseconds(500), BackoffType = DelayBackoffType.Exponential, UseJitter = true }) .Build();

Go (gobreaker) Example

import "github.com/sony/gobreaker" cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{ Name: "payment-service", MaxRequests: 3, // half-open probes Interval: 60 * time.Second, // counter reset interval Timeout: 30 * time.Second, // open → half-open duration ReadyToTrip: func(counts gobreaker.Counts) bool { failureRatio := float64(counts.TotalFailures) / float64(counts.Requests) return counts.Requests >= 5 && failureRatio >= 0.5 }, OnStateChange: func(name string, from, to gobreaker.State) { log.Printf("circuit breaker %s: %s → %s", name, from, to) }, }) result, err := cb.Execute(func() (interface{}, error) { return httpClient.Get("https://payments.internal/charge") })

10 · How to Implement

Whether you use a library or build a custom circuit breaker, the core implementation follows this pattern:

Step-by-Step Implementation

  1. Define your failure criteria. What counts as a failure? HTTP 5xx? Timeouts? Specific exceptions? Business logic errors should not trip the circuit.
  2. Choose window type. Count-based for simplicity; time-based for accuracy. Start with count-based (window = 20).
  3. Set initial thresholds conservatively. Start with 50% failure rate, 20-call window, 60-second wait duration. Tune from production data.
  4. Implement fallbacks for every circuit. No circuit breaker should just throw an exception—always have a degraded response path.
  5. Add bulkheads alongside circuit breakers. Circuit breakers detect failure; bulkheads contain the blast radius before the circuit trips.
  6. Configure retry inside the circuit. Retry outside the circuit so open-circuit calls fail fast without retries.
  7. Emit metrics for every state transition. Without observability, circuit breakers are invisible black boxes.

Custom Implementation Skeleton

public class CircuitBreaker { enum State { CLOSED, OPEN, HALF_OPEN } private State state = State.CLOSED; private int failureCount = 0; private int successCount = 0; private long lastFailureTime = 0; private final int failureThreshold; // e.g., 5 private final long waitDurationMs; // e.g., 60000 private final int halfOpenMaxCalls; // e.g., 3 public <T> T execute(Supplier<T> action, Supplier<T> fallback) { if (state == State.OPEN) { if (System.currentTimeMillis() - lastFailureTime > waitDurationMs) { state = State.HALF_OPEN; successCount = 0; failureCount = 0; } else { metrics.increment("circuit_breaker.rejected"); return fallback.get(); // Fast fail } } try { T result = action.get(); onSuccess(); return result; } catch (Exception e) { onFailure(); return fallback.get(); } } private void onSuccess() { if (state == State.HALF_OPEN) { successCount++; if (successCount >= halfOpenMaxCalls) { state = State.CLOSED; // Recovered! failureCount = 0; metrics.emit("circuit_breaker.closed"); } } else { failureCount = Math.max(0, failureCount - 1); // Slow recovery } } private void onFailure() { failureCount++; lastFailureTime = System.currentTimeMillis(); if (state == State.HALF_OPEN || failureCount >= failureThreshold) { state = State.OPEN; metrics.emit("circuit_breaker.opened"); } } }

11 · Monitoring Circuit Breaker State

In production, you must have comprehensive monitoring of circuit breaker behavior. Without it, you’ll never know if your circuits are tuned correctly or if they’re silently degrading service quality.

Key Metrics to Track

State Transitions
CLOSED→OPEN
Count per minute, alert on spikes
Rejection Rate
2.3%
Requests rejected by open circuit
Fallback Rate
1.8%
Requests served by fallback
Recovery Time
45s
Avg time from OPEN to CLOSED

Prometheus + Micrometer Metrics

# Resilience4j automatically exports these via Micrometer: # Circuit breaker state (gauge: 0=closed, 1=open, 2=half-open) resilience4j_circuitbreaker_state{name="paymentService"} # Total calls by outcome resilience4j_circuitbreaker_calls_seconds_count{ name="paymentService", kind="successful|failed|ignored|not_permitted" } # Failure rate (gauge, 0-100) resilience4j_circuitbreaker_failure_rate{name="paymentService"} # Slow call rate (gauge, 0-100) resilience4j_circuitbreaker_slow_call_rate{name="paymentService"} # Buffered calls in sliding window resilience4j_circuitbreaker_buffered_calls{ name="paymentService", kind="successful|failed" } # State transition events (counter) resilience4j_circuitbreaker_state_transitions_total{ name="paymentService", from_state="CLOSED", to_state="OPEN" }

Grafana Dashboard Panels

A production circuit breaker dashboard should include these panels:

  1. Circuit State Timeline: State band chart showing CLOSED/OPEN/HALF-OPEN transitions over time for all services
  2. Failure Rate vs Threshold: Line chart of failure_rate with a horizontal threshold line. Alert when rate approaches threshold.
  3. Rejection Rate: Percentage of requests rejected (not permitted) per service. Non-zero = circuit was open.
  4. Fallback Invocations: Count of fallback calls per service per minute. Trends show service reliability.
  5. Recovery Time Distribution: Histogram of time spent in OPEN state before successful recovery.
  6. Call Duration Percentiles: p50, p95, p99 latency. Expect sudden drops when circuit opens (fast-fail is fast).

Alerting Rules

# Prometheus alerting rules groups: - name: circuit-breaker-alerts rules: - alert: CircuitBreakerOpened expr: resilience4j_circuitbreaker_state == 1 for: 30s labels: severity: warning annotations: summary: "Circuit breaker {{ $labels.name }} is OPEN" description: "Service {{ $labels.name }} circuit has been open for >30s. Fallbacks are being served." - alert: CircuitBreakerHighRejectionRate expr: > rate(resilience4j_circuitbreaker_calls_seconds_count{kind="not_permitted"}[5m]) / rate(resilience4j_circuitbreaker_calls_seconds_count[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: ">10% of requests to {{ $labels.name }} are being rejected" - alert: CircuitBreakerFlapping expr: > increase(resilience4j_circuitbreaker_state_transitions_total[10m]) > 5 for: 1m labels: severity: warning annotations: summary: "Circuit breaker {{ $labels.name }} is flapping (>5 transitions in 10m)"
⚠ Flapping Detection: A circuit breaker that rapidly oscillates between OPEN and CLOSED (flapping) usually indicates the threshold is too sensitive, the wait duration is too short, or the downstream service is in a degraded (not failed) state. Consider increasing waitDurationInOpenState or minimumNumberOfCalls.

12 · Real-World: Netflix Circuit Breakers

Netflix pioneered the circuit breaker pattern in microservices at scale, creating the Hystrix library (now deprecated but historically foundational). Their experience during AWS outages provides the definitive case study.

Netflix’s Architecture Context

The 2011 AWS US-East-1 Outage

In April 2011, AWS experienced a major EBS (Elastic Block Store) outage in the US-East-1 region. Multiple Netflix dependencies became unresponsive:

  1. Personalization service: Could not read user preferences from EBS-backed databases
  2. Bookmark service: Could not retrieve “continue watching” positions
  3. Rating service: Could not load user ratings

Without circuit breakers, the Netflix API would have blocked on all three services, consuming all threads, and returning 503 errors to every device worldwide.

With Hystrix circuit breakers, what actually happened:

  1. Personalization circuit opened → returned generic (non-personalized) recommendations
  2. Bookmark circuit opened → “continue watching” row hidden from UI
  3. Rating circuit opened → ratings shown as “Not Available”
  4. Streaming continued to work. Users could still browse, search, and watch content.
💡 Netflix Lesson: “The best experience in an outage is a slightly degraded experience, not a completely broken one.” Circuit breakers enabled Netflix to shed non-critical functionality while preserving core streaming. Users might not even notice the degradation.

Netflix’s Configuration Philosophy

ParameterNetflix DefaultRationale
Thread pool size per service 10 threads Small pools force bulkheading; 100 dependencies × 10 threads = 1,000 total
Request timeout 1,000ms If a dependency can’t respond in 1s, something is wrong
Error threshold 50% of 20 calls Balanced between sensitivity and stability
Sleep window (wait duration) 5,000ms Aggressively short—try to recover quickly
Metrics rolling window 10 seconds Recent failures matter more than old ones

Key Takeaways from Netflix

  1. Every external call gets a circuit breaker. No exceptions. If it crosses a network boundary, it gets a circuit breaker.
  2. Thread pool isolation (bulkhead) was equally important. The thread pool per dependency was what actually prevented cascading failure before the circuit tripped.
  3. Fallbacks must be designed upfront. You cannot retrofit meaningful fallbacks during an outage. They must be part of the original design.
  4. Dashboard visibility is critical. The Hystrix Dashboard provided real-time visualization of circuit state across hundreds of services. Without this, operators could not make informed decisions.
  5. Test in production with fault injection. Netflix’s Chaos Engineering (Chaos Monkey, Chaos Kong) regularly injects failures to validate that circuit breakers and fallbacks work correctly.

From Hystrix to Resilience4j

Netflix deprecated Hystrix in 2018, recommending Resilience4j as the replacement. Key reasons:

13 · Best Practices & Anti-Patterns

✅ Do

  • Circuit break every cross-network call
  • Design fallbacks at architecture time, not incident time
  • Use bulkheads alongside circuit breakers
  • Monitor state transitions and alert on opens
  • Use exponential backoff with jitter for retries
  • Test circuit breakers with fault injection regularly
  • Set minimumNumberOfCalls to avoid false trips
  • Log every state change with context (which dependency, what failure)

❌ Don’t

  • Don’t circuit break on business logic errors (validation failures)
  • Don’t set wait duration too short (retry storms)
  • Don’t forget to tune thresholds from production data
  • Don’t use a single shared thread pool for all dependencies
  • Don’t retry without backoff and jitter
  • Don’t ignore flapping circuits—they indicate misconfiguration
  • Don’t treat circuit breakers as a replacement for proper error handling
  • Don’t forget health checks—circuit breakers protect callers but don’t fix the root cause
💡 Interview Tip: In system design interviews, when discussing resilience, mention three layers: timeouts (first line of defense), circuit breakers (detect and isolate failure), and bulkheads (contain blast radius). Then describe the three states (Closed, Open, Half-Open) and give a concrete fallback example. Mentioning Netflix Hystrix’s history and Resilience4j as the modern standard shows depth. If you can discuss exponential backoff with jitter and why “Full Jitter” outperforms other strategies, you’ll demonstrate production experience.