HLD · System Design » Distributed Systems

Circuit Breaker Pattern

High Level Design Series · Distributed Systems · Part 6 | April 2026 · 16 min read

In distributed systems, failures are not a question of if but when. A single unresponsive downstream service can consume all your threads, exhaust connection pools, and bring down your entire platform in seconds. The circuit breaker pattern—borrowed from electrical engineering—provides a mechanism to fail fast, protect the caller, and give the failing service time to recover. This post walks you through the complete pattern: the state machine, implementation strategies, production-grade configuration, and real-world lessons from Netflix’s battle-tested infrastructure.

1 · The Cascading Failure Problem

Consider a typical microservices architecture where Service A calls Service B, which calls Service C, which calls Service D. When Service D becomes unresponsive (not down—unresponsive), something insidious happens:

Service C threads block waiting for D’s response. The default HTTP timeout is often 30–60 seconds.
Service C’s thread pool fills up—200 threads, all blocked waiting on D.
Service B can no longer get responses from C, and its threads start blocking.
Service A blocks on B, and now your user-facing API returns 503s to every customer.

One slow dependency has just taken down your entire platform. This is a cascading failure, and it is the single most common failure mode in distributed systems.

⚠ Critical Insight: Slow services are more dangerous than dead services. A dead service fails immediately (connection refused), consuming minimal resources. A slow service holds threads, memory, and connections hostage for the duration of the timeout—often 30–60x longer than a normal request.

Why Timeouts Alone Are Not Enough

Setting aggressive timeouts helps, but doesn’t solve the problem:

Approach	Problem
Long timeouts (30s)	Threads block for too long, pools exhausted quickly
Short timeouts (1s)	False positives during normal latency spikes, legitimate slow queries fail
No timeout	Threads block forever, guaranteed cascading failure
Timeout + retry	Amplifies load on a struggling service (retry storm), making failure worse

What we need is a mechanism that detects when a downstream service is failing and stops sending requests to it entirely for a period. Enter the circuit breaker.

2 · The Circuit Breaker State Machine

A circuit breaker has exactly three states, analogous to an electrical circuit breaker:

CLOSED Normal Operation

The circuit is closed—requests flow through to the downstream service normally. Under the hood, the circuit breaker is silently tracking failure metrics:

Failure count or failure rate over a sliding window
Slow call rate—percentage of calls exceeding a duration threshold
Each response is classified: success, failure (exception), or slow (exceeded threshold)

When the failure rate exceeds a configured threshold (e.g., 50% of calls in the last 10 calls), the circuit trips and transitions to OPEN.

failure_rate = failures_in_window / total_calls_in_window if failure_rate >= threshold \to transition CLOSED \to OPEN

OPEN Failing Fast

The circuit is open—no requests reach the downstream service. Instead, every call returns immediately with:

A fallback response (cached data, default value, degraded experience)
Or a fast failure exception (CircuitBreakerOpenException)

This protects both the caller (no blocked threads) and the downstream service (no load while recovering). The circuit breaker starts a wait timer (e.g., 60 seconds).

While OPEN: • All requests \to immediate fallback (no remote call) • Thread consumption: ~0 (no blocking) • After wait_duration expires \to transition OPEN \to HALF-OPEN

HALF-OPEN Testing Recovery

After the wait timer expires, the circuit moves to half-open. A limited number of probe requests (typically 3–10) are allowed through to test whether the downstream service has recovered:

If probes succeed at or above the threshold → circuit transitions back to CLOSED
If probes fail → circuit transitions back to OPEN and the wait timer resets

HALF-OPEN permits N probe requests: if success_rate >= threshold \to CLOSED (service recovered) if failure_rate >= threshold \to OPEN (still failing, reset timer)

💡 Key Design: The half-open state is what makes circuit breakers self-healing. Without it, you’d need manual intervention to re-enable a dependency after an outage. The probe requests provide automatic recovery detection.

▶ Circuit Breaker State Machine

Step through the complete lifecycle: normal operation → failures accumulate → circuit opens → timeout → half-open probe → recovery or re-open.

3 · Failure Thresholds & Sliding Windows

The circuit breaker’s behavior is governed by several critical configuration parameters. Getting these right is the difference between a useful safety net and an over-sensitive alarm that cries wolf.

Sliding Window Types

Resilience4j (the standard Java circuit breaker library) supports two window types:

Count-Based Window

Tracks the last N calls. Failure rate is calculated over these N calls. Simple, predictable, but doesn’t account for time—a burst of 10 failures in 1 second is treated the same as 10 failures spread over 10 minutes.

slidingWindowSize: 10 failureRateThreshold: 50% \to trips after 5 failures in last 10 calls

Time-Based Window

Tracks calls within the last N seconds. More reflective of real-world conditions—a service that failed 5 minutes ago shouldn’t prevent current requests. Requires more memory (stores timestamps and results for each call).

slidingWindowSize: 60 (seconds) failureRateThreshold: 50% \to trips if >50% of calls failed in last 60s

Critical Configuration Parameters

Parameter	Description	Typical Value	Too Low	Too High
`failureRateThreshold`	% failures to trip circuit	50%	False trips on normal errors	Doesn’t trip when service is degraded
`slowCallRateThreshold`	% slow calls to trip circuit	80%	Trips on occasional slow queries	Doesn’t detect latency issues
`slowCallDurationThreshold`	Duration to classify a call as “slow”	2–5s	Normal calls classified as slow	Truly slow calls pass as normal
`slidingWindowSize`	Number of calls or seconds in window	10–100 calls / 60s	Noisy, reacts to small bursts	Slow to detect real failures
`minimumNumberOfCalls`	Min calls before rate is calculated	5–10	1 failure trips circuit (rate = 100%)	Slow to activate on low-traffic services
`waitDurationInOpenState`	Time before half-open transition	30–60s	Probes overloaded service too soon	Unnecessarily long outage for callers
`permittedNumberOfCallsInHalfOpenState`	Probe requests in half-open	3–10	Single fluke decides state	Too much load on recovering service

4 · Resilience4j: Production Configuration

Resilience4j is the de facto standard circuit breaker library for Java / Spring Boot applications, replacing the deprecated Netflix Hystrix. Here is a production-ready configuration with annotations explaining every parameter.

Spring Boot YAML Configuration

# application.yml — Resilience4j Circuit Breaker Config resilience4j: circuitbreaker: configs: default: # Sliding window: count-based, last 20 calls slidingWindowType: COUNT_BASED slidingWindowSize: 20 # Trip when ≥50% of calls fail failureRateThreshold: 50 # Also trip if ≥80% of calls are slow slowCallRateThreshold: 80 slowCallDurationThreshold: 3s # Need at least 5 calls before computing rate minimumNumberOfCalls: 5 # Wait 60s in OPEN before probing waitDurationInOpenState: 60s # Allow 5 probe requests in HALF-OPEN permittedNumberOfCallsInHalfOpenState: 5 # Automatically transition from OPEN to HALF-OPEN automaticTransitionFromOpenToHalfOpenEnabled: true # Count these exceptions as failures recordExceptions: - java.io.IOException - java.util.concurrent.TimeoutException - org.springframework.web.client.HttpServerErrorException # Ignore these (don't count as failure or success) ignoreExceptions: - com.example.BusinessValidationException instances: paymentService: baseConfig: default # Override: payments are critical, stricter thresholds failureRateThreshold: 30 waitDurationInOpenState: 30s inventoryService: baseConfig: default # More tolerant: occasional inventory misses are acceptable failureRateThreshold: 70 slidingWindowSize: 50

Java Annotation-Based Usage

@Service public class PaymentGateway { private final RestTemplate restTemplate; private final PaymentCacheService cache; @CircuitBreaker(name = "paymentService", fallbackMethod = "paymentFallback") @Retry(name = "paymentService") @Bulkhead(name = "paymentService") public PaymentResult processPayment(PaymentRequest request) { return restTemplate.postForObject( "https://payments.internal/api/v1/charge", request, PaymentResult.class ); } // Fallback: called when circuit is OPEN or call fails public PaymentResult paymentFallback(PaymentRequest req, Throwable t) { log.warn("Payment circuit open, queuing for retry: {}", t.getMessage()); // Queue payment for async retry messageQueue.send("payment-retry", req); return PaymentResult.pending(req.getOrderId()); } }

Retry Configuration (with Circuit Breaker)

# Retry wraps around circuit breaker # Order: Retry → CircuitBreaker → Bulkhead → TimeLimiter → Function resilience4j: retry: instances: paymentService: maxAttempts: 3 waitDuration: 500ms # Exponential backoff: 500ms → 1s → 2s enableExponentialBackoff: true exponentialBackoffMultiplier: 2 exponentialMaxWaitDuration: 5s # Add randomized jitter (±200ms) to prevent thundering herd enableRandomizedWait: true randomizedWaitFactor: 0.4 retryExceptions: - java.io.IOException - java.util.concurrent.TimeoutException ignoreExceptions: - com.example.BusinessValidationException

⚠ Decorator Order Matters: In Resilience4j, the decoration order is Retry → CircuitBreaker → RateLimiter → Bulkhead → TimeLimiter. This means retries happen outside the circuit breaker. If the circuit is open, the retry will get a CallNotPermittedException immediately—no wasted retries against a known-failed service.

5 · Fallback Strategies

When the circuit is open, you must decide what to return to the caller. The choice of fallback strategy depends on the operation’s semantics and consistency requirements.

Strategy	When to Use	Example	Risk
Cached Response	Read operations with stale-tolerant data	Product catalog, user profile, config	Stale data served; cache may be cold
Default Value	Optional enrichment data	Recommendations = empty list; rating = “N/A”	Degraded experience, may confuse users
Degraded Service	Partial functionality acceptable	Show products without personalized pricing	Feature loss, potential revenue impact
Queue for Retry	Write operations that must eventually succeed	Payment processing, order submission	Delayed processing, eventual consistency
Fail Fast with Error	Operations where wrong data is worse than no data	Financial calculations, compliance checks	User sees error, but data integrity preserved
Alternative Service	Redundant providers available	Primary CDN down → secondary CDN	Increased cost, configuration complexity

Fallback Implementation Pattern

@CircuitBreaker(name = "productService", fallbackMethod = "getProductFallback") public Product getProduct(String productId) { return productClient.getProduct(productId); } // Tiered fallback: cache → default → error public Product getProductFallback(String productId, Throwable t) { // Tier 1: Try local cache Product cached = localCache.get("product:" + productId); if (cached != null) { log.info("Serving cached product {}", productId); cached.setStale(true); // Mark as potentially stale return cached; } // Tier 2: Try distributed cache (Redis) Product redisCached = redisCache.get("product:" + productId); if (redisCached != null) { log.info("Serving Redis-cached product {}", productId); redisCached.setStale(true); return redisCached; } // Tier 3: Minimal default product log.warn("No cache for product {}, returning skeleton", productId); return Product.skeleton(productId, "Product temporarily unavailable"); }

6 · Bulkhead Pattern

Named after the watertight compartments in a ship’s hull, the bulkhead pattern isolates resources (threads, connections, memory) per downstream dependency. If one dependency fails, only its allocated resources are consumed—the rest of the system continues operating.

🚨 Without Bulkhead

All dependencies share a single thread pool (200 threads). When Service D is slow:

D consumes all 200 threads
Services B, C, E starved of threads
Entire application down

✅ With Bulkhead

Each dependency gets an isolated pool (50 threads each). When Service D is slow:

D consumes its 50 threads
B, C, E still have their 50 threads each
75% of application still works

Bulkhead Types

Type	Mechanism	Pros	Cons
Semaphore	Limits concurrent calls (semaphore counter)	Lightweight, no thread overhead	Doesn’t protect against slow calls blocking the caller’s thread
Thread Pool	Dedicated thread pool per dependency	Full isolation, timeout per dependency, caller never blocked	Thread overhead, context switching, harder to debug

Resilience4j Bulkhead Configuration

resilience4j: bulkhead: instances: paymentService: # Semaphore bulkhead: max 25 concurrent calls maxConcurrentCalls: 25 # Wait max 500ms for a permit before failing maxWaitDuration: 500ms thread-pool-bulkhead: instances: inventoryService: # Thread-pool bulkhead: dedicated thread pool maxThreadPoolSize: 20 coreThreadPoolSize: 10 queueCapacity: 50 keepAliveDuration: 30s

7 · Retry with Exponential Backoff & Jitter

Retries are essential but dangerous. Naive retries (immediate, fixed interval) create retry storms that amplify load on a struggling service. The solution: exponential backoff with jitter.

Exponential Backoff

Each retry waits exponentially longer than the previous one:

wait_time = base_delay \times multiplier^(attempt - 1) Attempt 1: 500ms \times 2^0 = 500ms Attempt 2: 500ms \times 2^1 = 1,000ms Attempt 3: 500ms \times 2^2 = 2,000ms Attempt 4: 500ms \times 2^3 = 4,000ms (capped at max_delay)

Why Jitter Is Critical

Without jitter, all clients that failed at roughly the same time will retry at exactly the same time, creating synchronized retry waves. Adding randomized jitter spreads retries across the interval:

Full Jitter (recommended): wait = random(0, base_delay \times 2^attempt) Equal Jitter : temp = base_delay \times 2^attempt wait = temp/2 + random(0, temp/2) Decorrelated Jitter (AWS recommendation): wait = min(max_delay, random(base_delay, prev_wait \times 3))

💡 AWS Study: Amazon’s analysis showed that “Full Jitter” dramatically outperforms “Equal Jitter” and “No Jitter” in total client work required to complete requests during contention. See AWS Architecture Blog: Exponential Backoff and Jitter.

8 · Cascading Failure Prevention

Let’s visualize how the circuit breaker pattern prevents cascading failures across a service chain. Compare the behavior with and without circuit breakers when a downstream dependency fails.

▶ Cascading Failure Prevention

Service A → B → C → D chain. D fails. Step through to see cascading failure without circuit breaker, then how circuit breaker stops the domino effect.

9 · Library Comparison

Library	Language	Status	Features	Notes
Netflix Hystrix	Java	🔴 Deprecated (2018)	Circuit breaker, bulkhead (thread pool), dashboard, metrics	Pioneer of the pattern. Used RxJava internally. Replaced by Resilience4j.
Resilience4j	Java / Kotlin	🟢 Active	Circuit breaker, retry, rate limiter, bulkhead, time limiter, cache	Lightweight, functional, Spring Boot integration. The recommended replacement for Hystrix.
Polly	.NET (C#)	🟢 Active	Circuit breaker, retry, bulkhead, timeout, fallback, hedging	Fluent API, supports async. Part of .NET Foundation. Polly v8+ with `ResiliencePipeline`.
Sentinel	Java	🟢 Active	Circuit breaker, flow control, concurrency limiting, system load protection	Alibaba project. Dashboard included. Strong in flow-control and system protection.
gobreaker	Go	🟢 Active	Circuit breaker (Sony open-source)	Minimalist, idiomatic Go. No bulkhead or retry—combine with other Go libraries.
opossum	Node.js	🟢 Active	Circuit breaker with events, fallback, health checks	Event-driven API. Prometheus metrics plugin available.

Polly (.NET) Example

// Polly v8+ ResiliencePipeline var pipeline = new ResiliencePipelineBuilder<HttpResponseMessage>() .AddCircuitBreaker(new CircuitBreakerStrategyOptions<HttpResponseMessage> { FailureRatio = 0.5, SamplingDuration = TimeSpan.FromSeconds(30), MinimumThroughput = 10, BreakDuration = TimeSpan.FromSeconds(60), ShouldHandle = new PredicateBuilder<HttpResponseMessage>() .Handle<HttpRequestException>() .Handle<TimeoutRejectedException>() .HandleResult(r => (int)r.StatusCode >= 500), OnOpened = args => { logger.LogWarning("Circuit opened for {BreakDuration}", args.BreakDuration); return ValueTask.CompletedTask; }, OnClosed = args => { logger.LogInformation("Circuit closed, service recovered"); return ValueTask.CompletedTask; } }) .AddRetry(new RetryStrategyOptions<HttpResponseMessage> { MaxRetryAttempts = 3, Delay = TimeSpan.FromMilliseconds(500), BackoffType = DelayBackoffType.Exponential, UseJitter = true }) .Build();

Go (gobreaker) Example

import "github.com/sony/gobreaker" cb := gobreaker.NewCircuitBreaker(gobreaker.Settings{ Name: "payment-service", MaxRequests: 3, // half-open probes Interval: 60 * time.Second, // counter reset interval Timeout: 30 * time.Second, // open → half-open duration ReadyToTrip: func(counts gobreaker.Counts) bool { failureRatio := float64(counts.TotalFailures) / float64(counts.Requests) return counts.Requests >= 5 && failureRatio >= 0.5 }, OnStateChange: func(name string, from, to gobreaker.State) { log.Printf("circuit breaker %s: %s → %s", name, from, to) }, }) result, err := cb.Execute(func() (interface{}, error) { return httpClient.Get("https://payments.internal/charge") })

10 · How to Implement

Whether you use a library or build a custom circuit breaker, the core implementation follows this pattern:

Step-by-Step Implementation

Define your failure criteria. What counts as a failure? HTTP 5xx? Timeouts? Specific exceptions? Business logic errors should not trip the circuit.
Choose window type. Count-based for simplicity; time-based for accuracy. Start with count-based (window = 20).
Set initial thresholds conservatively. Start with 50% failure rate, 20-call window, 60-second wait duration. Tune from production data.
Implement fallbacks for every circuit. No circuit breaker should just throw an exception—always have a degraded response path.
Add bulkheads alongside circuit breakers. Circuit breakers detect failure; bulkheads contain the blast radius before the circuit trips.
Configure retry inside the circuit. Retry outside the circuit so open-circuit calls fail fast without retries.
Emit metrics for every state transition. Without observability, circuit breakers are invisible black boxes.

Custom Implementation Skeleton

public class CircuitBreaker { enum State { CLOSED, OPEN, HALF_OPEN } private State state = State.CLOSED; private int failureCount = 0; private int successCount = 0; private long lastFailureTime = 0; private final int failureThreshold; // e.g., 5 private final long waitDurationMs; // e.g., 60000 private final int halfOpenMaxCalls; // e.g., 3 public <T> T execute(Supplier<T> action, Supplier<T> fallback) { if (state == State.OPEN) { if (System.currentTimeMillis() - lastFailureTime > waitDurationMs) { state = State.HALF_OPEN; successCount = 0; failureCount = 0; } else { metrics.increment("circuit_breaker.rejected"); return fallback.get(); // Fast fail } } try { T result = action.get(); onSuccess(); return result; } catch (Exception e) { onFailure(); return fallback.get(); } } private void onSuccess() { if (state == State.HALF_OPEN) { successCount++; if (successCount >= halfOpenMaxCalls) { state = State.CLOSED; // Recovered! failureCount = 0; metrics.emit("circuit_breaker.closed"); } } else { failureCount = Math.max(0, failureCount - 1); // Slow recovery } } private void onFailure() { failureCount++; lastFailureTime = System.currentTimeMillis(); if (state == State.HALF_OPEN || failureCount >= failureThreshold) { state = State.OPEN; metrics.emit("circuit_breaker.opened"); } } }

11 · Monitoring Circuit Breaker State

In production, you must have comprehensive monitoring of circuit breaker behavior. Without it, you’ll never know if your circuits are tuned correctly or if they’re silently degrading service quality.

Key Metrics to Track

State Transitions

CLOSED→OPEN

Count per minute, alert on spikes

Rejection Rate

2.3%

Requests rejected by open circuit

Fallback Rate

1.8%

Requests served by fallback

Recovery Time

45s

Avg time from OPEN to CLOSED

Prometheus + Micrometer Metrics

# Resilience4j automatically exports these via Micrometer: # Circuit breaker state (gauge: 0=closed, 1=open, 2=half-open) resilience4j_circuitbreaker_state{name="paymentService"} # Total calls by outcome resilience4j_circuitbreaker_calls_seconds_count{ name="paymentService", kind="successful|failed|ignored|not_permitted" } # Failure rate (gauge, 0-100) resilience4j_circuitbreaker_failure_rate{name="paymentService"} # Slow call rate (gauge, 0-100) resilience4j_circuitbreaker_slow_call_rate{name="paymentService"} # Buffered calls in sliding window resilience4j_circuitbreaker_buffered_calls{ name="paymentService", kind="successful|failed" } # State transition events (counter) resilience4j_circuitbreaker_state_transitions_total{ name="paymentService", from_state="CLOSED", to_state="OPEN" }

Grafana Dashboard Panels

A production circuit breaker dashboard should include these panels:

Circuit State Timeline: State band chart showing CLOSED/OPEN/HALF-OPEN transitions over time for all services
Failure Rate vs Threshold: Line chart of failure_rate with a horizontal threshold line. Alert when rate approaches threshold.
Rejection Rate: Percentage of requests rejected (not permitted) per service. Non-zero = circuit was open.
Fallback Invocations: Count of fallback calls per service per minute. Trends show service reliability.
Recovery Time Distribution: Histogram of time spent in OPEN state before successful recovery.
Call Duration Percentiles: p50, p95, p99 latency. Expect sudden drops when circuit opens (fast-fail is fast).

Alerting Rules

# Prometheus alerting rules groups: - name: circuit-breaker-alerts rules: - alert: CircuitBreakerOpened expr: resilience4j_circuitbreaker_state == 1 for: 30s labels: severity: warning annotations: summary: "Circuit breaker {{ $labels.name }} is OPEN" description: "Service {{ $labels.name }} circuit has been open for >30s. Fallbacks are being served." - alert: CircuitBreakerHighRejectionRate expr: > rate(resilience4j_circuitbreaker_calls_seconds_count{kind="not_permitted"}[5m]) / rate(resilience4j_circuitbreaker_calls_seconds_count[5m]) > 0.1 for: 2m labels: severity: critical annotations: summary: ">10% of requests to {{ $labels.name }} are being rejected" - alert: CircuitBreakerFlapping expr: > increase(resilience4j_circuitbreaker_state_transitions_total[10m]) > 5 for: 1m labels: severity: warning annotations: summary: "Circuit breaker {{ $labels.name }} is flapping (>5 transitions in 10m)"

⚠ Flapping Detection: A circuit breaker that rapidly oscillates between OPEN and CLOSED (flapping) usually indicates the threshold is too sensitive, the wait duration is too short, or the downstream service is in a degraded (not failed) state. Consider increasing waitDurationInOpenState or minimumNumberOfCalls.

12 · Real-World: Netflix Circuit Breakers

Netflix pioneered the circuit breaker pattern in microservices at scale, creating the Hystrix library (now deprecated but historically foundational). Their experience during AWS outages provides the definitive case study.

Netflix’s Architecture Context

~1,000+ microservices in production (as of their Hystrix era)
~10 billion Hystrix executions per day across the fleet
>99.99% upstream availability target even when individual services fail
Every API call between services was wrapped in a HystrixCommand

The 2011 AWS US-East-1 Outage

In April 2011, AWS experienced a major EBS (Elastic Block Store) outage in the US-East-1 region. Multiple Netflix dependencies became unresponsive:

Personalization service: Could not read user preferences from EBS-backed databases
Bookmark service: Could not retrieve “continue watching” positions
Rating service: Could not load user ratings

Without circuit breakers, the Netflix API would have blocked on all three services, consuming all threads, and returning 503 errors to every device worldwide.

With Hystrix circuit breakers, what actually happened:

Personalization circuit opened → returned generic (non-personalized) recommendations
Bookmark circuit opened → “continue watching” row hidden from UI
Rating circuit opened → ratings shown as “Not Available”
Streaming continued to work. Users could still browse, search, and watch content.

💡 Netflix Lesson: “The best experience in an outage is a slightly degraded experience, not a completely broken one.” Circuit breakers enabled Netflix to shed non-critical functionality while preserving core streaming. Users might not even notice the degradation.

Netflix’s Configuration Philosophy

Parameter	Netflix Default	Rationale
Thread pool size per service	10 threads	Small pools force bulkheading; 100 dependencies × 10 threads = 1,000 total
Request timeout	1,000ms	If a dependency can’t respond in 1s, something is wrong
Error threshold	50% of 20 calls	Balanced between sensitivity and stability
Sleep window (wait duration)	5,000ms	Aggressively short—try to recover quickly
Metrics rolling window	10 seconds	Recent failures matter more than old ones

Key Takeaways from Netflix

Every external call gets a circuit breaker. No exceptions. If it crosses a network boundary, it gets a circuit breaker.
Thread pool isolation (bulkhead) was equally important. The thread pool per dependency was what actually prevented cascading failure before the circuit tripped.
Fallbacks must be designed upfront. You cannot retrofit meaningful fallbacks during an outage. They must be part of the original design.
Dashboard visibility is critical. The Hystrix Dashboard provided real-time visualization of circuit state across hundreds of services. Without this, operators could not make informed decisions.
Test in production with fault injection. Netflix’s Chaos Engineering (Chaos Monkey, Chaos Kong) regularly injects failures to validate that circuit breakers and fallbacks work correctly.

From Hystrix to Resilience4j

Netflix deprecated Hystrix in 2018, recommending Resilience4j as the replacement. Key reasons:

Hystrix required RxJava, adding significant complexity
Thread pool isolation added overhead for every call (context switch)
Resilience4j uses a simpler functional approach with decorators
Resilience4j supports both semaphore and thread pool bulkheads (choose per use case)
Better integration with Spring Boot, Micrometer, and modern Java features

13 · Best Practices & Anti-Patterns

✅ Do

Circuit break every cross-network call
Design fallbacks at architecture time, not incident time
Use bulkheads alongside circuit breakers
Monitor state transitions and alert on opens
Use exponential backoff with jitter for retries
Test circuit breakers with fault injection regularly
Set minimumNumberOfCalls to avoid false trips
Log every state change with context (which dependency, what failure)

❌ Don’t

Don’t circuit break on business logic errors (validation failures)
Don’t set wait duration too short (retry storms)
Don’t forget to tune thresholds from production data
Don’t use a single shared thread pool for all dependencies
Don’t retry without backoff and jitter
Don’t ignore flapping circuits—they indicate misconfiguration
Don’t treat circuit breakers as a replacement for proper error handling
Don’t forget health checks—circuit breakers protect callers but don’t fix the root cause

💡 Interview Tip: In system design interviews, when discussing resilience, mention three layers: timeouts (first line of defense), circuit breakers (detect and isolate failure), and bulkheads (contain blast radius). Then describe the three states (Closed, Open, Half-Open) and give a concrete fallback example. Mentioning Netflix Hystrix’s history and Resilience4j as the modern standard shows depth. If you can discuss exponential backoff with jitter and why “Full Jitter” outperforms other strategies, you’ll demonstrate production experience.