High Level Design Series· Building Blocks · Part 9

Heartbeat, Health Checks & Failure Detection

April 2026 · 14 min read

In a single-machine program, detecting failure is trivial: the process either runs or it doesn’t. The operating system tells you. But in a distributed system, you have no shared memory, no shared clock, and no shared fate. When Node B stops responding to Node A, is it crashed? Is the network down? Is it simply slow under load? This ambiguity sits at the heart of every distributed systems problem, and failure detection is the mechanism we build to manage it.

This post covers the full spectrum: from simple heartbeats to probabilistic failure detectors, from Kubernetes health probes to gossip-based protocols like SWIM, and from graceful degradation patterns to the terrifying split-brain problem. Every section includes real-world configuration examples you can drop into production.

The Failure Detection Problem

Before we build solutions, we need to understand why failure detection is fundamentally hard in distributed systems. The difficulty is not an engineering shortcoming—it is a mathematical impossibility proven in 1985.

Why You Can’t Tell Dead from Slow

Imagine Node A sends a request to Node B and waits for a response. After 10 seconds of silence, Node A faces a decision:

Node B crashed — The process is dead. No response will ever come.
The network is partitioned — Node B is alive and well, but packets can’t reach Node A. From B’s perspective, it’s A that disappeared.
Node B is slow — A long garbage collection pause, heavy disk I/O, or CPU saturation is causing delays. The response will arrive eventually—in 30 seconds, 2 minutes, or 10 minutes.
The response was lost — Node B processed the request and sent a reply, but the return packet was dropped by a flaky switch.

From Node A’s perspective, all four scenarios look identical: silence. There is no way to distinguish them without additional information. This is the core challenge.

FLP Impossibility

The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proves that in an asynchronous system where even a single process can crash, no deterministic algorithm can guarantee consensus. The key insight is that if messages can be delayed arbitrarily (which they can in any real network), you cannot distinguish a crashed process from a very slow one.

Formally: in an asynchronous model with reliable channels, there is no deterministic protocol that solves consensus if even one process may fail by crashing. This means any practical failure detector must be probabilistic or make timing assumptions.

Chandra-Toueg (1996): Introduced the concept of unreliable failure detectors with two properties: completeness (every crashed process is eventually suspected by every correct process) and accuracy (correct processes are not suspected forever). They proved that even a failure detector with eventual accuracy (class ◊S) is sufficient to solve consensus. This is the theoretical foundation for every practical failure detector we use today.

Two Types of Mistakes

Mistake	Description	Consequence
False positive	A healthy node is declared dead	Unnecessary failover, data re-replication, load spike on remaining nodes, potential split brain
False negative	A dead node is considered alive	Requests routed to a dead node, timeouts, user-facing errors, stale data served

Every failure detector trades off between these two. A shorter timeout catches failures faster (fewer false negatives) but triggers more false positives. A longer timeout reduces false positives but delays detection. There is no perfect setting—only the right tradeoff for your system.

< 1s

Aggressive detection (gaming, trading)

5-10s

Typical web services

30-60s

Conservative (databases, consensus)

Heartbeat Mechanism

The heartbeat is the simplest and most widely used failure detection mechanism. The idea is primitive but effective: a node periodically sends a small “I’m alive” message. If the monitor doesn’t receive heartbeats for a configurable duration, it suspects the node has failed.

Push vs. Pull Heartbeats

Model	How It Works	Pros	Cons
Push (active)	Each node periodically sends heartbeat messages to a monitor or to peers. The monitor passively listens.	Simple to implement. Node controls the pace. Low latency for detection.	N nodes → N messages per interval. Monitor can be overwhelmed.
Pull (polling)	A central monitor periodically pings each node and expects a response within a timeout.	Monitor controls the load. Can stagger checks. Works through firewalls (outbound only).	Monitor is SPOF. Polling interval limits detection speed. Higher latency.
Hybrid	Nodes push heartbeats, but the monitor also pulls if it hasn’t heard from a node recently.	Best of both worlds. Redundant detection paths.	More complex to implement correctly.

Heartbeat Parameters

A typical heartbeat system has three configurable parameters:

Interval (T) — How often heartbeats are sent. Common values: 1–10 seconds. Shorter intervals detect failures faster but generate more network traffic. At 1,000 nodes with a 1-second interval, that’s 1,000 packets/sec just for heartbeats.
Timeout (k × T) — How many missed heartbeats before suspicion. Typically k = 3 to 5. With T = 5s and k = 3, a node is suspected after 15 seconds of silence.
Suspicion vs. Confirmation — Many systems have a two-stage process: first the node is suspected, then after further verification (or a longer timeout), it is confirmed dead. This reduces false positives from transient network issues.

Cassandra Heartbeat Configuration (cassandra.yaml)

# How often to perform inter-node heartbeats (gossip)
# Default: 1000ms
phi_convict_threshold: 8

# Gossip interval in milliseconds
# Each node gossips to 1-3 random peers per interval
gossip_interval_in_ms: 1000

# How long to wait for a node to be considered dead
# after being marked DOWN (allows time for flapping)
dead_retry_interval_in_ms: 30000

# RPC timeout for inter-node communication
rpc_timeout_in_ms: 10000

# Cross-datacenter latency can be higher
cross_dc_rtt_in_ms: 100

ZooKeeper Session Heartbeat (zoo.cfg)

# Tick time is the basic unit of time (milliseconds)
# Heartbeats sent every tickTime
tickTime=2000

# Number of ticks for initial sync (followers → leader)
initLimit=10

# Number of ticks for follower to sync with leader
# If no heartbeat for syncLimit * tickTime → session expired
syncLimit=5

# Client session timeout range
minSessionTimeout=4000   # 2 * tickTime
maxSessionTimeout=40000  # 20 * tickTime

Heartbeat Message Contents

A heartbeat is not just a ping. Production heartbeats carry useful metadata:

{
  "node_id": "node-17",
  "timestamp": 1714502400000,
  "generation": 42,          // incremented on restart
  "version": "3.11.4",
  "load": {
    "cpu_percent": 34.2,
    "memory_percent": 67.8,
    "disk_free_gb": 128.4,
    "active_connections": 1247
  },
  "status": "NORMAL",        // NORMAL | JOINING | LEAVING | DRAINING
  "tokens": ["...", "..."]   // consistent hash ring tokens
}

This piggybacking approach is used by Cassandra, Consul, and etcd. Instead of sending separate monitoring messages, the heartbeat itself carries the information needed for load balancing, capacity planning, and cluster membership decisions.

⚡ Heartbeat Timeline

Watch Node A send heartbeats. See what happens when they stop arriving.

Step 0 / 7 — Ready

Phi Accrual Failure Detector

Simple heartbeat-based detection uses a fixed timeout: if no heartbeat arrives within T seconds, the node is dead. This is crude. Network latency varies, GC pauses happen, and a fixed threshold that works at 2 AM may trigger false positives during peak traffic at 2 PM.

The Phi (φ) Accrual Failure Detector, introduced by Hayashibara et al. (2004), takes a fundamentally different approach. Instead of a binary “alive or dead” decision, it computes a continuous suspicion level (the φ value) that represents how confident the detector is that the node has failed.

How It Works

Collect Heartbeat Arrival Times

Maintain a sliding window of the last N heartbeat inter-arrival times. For example, if heartbeats arrive at times 0, 1.02, 2.01, 3.05, 4.00, the inter-arrival times are [1.02, 0.99, 1.04, 0.95].

Model the Distribution

Assume the inter-arrival times follow a normal distribution (Gaussian). Compute the sample mean (μ) and sample variance (σ²) from the sliding window. Cassandra uses a window of the last 1,000 heartbeat intervals.

Compute φ

When the current time is t_now and the last heartbeat arrived at t_last, compute how late the heartbeat is: $\Delta = t_{now} - t_{last}$. Then compute φ using the complementary CDF of the normal distribution:

$$\phi = -\log_{10}\left(1 - F\left(\Delta \;\middle|\; \mu, \sigma^2\right)\right)$$

where $F$ is the cumulative distribution function of the normal distribution with mean $\mu$ and variance $\sigma^2$. Intuitively, φ represents the negative log-probability that a heartbeat is this late by chance.

Compare Against Threshold

If $\phi \geq \phi_{threshold}$, suspect the node. The threshold is configurable:

$\phi = 1$ → 10% chance this is a false positive
$\phi = 2$ → 1% chance
$\phi = 3$ → 0.1% chance
$\phi = 8$ → 0.000001% chance (Cassandra default)

Why Phi Accrual Is Superior

Property	Fixed Timeout	Phi Accrual
Adapts to network conditions	❌ No. Must be manually tuned.	✅ Yes. Automatically adjusts based on observed latency.
Handles GC pauses	❌ Causes false positives during long pauses.	✅ If pauses are occasional, the distribution absorbs them.
Cross-datacenter	❌ Need different timeouts per DC.	✅ Automatically adapts to higher latency.
Configurable accuracy	Binary: alive or dead.	Continuous: φ = 1 (low suspicion) to φ = 20+ (certain failure).
Used by	ZooKeeper, Redis Sentinel, most custom systems	Cassandra, Akka, Hazelcast

Cassandra Phi Accrual Configuration

# phi_convict_threshold: suspicion threshold
# Default: 8 (99.99999% confidence before marking DOWN)
# Lower values (5-6) = faster detection, more false positives
# Higher values (10-12) = slower detection, fewer false positives
# Recommended: 8 for same-DC, 12 for cross-DC
phi_convict_threshold: 8

# The window size for tracking heartbeat arrivals
# Larger window = more stable estimates, slower to adapt
# Default: 1000 samples
phi_convict_max_window_size: 1000

Akka Phi Accrual Configuration (application.conf)

akka {
  remote {
    watch-failure-detector {
      # Threshold for phi accrual failure detector
      threshold = 10.0

      # Number of heartbeat samples to use for calculation
      max-sample-size = 200

      # Minimum standard deviation (avoids 0-variance edge case)
      min-std-deviation = 100 ms

      # How often heartbeat request is sent
      heartbeat-interval = 1 s

      # Number of potentially lost/delayed heartbeats before
      # considering it an anomaly (adjusts the first heartbeat
      # timing)
      acceptable-heartbeat-pause = 3 s

      # How often to check for nodes marked unreachable
      unreachable-nodes-reaper-interval = 1s
    }
  }
}

📈 Phi Accrual Failure Detector

Watch φ grow as heartbeats get delayed. When φ crosses the threshold, the node is suspected dead.

Step 0 / 8 — Ready

Health Check Patterns

Heartbeats tell you whether a node is alive. But “alive” is not the same as “useful.” A process can be running but unable to serve traffic because it’s still loading data, its database connection pool is exhausted, or a dependent service is down. Health checks solve this by probing the functional state of a service.

Three Types of Health Checks

Check Type	Question It Answers	When It Fails	Response
Liveness	Is the process running and not stuck?	Deadlock, infinite loop, OOM but process not killed	Restart the container/process
Readiness	Can it accept and serve traffic right now?	Still warming cache, DB connection down, rate limited	Remove from load balancer (don’t restart)
Startup	Has initial boot completed?	Still loading large dataset, running migrations	Don’t check liveness until startup succeeds

⚠ Common mistake: Using the same endpoint for both liveness and readiness. If your readiness check fails because the database is down, and you use the same check for liveness, Kubernetes will restart your pod—which won’t fix the database. You’ll get a restart loop that makes everything worse. Always separate liveness (is my process healthy?) from readiness (can I serve traffic?).

Kubernetes Probe Types

Kubernetes supports three probe mechanisms:

HTTP GET — Sends an HTTP GET request to a specified path. Any status code ≥ 200 and < 400 means success.
TCP Socket — Attempts to open a TCP connection to a specified port. If the connection succeeds, the probe passes.
Exec — Runs an arbitrary command inside the container. Exit code 0 = healthy, non-zero = unhealthy.
gRPC — (Kubernetes 1.24+) Calls the gRPC health checking protocol on a specified port.

Complete Kubernetes Health Probe Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v2.4.1
        ports:
        - containerPort: 8080

        # STARTUP PROBE: wait for initialization
        # Allows up to 5 min (30 * 10s) for startup
        startupProbe:
          httpGet:
            path: /healthz/startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

        # LIVENESS PROBE: detect deadlocks / stuck processes
        # Restarts pod if 3 consecutive checks fail
        livenessProbe:
          httpGet:
            path: /healthz/live
            port: 8080
          initialDelaySeconds: 0   # startup probe handles delay
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

        # READINESS PROBE: control traffic routing
        # Removes from Service endpoints if unhealthy
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
          successThreshold: 2

Implementing Health Endpoints

A production-grade health endpoint doesn’t just return 200 OK. It checks dependencies and reports detailed status:

// Express.js health endpoint implementation

app.get('/healthz/live', (req, res) => {
  // Liveness: only checks if the process is responsive
  // Do NOT check external dependencies here
  res.status(200).json({ status: 'alive', uptime: process.uptime() });
});

app.get('/healthz/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    kafka: await checkKafka(),
  };

  const allHealthy = Object.values(checks).every(c => c.healthy);
  const status = allHealthy ? 200 : 503;

  res.status(status).json({
    status: allHealthy ? 'ready' : 'not_ready',
    checks,
    timestamp: new Date().toISOString(),
  });
});

app.get('/healthz/startup', async (req, res) => {
  if (!cacheWarmed || !migrationsComplete) {
    return res.status(503).json({ status: 'starting' });
  }
  res.status(200).json({ status: 'started' });
});

async function checkDatabase() {
  try {
    const start = Date.now();
    await db.query('SELECT 1');
    return { healthy: true, latency_ms: Date.now() - start };
  } catch (err) {
    return { healthy: false, error: err.message };
  }
}

async function checkRedis() {
  try {
    const start = Date.now();
    await redis.ping();
    return { healthy: true, latency_ms: Date.now() - start };
  } catch (err) {
    return { healthy: false, error: err.message };
  }
}

💡 Deep vs. Shallow Health Checks: A shallow health check (e.g., return 200 immediately) only confirms the HTTP server is running. A deep health check (e.g., query the database, check cache) confirms the service can actually do useful work. Use shallow checks for liveness (you don’t want to restart a pod because PostgreSQL is temporarily unreachable) and deep checks for readiness (you do want to stop routing traffic if the database is down).

Health Check Anti-Patterns

Cascading health failures: Service A’s readiness checks Service B, which checks Service C. If C goes down, A and B both go unready—cascading outage. Solution: only check direct dependencies, and use circuit breakers.
Expensive health checks: Running a complex query or loading large data in the health endpoint. Solution: cache the result and update it in a background goroutine.
No timeout on probes: Kubernetes default timeout is 1 second. If your database check takes 5 seconds, the probe always fails. Always set timeoutSeconds appropriately.
Checking non-critical dependencies: Your /ready endpoint checks the email service. Email is down, so your API stops serving—even though 99% of requests don’t need email. Solution: only check dependencies critical to the majority of traffic.

Gossip-Based Failure Detection

Centralized heartbeat monitoring has a fundamental problem: the monitor is a single point of failure, and at scale (thousands of nodes), it becomes a bottleneck. Gossip-based protocols solve this by making failure detection a collective, decentralized activity.

How Gossip Works

Each node maintains a membership list—a table of all known nodes and their heartbeat counters. Periodically (e.g., every second), each node:

Increments its own heartbeat counter.
Selects a random peer from its membership list.
Sends its full membership list to that peer.
Merges the received list with its own, keeping the highest heartbeat counter for each node.

If a node’s heartbeat counter hasn’t been incremented within a timeout period, it is suspected as failed. This is “epidemic-style” dissemination: information spreads like a rumor, reaching all $N$ nodes in $O(\log N)$ gossip rounds.

Convergence guarantee: With $N$ nodes, each gossiping to 1 random peer per round, it takes approximately $\lceil \log_2 N \rceil$ rounds for information to reach every node with high probability. For 1,000 nodes with 1-second intervals, that’s about 10 seconds for cluster-wide awareness. For 1,000,000 nodes: ~20 seconds.

The SWIM Protocol

SWIM: Scalable Weakly-consistent Infection-style Membership

SWIM (Das et al., 2002) is the gold standard for gossip-based failure detection. Used by Consul (HashiCorp), Memberlist (Go library), and Serf. It improves basic gossip in three crucial ways:

Ping

Each protocol period, node $M_i$ selects a random target $M_j$ and sends a ping. If $M_j$ responds with an ack within a timeout, $M_j$ is alive. Done.

Ping-Req

If $M_j$ doesn’t respond to the direct ping, $M_i$ does not immediately suspect $M_j$. Instead, $M_i$ selects $k$ random nodes (typically $k = 3$) and asks them to ping $M_j$ on its behalf (ping-req). If any of the $k$ nodes gets an ack from $M_j$, $M_j$ is alive. This handles the case where only the $M_i \leftrightarrow M_j$ path is broken.

Suspicion Sub-Protocol

If neither the direct ping nor any indirect ping gets an ack, $M_j$ is marked SUSPECT. The suspicion is disseminated via gossip (piggybacked on regular protocol messages). $M_j$ remains in suspect state for a configurable timeout. If $M_j$ sends any message during this window, it refutes the suspicion and returns to ALIVE. If the timeout expires without refutation, $M_j$ is marked CONFIRM (dead) and removed from the membership list.

SWIM vs. Traditional Heartbeat

Property	Central Heartbeat	SWIM
Message complexity per period	$O(N)$ — every node → monitor	$O(1)$ per node — each node sends one ping + up to $k$ ping-reqs
False positive rate	Depends on single path quality	Lower: $k+1$ independent paths must all fail
Detection time	$O(1)$ protocol periods (fast for monitor)	$O(1)$ protocol periods per node, $O(\log N)$ for cluster-wide dissemination
SPOF	Yes — the monitor	No — fully decentralized
Network load	All traffic to one node — fan-in bottleneck	Distributed evenly across all nodes

Consul (Memberlist/SWIM) Configuration

# /etc/consul.d/consul.hcl

# SWIM protocol configuration
gossip_lan {
  # Interval between protocol rounds
  gossip_interval = "200ms"

  # Number of random nodes to gossip to
  gossip_nodes = 3

  # Number of indirect probes on ping failure
  probe_interval = "1s"
  probe_timeout  = "500ms"

  # Suspicion multiplier:
  # suspicion_timeout = suspicion_mult * log(N) * probe_interval
  suspicion_mult = 4

  # How many nodes to ask for indirect probes
  indirect_checks = 3

  # Number of retransmissions for protocol messages
  retransmit_mult = 4
}

# WAN gossip (cross-datacenter, higher latency)
gossip_wan {
  gossip_interval = "500ms"
  probe_interval  = "3s"
  probe_timeout   = "2s"
  suspicion_mult  = 6
}

Gossip Protocol Optimizations

Piggybacking: Instead of sending separate protocol messages, SWIM piggybacks membership updates on top of ping/ack messages. This is why SWIM achieves $O(1)$ message overhead per node per period—no extra messages needed for dissemination.
Infection-style dissemination: Each update has a retransmission counter that starts at $\lambda \cdot \log N$ (where $\lambda$ is the retransmit multiplier). Each time the update is piggybacked on a message, the counter decrements. When it reaches zero, the update is dropped. This guarantees $O(\log N)$ rounds for convergence while bounding message size.
Suspicion with incarnation numbers: Each node has an incarnation number that it increments to refute false suspicions. If Node A suspects Node B, B can send a message with an incremented incarnation number to override the suspicion. This prevents the “accusation storm” problem.
Round-robin target selection: Instead of truly random target selection (which can miss some nodes for many rounds), SWIM uses a round-robin-shuffled list to ensure every node is probed within $N$ protocol periods. This provides a worst-case detection bound.

Graceful Degradation

Detecting failure is only half the battle. What you do after detection determines whether the failure cascades into a system-wide outage or is absorbed gracefully. This section covers the four key patterns for handling detected failures.

Circuit Breaker Pattern

Inspired by electrical circuit breakers, this pattern prevents a service from repeatedly calling a failing dependency:

State	Behavior	Transition
CLOSED	All requests pass through normally. Failures are counted.	If failure count ≥ threshold within window → OPEN
OPEN	All requests immediately fail without calling the dependency. Returns fallback response or error.	After timeout period → HALF-OPEN
HALF-OPEN	Allow one (or a few) test requests through to the dependency.	If test succeeds → CLOSED. If test fails → OPEN (reset timer).

# Python circuit breaker implementation
import time
from enum import Enum
from threading import Lock

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 half_open_max_calls=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.state = State.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0
        self.lock = Lock()

    def call(self, func, *args, fallback=None, **kwargs):
        with self.lock:
            if self.state == State.OPEN:
                if self._should_attempt_reset():
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    if fallback:
                        return fallback()
                    raise CircuitOpenError(
                        f"Circuit open since {self.last_failure_time}")

            if self.state == State.HALF_OPEN:
                if self.half_open_calls >= self.half_open_max_calls:
                    if fallback:
                        return fallback()
                    raise CircuitOpenError("Half-open call limit reached")
                self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback:
                return fallback()
            raise

    def _on_success(self):
        with self.lock:
            self.failure_count = 0
            self.state = State.CLOSED

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = State.OPEN

    def _should_attempt_reset(self):
        return (time.time() - self.last_failure_time
                >= self.recovery_timeout)

Retry with Exponential Backoff

When a call fails, retrying immediately is usually worse than not retrying at all—it adds load to an already-struggling system. Exponential backoff with jitter spreads retries over time:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1.0,
                       max_delay=60.0, jitter=True):
    """
    Retry with exponential backoff and full jitter.
    Delay = random(0, min(max_delay, base_delay * 2^attempt))
    """
    for attempt in range(max_retries + 1):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries:
                raise  # exhausted retries

            delay = min(max_delay, base_delay * (2 ** attempt))
            if jitter:
                # Full jitter: uniform random between 0 and delay
                delay = random.uniform(0, delay)

            print(f"Attempt {attempt+1} failed: {e}. "
                  f"Retrying in {delay:.2f}s...")
            time.sleep(delay)

Why jitter matters: Without jitter, if 1,000 clients all fail at the same time, they all retry at exactly 1s, then 2s, then 4s—creating thundering herd spikes. Full jitter (uniform random between 0 and the backoff delay) spreads retries uniformly, reducing peak load by ~50%. AWS recommends full jitter for all retries.

Bulkhead Pattern

Named after the watertight compartments in a ship: if one compartment floods, the others stay dry. In software, the bulkhead pattern isolates failures by partitioning resources:

Thread pool isolation: Each downstream dependency gets its own thread pool. If the payment service is slow and exhausts its 20 threads, the inventory service still has its own 20 threads and keeps working.
Connection pool isolation: Separate database connection pools for critical (checkout) and non-critical (analytics) queries. If analytics runs a slow query and exhausts its pool, checkout still works.
Process isolation: Run different services in separate containers/processes. A memory leak in one service can’t crash another.
Sidecar isolation: In service meshes, the proxy sidecar handles retries, circuit breaking, and rate limiting outside the application process.

# Java-style bulkhead with semaphore (conceptual Python)
from threading import Semaphore
from functools import wraps

def bulkhead(max_concurrent, name="default"):
    """Limit concurrent calls to a dependency."""
    sem = Semaphore(max_concurrent)

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            acquired = sem.acquire(timeout=5)
            if not acquired:
                raise BulkheadFullError(
                    f"Bulkhead '{name}' full: "
                    f"{max_concurrent} concurrent calls")
            try:
                return func(*args, **kwargs)
            finally:
                sem.release()
        return wrapper
    return decorator

@bulkhead(max_concurrent=20, name="payment-service")
def call_payment_service(order_id, amount):
    return requests.post(
        "https://payment.internal/charge",
        json={"order_id": order_id, "amount": amount},
        timeout=5
    )

Fallback Strategies

Strategy	Description	Example
Cached response	Return the last known good response from cache	Product catalog returns stale (5-min old) prices when price service is down
Default value	Return a safe, hardcoded default	Recommendation service down → show “Popular Products” instead
Graceful feature removal	Disable the failing feature, keep rest working	Review service down → hide reviews section, product page still works
Queue for later	Write the request to a queue and process when healthy	Email service down → enqueue confirmation email for retry
Alternate dependency	Route to a backup service or different region	Primary payment processor down → switch to secondary

Split Brain

Split brain is the most dangerous failure mode in distributed systems. It occurs when a network partition divides a cluster into two (or more) groups, each believing the other is dead. Both sides continue operating independently, potentially accepting conflicting writes, electing competing leaders, and corrupting data.

How Split Brain Happens

Consider a 5-node database cluster with Node 1 as the leader:

A network switch failure partitions the cluster: Nodes {1, 2} on one side, Nodes {3, 4, 5} on the other.
Nodes {3, 4, 5} stop receiving heartbeats from Node 1. Their failure detector marks Node 1 as dead.
Nodes {3, 4, 5} elect Node 3 as the new leader and start accepting writes.
Meanwhile, Nodes {1, 2} are still running. Node 1 thinks Nodes 3, 4, 5 are dead but continues serving as leader.
Now you have two leaders accepting writes simultaneously. Data diverges. When the partition heals, you have conflicting state that may be impossible to reconcile.

🚨 Real-world incident: In 2017, a GitHub database cluster experienced a split-brain event during a network partition. Both sides of the partition promoted a new primary and accepted writes for ~30 seconds. When the partition healed, MySQL couldn’t reconcile the conflicting auto-increment IDs. Thousands of database rows had to be manually inspected and resolved. The incident report cited this as one of their most complex recovery operations ever.

Prevention Strategies

1. Quorum-Based Decisions

The most fundamental defense: require a majority (quorum) to make any decision. In a cluster of $N$ nodes, a quorum is $\lfloor N/2 \rfloor + 1$ nodes.

With 5 nodes, quorum = 3. If the partition splits {2} vs {3}, only the side with 3 nodes can elect a leader.
With 3 nodes, quorum = 2. Tolerates 1 node failure.
Why odd numbers: With 4 nodes, quorum = 3. A 2-2 split means neither side has quorum—the whole system stops. With 5 nodes, a 2-3 split has one side with quorum. This is why consensus systems (etcd, ZooKeeper, Raft) always use odd numbers of nodes.

# etcd cluster configuration requiring quorum
# etcd uses Raft consensus - automatically requires majority

# Cluster of 5 nodes (tolerates 2 failures)
ETCD_INITIAL_CLUSTER="
  etcd1=https://10.0.1.1:2380,
  etcd2=https://10.0.1.2:2380,
  etcd3=https://10.0.1.3:2380,
  etcd4=https://10.0.1.4:2380,
  etcd5=https://10.0.1.5:2380"

# Election timeout: how long a follower waits for heartbeat
# before starting election
ETCD_ELECTION_TIMEOUT=5000       # 5 seconds

# Heartbeat interval: leader sends heartbeat every N ms
ETCD_HEARTBEAT_INTERVAL=500      # 500ms

# A leader needs acks from at least 3/5 nodes for any write
# A candidate needs votes from at least 3/5 nodes to win election

2. STONITH (Shoot The Other Node In The Head)

When a node detects a potential split-brain, it forcibly kills the other node before taking over. This is common in traditional HA database setups:

Power fencing: Use IPMI (Intelligent Platform Management Interface) or iLO/iDRAC to remotely power off the other node’s hardware.
SBD (Storage-Based Death): Both nodes monitor a shared storage device (SAN, shared disk). A node that can’t write to the SBD device within a timeout kills itself (self-fencing).
Cloud fencing: Use the cloud provider API to stop the other node’s instance (e.g., AWS ec2 stop-instances).

Pacemaker STONITH Configuration (Linux HA)

# Configure IPMI-based fencing
pcs stonith create ipmi-fence fence_ipmilan \
  ipaddr="192.168.1.100" \         # IPMI address
  login="admin" \
  passwd="secret" \
  pcmk_host_list="node2" \         # target node
  pcmk_reboot_action="off" \       # power off, don't reboot
  power_timeout=20 \               # wait 20s for power off
  lanplus=1

# Both nodes must have fencing configured for each other
# If fencing fails, the node that can't fence should
# self-fence (suicide) to prevent split brain
pcs property set stonith-enabled=true
pcs property set no-quorum-policy=suicide

3. Lease-Based Leadership

The leader must periodically renew a lease (a time-limited lock). If it can’t renew the lease (because it’s partitioned from the lease store), it must step down immediately:

# Lease-based leader election pseudocode
class LeaderElection:
    def __init__(self, lease_store, lease_duration=15,
                 renew_interval=5):
        self.store = lease_store  # e.g., etcd, ZooKeeper, DynamoDB
        self.duration = lease_duration  # seconds
        self.renew_interval = renew_interval
        self.is_leader = False

    def try_acquire(self):
        """Try to become leader by acquiring the lease."""
        success = self.store.create_if_not_exists(
            key="/leader",
            value=self.node_id,
            ttl=self.duration
        )
        if success:
            self.is_leader = True
            self._start_renewal_loop()
        return success

    def _start_renewal_loop(self):
        while self.is_leader:
            time.sleep(self.renew_interval)
            try:
                self.store.refresh(key="/leader",
                                   ttl=self.duration)
            except (NetworkError, TimeoutError):
                # Can't reach lease store — MUST step down
                # Another node may acquire the lease
                self.is_leader = False
                self._stop_accepting_writes()
                break

⚠ Lease clock skew danger: Leases depend on wall-clock time. If the leader’s clock is running fast, it may think the lease is valid while the lease store has already expired it. Solution: use monotonic clocks for measuring lease duration, and always include a safety margin (fence the write path before the lease actually expires, e.g., stop writes when 80% of the lease has elapsed).

4. Witness / Tiebreaker Node

Deploy a lightweight “witness” node in a third location. It doesn’t store data, but it votes in elections. In a 2-datacenter setup with 2 nodes each, the witness in a third location creates a 5-node quorum. This is used by:

SQL Server Always On: Azure Cloud Witness or file share witness.
CockroachDB: Locality-aware placement with odd-numbered node counts.
Redis Sentinel: An odd number of sentinels (3 or 5) monitoring the primary.

Redis Sentinel Configuration (sentinel.conf)

# Three sentinels monitoring a Redis primary
# Sentinel 1: 10.0.1.1:26379
# Sentinel 2: 10.0.2.1:26379
# Sentinel 3: 10.0.3.1:26379 (witness in third AZ)

# Name of the monitored master
# Last parameter: quorum (2 of 3 sentinels must agree)
sentinel monitor mymaster 10.0.1.100 6379 2

# How often sentinels send PING to the master
sentinel down-after-milliseconds mymaster 5000

# After declaring master down, wait this long before
# attempting failover (handles transient partitions)
sentinel failover-timeout mymaster 30000

# How many replicas can be reconfigured simultaneously
sentinel parallel-syncs mymaster 1

# Require TLS for sentinel communication
sentinel auth-pass mymaster <password>

Split Brain Recovery

Despite all prevention measures, split brain can still happen (bugs, misconfigurations, Byzantine failures). When it does, you need a recovery strategy:

Last-writer-wins (LWW): Use timestamps or logical clocks to pick one version. Simple but loses data.
Conflict-free Replicated Data Types (CRDTs): Design data structures that can be merged without conflict (counters, sets, registers). Used by Redis Enterprise, Riak.
Manual resolution: Flag conflicting records and let operators or application logic resolve them. Amazon’s shopping cart (Dynamo paper) famously took this approach.
Application-level merge: Define custom merge functions per data type. Two versions of a user profile? Merge the fields individually, take the latest email, union the address list.

Summary & Decision Guide

Scenario	Recommended Approach	Why
Small cluster (< 10 nodes)	Simple heartbeat with fixed timeout	Low complexity, easy to debug
Variable network latency	Phi accrual failure detector	Adapts to changing conditions automatically
Large cluster (100+ nodes)	SWIM / gossip-based protocol	$O(1)$ per-node overhead, no SPOF
Kubernetes workloads	Separate liveness, readiness, startup probes	Kubernetes-native, granular control
Cross-datacenter replication	Phi accrual + quorum-based consensus	Handles high-latency links, prevents split brain
Leader election	Lease-based + STONITH/fencing	Guarantees at most one leader at any time
Calling external services	Circuit breaker + retry + bulkhead	Prevents cascading failures

Key takeaway: Failure detection is not a solved problem—it is a spectrum of tradeoffs. You choose your detector based on your scale, latency tolerance, and the cost of each type of mistake. The Phi accrual detector is the closest thing to a “default good choice,” but even it must be paired with proper health checks, graceful degradation, and split-brain prevention to build a truly resilient system.