← All Posts
High Level Design Series· Building Blocks · Part 9

Heartbeat, Health Checks & Failure Detection

In a single-machine program, detecting failure is trivial: the process either runs or it doesn’t. The operating system tells you. But in a distributed system, you have no shared memory, no shared clock, and no shared fate. When Node B stops responding to Node A, is it crashed? Is the network down? Is it simply slow under load? This ambiguity sits at the heart of every distributed systems problem, and failure detection is the mechanism we build to manage it.

This post covers the full spectrum: from simple heartbeats to probabilistic failure detectors, from Kubernetes health probes to gossip-based protocols like SWIM, and from graceful degradation patterns to the terrifying split-brain problem. Every section includes real-world configuration examples you can drop into production.

The Failure Detection Problem

Before we build solutions, we need to understand why failure detection is fundamentally hard in distributed systems. The difficulty is not an engineering shortcoming—it is a mathematical impossibility proven in 1985.

Why You Can’t Tell Dead from Slow

Imagine Node A sends a request to Node B and waits for a response. After 10 seconds of silence, Node A faces a decision:

From Node A’s perspective, all four scenarios look identical: silence. There is no way to distinguish them without additional information. This is the core challenge.

FLP Impossibility

The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proves that in an asynchronous system where even a single process can crash, no deterministic algorithm can guarantee consensus. The key insight is that if messages can be delayed arbitrarily (which they can in any real network), you cannot distinguish a crashed process from a very slow one.

Formally: in an asynchronous model with reliable channels, there is no deterministic protocol that solves consensus if even one process may fail by crashing. This means any practical failure detector must be probabilistic or make timing assumptions.

Chandra-Toueg (1996): Introduced the concept of unreliable failure detectors with two properties: completeness (every crashed process is eventually suspected by every correct process) and accuracy (correct processes are not suspected forever). They proved that even a failure detector with eventual accuracy (class ◊S) is sufficient to solve consensus. This is the theoretical foundation for every practical failure detector we use today.

Two Types of Mistakes

MistakeDescriptionConsequence
False positive A healthy node is declared dead Unnecessary failover, data re-replication, load spike on remaining nodes, potential split brain
False negative A dead node is considered alive Requests routed to a dead node, timeouts, user-facing errors, stale data served

Every failure detector trades off between these two. A shorter timeout catches failures faster (fewer false negatives) but triggers more false positives. A longer timeout reduces false positives but delays detection. There is no perfect setting—only the right tradeoff for your system.

< 1s
Aggressive detection (gaming, trading)
5-10s
Typical web services
30-60s
Conservative (databases, consensus)

Heartbeat Mechanism

The heartbeat is the simplest and most widely used failure detection mechanism. The idea is primitive but effective: a node periodically sends a small “I’m alive” message. If the monitor doesn’t receive heartbeats for a configurable duration, it suspects the node has failed.

Push vs. Pull Heartbeats

ModelHow It WorksProsCons
Push (active) Each node periodically sends heartbeat messages to a monitor or to peers. The monitor passively listens. Simple to implement. Node controls the pace. Low latency for detection. N nodes → N messages per interval. Monitor can be overwhelmed.
Pull (polling) A central monitor periodically pings each node and expects a response within a timeout. Monitor controls the load. Can stagger checks. Works through firewalls (outbound only). Monitor is SPOF. Polling interval limits detection speed. Higher latency.
Hybrid Nodes push heartbeats, but the monitor also pulls if it hasn’t heard from a node recently. Best of both worlds. Redundant detection paths. More complex to implement correctly.

Heartbeat Parameters

A typical heartbeat system has three configurable parameters:

Cassandra Heartbeat Configuration (cassandra.yaml)

# How often to perform inter-node heartbeats (gossip)
# Default: 1000ms
phi_convict_threshold: 8

# Gossip interval in milliseconds
# Each node gossips to 1-3 random peers per interval
gossip_interval_in_ms: 1000

# How long to wait for a node to be considered dead
# after being marked DOWN (allows time for flapping)
dead_retry_interval_in_ms: 30000

# RPC timeout for inter-node communication
rpc_timeout_in_ms: 10000

# Cross-datacenter latency can be higher
cross_dc_rtt_in_ms: 100

ZooKeeper Session Heartbeat (zoo.cfg)

# Tick time is the basic unit of time (milliseconds)
# Heartbeats sent every tickTime
tickTime=2000

# Number of ticks for initial sync (followers → leader)
initLimit=10

# Number of ticks for follower to sync with leader
# If no heartbeat for syncLimit * tickTime → session expired
syncLimit=5

# Client session timeout range
minSessionTimeout=4000   # 2 * tickTime
maxSessionTimeout=40000  # 20 * tickTime

Heartbeat Message Contents

A heartbeat is not just a ping. Production heartbeats carry useful metadata:

{
  "node_id": "node-17",
  "timestamp": 1714502400000,
  "generation": 42,          // incremented on restart
  "version": "3.11.4",
  "load": {
    "cpu_percent": 34.2,
    "memory_percent": 67.8,
    "disk_free_gb": 128.4,
    "active_connections": 1247
  },
  "status": "NORMAL",        // NORMAL | JOINING | LEAVING | DRAINING
  "tokens": ["...", "..."]   // consistent hash ring tokens
}

This piggybacking approach is used by Cassandra, Consul, and etcd. Instead of sending separate monitoring messages, the heartbeat itself carries the information needed for load balancing, capacity planning, and cluster membership decisions.

⚡ Heartbeat Timeline

Watch Node A send heartbeats. See what happens when they stop arriving.

Step 0 / 7 — Ready

Phi Accrual Failure Detector

Simple heartbeat-based detection uses a fixed timeout: if no heartbeat arrives within T seconds, the node is dead. This is crude. Network latency varies, GC pauses happen, and a fixed threshold that works at 2 AM may trigger false positives during peak traffic at 2 PM.

The Phi (φ) Accrual Failure Detector, introduced by Hayashibara et al. (2004), takes a fundamentally different approach. Instead of a binary “alive or dead” decision, it computes a continuous suspicion level (the φ value) that represents how confident the detector is that the node has failed.

How It Works

Collect Heartbeat Arrival Times

Maintain a sliding window of the last N heartbeat inter-arrival times. For example, if heartbeats arrive at times 0, 1.02, 2.01, 3.05, 4.00, the inter-arrival times are [1.02, 0.99, 1.04, 0.95].

Model the Distribution

Assume the inter-arrival times follow a normal distribution (Gaussian). Compute the sample mean (μ) and sample variance (σ²) from the sliding window. Cassandra uses a window of the last 1,000 heartbeat intervals.

Compute φ

When the current time is t_now and the last heartbeat arrived at t_last, compute how late the heartbeat is: $\Delta = t_{now} - t_{last}$. Then compute φ using the complementary CDF of the normal distribution:

$$\phi = -\log_{10}\left(1 - F\left(\Delta \;\middle|\; \mu, \sigma^2\right)\right)$$

where $F$ is the cumulative distribution function of the normal distribution with mean $\mu$ and variance $\sigma^2$. Intuitively, φ represents the negative log-probability that a heartbeat is this late by chance.

Compare Against Threshold

If $\phi \geq \phi_{threshold}$, suspect the node. The threshold is configurable:

  • $\phi = 1$ → 10% chance this is a false positive
  • $\phi = 2$ → 1% chance
  • $\phi = 3$ → 0.1% chance
  • $\phi = 8$ → 0.000001% chance (Cassandra default)

Why Phi Accrual Is Superior

PropertyFixed TimeoutPhi Accrual
Adapts to network conditions ❌ No. Must be manually tuned. ✅ Yes. Automatically adjusts based on observed latency.
Handles GC pauses ❌ Causes false positives during long pauses. ✅ If pauses are occasional, the distribution absorbs them.
Cross-datacenter ❌ Need different timeouts per DC. ✅ Automatically adapts to higher latency.
Configurable accuracy Binary: alive or dead. Continuous: φ = 1 (low suspicion) to φ = 20+ (certain failure).
Used by ZooKeeper, Redis Sentinel, most custom systems Cassandra, Akka, Hazelcast

Cassandra Phi Accrual Configuration

# phi_convict_threshold: suspicion threshold
# Default: 8 (99.99999% confidence before marking DOWN)
# Lower values (5-6) = faster detection, more false positives
# Higher values (10-12) = slower detection, fewer false positives
# Recommended: 8 for same-DC, 12 for cross-DC
phi_convict_threshold: 8

# The window size for tracking heartbeat arrivals
# Larger window = more stable estimates, slower to adapt
# Default: 1000 samples
phi_convict_max_window_size: 1000

Akka Phi Accrual Configuration (application.conf)

akka {
  remote {
    watch-failure-detector {
      # Threshold for phi accrual failure detector
      threshold = 10.0

      # Number of heartbeat samples to use for calculation
      max-sample-size = 200

      # Minimum standard deviation (avoids 0-variance edge case)
      min-std-deviation = 100 ms

      # How often heartbeat request is sent
      heartbeat-interval = 1 s

      # Number of potentially lost/delayed heartbeats before
      # considering it an anomaly (adjusts the first heartbeat
      # timing)
      acceptable-heartbeat-pause = 3 s

      # How often to check for nodes marked unreachable
      unreachable-nodes-reaper-interval = 1s
    }
  }
}

📈 Phi Accrual Failure Detector

Watch φ grow as heartbeats get delayed. When φ crosses the threshold, the node is suspected dead.

Step 0 / 8 — Ready

Health Check Patterns

Heartbeats tell you whether a node is alive. But “alive” is not the same as “useful.” A process can be running but unable to serve traffic because it’s still loading data, its database connection pool is exhausted, or a dependent service is down. Health checks solve this by probing the functional state of a service.

Three Types of Health Checks

Check TypeQuestion It AnswersWhen It FailsResponse
Liveness Is the process running and not stuck? Deadlock, infinite loop, OOM but process not killed Restart the container/process
Readiness Can it accept and serve traffic right now? Still warming cache, DB connection down, rate limited Remove from load balancer (don’t restart)
Startup Has initial boot completed? Still loading large dataset, running migrations Don’t check liveness until startup succeeds
⚠ Common mistake: Using the same endpoint for both liveness and readiness. If your readiness check fails because the database is down, and you use the same check for liveness, Kubernetes will restart your pod—which won’t fix the database. You’ll get a restart loop that makes everything worse. Always separate liveness (is my process healthy?) from readiness (can I serve traffic?).

Kubernetes Probe Types

Kubernetes supports three probe mechanisms:

Complete Kubernetes Health Probe Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: order-service
spec:
  replicas: 3
  selector:
    matchLabels:
      app: order-service
  template:
    metadata:
      labels:
        app: order-service
    spec:
      containers:
      - name: order-service
        image: myregistry/order-service:v2.4.1
        ports:
        - containerPort: 8080

        # STARTUP PROBE: wait for initialization
        # Allows up to 5 min (30 * 10s) for startup
        startupProbe:
          httpGet:
            path: /healthz/startup
            port: 8080
          failureThreshold: 30
          periodSeconds: 10

        # LIVENESS PROBE: detect deadlocks / stuck processes
        # Restarts pod if 3 consecutive checks fail
        livenessProbe:
          httpGet:
            path: /healthz/live
            port: 8080
          initialDelaySeconds: 0   # startup probe handles delay
          periodSeconds: 10
          timeoutSeconds: 3
          failureThreshold: 3
          successThreshold: 1

        # READINESS PROBE: control traffic routing
        # Removes from Service endpoints if unhealthy
        readinessProbe:
          httpGet:
            path: /healthz/ready
            port: 8080
          initialDelaySeconds: 0
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
          successThreshold: 2

Implementing Health Endpoints

A production-grade health endpoint doesn’t just return 200 OK. It checks dependencies and reports detailed status:

// Express.js health endpoint implementation

app.get('/healthz/live', (req, res) => {
  // Liveness: only checks if the process is responsive
  // Do NOT check external dependencies here
  res.status(200).json({ status: 'alive', uptime: process.uptime() });
});

app.get('/healthz/ready', async (req, res) => {
  const checks = {
    database: await checkDatabase(),
    redis: await checkRedis(),
    kafka: await checkKafka(),
  };

  const allHealthy = Object.values(checks).every(c => c.healthy);
  const status = allHealthy ? 200 : 503;

  res.status(status).json({
    status: allHealthy ? 'ready' : 'not_ready',
    checks,
    timestamp: new Date().toISOString(),
  });
});

app.get('/healthz/startup', async (req, res) => {
  if (!cacheWarmed || !migrationsComplete) {
    return res.status(503).json({ status: 'starting' });
  }
  res.status(200).json({ status: 'started' });
});

async function checkDatabase() {
  try {
    const start = Date.now();
    await db.query('SELECT 1');
    return { healthy: true, latency_ms: Date.now() - start };
  } catch (err) {
    return { healthy: false, error: err.message };
  }
}

async function checkRedis() {
  try {
    const start = Date.now();
    await redis.ping();
    return { healthy: true, latency_ms: Date.now() - start };
  } catch (err) {
    return { healthy: false, error: err.message };
  }
}
💡 Deep vs. Shallow Health Checks: A shallow health check (e.g., return 200 immediately) only confirms the HTTP server is running. A deep health check (e.g., query the database, check cache) confirms the service can actually do useful work. Use shallow checks for liveness (you don’t want to restart a pod because PostgreSQL is temporarily unreachable) and deep checks for readiness (you do want to stop routing traffic if the database is down).

Health Check Anti-Patterns

Gossip-Based Failure Detection

Centralized heartbeat monitoring has a fundamental problem: the monitor is a single point of failure, and at scale (thousands of nodes), it becomes a bottleneck. Gossip-based protocols solve this by making failure detection a collective, decentralized activity.

How Gossip Works

Each node maintains a membership list—a table of all known nodes and their heartbeat counters. Periodically (e.g., every second), each node:

  1. Increments its own heartbeat counter.
  2. Selects a random peer from its membership list.
  3. Sends its full membership list to that peer.
  4. Merges the received list with its own, keeping the highest heartbeat counter for each node.

If a node’s heartbeat counter hasn’t been incremented within a timeout period, it is suspected as failed. This is “epidemic-style” dissemination: information spreads like a rumor, reaching all $N$ nodes in $O(\log N)$ gossip rounds.

Convergence guarantee: With $N$ nodes, each gossiping to 1 random peer per round, it takes approximately $\lceil \log_2 N \rceil$ rounds for information to reach every node with high probability. For 1,000 nodes with 1-second intervals, that’s about 10 seconds for cluster-wide awareness. For 1,000,000 nodes: ~20 seconds.

The SWIM Protocol

SWIM: Scalable Weakly-consistent Infection-style Membership

SWIM (Das et al., 2002) is the gold standard for gossip-based failure detection. Used by Consul (HashiCorp), Memberlist (Go library), and Serf. It improves basic gossip in three crucial ways:

Ping

Each protocol period, node $M_i$ selects a random target $M_j$ and sends a ping. If $M_j$ responds with an ack within a timeout, $M_j$ is alive. Done.

Ping-Req

If $M_j$ doesn’t respond to the direct ping, $M_i$ does not immediately suspect $M_j$. Instead, $M_i$ selects $k$ random nodes (typically $k = 3$) and asks them to ping $M_j$ on its behalf (ping-req). If any of the $k$ nodes gets an ack from $M_j$, $M_j$ is alive. This handles the case where only the $M_i \leftrightarrow M_j$ path is broken.

Suspicion Sub-Protocol

If neither the direct ping nor any indirect ping gets an ack, $M_j$ is marked SUSPECT. The suspicion is disseminated via gossip (piggybacked on regular protocol messages). $M_j$ remains in suspect state for a configurable timeout. If $M_j$ sends any message during this window, it refutes the suspicion and returns to ALIVE. If the timeout expires without refutation, $M_j$ is marked CONFIRM (dead) and removed from the membership list.

SWIM vs. Traditional Heartbeat

PropertyCentral HeartbeatSWIM
Message complexity per period $O(N)$ — every node → monitor $O(1)$ per node — each node sends one ping + up to $k$ ping-reqs
False positive rate Depends on single path quality Lower: $k+1$ independent paths must all fail
Detection time $O(1)$ protocol periods (fast for monitor) $O(1)$ protocol periods per node, $O(\log N)$ for cluster-wide dissemination
SPOF Yes — the monitor No — fully decentralized
Network load All traffic to one node — fan-in bottleneck Distributed evenly across all nodes

Consul (Memberlist/SWIM) Configuration

# /etc/consul.d/consul.hcl

# SWIM protocol configuration
gossip_lan {
  # Interval between protocol rounds
  gossip_interval = "200ms"

  # Number of random nodes to gossip to
  gossip_nodes = 3

  # Number of indirect probes on ping failure
  probe_interval = "1s"
  probe_timeout  = "500ms"

  # Suspicion multiplier:
  # suspicion_timeout = suspicion_mult * log(N) * probe_interval
  suspicion_mult = 4

  # How many nodes to ask for indirect probes
  indirect_checks = 3

  # Number of retransmissions for protocol messages
  retransmit_mult = 4
}

# WAN gossip (cross-datacenter, higher latency)
gossip_wan {
  gossip_interval = "500ms"
  probe_interval  = "3s"
  probe_timeout   = "2s"
  suspicion_mult  = 6
}

Gossip Protocol Optimizations

Graceful Degradation

Detecting failure is only half the battle. What you do after detection determines whether the failure cascades into a system-wide outage or is absorbed gracefully. This section covers the four key patterns for handling detected failures.

Circuit Breaker Pattern

Inspired by electrical circuit breakers, this pattern prevents a service from repeatedly calling a failing dependency:

StateBehaviorTransition
CLOSED All requests pass through normally. Failures are counted. If failure count ≥ threshold within window → OPEN
OPEN All requests immediately fail without calling the dependency. Returns fallback response or error. After timeout period → HALF-OPEN
HALF-OPEN Allow one (or a few) test requests through to the dependency. If test succeeds → CLOSED. If test fails → OPEN (reset timer).
# Python circuit breaker implementation
import time
from enum import Enum
from threading import Lock

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30,
                 half_open_max_calls=3):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max_calls = half_open_max_calls
        self.state = State.CLOSED
        self.failure_count = 0
        self.last_failure_time = None
        self.half_open_calls = 0
        self.lock = Lock()

    def call(self, func, *args, fallback=None, **kwargs):
        with self.lock:
            if self.state == State.OPEN:
                if self._should_attempt_reset():
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    if fallback:
                        return fallback()
                    raise CircuitOpenError(
                        f"Circuit open since {self.last_failure_time}")

            if self.state == State.HALF_OPEN:
                if self.half_open_calls >= self.half_open_max_calls:
                    if fallback:
                        return fallback()
                    raise CircuitOpenError("Half-open call limit reached")
                self.half_open_calls += 1

        try:
            result = func(*args, **kwargs)
            self._on_success()
            return result
        except Exception as e:
            self._on_failure()
            if fallback:
                return fallback()
            raise

    def _on_success(self):
        with self.lock:
            self.failure_count = 0
            self.state = State.CLOSED

    def _on_failure(self):
        with self.lock:
            self.failure_count += 1
            self.last_failure_time = time.time()
            if self.failure_count >= self.failure_threshold:
                self.state = State.OPEN

    def _should_attempt_reset(self):
        return (time.time() - self.last_failure_time
                >= self.recovery_timeout)

Retry with Exponential Backoff

When a call fails, retrying immediately is usually worse than not retrying at all—it adds load to an already-struggling system. Exponential backoff with jitter spreads retries over time:

import random
import time

def retry_with_backoff(func, max_retries=5, base_delay=1.0,
                       max_delay=60.0, jitter=True):
    """
    Retry with exponential backoff and full jitter.
    Delay = random(0, min(max_delay, base_delay * 2^attempt))
    """
    for attempt in range(max_retries + 1):
        try:
            return func()
        except Exception as e:
            if attempt == max_retries:
                raise  # exhausted retries

            delay = min(max_delay, base_delay * (2 ** attempt))
            if jitter:
                # Full jitter: uniform random between 0 and delay
                delay = random.uniform(0, delay)

            print(f"Attempt {attempt+1} failed: {e}. "
                  f"Retrying in {delay:.2f}s...")
            time.sleep(delay)
Why jitter matters: Without jitter, if 1,000 clients all fail at the same time, they all retry at exactly 1s, then 2s, then 4s—creating thundering herd spikes. Full jitter (uniform random between 0 and the backoff delay) spreads retries uniformly, reducing peak load by ~50%. AWS recommends full jitter for all retries.

Bulkhead Pattern

Named after the watertight compartments in a ship: if one compartment floods, the others stay dry. In software, the bulkhead pattern isolates failures by partitioning resources:

# Java-style bulkhead with semaphore (conceptual Python)
from threading import Semaphore
from functools import wraps

def bulkhead(max_concurrent, name="default"):
    """Limit concurrent calls to a dependency."""
    sem = Semaphore(max_concurrent)

    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            acquired = sem.acquire(timeout=5)
            if not acquired:
                raise BulkheadFullError(
                    f"Bulkhead '{name}' full: "
                    f"{max_concurrent} concurrent calls")
            try:
                return func(*args, **kwargs)
            finally:
                sem.release()
        return wrapper
    return decorator

@bulkhead(max_concurrent=20, name="payment-service")
def call_payment_service(order_id, amount):
    return requests.post(
        "https://payment.internal/charge",
        json={"order_id": order_id, "amount": amount},
        timeout=5
    )

Fallback Strategies

StrategyDescriptionExample
Cached response Return the last known good response from cache Product catalog returns stale (5-min old) prices when price service is down
Default value Return a safe, hardcoded default Recommendation service down → show “Popular Products” instead
Graceful feature removal Disable the failing feature, keep rest working Review service down → hide reviews section, product page still works
Queue for later Write the request to a queue and process when healthy Email service down → enqueue confirmation email for retry
Alternate dependency Route to a backup service or different region Primary payment processor down → switch to secondary

Split Brain

Split brain is the most dangerous failure mode in distributed systems. It occurs when a network partition divides a cluster into two (or more) groups, each believing the other is dead. Both sides continue operating independently, potentially accepting conflicting writes, electing competing leaders, and corrupting data.

How Split Brain Happens

Consider a 5-node database cluster with Node 1 as the leader:

  1. A network switch failure partitions the cluster: Nodes {1, 2} on one side, Nodes {3, 4, 5} on the other.
  2. Nodes {3, 4, 5} stop receiving heartbeats from Node 1. Their failure detector marks Node 1 as dead.
  3. Nodes {3, 4, 5} elect Node 3 as the new leader and start accepting writes.
  4. Meanwhile, Nodes {1, 2} are still running. Node 1 thinks Nodes 3, 4, 5 are dead but continues serving as leader.
  5. Now you have two leaders accepting writes simultaneously. Data diverges. When the partition heals, you have conflicting state that may be impossible to reconcile.
🚨 Real-world incident: In 2017, a GitHub database cluster experienced a split-brain event during a network partition. Both sides of the partition promoted a new primary and accepted writes for ~30 seconds. When the partition healed, MySQL couldn’t reconcile the conflicting auto-increment IDs. Thousands of database rows had to be manually inspected and resolved. The incident report cited this as one of their most complex recovery operations ever.

Prevention Strategies

1. Quorum-Based Decisions

The most fundamental defense: require a majority (quorum) to make any decision. In a cluster of $N$ nodes, a quorum is $\lfloor N/2 \rfloor + 1$ nodes.

# etcd cluster configuration requiring quorum
# etcd uses Raft consensus - automatically requires majority

# Cluster of 5 nodes (tolerates 2 failures)
ETCD_INITIAL_CLUSTER="
  etcd1=https://10.0.1.1:2380,
  etcd2=https://10.0.1.2:2380,
  etcd3=https://10.0.1.3:2380,
  etcd4=https://10.0.1.4:2380,
  etcd5=https://10.0.1.5:2380"

# Election timeout: how long a follower waits for heartbeat
# before starting election
ETCD_ELECTION_TIMEOUT=5000       # 5 seconds

# Heartbeat interval: leader sends heartbeat every N ms
ETCD_HEARTBEAT_INTERVAL=500      # 500ms

# A leader needs acks from at least 3/5 nodes for any write
# A candidate needs votes from at least 3/5 nodes to win election

2. STONITH (Shoot The Other Node In The Head)

When a node detects a potential split-brain, it forcibly kills the other node before taking over. This is common in traditional HA database setups:

Pacemaker STONITH Configuration (Linux HA)

# Configure IPMI-based fencing
pcs stonith create ipmi-fence fence_ipmilan \
  ipaddr="192.168.1.100" \         # IPMI address
  login="admin" \
  passwd="secret" \
  pcmk_host_list="node2" \         # target node
  pcmk_reboot_action="off" \       # power off, don't reboot
  power_timeout=20 \               # wait 20s for power off
  lanplus=1

# Both nodes must have fencing configured for each other
# If fencing fails, the node that can't fence should
# self-fence (suicide) to prevent split brain
pcs property set stonith-enabled=true
pcs property set no-quorum-policy=suicide

3. Lease-Based Leadership

The leader must periodically renew a lease (a time-limited lock). If it can’t renew the lease (because it’s partitioned from the lease store), it must step down immediately:

# Lease-based leader election pseudocode
class LeaderElection:
    def __init__(self, lease_store, lease_duration=15,
                 renew_interval=5):
        self.store = lease_store  # e.g., etcd, ZooKeeper, DynamoDB
        self.duration = lease_duration  # seconds
        self.renew_interval = renew_interval
        self.is_leader = False

    def try_acquire(self):
        """Try to become leader by acquiring the lease."""
        success = self.store.create_if_not_exists(
            key="/leader",
            value=self.node_id,
            ttl=self.duration
        )
        if success:
            self.is_leader = True
            self._start_renewal_loop()
        return success

    def _start_renewal_loop(self):
        while self.is_leader:
            time.sleep(self.renew_interval)
            try:
                self.store.refresh(key="/leader",
                                   ttl=self.duration)
            except (NetworkError, TimeoutError):
                # Can't reach lease store — MUST step down
                # Another node may acquire the lease
                self.is_leader = False
                self._stop_accepting_writes()
                break
⚠ Lease clock skew danger: Leases depend on wall-clock time. If the leader’s clock is running fast, it may think the lease is valid while the lease store has already expired it. Solution: use monotonic clocks for measuring lease duration, and always include a safety margin (fence the write path before the lease actually expires, e.g., stop writes when 80% of the lease has elapsed).

4. Witness / Tiebreaker Node

Deploy a lightweight “witness” node in a third location. It doesn’t store data, but it votes in elections. In a 2-datacenter setup with 2 nodes each, the witness in a third location creates a 5-node quorum. This is used by:

Redis Sentinel Configuration (sentinel.conf)

# Three sentinels monitoring a Redis primary
# Sentinel 1: 10.0.1.1:26379
# Sentinel 2: 10.0.2.1:26379
# Sentinel 3: 10.0.3.1:26379 (witness in third AZ)

# Name of the monitored master
# Last parameter: quorum (2 of 3 sentinels must agree)
sentinel monitor mymaster 10.0.1.100 6379 2

# How often sentinels send PING to the master
sentinel down-after-milliseconds mymaster 5000

# After declaring master down, wait this long before
# attempting failover (handles transient partitions)
sentinel failover-timeout mymaster 30000

# How many replicas can be reconfigured simultaneously
sentinel parallel-syncs mymaster 1

# Require TLS for sentinel communication
sentinel auth-pass mymaster <password>

Split Brain Recovery

Despite all prevention measures, split brain can still happen (bugs, misconfigurations, Byzantine failures). When it does, you need a recovery strategy:

Summary & Decision Guide

ScenarioRecommended ApproachWhy
Small cluster (< 10 nodes) Simple heartbeat with fixed timeout Low complexity, easy to debug
Variable network latency Phi accrual failure detector Adapts to changing conditions automatically
Large cluster (100+ nodes) SWIM / gossip-based protocol $O(1)$ per-node overhead, no SPOF
Kubernetes workloads Separate liveness, readiness, startup probes Kubernetes-native, granular control
Cross-datacenter replication Phi accrual + quorum-based consensus Handles high-latency links, prevents split brain
Leader election Lease-based + STONITH/fencing Guarantees at most one leader at any time
Calling external services Circuit breaker + retry + bulkhead Prevents cascading failures
Key takeaway: Failure detection is not a solved problem—it is a spectrum of tradeoffs. You choose your detector based on your scale, latency tolerance, and the cost of each type of mistake. The Phi accrual detector is the closest thing to a “default good choice,” but even it must be paired with proper health checks, graceful degradation, and split-brain prevention to build a truly resilient system.