Heartbeat, Health Checks & Failure Detection
In a single-machine program, detecting failure is trivial: the process either runs or it doesn’t. The operating system tells you. But in a distributed system, you have no shared memory, no shared clock, and no shared fate. When Node B stops responding to Node A, is it crashed? Is the network down? Is it simply slow under load? This ambiguity sits at the heart of every distributed systems problem, and failure detection is the mechanism we build to manage it.
This post covers the full spectrum: from simple heartbeats to probabilistic failure detectors, from Kubernetes health probes to gossip-based protocols like SWIM, and from graceful degradation patterns to the terrifying split-brain problem. Every section includes real-world configuration examples you can drop into production.
The Failure Detection Problem
Before we build solutions, we need to understand why failure detection is fundamentally hard in distributed systems. The difficulty is not an engineering shortcoming—it is a mathematical impossibility proven in 1985.
Why You Can’t Tell Dead from Slow
Imagine Node A sends a request to Node B and waits for a response. After 10 seconds of silence, Node A faces a decision:
- Node B crashed — The process is dead. No response will ever come.
- The network is partitioned — Node B is alive and well, but packets can’t reach Node A. From B’s perspective, it’s A that disappeared.
- Node B is slow — A long garbage collection pause, heavy disk I/O, or CPU saturation is causing delays. The response will arrive eventually—in 30 seconds, 2 minutes, or 10 minutes.
- The response was lost — Node B processed the request and sent a reply, but the return packet was dropped by a flaky switch.
From Node A’s perspective, all four scenarios look identical: silence. There is no way to distinguish them without additional information. This is the core challenge.
FLP Impossibility
The Fischer-Lynch-Paterson (FLP) impossibility result (1985) proves that in an asynchronous system where even a single process can crash, no deterministic algorithm can guarantee consensus. The key insight is that if messages can be delayed arbitrarily (which they can in any real network), you cannot distinguish a crashed process from a very slow one.
Formally: in an asynchronous model with reliable channels, there is no deterministic protocol that solves consensus if even one process may fail by crashing. This means any practical failure detector must be probabilistic or make timing assumptions.
Two Types of Mistakes
| Mistake | Description | Consequence |
|---|---|---|
| False positive | A healthy node is declared dead | Unnecessary failover, data re-replication, load spike on remaining nodes, potential split brain |
| False negative | A dead node is considered alive | Requests routed to a dead node, timeouts, user-facing errors, stale data served |
Every failure detector trades off between these two. A shorter timeout catches failures faster (fewer false negatives) but triggers more false positives. A longer timeout reduces false positives but delays detection. There is no perfect setting—only the right tradeoff for your system.
Heartbeat Mechanism
The heartbeat is the simplest and most widely used failure detection mechanism. The idea is primitive but effective: a node periodically sends a small “I’m alive” message. If the monitor doesn’t receive heartbeats for a configurable duration, it suspects the node has failed.
Push vs. Pull Heartbeats
| Model | How It Works | Pros | Cons |
|---|---|---|---|
| Push (active) | Each node periodically sends heartbeat messages to a monitor or to peers. The monitor passively listens. | Simple to implement. Node controls the pace. Low latency for detection. | N nodes → N messages per interval. Monitor can be overwhelmed. |
| Pull (polling) | A central monitor periodically pings each node and expects a response within a timeout. | Monitor controls the load. Can stagger checks. Works through firewalls (outbound only). | Monitor is SPOF. Polling interval limits detection speed. Higher latency. |
| Hybrid | Nodes push heartbeats, but the monitor also pulls if it hasn’t heard from a node recently. | Best of both worlds. Redundant detection paths. | More complex to implement correctly. |
Heartbeat Parameters
A typical heartbeat system has three configurable parameters:
- Interval (T) — How often heartbeats are sent. Common values: 1–10 seconds. Shorter intervals detect failures faster but generate more network traffic. At 1,000 nodes with a 1-second interval, that’s 1,000 packets/sec just for heartbeats.
- Timeout (k × T) — How many missed heartbeats before suspicion. Typically k = 3 to 5. With T = 5s and k = 3, a node is suspected after 15 seconds of silence.
- Suspicion vs. Confirmation — Many systems have a two-stage process: first the node is suspected, then after further verification (or a longer timeout), it is confirmed dead. This reduces false positives from transient network issues.
Cassandra Heartbeat Configuration (cassandra.yaml)
# How often to perform inter-node heartbeats (gossip)
# Default: 1000ms
phi_convict_threshold: 8
# Gossip interval in milliseconds
# Each node gossips to 1-3 random peers per interval
gossip_interval_in_ms: 1000
# How long to wait for a node to be considered dead
# after being marked DOWN (allows time for flapping)
dead_retry_interval_in_ms: 30000
# RPC timeout for inter-node communication
rpc_timeout_in_ms: 10000
# Cross-datacenter latency can be higher
cross_dc_rtt_in_ms: 100
ZooKeeper Session Heartbeat (zoo.cfg)
# Tick time is the basic unit of time (milliseconds)
# Heartbeats sent every tickTime
tickTime=2000
# Number of ticks for initial sync (followers → leader)
initLimit=10
# Number of ticks for follower to sync with leader
# If no heartbeat for syncLimit * tickTime → session expired
syncLimit=5
# Client session timeout range
minSessionTimeout=4000 # 2 * tickTime
maxSessionTimeout=40000 # 20 * tickTime
Heartbeat Message Contents
A heartbeat is not just a ping. Production heartbeats carry useful metadata:
{
"node_id": "node-17",
"timestamp": 1714502400000,
"generation": 42, // incremented on restart
"version": "3.11.4",
"load": {
"cpu_percent": 34.2,
"memory_percent": 67.8,
"disk_free_gb": 128.4,
"active_connections": 1247
},
"status": "NORMAL", // NORMAL | JOINING | LEAVING | DRAINING
"tokens": ["...", "..."] // consistent hash ring tokens
}
This piggybacking approach is used by Cassandra, Consul, and etcd. Instead of sending separate monitoring messages, the heartbeat itself carries the information needed for load balancing, capacity planning, and cluster membership decisions.
⚡ Heartbeat Timeline
Watch Node A send heartbeats. See what happens when they stop arriving.
Phi Accrual Failure Detector
Simple heartbeat-based detection uses a fixed timeout: if no heartbeat arrives within T seconds, the node is dead. This is crude. Network latency varies, GC pauses happen, and a fixed threshold that works at 2 AM may trigger false positives during peak traffic at 2 PM.
The Phi (φ) Accrual Failure Detector, introduced by Hayashibara et al. (2004), takes a fundamentally different approach. Instead of a binary “alive or dead” decision, it computes a continuous suspicion level (the φ value) that represents how confident the detector is that the node has failed.
How It Works
Collect Heartbeat Arrival Times
Maintain a sliding window of the last N heartbeat inter-arrival times. For example, if heartbeats arrive at times 0, 1.02, 2.01, 3.05, 4.00, the inter-arrival times are [1.02, 0.99, 1.04, 0.95].
Model the Distribution
Assume the inter-arrival times follow a normal distribution (Gaussian). Compute the sample mean (μ) and sample variance (σ²) from the sliding window. Cassandra uses a window of the last 1,000 heartbeat intervals.
Compute φ
When the current time is t_now and the last heartbeat arrived at t_last, compute how late the heartbeat is: $\Delta = t_{now} - t_{last}$. Then compute φ using the complementary CDF of the normal distribution:
$$\phi = -\log_{10}\left(1 - F\left(\Delta \;\middle|\; \mu, \sigma^2\right)\right)$$
where $F$ is the cumulative distribution function of the normal distribution with mean $\mu$ and variance $\sigma^2$. Intuitively, φ represents the negative log-probability that a heartbeat is this late by chance.
Compare Against Threshold
If $\phi \geq \phi_{threshold}$, suspect the node. The threshold is configurable:
- $\phi = 1$ → 10% chance this is a false positive
- $\phi = 2$ → 1% chance
- $\phi = 3$ → 0.1% chance
- $\phi = 8$ → 0.000001% chance (Cassandra default)
Why Phi Accrual Is Superior
| Property | Fixed Timeout | Phi Accrual |
|---|---|---|
| Adapts to network conditions | ❌ No. Must be manually tuned. | ✅ Yes. Automatically adjusts based on observed latency. |
| Handles GC pauses | ❌ Causes false positives during long pauses. | ✅ If pauses are occasional, the distribution absorbs them. |
| Cross-datacenter | ❌ Need different timeouts per DC. | ✅ Automatically adapts to higher latency. |
| Configurable accuracy | Binary: alive or dead. | Continuous: φ = 1 (low suspicion) to φ = 20+ (certain failure). |
| Used by | ZooKeeper, Redis Sentinel, most custom systems | Cassandra, Akka, Hazelcast |
Cassandra Phi Accrual Configuration
# phi_convict_threshold: suspicion threshold
# Default: 8 (99.99999% confidence before marking DOWN)
# Lower values (5-6) = faster detection, more false positives
# Higher values (10-12) = slower detection, fewer false positives
# Recommended: 8 for same-DC, 12 for cross-DC
phi_convict_threshold: 8
# The window size for tracking heartbeat arrivals
# Larger window = more stable estimates, slower to adapt
# Default: 1000 samples
phi_convict_max_window_size: 1000
Akka Phi Accrual Configuration (application.conf)
akka {
remote {
watch-failure-detector {
# Threshold for phi accrual failure detector
threshold = 10.0
# Number of heartbeat samples to use for calculation
max-sample-size = 200
# Minimum standard deviation (avoids 0-variance edge case)
min-std-deviation = 100 ms
# How often heartbeat request is sent
heartbeat-interval = 1 s
# Number of potentially lost/delayed heartbeats before
# considering it an anomaly (adjusts the first heartbeat
# timing)
acceptable-heartbeat-pause = 3 s
# How often to check for nodes marked unreachable
unreachable-nodes-reaper-interval = 1s
}
}
}
📈 Phi Accrual Failure Detector
Watch φ grow as heartbeats get delayed. When φ crosses the threshold, the node is suspected dead.
Health Check Patterns
Heartbeats tell you whether a node is alive. But “alive” is not the same as “useful.” A process can be running but unable to serve traffic because it’s still loading data, its database connection pool is exhausted, or a dependent service is down. Health checks solve this by probing the functional state of a service.
Three Types of Health Checks
| Check Type | Question It Answers | When It Fails | Response |
|---|---|---|---|
| Liveness | Is the process running and not stuck? | Deadlock, infinite loop, OOM but process not killed | Restart the container/process |
| Readiness | Can it accept and serve traffic right now? | Still warming cache, DB connection down, rate limited | Remove from load balancer (don’t restart) |
| Startup | Has initial boot completed? | Still loading large dataset, running migrations | Don’t check liveness until startup succeeds |
Kubernetes Probe Types
Kubernetes supports three probe mechanisms:
- HTTP GET — Sends an HTTP GET request to a specified path. Any status code ≥ 200 and < 400 means success.
- TCP Socket — Attempts to open a TCP connection to a specified port. If the connection succeeds, the probe passes.
- Exec — Runs an arbitrary command inside the container. Exit code 0 = healthy, non-zero = unhealthy.
- gRPC — (Kubernetes 1.24+) Calls the gRPC health checking protocol on a specified port.
Complete Kubernetes Health Probe Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: order-service
spec:
replicas: 3
selector:
matchLabels:
app: order-service
template:
metadata:
labels:
app: order-service
spec:
containers:
- name: order-service
image: myregistry/order-service:v2.4.1
ports:
- containerPort: 8080
# STARTUP PROBE: wait for initialization
# Allows up to 5 min (30 * 10s) for startup
startupProbe:
httpGet:
path: /healthz/startup
port: 8080
failureThreshold: 30
periodSeconds: 10
# LIVENESS PROBE: detect deadlocks / stuck processes
# Restarts pod if 3 consecutive checks fail
livenessProbe:
httpGet:
path: /healthz/live
port: 8080
initialDelaySeconds: 0 # startup probe handles delay
periodSeconds: 10
timeoutSeconds: 3
failureThreshold: 3
successThreshold: 1
# READINESS PROBE: control traffic routing
# Removes from Service endpoints if unhealthy
readinessProbe:
httpGet:
path: /healthz/ready
port: 8080
initialDelaySeconds: 0
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 2
successThreshold: 2
Implementing Health Endpoints
A production-grade health endpoint doesn’t just return 200 OK. It checks dependencies and reports detailed status:
// Express.js health endpoint implementation
app.get('/healthz/live', (req, res) => {
// Liveness: only checks if the process is responsive
// Do NOT check external dependencies here
res.status(200).json({ status: 'alive', uptime: process.uptime() });
});
app.get('/healthz/ready', async (req, res) => {
const checks = {
database: await checkDatabase(),
redis: await checkRedis(),
kafka: await checkKafka(),
};
const allHealthy = Object.values(checks).every(c => c.healthy);
const status = allHealthy ? 200 : 503;
res.status(status).json({
status: allHealthy ? 'ready' : 'not_ready',
checks,
timestamp: new Date().toISOString(),
});
});
app.get('/healthz/startup', async (req, res) => {
if (!cacheWarmed || !migrationsComplete) {
return res.status(503).json({ status: 'starting' });
}
res.status(200).json({ status: 'started' });
});
async function checkDatabase() {
try {
const start = Date.now();
await db.query('SELECT 1');
return { healthy: true, latency_ms: Date.now() - start };
} catch (err) {
return { healthy: false, error: err.message };
}
}
async function checkRedis() {
try {
const start = Date.now();
await redis.ping();
return { healthy: true, latency_ms: Date.now() - start };
} catch (err) {
return { healthy: false, error: err.message };
}
}
Health Check Anti-Patterns
- Cascading health failures: Service A’s readiness checks Service B, which checks Service C. If C goes down, A and B both go unready—cascading outage. Solution: only check direct dependencies, and use circuit breakers.
- Expensive health checks: Running a complex query or loading large data in the health endpoint. Solution: cache the result and update it in a background goroutine.
- No timeout on probes: Kubernetes default timeout is 1 second. If your database check takes 5 seconds, the probe always fails. Always set
timeoutSecondsappropriately. - Checking non-critical dependencies: Your /ready endpoint checks the email service. Email is down, so your API stops serving—even though 99% of requests don’t need email. Solution: only check dependencies critical to the majority of traffic.
Gossip-Based Failure Detection
Centralized heartbeat monitoring has a fundamental problem: the monitor is a single point of failure, and at scale (thousands of nodes), it becomes a bottleneck. Gossip-based protocols solve this by making failure detection a collective, decentralized activity.
How Gossip Works
Each node maintains a membership list—a table of all known nodes and their heartbeat counters. Periodically (e.g., every second), each node:
- Increments its own heartbeat counter.
- Selects a random peer from its membership list.
- Sends its full membership list to that peer.
- Merges the received list with its own, keeping the highest heartbeat counter for each node.
If a node’s heartbeat counter hasn’t been incremented within a timeout period, it is suspected as failed. This is “epidemic-style” dissemination: information spreads like a rumor, reaching all $N$ nodes in $O(\log N)$ gossip rounds.
The SWIM Protocol
SWIM: Scalable Weakly-consistent Infection-style Membership
SWIM (Das et al., 2002) is the gold standard for gossip-based failure detection. Used by Consul (HashiCorp), Memberlist (Go library), and Serf. It improves basic gossip in three crucial ways:
Ping
Each protocol period, node $M_i$ selects a random target $M_j$ and sends a ping. If $M_j$ responds with an ack within a timeout, $M_j$ is alive. Done.
Ping-Req
If $M_j$ doesn’t respond to the direct ping, $M_i$ does not immediately suspect $M_j$. Instead, $M_i$ selects $k$ random nodes (typically $k = 3$) and asks them to ping $M_j$ on its behalf (ping-req). If any of the $k$ nodes gets an ack from $M_j$, $M_j$ is alive. This handles the case where only the $M_i \leftrightarrow M_j$ path is broken.
Suspicion Sub-Protocol
If neither the direct ping nor any indirect ping gets an ack, $M_j$ is marked SUSPECT. The suspicion is disseminated via gossip (piggybacked on regular protocol messages). $M_j$ remains in suspect state for a configurable timeout. If $M_j$ sends any message during this window, it refutes the suspicion and returns to ALIVE. If the timeout expires without refutation, $M_j$ is marked CONFIRM (dead) and removed from the membership list.
SWIM vs. Traditional Heartbeat
| Property | Central Heartbeat | SWIM |
|---|---|---|
| Message complexity per period | $O(N)$ — every node → monitor | $O(1)$ per node — each node sends one ping + up to $k$ ping-reqs |
| False positive rate | Depends on single path quality | Lower: $k+1$ independent paths must all fail |
| Detection time | $O(1)$ protocol periods (fast for monitor) | $O(1)$ protocol periods per node, $O(\log N)$ for cluster-wide dissemination |
| SPOF | Yes — the monitor | No — fully decentralized |
| Network load | All traffic to one node — fan-in bottleneck | Distributed evenly across all nodes |
Consul (Memberlist/SWIM) Configuration
# /etc/consul.d/consul.hcl
# SWIM protocol configuration
gossip_lan {
# Interval between protocol rounds
gossip_interval = "200ms"
# Number of random nodes to gossip to
gossip_nodes = 3
# Number of indirect probes on ping failure
probe_interval = "1s"
probe_timeout = "500ms"
# Suspicion multiplier:
# suspicion_timeout = suspicion_mult * log(N) * probe_interval
suspicion_mult = 4
# How many nodes to ask for indirect probes
indirect_checks = 3
# Number of retransmissions for protocol messages
retransmit_mult = 4
}
# WAN gossip (cross-datacenter, higher latency)
gossip_wan {
gossip_interval = "500ms"
probe_interval = "3s"
probe_timeout = "2s"
suspicion_mult = 6
}
Gossip Protocol Optimizations
- Piggybacking: Instead of sending separate protocol messages, SWIM piggybacks membership updates on top of ping/ack messages. This is why SWIM achieves $O(1)$ message overhead per node per period—no extra messages needed for dissemination.
- Infection-style dissemination: Each update has a retransmission counter that starts at $\lambda \cdot \log N$ (where $\lambda$ is the retransmit multiplier). Each time the update is piggybacked on a message, the counter decrements. When it reaches zero, the update is dropped. This guarantees $O(\log N)$ rounds for convergence while bounding message size.
- Suspicion with incarnation numbers: Each node has an incarnation number that it increments to refute false suspicions. If Node A suspects Node B, B can send a message with an incremented incarnation number to override the suspicion. This prevents the “accusation storm” problem.
- Round-robin target selection: Instead of truly random target selection (which can miss some nodes for many rounds), SWIM uses a round-robin-shuffled list to ensure every node is probed within $N$ protocol periods. This provides a worst-case detection bound.
Graceful Degradation
Detecting failure is only half the battle. What you do after detection determines whether the failure cascades into a system-wide outage or is absorbed gracefully. This section covers the four key patterns for handling detected failures.
Circuit Breaker Pattern
Inspired by electrical circuit breakers, this pattern prevents a service from repeatedly calling a failing dependency:
| State | Behavior | Transition |
|---|---|---|
| CLOSED | All requests pass through normally. Failures are counted. | If failure count ≥ threshold within window → OPEN |
| OPEN | All requests immediately fail without calling the dependency. Returns fallback response or error. | After timeout period → HALF-OPEN |
| HALF-OPEN | Allow one (or a few) test requests through to the dependency. | If test succeeds → CLOSED. If test fails → OPEN (reset timer). |
# Python circuit breaker implementation
import time
from enum import Enum
from threading import Lock
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30,
half_open_max_calls=3):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max_calls = half_open_max_calls
self.state = State.CLOSED
self.failure_count = 0
self.last_failure_time = None
self.half_open_calls = 0
self.lock = Lock()
def call(self, func, *args, fallback=None, **kwargs):
with self.lock:
if self.state == State.OPEN:
if self._should_attempt_reset():
self.state = State.HALF_OPEN
self.half_open_calls = 0
else:
if fallback:
return fallback()
raise CircuitOpenError(
f"Circuit open since {self.last_failure_time}")
if self.state == State.HALF_OPEN:
if self.half_open_calls >= self.half_open_max_calls:
if fallback:
return fallback()
raise CircuitOpenError("Half-open call limit reached")
self.half_open_calls += 1
try:
result = func(*args, **kwargs)
self._on_success()
return result
except Exception as e:
self._on_failure()
if fallback:
return fallback()
raise
def _on_success(self):
with self.lock:
self.failure_count = 0
self.state = State.CLOSED
def _on_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.failure_count >= self.failure_threshold:
self.state = State.OPEN
def _should_attempt_reset(self):
return (time.time() - self.last_failure_time
>= self.recovery_timeout)
Retry with Exponential Backoff
When a call fails, retrying immediately is usually worse than not retrying at all—it adds load to an already-struggling system. Exponential backoff with jitter spreads retries over time:
import random
import time
def retry_with_backoff(func, max_retries=5, base_delay=1.0,
max_delay=60.0, jitter=True):
"""
Retry with exponential backoff and full jitter.
Delay = random(0, min(max_delay, base_delay * 2^attempt))
"""
for attempt in range(max_retries + 1):
try:
return func()
except Exception as e:
if attempt == max_retries:
raise # exhausted retries
delay = min(max_delay, base_delay * (2 ** attempt))
if jitter:
# Full jitter: uniform random between 0 and delay
delay = random.uniform(0, delay)
print(f"Attempt {attempt+1} failed: {e}. "
f"Retrying in {delay:.2f}s...")
time.sleep(delay)
Bulkhead Pattern
Named after the watertight compartments in a ship: if one compartment floods, the others stay dry. In software, the bulkhead pattern isolates failures by partitioning resources:
- Thread pool isolation: Each downstream dependency gets its own thread pool. If the payment service is slow and exhausts its 20 threads, the inventory service still has its own 20 threads and keeps working.
- Connection pool isolation: Separate database connection pools for critical (checkout) and non-critical (analytics) queries. If analytics runs a slow query and exhausts its pool, checkout still works.
- Process isolation: Run different services in separate containers/processes. A memory leak in one service can’t crash another.
- Sidecar isolation: In service meshes, the proxy sidecar handles retries, circuit breaking, and rate limiting outside the application process.
# Java-style bulkhead with semaphore (conceptual Python)
from threading import Semaphore
from functools import wraps
def bulkhead(max_concurrent, name="default"):
"""Limit concurrent calls to a dependency."""
sem = Semaphore(max_concurrent)
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
acquired = sem.acquire(timeout=5)
if not acquired:
raise BulkheadFullError(
f"Bulkhead '{name}' full: "
f"{max_concurrent} concurrent calls")
try:
return func(*args, **kwargs)
finally:
sem.release()
return wrapper
return decorator
@bulkhead(max_concurrent=20, name="payment-service")
def call_payment_service(order_id, amount):
return requests.post(
"https://payment.internal/charge",
json={"order_id": order_id, "amount": amount},
timeout=5
)
Fallback Strategies
| Strategy | Description | Example |
|---|---|---|
| Cached response | Return the last known good response from cache | Product catalog returns stale (5-min old) prices when price service is down |
| Default value | Return a safe, hardcoded default | Recommendation service down → show “Popular Products” instead |
| Graceful feature removal | Disable the failing feature, keep rest working | Review service down → hide reviews section, product page still works |
| Queue for later | Write the request to a queue and process when healthy | Email service down → enqueue confirmation email for retry |
| Alternate dependency | Route to a backup service or different region | Primary payment processor down → switch to secondary |
Split Brain
Split brain is the most dangerous failure mode in distributed systems. It occurs when a network partition divides a cluster into two (or more) groups, each believing the other is dead. Both sides continue operating independently, potentially accepting conflicting writes, electing competing leaders, and corrupting data.
How Split Brain Happens
Consider a 5-node database cluster with Node 1 as the leader:
- A network switch failure partitions the cluster: Nodes {1, 2} on one side, Nodes {3, 4, 5} on the other.
- Nodes {3, 4, 5} stop receiving heartbeats from Node 1. Their failure detector marks Node 1 as dead.
- Nodes {3, 4, 5} elect Node 3 as the new leader and start accepting writes.
- Meanwhile, Nodes {1, 2} are still running. Node 1 thinks Nodes 3, 4, 5 are dead but continues serving as leader.
- Now you have two leaders accepting writes simultaneously. Data diverges. When the partition heals, you have conflicting state that may be impossible to reconcile.
Prevention Strategies
1. Quorum-Based Decisions
The most fundamental defense: require a majority (quorum) to make any decision. In a cluster of $N$ nodes, a quorum is $\lfloor N/2 \rfloor + 1$ nodes.
- With 5 nodes, quorum = 3. If the partition splits {2} vs {3}, only the side with 3 nodes can elect a leader.
- With 3 nodes, quorum = 2. Tolerates 1 node failure.
- Why odd numbers: With 4 nodes, quorum = 3. A 2-2 split means neither side has quorum—the whole system stops. With 5 nodes, a 2-3 split has one side with quorum. This is why consensus systems (etcd, ZooKeeper, Raft) always use odd numbers of nodes.
# etcd cluster configuration requiring quorum
# etcd uses Raft consensus - automatically requires majority
# Cluster of 5 nodes (tolerates 2 failures)
ETCD_INITIAL_CLUSTER="
etcd1=https://10.0.1.1:2380,
etcd2=https://10.0.1.2:2380,
etcd3=https://10.0.1.3:2380,
etcd4=https://10.0.1.4:2380,
etcd5=https://10.0.1.5:2380"
# Election timeout: how long a follower waits for heartbeat
# before starting election
ETCD_ELECTION_TIMEOUT=5000 # 5 seconds
# Heartbeat interval: leader sends heartbeat every N ms
ETCD_HEARTBEAT_INTERVAL=500 # 500ms
# A leader needs acks from at least 3/5 nodes for any write
# A candidate needs votes from at least 3/5 nodes to win election
2. STONITH (Shoot The Other Node In The Head)
When a node detects a potential split-brain, it forcibly kills the other node before taking over. This is common in traditional HA database setups:
- Power fencing: Use IPMI (Intelligent Platform Management Interface) or iLO/iDRAC to remotely power off the other node’s hardware.
- SBD (Storage-Based Death): Both nodes monitor a shared storage device (SAN, shared disk). A node that can’t write to the SBD device within a timeout kills itself (self-fencing).
- Cloud fencing: Use the cloud provider API to stop the other node’s instance (e.g., AWS
ec2 stop-instances).
Pacemaker STONITH Configuration (Linux HA)
# Configure IPMI-based fencing
pcs stonith create ipmi-fence fence_ipmilan \
ipaddr="192.168.1.100" \ # IPMI address
login="admin" \
passwd="secret" \
pcmk_host_list="node2" \ # target node
pcmk_reboot_action="off" \ # power off, don't reboot
power_timeout=20 \ # wait 20s for power off
lanplus=1
# Both nodes must have fencing configured for each other
# If fencing fails, the node that can't fence should
# self-fence (suicide) to prevent split brain
pcs property set stonith-enabled=true
pcs property set no-quorum-policy=suicide
3. Lease-Based Leadership
The leader must periodically renew a lease (a time-limited lock). If it can’t renew the lease (because it’s partitioned from the lease store), it must step down immediately:
# Lease-based leader election pseudocode
class LeaderElection:
def __init__(self, lease_store, lease_duration=15,
renew_interval=5):
self.store = lease_store # e.g., etcd, ZooKeeper, DynamoDB
self.duration = lease_duration # seconds
self.renew_interval = renew_interval
self.is_leader = False
def try_acquire(self):
"""Try to become leader by acquiring the lease."""
success = self.store.create_if_not_exists(
key="/leader",
value=self.node_id,
ttl=self.duration
)
if success:
self.is_leader = True
self._start_renewal_loop()
return success
def _start_renewal_loop(self):
while self.is_leader:
time.sleep(self.renew_interval)
try:
self.store.refresh(key="/leader",
ttl=self.duration)
except (NetworkError, TimeoutError):
# Can't reach lease store — MUST step down
# Another node may acquire the lease
self.is_leader = False
self._stop_accepting_writes()
break
4. Witness / Tiebreaker Node
Deploy a lightweight “witness” node in a third location. It doesn’t store data, but it votes in elections. In a 2-datacenter setup with 2 nodes each, the witness in a third location creates a 5-node quorum. This is used by:
- SQL Server Always On: Azure Cloud Witness or file share witness.
- CockroachDB: Locality-aware placement with odd-numbered node counts.
- Redis Sentinel: An odd number of sentinels (3 or 5) monitoring the primary.
Redis Sentinel Configuration (sentinel.conf)
# Three sentinels monitoring a Redis primary
# Sentinel 1: 10.0.1.1:26379
# Sentinel 2: 10.0.2.1:26379
# Sentinel 3: 10.0.3.1:26379 (witness in third AZ)
# Name of the monitored master
# Last parameter: quorum (2 of 3 sentinels must agree)
sentinel monitor mymaster 10.0.1.100 6379 2
# How often sentinels send PING to the master
sentinel down-after-milliseconds mymaster 5000
# After declaring master down, wait this long before
# attempting failover (handles transient partitions)
sentinel failover-timeout mymaster 30000
# How many replicas can be reconfigured simultaneously
sentinel parallel-syncs mymaster 1
# Require TLS for sentinel communication
sentinel auth-pass mymaster <password>
Split Brain Recovery
Despite all prevention measures, split brain can still happen (bugs, misconfigurations, Byzantine failures). When it does, you need a recovery strategy:
- Last-writer-wins (LWW): Use timestamps or logical clocks to pick one version. Simple but loses data.
- Conflict-free Replicated Data Types (CRDTs): Design data structures that can be merged without conflict (counters, sets, registers). Used by Redis Enterprise, Riak.
- Manual resolution: Flag conflicting records and let operators or application logic resolve them. Amazon’s shopping cart (Dynamo paper) famously took this approach.
- Application-level merge: Define custom merge functions per data type. Two versions of a user profile? Merge the fields individually, take the latest email, union the address list.
Summary & Decision Guide
| Scenario | Recommended Approach | Why |
|---|---|---|
| Small cluster (< 10 nodes) | Simple heartbeat with fixed timeout | Low complexity, easy to debug |
| Variable network latency | Phi accrual failure detector | Adapts to changing conditions automatically |
| Large cluster (100+ nodes) | SWIM / gossip-based protocol | $O(1)$ per-node overhead, no SPOF |
| Kubernetes workloads | Separate liveness, readiness, startup probes | Kubernetes-native, granular control |
| Cross-datacenter replication | Phi accrual + quorum-based consensus | Handles high-latency links, prevents split brain |
| Leader election | Lease-based + STONITH/fencing | Guarantees at most one leader at any time |
| Calling external services | Circuit breaker + retry + bulkhead | Prevents cascading failures |