Leader Election
In a distributed system, many tasks require exactly one node to take charge: writing to a shared log, scheduling jobs, rebalancing partitions, or coordinating a two-phase commit. The mechanism that selects that one node from a group of peers is called leader election. Get it wrong and you get split brain, data corruption, or total cluster paralysis. This post covers every algorithm and tool you need — from classical approaches (Bully, Ring) to production-grade systems (Raft, ZooKeeper, etcd) — along with the safety mechanisms (fencing tokens, leases) that make leader election reliable in the real world.
1 · Why Leader Election?
Many distributed problems become dramatically simpler when exactly one node acts as the leader (also called the master, primary, or coordinator). The leader centralizes decision-making so that the rest of the cluster can follow a single, authoritative source of truth.
Single Writer / Log Ordering
A replicated state machine requires a totally ordered log of commands. Without a single writer, concurrent writes create conflicting orderings. The leader serializes all writes, assigns monotonically increasing sequence numbers, and replicates the log to followers. This is the foundation of Raft, ZAB (ZooKeeper), and Multi-Paxos.
Coordination & Mutual Exclusion
Tasks like database schema migrations, shard rebalancing, or snapshot creation must be performed by exactly one node. A leader can act as a coordinator, issuing commands to participants without risking concurrent conflicting operations. This is far simpler than distributed locking across all nodes.
Task Scheduling
In systems like Apache Kafka or Kubernetes, a controller node assigns partitions to brokers, schedules pods to nodes, or triggers failovers. Without a single scheduler, assignment conflicts cause duplicate work, missed tasks, or oscillating state. The leader acts as the authoritative scheduler.
Consistency Guarantees
If every node independently accepts writes, you need a conflict-resolution strategy (CRDTs, last-write-wins, etc.). A single leader eliminates write conflicts entirely — every write is ordered through one node, giving you linearizability by construction.
2 · The Bully Algorithm
The Bully Algorithm, proposed by Hector Garcia-Molina in 1982, is one of the simplest and most intuitive leader election algorithms. The idea: the node with the highest ID always wins. If a node suspects the leader is dead, it "bullies" lower-ID nodes into submission.
How It Works
- Every node has a unique numeric ID (e.g., 1 through N).
- When a node detects the leader has failed (timeout on heartbeat), it initiates an election.
- The initiator sends an
ELECTIONmessage to all nodes with higher IDs. - If any higher-ID node responds with
OK(meaning "I'll take it from here"), the initiator backs off. - Each responding node then starts its own election by sending
ELECTIONto nodes with even higher IDs. - The process repeats until a node sends
ELECTIONand receives no response — that node is the highest alive and declares itself leader. - The new leader broadcasts a
COORDINATORmessage to all nodes.
Pseudocode
Complexity Analysis
| Metric | Best Case | Worst Case |
|---|---|---|
| Messages | O(n) — highest node detects failure, sends n−1 COORDINATOR messages | O(n²) — lowest node detects failure, cascade of ELECTION + OK messages |
| Rounds | 1 round | O(n) rounds |
| Time | O(1) message delays | O(n) message delays |
Trade-offs
✅ Advantages
- Simple to understand and implement
- Deterministic outcome (highest alive ID always wins)
- Fast in the best case (1 round)
❌ Disadvantages
- O(n²) messages in worst case
- Highest-ID node is always elected — no load balancing
- Vulnerable to a "flapping" high-ID node that keeps crashing and restarting
- No partition tolerance — relies on reliable failure detection
▶ Bully Algorithm Election
5 nodes. Node 5 (leader) dies. Watch the cascade of ELECTION → OK → COORDINATOR messages. Click Step to advance.
3 · The Ring Algorithm
The Ring Algorithm (Chang-Roberts, 1979) arranges nodes in a logical ring. When a node detects a leader failure, it passes an election token around the ring. The token collects node IDs, and the highest ID wins.
How It Works
- Nodes are arranged in a logical ring — each node knows its successor.
- When a node detects leader failure, it creates an
ELECTIONmessage containing its own ID and forwards it to its successor. - Each node that receives the message appends its own ID (if alive) and forwards it.
- When the message returns to the initiator (it has traversed the full ring), the node with the highest ID in the list is declared leader.
- A
COORDINATORmessage is sent around the ring to inform everyone.
Pseudocode
Complexity
The ring algorithm always sends exactly 2n messages (n for the election round, n for the coordinator announcement), regardless of which node initiates. This is better than the Bully algorithm's worst case, but slower because the message must traverse the entire ring.
| Algorithm | Best Case Messages | Worst Case Messages | Partition Tolerant? |
|---|---|---|---|
| Bully | O(n) | O(n²) | No |
| Ring (Chang-Roberts) | O(n) | O(n) | No (ring breaks) |
| Raft | O(n) | O(n) | Yes (majority quorum) |
4 · Raft-Based Leader Election
Raft (Ongaro & Ousterhout, 2014) is the industry standard for consensus-based leader election. Unlike Bully or Ring, Raft is partition-tolerant and guarantees safety: at most one leader per term, and a leader must be accepted by a majority of nodes.
Terms & States
Time is divided into terms — monotonically increasing integers. Each term has at most one leader. Every node is in one of three states:
- Follower: Passive; responds to RPCs from the leader and candidates.
- Candidate: Actively seeking votes to become the new leader.
- Leader: Handles all client requests and replicates the log.
Randomized Election Timeouts
Each follower has an election timeout — a random duration between 150ms and 300ms. If the follower does not hear from the leader (via heartbeats) before its timeout fires, it assumes the leader has failed and becomes a candidate. The randomization ensures that in most cases, only one follower times out first, avoiding split votes.
The Election Process
- Timeout fires: A follower increments its term to
T+1, transitions to candidate, and votes for itself. - RequestVote RPC: The candidate sends
RequestVote(term=T+1, candidateId, lastLogIndex, lastLogTerm)to all other nodes. - Vote granting: A node grants its vote if:
(a) it has not already voted in term T+1, AND
(b) the candidate's log is at least as up-to-date as the voter's log. - Majority wins: If the candidate receives votes from a majority of the cluster (⌈n/2⌉ + 1), it becomes leader for term T+1.
- Heartbeats: The new leader immediately sends
AppendEntriesheartbeats to all followers to assert authority and prevent new elections.
Pseudocode
Split Votes
If two candidates start elections simultaneously and split the vote so neither gets a majority, neither wins. After another random timeout, one of them starts a new election with an incremented term. The randomized timeouts make consecutive split votes statistically unlikely — Raft converges in 1–2 rounds in practice.
5 · ZooKeeper Ephemeral Sequential Nodes
Apache ZooKeeper provides a powerful primitive for leader election: ephemeral sequential znodes. This approach is elegant, battle-tested, and avoids the herd effect (thundering herd problem).
The Mechanism
- Each candidate creates an ephemeral sequential znode under a known path:
/election/node-. ZooKeeper appends a monotonically increasing sequence number:/election/node-0000000001,/election/node-0000000002, etc. - The candidate with the lowest sequence number is the leader.
- Each non-leader node sets a watch on the znode immediately before it in the sequence (not on the leader — this prevents the thundering herd).
- When the leader crashes, its ephemeral znode is automatically deleted by ZooKeeper (the session is gone).
- The watch on the deleted znode fires, notifying the next node in sequence, which becomes the new leader.
ZooKeeper Election Code (Java)
Avoiding the Thundering Herd
A naive approach would have all non-leader nodes watch the leader znode. When the leader dies, all N−1 nodes receive the notification simultaneously and race to become leader. This creates O(n) simultaneous getChildren calls — the thundering herd problem.
The sequential-predecessor watch pattern ensures that when the leader dies, only one node (the second in sequence) is notified. It checks if it's now the smallest and becomes leader. O(1) notifications, O(1) ZooKeeper operations.
▶ ZooKeeper Ephemeral Node Election
Clients create sequential ephemeral znodes. Lowest number is leader. Watch the failover when the leader dies.
6 · etcd Leader Election with Leases
etcd, the distributed key-value store powering Kubernetes, provides leader election via leases and its concurrency package. A lease is a time-bound lock on a key — if the leader fails to renew the lease within the TTL, it automatically expires.
How etcd Election Works
- A candidate creates a lease with a TTL (e.g., 15 seconds) and keeps it alive with periodic
KeepAliveRPCs. - The candidate attempts to become leader by calling
Campaign(), which creates a key under an election prefix with the lease attached. - etcd uses its Raft-based consensus internally — the first key created in the election prefix wins.
- If the leader crashes or fails to renew the lease, the key is automatically deleted after TTL seconds.
- The next waiting candidate (which was blocked on
Campaign()) is promoted to leader.
etcd Election Code (Go)
7 · Split Brain Prevention
Split brain is the most dangerous failure mode in leader election: two nodes both believe they are the leader and independently accept writes. This leads to divergent state, data corruption, and conflicting decisions.
How Split Brain Happens
- Network partition: The leader is isolated. Followers in the other partition elect a new leader. The old leader doesn't know it's been replaced.
- GC pause / process freeze: The leader pauses for 30 seconds during a GC. Followers time out and elect a new leader. The old leader wakes up, still thinking it's the leader.
- Asymmetric partition: The leader can send to followers but can't receive responses. It keeps acting as leader while followers elect a new one.
Prevention Strategies
| Strategy | Mechanism | Used By |
|---|---|---|
| Majority quorum | Leader must replicate to ⌈n/2⌉+1 nodes before committing. An isolated leader cannot get a quorum. | Raft, Paxos, ZAB |
| Lease expiry | Leader holds a time-bound lease. Must renew before TTL. If partition prevents renewal, leadership automatically expires. | etcd, Chubby, DynamoDB |
| Fencing tokens | Every new leader gets a monotonically increasing token. Resource servers reject requests with stale tokens. | HBase, custom systems |
| STONITH | "Shoot The Other Node In The Head" — physically power-off the old leader via IPMI/BMC before the new leader starts. | Pacemaker, HA clusters |
| Epoch numbers | Similar to Raft terms. Every election increments the epoch. Followers reject commands from stale epochs. | Kafka (KRaft), ZooKeeper |
8 · Fencing Tokens for Safety
Even with quorum-based election, there is a window where an old leader might still issue writes. Fencing tokens close this window by giving each leader a unique, monotonically increasing token that resource servers validate.
How Fencing Works
- Each time a new leader is elected, the election system issues a fencing token (an integer that strictly increases with each election — this is the Raft "term" or ZooKeeper "zxid").
- The leader includes its fencing token in every request to storage/resources.
- The storage server tracks the highest token it has seen.
- If a request arrives with a token lower than the highest seen, it is rejected — it came from a stale leader.
Example Scenario
Fencing Token Validation
9 · Lease-Based Leadership
A lease is a time-bounded grant of leadership. The leader must periodically renew the lease before it expires. If the leader fails to renew (crash, partition, GC pause), the lease expires and other candidates can acquire it.
How Leases Work
The Clock Skew Problem
Leases depend on time, and clocks across machines can skew. If the leader's clock is slow and the follower's clock is fast, the follower might think the lease has expired while the leader thinks it's still valid. Solutions:
- Conservative renewal: Renew at TTL/3 so there's a 2/3 × TTL buffer.
- Bound clock skew: Use NTP and assume max skew ε. The effective lease duration is TTL − ε.
- Combine with fencing tokens: Even if two leaders overlap briefly, the fencing token prevents stale writes.
Leases vs Heartbeats
| Aspect | Heartbeat | Lease |
|---|---|---|
| Direction | Leader → followers | Leader → lease server |
| Failure detection | Followers detect missing heartbeats | Lease server detects expired TTL |
| Time dependency | Relative timeouts | Absolute TTL (clock-dependent) |
| Safety | Requires additional mechanism (terms, fencing) | Built-in expiry prevents stale leadership |
| Typical use | Raft, ZAB | etcd, Chubby, DynamoDB |
10 · When to Avoid Leader Election
Leader election introduces a single point of failure and a bottleneck. Not every distributed system needs a leader. Consider leaderless alternatives when:
When Leaderless Works Better
- High write throughput: A single leader serializes all writes — it becomes the bottleneck. Leaderless systems (Cassandra, DynamoDB, Riak) allow any node to accept writes, distributing load evenly.
- Availability over consistency: If your application tolerates eventual consistency, a leader adds unnecessary complexity and latency (every write must go through one node).
- Geo-distributed deployments: With data centers on different continents, routing all writes to one leader introduces cross-continent latency. Multi-leader or leaderless systems accept writes locally.
- CRDTs and conflict-free data: If your data model uses Conflict-Free Replicated Data Types (counters, sets, registers), concurrent writes merge automatically without a leader.
Leader vs Leaderless Comparison
| Aspect | Leader-Based | Leaderless |
|---|---|---|
| Consistency | Strong (linearizable) | Eventual (tunable) |
| Write throughput | Limited by leader | Scales with nodes |
| Read throughput | Can scale with followers | Scales with nodes |
| Availability | Leader failure → brief downtime | Any node serves requests |
| Complexity | Election, failover, fencing | Conflict resolution, read repair |
| Examples | etcd, ZooKeeper, Kafka (KRaft) | Cassandra, DynamoDB, Riak |
11 · Real-World Leader Election
Kafka Controller Election
In Apache Kafka, the controller is a special broker that manages partition leadership, replica assignment, and topic creation. There must be exactly one controller in the cluster.
Pre-KRaft (ZooKeeper-based)
- Brokers race to create an ephemeral znode
/controllerin ZooKeeper. - The first one to create it wins (ZooKeeper's atomic
createensures exclusivity). - Other brokers set a watch on
/controller. - When the controller crashes, the ephemeral znode is deleted, watches fire, and brokers race again.
- Problem: thundering herd — all N brokers rush to ZooKeeper simultaneously.
KRaft (Raft-based, Kafka 3.x+)
- Kafka runs its own Raft-based consensus (KRaft) among a set of controller quorum nodes.
- These nodes elect a leader using Raft's standard term + majority vote mechanism.
- The elected leader maintains the metadata log (partition assignments, ISR lists, configs).
- Benefits: no external dependency (ZooKeeper removed), fewer moving parts, faster failover.
Elasticsearch Master Election
Elasticsearch uses a master node for cluster state management: index creation/deletion, shard allocation, and node membership. The election uses a Zen Discovery protocol (pre-7.x) or a custom Raft-like protocol (7.x+).
- Master-eligible nodes: Configured with
node.roles: [master]. Only these participate in elections. - Quorum:
discovery.seed_hostslists known master-eligible nodes. The cluster needs a majority (⌈n/2⌉+1) to elect a master. - Cluster state: The elected master publishes cluster state updates. Nodes that don't acknowledge are removed.
- Minimum master nodes: In pre-7.x,
discovery.zen.minimum_master_nodeshad to be set manually to avoid split brain. 7.x+ automatically manages the quorum.
More Real-World Examples
| System | Election Mechanism | Notes |
|---|---|---|
| Kubernetes | etcd lease-based (via client-go/tools/leaderelection) | kube-controller-manager and kube-scheduler use leader election for HA |
| Redis Sentinel | Raft-like majority vote among Sentinels | Sentinels vote on which Sentinel performs the failover |
| MySQL Group Replication | Paxos-based consensus | Single-primary mode elects one writable node |
| CockroachDB | Per-range Raft | Each range (shard) has its own Raft group and elected leader |
| HDFS | ZooKeeper ZKFC (NameNode HA) | ZooKeeper Failover Controller elects active NameNode |
| Google Chubby | Paxos | The original distributed lock service; inspired ZooKeeper |
12 · Summary & Decision Framework
Choosing the right leader election approach depends on your system's requirements. Here is a decision framework:
| Requirement | Recommended Approach |
|---|---|
| Simple, small cluster (<10 nodes), no partitions | Bully or Ring algorithm |
| Strong consistency, partition tolerance | Raft / Paxos / ZAB |
| Application-level election using existing infra | ZooKeeper ephemeral nodes or etcd leases |
| Kubernetes-native HA | etcd + leaderelection package |
| High availability, eventual consistency OK | Avoid election — use leaderless (Cassandra, DynamoDB) |
| Need split-brain protection | Fencing tokens + quorum writes |