← All Posts
High Level Design Series · Distributed Systems · Part 7

Leader Election

In a distributed system, many tasks require exactly one node to take charge: writing to a shared log, scheduling jobs, rebalancing partitions, or coordinating a two-phase commit. The mechanism that selects that one node from a group of peers is called leader election. Get it wrong and you get split brain, data corruption, or total cluster paralysis. This post covers every algorithm and tool you need — from classical approaches (Bully, Ring) to production-grade systems (Raft, ZooKeeper, etcd) — along with the safety mechanisms (fencing tokens, leases) that make leader election reliable in the real world.

1 · Why Leader Election?

Many distributed problems become dramatically simpler when exactly one node acts as the leader (also called the master, primary, or coordinator). The leader centralizes decision-making so that the rest of the cluster can follow a single, authoritative source of truth.

Single Writer / Log Ordering

A replicated state machine requires a totally ordered log of commands. Without a single writer, concurrent writes create conflicting orderings. The leader serializes all writes, assigns monotonically increasing sequence numbers, and replicates the log to followers. This is the foundation of Raft, ZAB (ZooKeeper), and Multi-Paxos.

Coordination & Mutual Exclusion

Tasks like database schema migrations, shard rebalancing, or snapshot creation must be performed by exactly one node. A leader can act as a coordinator, issuing commands to participants without risking concurrent conflicting operations. This is far simpler than distributed locking across all nodes.

Task Scheduling

In systems like Apache Kafka or Kubernetes, a controller node assigns partitions to brokers, schedules pods to nodes, or triggers failovers. Without a single scheduler, assignment conflicts cause duplicate work, missed tasks, or oscillating state. The leader acts as the authoritative scheduler.

Consistency Guarantees

If every node independently accepts writes, you need a conflict-resolution strategy (CRDTs, last-write-wins, etc.). A single leader eliminates write conflicts entirely — every write is ordered through one node, giving you linearizability by construction.

Key Insight: Leader election is not about making one node "special" — it is about reducing a distributed coordination problem to a single-node coordination problem, which is fundamentally simpler.

2 · The Bully Algorithm

The Bully Algorithm, proposed by Hector Garcia-Molina in 1982, is one of the simplest and most intuitive leader election algorithms. The idea: the node with the highest ID always wins. If a node suspects the leader is dead, it "bullies" lower-ID nodes into submission.

How It Works

  1. Every node has a unique numeric ID (e.g., 1 through N).
  2. When a node detects the leader has failed (timeout on heartbeat), it initiates an election.
  3. The initiator sends an ELECTION message to all nodes with higher IDs.
  4. If any higher-ID node responds with OK (meaning "I'll take it from here"), the initiator backs off.
  5. Each responding node then starts its own election by sending ELECTION to nodes with even higher IDs.
  6. The process repeats until a node sends ELECTION and receives no response — that node is the highest alive and declares itself leader.
  7. The new leader broadcasts a COORDINATOR message to all nodes.

Pseudocode

Pseudocode function detectLeaderFailure(self): // Triggered when heartbeat from leader times out startElection(self) function startElection(self): higherNodes = { n ∈ cluster | n.id > self.id } if higherNodes is empty: // I have the highest ID — I am the new leader self.isLeader = true for each node n in cluster: send(n, COORDINATOR, self.id) return // Send ELECTION to all higher-ID nodes for each node n in higherNodes: send(n, ELECTION, self.id) // Wait for OK responses (with timeout) responses = waitForOK(timeout=T) if responses is empty: // No higher node responded — I win self.isLeader = true for each node n in cluster: send(n, COORDINATOR, self.id) else: // A higher node will take over — back off waitForCoordinator(timeout=T2) function onReceive(msg, sender): if msg == ELECTION: send(sender, OK) // "I outrank you" startElection(self) // Start my own election if msg == COORDINATOR: self.leader = sender.id // Accept new leader self.isLeader = false

Complexity Analysis

MetricBest CaseWorst Case
MessagesO(n) — highest node detects failure, sends n−1 COORDINATOR messagesO(n²) — lowest node detects failure, cascade of ELECTION + OK messages
Rounds1 roundO(n) rounds
TimeO(1) message delaysO(n) message delays

Trade-offs

✅ Advantages

  • Simple to understand and implement
  • Deterministic outcome (highest alive ID always wins)
  • Fast in the best case (1 round)

❌ Disadvantages

  • O(n²) messages in worst case
  • Highest-ID node is always elected — no load balancing
  • Vulnerable to a "flapping" high-ID node that keeps crashing and restarting
  • No partition tolerance — relies on reliable failure detection

▶ Bully Algorithm Election

5 nodes. Node 5 (leader) dies. Watch the cascade of ELECTION → OK → COORDINATOR messages. Click Step to advance.

3 · The Ring Algorithm

The Ring Algorithm (Chang-Roberts, 1979) arranges nodes in a logical ring. When a node detects a leader failure, it passes an election token around the ring. The token collects node IDs, and the highest ID wins.

How It Works

  1. Nodes are arranged in a logical ring — each node knows its successor.
  2. When a node detects leader failure, it creates an ELECTION message containing its own ID and forwards it to its successor.
  3. Each node that receives the message appends its own ID (if alive) and forwards it.
  4. When the message returns to the initiator (it has traversed the full ring), the node with the highest ID in the list is declared leader.
  5. A COORDINATOR message is sent around the ring to inform everyone.

Pseudocode

Pseudocode function startRingElection(self): msg = ElectionMessage { candidates: [self.id], initiator: self.id } send(self.successor, msg) function onElectionMessage(self, msg): if msg.initiator == self.id: // Message has gone full circle — determine winner leader = max(msg.candidates) coordinatorMsg = CoordinatorMessage { leader: leader } send(self.successor, coordinatorMsg) else: // Add my ID and forward msg.candidates.append(self.id) send(self.successor, msg) function onCoordinatorMessage(self, msg): self.leader = msg.leader if msg.leader != self.id: self.isLeader = false else: self.isLeader = true // Forward to complete the ring if self.successor != msg.initiator: send(self.successor, msg)

Complexity

The ring algorithm always sends exactly 2n messages (n for the election round, n for the coordinator announcement), regardless of which node initiates. This is better than the Bully algorithm's worst case, but slower because the message must traverse the entire ring.

AlgorithmBest Case MessagesWorst Case MessagesPartition Tolerant?
BullyO(n)O(n²)No
Ring (Chang-Roberts)O(n)O(n)No (ring breaks)
RaftO(n)O(n)Yes (majority quorum)

4 · Raft-Based Leader Election

Raft (Ongaro & Ousterhout, 2014) is the industry standard for consensus-based leader election. Unlike Bully or Ring, Raft is partition-tolerant and guarantees safety: at most one leader per term, and a leader must be accepted by a majority of nodes.

Terms & States

Time is divided into terms — monotonically increasing integers. Each term has at most one leader. Every node is in one of three states:

Randomized Election Timeouts

Each follower has an election timeout — a random duration between 150ms and 300ms. If the follower does not hear from the leader (via heartbeats) before its timeout fires, it assumes the leader has failed and becomes a candidate. The randomization ensures that in most cases, only one follower times out first, avoiding split votes.

The Election Process

  1. Timeout fires: A follower increments its term to T+1, transitions to candidate, and votes for itself.
  2. RequestVote RPC: The candidate sends RequestVote(term=T+1, candidateId, lastLogIndex, lastLogTerm) to all other nodes.
  3. Vote granting: A node grants its vote if:
      (a) it has not already voted in term T+1, AND
      (b) the candidate's log is at least as up-to-date as the voter's log.
  4. Majority wins: If the candidate receives votes from a majority of the cluster (⌈n/2⌉ + 1), it becomes leader for term T+1.
  5. Heartbeats: The new leader immediately sends AppendEntries heartbeats to all followers to assert authority and prevent new elections.

Pseudocode

Raft Election // On election timeout (no heartbeat from leader) function onElectionTimeout(self): self.currentTerm += 1 self.state = CANDIDATE self.votedFor = self.id votesReceived = 1 // Vote for myself for each node n in cluster (n ≠ self): sendRequestVote(n, { term: self.currentTerm, candidateId: self.id, lastLogIndex: len(self.log) - 1, lastLogTerm: self.log[len(self.log) - 1].term }) // Await responses... function onRequestVoteReply(self, reply): if reply.term > self.currentTerm: stepDown(self, reply.term) // A higher term exists return if reply.voteGranted: votesReceived += 1 if votesReceived > clusterSize / 2: becomeLeader(self) function onRequestVote(self, request): if request.term < self.currentTerm: return { term: self.currentTerm, voteGranted: false } if request.term > self.currentTerm: stepDown(self, request.term) logOk = request.lastLogTerm > self.lastLogTerm OR (request.lastLogTerm == self.lastLogTerm AND request.lastLogIndex >= len(self.log) - 1) if (self.votedFor == null OR self.votedFor == request.candidateId) AND logOk: self.votedFor = request.candidateId resetElectionTimer() return { term: self.currentTerm, voteGranted: true } else: return { term: self.currentTerm, voteGranted: false }

Split Votes

If two candidates start elections simultaneously and split the vote so neither gets a majority, neither wins. After another random timeout, one of them starts a new election with an incremented term. The randomized timeouts make consecutive split votes statistically unlikely — Raft converges in 1–2 rounds in practice.

Safety guarantee: At most one leader per term. Since votes are granted at most once per term, and a leader needs a majority, two leaders would require two majorities in the same term — which is impossible since any two majorities overlap in at least one node.

5 · ZooKeeper Ephemeral Sequential Nodes

Apache ZooKeeper provides a powerful primitive for leader election: ephemeral sequential znodes. This approach is elegant, battle-tested, and avoids the herd effect (thundering herd problem).

The Mechanism

  1. Each candidate creates an ephemeral sequential znode under a known path: /election/node-. ZooKeeper appends a monotonically increasing sequence number: /election/node-0000000001, /election/node-0000000002, etc.
  2. The candidate with the lowest sequence number is the leader.
  3. Each non-leader node sets a watch on the znode immediately before it in the sequence (not on the leader — this prevents the thundering herd).
  4. When the leader crashes, its ephemeral znode is automatically deleted by ZooKeeper (the session is gone).
  5. The watch on the deleted znode fires, notifying the next node in sequence, which becomes the new leader.

ZooKeeper Election Code (Java)

Java / ZooKeeper public class LeaderElection implements Watcher { private ZooKeeper zk; private String myZnode; private static final String ELECTION_PATH = "/election"; public void volunteer() throws Exception { // Create ephemeral sequential znode myZnode = zk.create( ELECTION_PATH + "/node-", new byte[0], ZooDefs.Ids.OPEN_ACL_UNSAFE, CreateMode.EPHEMERAL_SEQUENTIAL ); System.out.println("Created: " + myZnode); checkLeadership(); } private void checkLeadership() throws Exception { // Get all children sorted by sequence number List<String> children = zk.getChildren(ELECTION_PATH, false); Collections.sort(children); String smallestNode = children.get(0); String myNodeName = myZnode.substring( myZnode.lastIndexOf('/') + 1 ); if (myNodeName.equals(smallestNode)) { // I am the leader! System.out.println("I am the leader: " + myZnode); onElectedLeader(); } else { // Watch the node JUST before me (avoids herd effect) int myIndex = children.indexOf(myNodeName); String predecessorNode = children.get(myIndex - 1); // Set watch on predecessor Stat stat = zk.exists( ELECTION_PATH + "/" + predecessorNode, this // Watcher callback ); if (stat == null) { // Predecessor already gone — re-check checkLeadership(); } else { System.out.println( "Watching predecessor: " + predecessorNode ); } } } public void process(WatchedEvent event) { if (event.getType() == EventType.NodeDeleted) { // Predecessor deleted — maybe I'm leader now try { checkLeadership(); } catch (Exception e) { e.printStackTrace(); } } } }

Avoiding the Thundering Herd

A naive approach would have all non-leader nodes watch the leader znode. When the leader dies, all N−1 nodes receive the notification simultaneously and race to become leader. This creates O(n) simultaneous getChildren calls — the thundering herd problem.

The sequential-predecessor watch pattern ensures that when the leader dies, only one node (the second in sequence) is notified. It checks if it's now the smallest and becomes leader. O(1) notifications, O(1) ZooKeeper operations.

▶ ZooKeeper Ephemeral Node Election

Clients create sequential ephemeral znodes. Lowest number is leader. Watch the failover when the leader dies.

6 · etcd Leader Election with Leases

etcd, the distributed key-value store powering Kubernetes, provides leader election via leases and its concurrency package. A lease is a time-bound lock on a key — if the leader fails to renew the lease within the TTL, it automatically expires.

How etcd Election Works

  1. A candidate creates a lease with a TTL (e.g., 15 seconds) and keeps it alive with periodic KeepAlive RPCs.
  2. The candidate attempts to become leader by calling Campaign(), which creates a key under an election prefix with the lease attached.
  3. etcd uses its Raft-based consensus internally — the first key created in the election prefix wins.
  4. If the leader crashes or fails to renew the lease, the key is automatically deleted after TTL seconds.
  5. The next waiting candidate (which was blocked on Campaign()) is promoted to leader.

etcd Election Code (Go)

Go / etcd package main import ( "context" "fmt" "log" "time" clientv3 "go.etcd.io/etcd/client/v3" "go.etcd.io/etcd/client/v3/concurrency" ) func main() { // Connect to etcd cluster cli, err := clientv3.New(clientv3.Config{ Endpoints: []string{"localhost:2379"}, DialTimeout: 5 * time.Second, }) if err != nil { log.Fatal(err) } defer cli.Close() // Create a session (lease with keep-alive) session, err := concurrency.NewSession(cli, concurrency.WithTTL(15), // 15-second lease TTL ) if err != nil { log.Fatal(err) } defer session.Close() // Create an election on prefix "/my-app/leader" election := concurrency.NewElection(session, "/my-app/leader", ) // Campaign to become leader (blocks until elected) ctx := context.Background() if err := election.Campaign(ctx, "node-1"); err != nil { log.Fatal(err) } fmt.Println("I am the leader!") // Do leader work... doLeaderWork() // Voluntarily resign leadership if err := election.Resign(ctx); err != nil { log.Fatal(err) } }
etcd vs ZooKeeper: Both solve the same problem. etcd uses Raft and gRPC with a Go-native API. ZooKeeper uses ZAB and a Java-centric API. Kubernetes chose etcd for its simplicity and native Go support. Kafka (until 3.x) used ZooKeeper; KRaft now replaces it with a built-in Raft implementation.

7 · Split Brain Prevention

Split brain is the most dangerous failure mode in leader election: two nodes both believe they are the leader and independently accept writes. This leads to divergent state, data corruption, and conflicting decisions.

How Split Brain Happens

Prevention Strategies

StrategyMechanismUsed By
Majority quorumLeader must replicate to ⌈n/2⌉+1 nodes before committing. An isolated leader cannot get a quorum.Raft, Paxos, ZAB
Lease expiryLeader holds a time-bound lease. Must renew before TTL. If partition prevents renewal, leadership automatically expires.etcd, Chubby, DynamoDB
Fencing tokensEvery new leader gets a monotonically increasing token. Resource servers reject requests with stale tokens.HBase, custom systems
STONITH"Shoot The Other Node In The Head" — physically power-off the old leader via IPMI/BMC before the new leader starts.Pacemaker, HA clusters
Epoch numbersSimilar to Raft terms. Every election increments the epoch. Followers reject commands from stale epochs.Kafka (KRaft), ZooKeeper

8 · Fencing Tokens for Safety

Even with quorum-based election, there is a window where an old leader might still issue writes. Fencing tokens close this window by giving each leader a unique, monotonically increasing token that resource servers validate.

How Fencing Works

  1. Each time a new leader is elected, the election system issues a fencing token (an integer that strictly increases with each election — this is the Raft "term" or ZooKeeper "zxid").
  2. The leader includes its fencing token in every request to storage/resources.
  3. The storage server tracks the highest token it has seen.
  4. If a request arrives with a token lower than the highest seen, it is rejected — it came from a stale leader.

Example Scenario

Timeline Time 0: Leader A elected, gets fencing token = 34 Time 1: Leader A writes to storage with token 34 → ✅ accepted Time 2: Leader A gets a long GC pause (appears dead) Time 3: Leader B elected, gets fencing token = 35 Time 4: Leader B writes to storage with token 35 → ✅ accepted Time 5: Leader A wakes up, tries to write with token 34 → ❌ REJECTED (storage has seen token 35, which is higher)
Warning: Fencing tokens only work if all resource servers participate in validation. A storage system that doesn't check the token creates a safety hole. This is why Martin Kleppmann argues that Redlock (Redis distributed locks) is unsafe — Redis doesn't have built-in fencing token validation.

Fencing Token Validation

Pseudocode // Storage server side class FencedStorage: highestTokenSeen = 0 data = {} function write(key, value, fencingToken): if fencingToken < highestTokenSeen: raise StaleLeaderError( "Token " + fencingToken + " < highest seen " + highestTokenSeen ) highestTokenSeen = fencingToken data[key] = value return OK

9 · Lease-Based Leadership

A lease is a time-bounded grant of leadership. The leader must periodically renew the lease before it expires. If the leader fails to renew (crash, partition, GC pause), the lease expires and other candidates can acquire it.

How Leases Work

Lease Lifecycle 1. Acquire: Leader obtains lease with TTL = 15s 2. Renew: Leader sends keepalive every 5s (TTL/3) 3. Use: Leader performs work only while lease is valid 4. Expire: If renewal fails, lease expires after 15s 5. Re-elect: Another candidate acquires the lease

The Clock Skew Problem

Leases depend on time, and clocks across machines can skew. If the leader's clock is slow and the follower's clock is fast, the follower might think the lease has expired while the leader thinks it's still valid. Solutions:

Leases vs Heartbeats

AspectHeartbeatLease
DirectionLeader → followersLeader → lease server
Failure detectionFollowers detect missing heartbeatsLease server detects expired TTL
Time dependencyRelative timeoutsAbsolute TTL (clock-dependent)
SafetyRequires additional mechanism (terms, fencing)Built-in expiry prevents stale leadership
Typical useRaft, ZABetcd, Chubby, DynamoDB

10 · When to Avoid Leader Election

Leader election introduces a single point of failure and a bottleneck. Not every distributed system needs a leader. Consider leaderless alternatives when:

When Leaderless Works Better

Leader vs Leaderless Comparison

AspectLeader-BasedLeaderless
ConsistencyStrong (linearizable)Eventual (tunable)
Write throughputLimited by leaderScales with nodes
Read throughputCan scale with followersScales with nodes
AvailabilityLeader failure → brief downtimeAny node serves requests
ComplexityElection, failover, fencingConflict resolution, read repair
Examplesetcd, ZooKeeper, Kafka (KRaft)Cassandra, DynamoDB, Riak
💡 Interview Tip: When asked "would you use leader election here?", consider the trade-offs. If the system needs strong consistency and ordered operations (like a distributed lock service or a metadata store), leader election is the right choice. If the system needs high availability and write throughput across regions (like a social media feed or analytics pipeline), leaderless is better. Show that you understand when to use each approach.

11 · Real-World Leader Election

Kafka Controller Election

In Apache Kafka, the controller is a special broker that manages partition leadership, replica assignment, and topic creation. There must be exactly one controller in the cluster.

Pre-KRaft (ZooKeeper-based)

KRaft (Raft-based, Kafka 3.x+)

KRaft Config # server.properties for a KRaft controller process.roles=controller node.id=1 controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093 controller.listener.names=CONTROLLER listeners=CONTROLLER://controller1:9093

Elasticsearch Master Election

Elasticsearch uses a master node for cluster state management: index creation/deletion, shard allocation, and node membership. The election uses a Zen Discovery protocol (pre-7.x) or a custom Raft-like protocol (7.x+).

elasticsearch.yml # Elasticsearch 8.x master election config node.roles: [master, data] cluster.name: my-cluster cluster.initial_master_nodes: - master-node-1 - master-node-2 - master-node-3 discovery.seed_hosts: - "10.0.1.1:9300" - "10.0.1.2:9300" - "10.0.1.3:9300"

More Real-World Examples

SystemElection MechanismNotes
Kubernetesetcd lease-based (via client-go/tools/leaderelection)kube-controller-manager and kube-scheduler use leader election for HA
Redis SentinelRaft-like majority vote among SentinelsSentinels vote on which Sentinel performs the failover
MySQL Group ReplicationPaxos-based consensusSingle-primary mode elects one writable node
CockroachDBPer-range RaftEach range (shard) has its own Raft group and elected leader
HDFSZooKeeper ZKFC (NameNode HA)ZooKeeper Failover Controller elects active NameNode
Google ChubbyPaxosThe original distributed lock service; inspired ZooKeeper

12 · Summary & Decision Framework

Choosing the right leader election approach depends on your system's requirements. Here is a decision framework:

RequirementRecommended Approach
Simple, small cluster (<10 nodes), no partitionsBully or Ring algorithm
Strong consistency, partition toleranceRaft / Paxos / ZAB
Application-level election using existing infraZooKeeper ephemeral nodes or etcd leases
Kubernetes-native HAetcd + leaderelection package
High availability, eventual consistency OKAvoid election — use leaderless (Cassandra, DynamoDB)
Need split-brain protectionFencing tokens + quorum writes
💡 Interview Tip: In system design interviews, when leader election comes up, mention: (1) why you need a leader (single writer, coordination), (2) how you'll elect one (Raft-based, ZooKeeper, etcd), (3) split brain prevention (fencing tokens, majority quorum), and (4) failover time (typically TTL + election time ≈ 5–30 seconds). If the interviewer asks about alternatives, discuss leaderless architectures and when they're appropriate.