High Level Design Series · Distributed Systems · Part 7

Leader Election

April 2026 · 16 min read

In a distributed system, many tasks require exactly one node to take charge: writing to a shared log, scheduling jobs, rebalancing partitions, or coordinating a two-phase commit. The mechanism that selects that one node from a group of peers is called leader election. Get it wrong and you get split brain, data corruption, or total cluster paralysis. This post covers every algorithm and tool you need — from classical approaches (Bully, Ring) to production-grade systems (Raft, ZooKeeper, etcd) — along with the safety mechanisms (fencing tokens, leases) that make leader election reliable in the real world.

1 · Why Leader Election?

Many distributed problems become dramatically simpler when exactly one node acts as the leader (also called the master, primary, or coordinator). The leader centralizes decision-making so that the rest of the cluster can follow a single, authoritative source of truth.

Single Writer / Log Ordering

A replicated state machine requires a totally ordered log of commands. Without a single writer, concurrent writes create conflicting orderings. The leader serializes all writes, assigns monotonically increasing sequence numbers, and replicates the log to followers. This is the foundation of Raft, ZAB (ZooKeeper), and Multi-Paxos.

Coordination & Mutual Exclusion

Tasks like database schema migrations, shard rebalancing, or snapshot creation must be performed by exactly one node. A leader can act as a coordinator, issuing commands to participants without risking concurrent conflicting operations. This is far simpler than distributed locking across all nodes.

Task Scheduling

In systems like Apache Kafka or Kubernetes, a controller node assigns partitions to brokers, schedules pods to nodes, or triggers failovers. Without a single scheduler, assignment conflicts cause duplicate work, missed tasks, or oscillating state. The leader acts as the authoritative scheduler.

Consistency Guarantees

If every node independently accepts writes, you need a conflict-resolution strategy (CRDTs, last-write-wins, etc.). A single leader eliminates write conflicts entirely — every write is ordered through one node, giving you linearizability by construction.

Key Insight: Leader election is not about making one node "special" — it is about reducing a distributed coordination problem to a single-node coordination problem, which is fundamentally simpler.

2 · The Bully Algorithm

The Bully Algorithm, proposed by Hector Garcia-Molina in 1982, is one of the simplest and most intuitive leader election algorithms. The idea: the node with the highest ID always wins. If a node suspects the leader is dead, it "bullies" lower-ID nodes into submission.

How It Works

Every node has a unique numeric ID (e.g., 1 through N).
When a node detects the leader has failed (timeout on heartbeat), it initiates an election.
The initiator sends an ELECTION message to all nodes with higher IDs.
If any higher-ID node responds with OK (meaning "I'll take it from here"), the initiator backs off.
Each responding node then starts its own election by sending ELECTION to nodes with even higher IDs.
The process repeats until a node sends ELECTION and receives no response — that node is the highest alive and declares itself leader.
The new leader broadcasts a COORDINATOR message to all nodes.

Pseudocode

        Pseudocode
function detectLeaderFailure(self):
    // Triggered when heartbeat from leader times out
    startElection(self)

function startElection(self):
    higherNodes = { n ∈ cluster | n.id > self.id }

    if higherNodes is empty:
        // I have the highest ID — I am the new leader
        self.isLeader = true
        for each node n in cluster:
            send(n, COORDINATOR, self.id)
        return

    // Send ELECTION to all higher-ID nodes
    for each node n in higherNodes:
        send(n, ELECTION, self.id)

    // Wait for OK responses (with timeout)
    responses = waitForOK(timeout=T)

    if responses is empty:
        // No higher node responded — I win
        self.isLeader = true
        for each node n in cluster:
            send(n, COORDINATOR, self.id)
    else:
        // A higher node will take over — back off
        waitForCoordinator(timeout=T2)

function onReceive(msg, sender):
    if msg == ELECTION:
        send(sender, OK)          // "I outrank you"
        startElection(self)       // Start my own election

    if msg == COORDINATOR:
        self.leader = sender.id   // Accept new leader
        self.isLeader = false
      

Complexity Analysis

Metric	Best Case	Worst Case
Messages	O(n) — highest node detects failure, sends n−1 COORDINATOR messages	O(n²) — lowest node detects failure, cascade of ELECTION + OK messages
Rounds	1 round	O(n) rounds
Time	O(1) message delays	O(n) message delays

Trade-offs

✅ Advantages

Simple to understand and implement
Deterministic outcome (highest alive ID always wins)
Fast in the best case (1 round)

❌ Disadvantages

O(n²) messages in worst case
Highest-ID node is always elected — no load balancing
Vulnerable to a "flapping" high-ID node that keeps crashing and restarting
No partition tolerance — relies on reliable failure detection

▶ Bully Algorithm Election

5 nodes. Node 5 (leader) dies. Watch the cascade of ELECTION → OK → COORDINATOR messages. Click Step to advance.

3 · The Ring Algorithm

The Ring Algorithm (Chang-Roberts, 1979) arranges nodes in a logical ring. When a node detects a leader failure, it passes an election token around the ring. The token collects node IDs, and the highest ID wins.

How It Works

Nodes are arranged in a logical ring — each node knows its successor.
When a node detects leader failure, it creates an ELECTION message containing its own ID and forwards it to its successor.
Each node that receives the message appends its own ID (if alive) and forwards it.
When the message returns to the initiator (it has traversed the full ring), the node with the highest ID in the list is declared leader.
A COORDINATOR message is sent around the ring to inform everyone.

Pseudocode

        Pseudocode
function startRingElection(self):
    msg = ElectionMessage { candidates: [self.id], initiator: self.id }
    send(self.successor, msg)

function onElectionMessage(self, msg):
    if msg.initiator == self.id:
        // Message has gone full circle — determine winner
        leader = max(msg.candidates)
        coordinatorMsg = CoordinatorMessage { leader: leader }
        send(self.successor, coordinatorMsg)
    else:
        // Add my ID and forward
        msg.candidates.append(self.id)
        send(self.successor, msg)

function onCoordinatorMessage(self, msg):
    self.leader = msg.leader
    if msg.leader != self.id:
        self.isLeader = false
    else:
        self.isLeader = true
    // Forward to complete the ring
    if self.successor != msg.initiator:
        send(self.successor, msg)
      

Complexity

The ring algorithm always sends exactly 2n messages (n for the election round, n for the coordinator announcement), regardless of which node initiates. This is better than the Bully algorithm's worst case, but slower because the message must traverse the entire ring.

Algorithm	Best Case Messages	Worst Case Messages	Partition Tolerant?
Bully	O(n)	O(n²)	No
Ring (Chang-Roberts)	O(n)	O(n)	No (ring breaks)
Raft	O(n)	O(n)	Yes (majority quorum)

4 · Raft-Based Leader Election

Raft (Ongaro & Ousterhout, 2014) is the industry standard for consensus-based leader election. Unlike Bully or Ring, Raft is partition-tolerant and guarantees safety: at most one leader per term, and a leader must be accepted by a majority of nodes.

Terms & States

Time is divided into terms — monotonically increasing integers. Each term has at most one leader. Every node is in one of three states:

Follower: Passive; responds to RPCs from the leader and candidates.
Candidate: Actively seeking votes to become the new leader.
Leader: Handles all client requests and replicates the log.

Randomized Election Timeouts

Each follower has an election timeout — a random duration between 150ms and 300ms. If the follower does not hear from the leader (via heartbeats) before its timeout fires, it assumes the leader has failed and becomes a candidate. The randomization ensures that in most cases, only one follower times out first, avoiding split votes.

The Election Process

Timeout fires: A follower increments its term to T+1, transitions to candidate, and votes for itself.
RequestVote RPC: The candidate sends RequestVote(term=T+1, candidateId, lastLogIndex, lastLogTerm) to all other nodes.
Vote granting: A node grants its vote if:
(a) it has not already voted in term T+1, AND
(b) the candidate's log is at least as up-to-date as the voter's log.
Majority wins: If the candidate receives votes from a majority of the cluster (⌈n/2⌉ + 1), it becomes leader for term T+1.
Heartbeats: The new leader immediately sends AppendEntries heartbeats to all followers to assert authority and prevent new elections.

Pseudocode

        Raft Election
// On election timeout (no heartbeat from leader)
function onElectionTimeout(self):
    self.currentTerm += 1
    self.state = CANDIDATE
    self.votedFor = self.id
    votesReceived = 1  // Vote for myself

    for each node n in cluster (n ≠ self):
        sendRequestVote(n, {
            term:         self.currentTerm,
            candidateId:  self.id,
            lastLogIndex: len(self.log) - 1,
            lastLogTerm:  self.log[len(self.log) - 1].term
        })

    // Await responses...

function onRequestVoteReply(self, reply):
    if reply.term > self.currentTerm:
        stepDown(self, reply.term)  // A higher term exists
        return

    if reply.voteGranted:
        votesReceived += 1
        if votesReceived > clusterSize / 2:
            becomeLeader(self)

function onRequestVote(self, request):
    if request.term < self.currentTerm:
        return { term: self.currentTerm, voteGranted: false }

    if request.term > self.currentTerm:
        stepDown(self, request.term)

    logOk = request.lastLogTerm > self.lastLogTerm
          OR (request.lastLogTerm == self.lastLogTerm
              AND request.lastLogIndex >= len(self.log) - 1)

    if (self.votedFor == null OR self.votedFor == request.candidateId) AND logOk:
        self.votedFor = request.candidateId
        resetElectionTimer()
        return { term: self.currentTerm, voteGranted: true }
    else:
        return { term: self.currentTerm, voteGranted: false }
      

Split Votes

If two candidates start elections simultaneously and split the vote so neither gets a majority, neither wins. After another random timeout, one of them starts a new election with an incremented term. The randomized timeouts make consecutive split votes statistically unlikely — Raft converges in 1–2 rounds in practice.

Safety guarantee: At most one leader per term. Since votes are granted at most once per term, and a leader needs a majority, two leaders would require two majorities in the same term — which is impossible since any two majorities overlap in at least one node.

5 · ZooKeeper Ephemeral Sequential Nodes

Apache ZooKeeper provides a powerful primitive for leader election: ephemeral sequential znodes. This approach is elegant, battle-tested, and avoids the herd effect (thundering herd problem).

The Mechanism

Each candidate creates an ephemeral sequential znode under a known path: /election/node-. ZooKeeper appends a monotonically increasing sequence number: /election/node-0000000001, /election/node-0000000002, etc.
The candidate with the lowest sequence number is the leader.
Each non-leader node sets a watch on the znode immediately before it in the sequence (not on the leader — this prevents the thundering herd).
When the leader crashes, its ephemeral znode is automatically deleted by ZooKeeper (the session is gone).
The watch on the deleted znode fires, notifying the next node in sequence, which becomes the new leader.

ZooKeeper Election Code (Java)

        Java / ZooKeeper
public class LeaderElection implements Watcher {
    private ZooKeeper zk;
    private String myZnode;
    private static final String ELECTION_PATH = "/election";

    public void volunteer() throws Exception {
        // Create ephemeral sequential znode
        myZnode = zk.create(
            ELECTION_PATH + "/node-",
            new byte[0],
            ZooDefs.Ids.OPEN_ACL_UNSAFE,
            CreateMode.EPHEMERAL_SEQUENTIAL
        );
        System.out.println("Created: " + myZnode);
        checkLeadership();
    }

    private void checkLeadership() throws Exception {
        // Get all children sorted by sequence number
        List<String> children = zk.getChildren(ELECTION_PATH, false);
        Collections.sort(children);

        String smallestNode = children.get(0);
        String myNodeName = myZnode.substring(
            myZnode.lastIndexOf('/') + 1
        );

        if (myNodeName.equals(smallestNode)) {
            // I am the leader!
            System.out.println("I am the leader: " + myZnode);
            onElectedLeader();
        } else {
            // Watch the node JUST before me (avoids herd effect)
            int myIndex = children.indexOf(myNodeName);
            String predecessorNode = children.get(myIndex - 1);

            // Set watch on predecessor
            Stat stat = zk.exists(
                ELECTION_PATH + "/" + predecessorNode,
                this  // Watcher callback
            );

            if (stat == null) {
                // Predecessor already gone — re-check
                checkLeadership();
            } else {
                System.out.println(
                    "Watching predecessor: " + predecessorNode
                );
            }
        }
    }

    public void process(WatchedEvent event) {
        if (event.getType() == EventType.NodeDeleted) {
            // Predecessor deleted — maybe I'm leader now
            try {
                checkLeadership();
            } catch (Exception e) {
                e.printStackTrace();
            }
        }
    }
}
      

Avoiding the Thundering Herd

A naive approach would have all non-leader nodes watch the leader znode. When the leader dies, all N−1 nodes receive the notification simultaneously and race to become leader. This creates O(n) simultaneous getChildren calls — the thundering herd problem.

The sequential-predecessor watch pattern ensures that when the leader dies, only one node (the second in sequence) is notified. It checks if it's now the smallest and becomes leader. O(1) notifications, O(1) ZooKeeper operations.

▶ ZooKeeper Ephemeral Node Election

Clients create sequential ephemeral znodes. Lowest number is leader. Watch the failover when the leader dies.

6 · etcd Leader Election with Leases

etcd, the distributed key-value store powering Kubernetes, provides leader election via leases and its concurrency package. A lease is a time-bound lock on a key — if the leader fails to renew the lease within the TTL, it automatically expires.

How etcd Election Works

A candidate creates a lease with a TTL (e.g., 15 seconds) and keeps it alive with periodic KeepAlive RPCs.
The candidate attempts to become leader by calling Campaign(), which creates a key under an election prefix with the lease attached.
etcd uses its Raft-based consensus internally — the first key created in the election prefix wins.
If the leader crashes or fails to renew the lease, the key is automatically deleted after TTL seconds.
The next waiting candidate (which was blocked on Campaign()) is promoted to leader.

etcd Election Code (Go)

        Go / etcd
package main

import (
    "context"
    "fmt"
    "log"
    "time"

    clientv3 "go.etcd.io/etcd/client/v3"
    "go.etcd.io/etcd/client/v3/concurrency"
)

func main() {
    // Connect to etcd cluster
    cli, err := clientv3.New(clientv3.Config{
        Endpoints:   []string{"localhost:2379"},
        DialTimeout: 5 * time.Second,
    })
    if err != nil { log.Fatal(err) }
    defer cli.Close()

    // Create a session (lease with keep-alive)
    session, err := concurrency.NewSession(cli,
        concurrency.WithTTL(15),  // 15-second lease TTL
    )
    if err != nil { log.Fatal(err) }
    defer session.Close()

    // Create an election on prefix "/my-app/leader"
    election := concurrency.NewElection(session,
        "/my-app/leader",
    )

    // Campaign to become leader (blocks until elected)
    ctx := context.Background()
    if err := election.Campaign(ctx, "node-1"); err != nil {
        log.Fatal(err)
    }

    fmt.Println("I am the leader!")

    // Do leader work...
    doLeaderWork()

    // Voluntarily resign leadership
    if err := election.Resign(ctx); err != nil {
        log.Fatal(err)
    }
}
      

etcd vs ZooKeeper: Both solve the same problem. etcd uses Raft and gRPC with a Go-native API. ZooKeeper uses ZAB and a Java-centric API. Kubernetes chose etcd for its simplicity and native Go support. Kafka (until 3.x) used ZooKeeper; KRaft now replaces it with a built-in Raft implementation.

7 · Split Brain Prevention

Split brain is the most dangerous failure mode in leader election: two nodes both believe they are the leader and independently accept writes. This leads to divergent state, data corruption, and conflicting decisions.

How Split Brain Happens

Network partition: The leader is isolated. Followers in the other partition elect a new leader. The old leader doesn't know it's been replaced.
GC pause / process freeze: The leader pauses for 30 seconds during a GC. Followers time out and elect a new leader. The old leader wakes up, still thinking it's the leader.
Asymmetric partition: The leader can send to followers but can't receive responses. It keeps acting as leader while followers elect a new one.

Prevention Strategies

Strategy	Mechanism	Used By
Majority quorum	Leader must replicate to ⌈n/2⌉+1 nodes before committing. An isolated leader cannot get a quorum.	Raft, Paxos, ZAB
Lease expiry	Leader holds a time-bound lease. Must renew before TTL. If partition prevents renewal, leadership automatically expires.	etcd, Chubby, DynamoDB
Fencing tokens	Every new leader gets a monotonically increasing token. Resource servers reject requests with stale tokens.	HBase, custom systems
STONITH	"Shoot The Other Node In The Head" — physically power-off the old leader via IPMI/BMC before the new leader starts.	Pacemaker, HA clusters
Epoch numbers	Similar to Raft terms. Every election increments the epoch. Followers reject commands from stale epochs.	Kafka (KRaft), ZooKeeper

8 · Fencing Tokens for Safety

Even with quorum-based election, there is a window where an old leader might still issue writes. Fencing tokens close this window by giving each leader a unique, monotonically increasing token that resource servers validate.

How Fencing Works

Each time a new leader is elected, the election system issues a fencing token (an integer that strictly increases with each election — this is the Raft "term" or ZooKeeper "zxid").
The leader includes its fencing token in every request to storage/resources.
The storage server tracks the highest token it has seen.
If a request arrives with a token lower than the highest seen, it is rejected — it came from a stale leader.

Example Scenario

        Timeline
Time 0:  Leader A elected, gets fencing token = 34
Time 1:  Leader A writes to storage with token 34 → ✅ accepted
Time 2:  Leader A gets a long GC pause (appears dead)
Time 3:  Leader B elected, gets fencing token = 35
Time 4:  Leader B writes to storage with token 35 → ✅ accepted
Time 5:  Leader A wakes up, tries to write with token 34
          → ❌ REJECTED (storage has seen token 35, which is higher)
      

Warning: Fencing tokens only work if all resource servers participate in validation. A storage system that doesn't check the token creates a safety hole. This is why Martin Kleppmann argues that Redlock (Redis distributed locks) is unsafe — Redis doesn't have built-in fencing token validation.

Fencing Token Validation

        Pseudocode
// Storage server side
class FencedStorage:
    highestTokenSeen = 0
    data = {}

    function write(key, value, fencingToken):
        if fencingToken < highestTokenSeen:
            raise StaleLeaderError(
                "Token " + fencingToken +
                " < highest seen " + highestTokenSeen
            )
        highestTokenSeen = fencingToken
        data[key] = value
        return OK
      

9 · Lease-Based Leadership

A lease is a time-bounded grant of leadership. The leader must periodically renew the lease before it expires. If the leader fails to renew (crash, partition, GC pause), the lease expires and other candidates can acquire it.

How Leases Work

        Lease Lifecycle
Acquire:   Leader obtains lease with TTL = 15s
Renew:     Leader sends keepalive every 5s (TTL/3)
Use:       Leader performs work only while lease is valid
Expire:    If renewal fails, lease expires after 15s
Re-elect:  Another candidate acquires the lease
      

The Clock Skew Problem

Leases depend on time, and clocks across machines can skew. If the leader's clock is slow and the follower's clock is fast, the follower might think the lease has expired while the leader thinks it's still valid. Solutions:

Conservative renewal: Renew at TTL/3 so there's a 2/3 × TTL buffer.
Bound clock skew: Use NTP and assume max skew ε. The effective lease duration is TTL − ε.
Combine with fencing tokens: Even if two leaders overlap briefly, the fencing token prevents stale writes.

Leases vs Heartbeats

Aspect	Heartbeat	Lease
Direction	Leader → followers	Leader → lease server
Failure detection	Followers detect missing heartbeats	Lease server detects expired TTL
Time dependency	Relative timeouts	Absolute TTL (clock-dependent)
Safety	Requires additional mechanism (terms, fencing)	Built-in expiry prevents stale leadership
Typical use	Raft, ZAB	etcd, Chubby, DynamoDB

10 · When to Avoid Leader Election

Leader election introduces a single point of failure and a bottleneck. Not every distributed system needs a leader. Consider leaderless alternatives when:

When Leaderless Works Better

High write throughput: A single leader serializes all writes — it becomes the bottleneck. Leaderless systems (Cassandra, DynamoDB, Riak) allow any node to accept writes, distributing load evenly.
Availability over consistency: If your application tolerates eventual consistency, a leader adds unnecessary complexity and latency (every write must go through one node).
Geo-distributed deployments: With data centers on different continents, routing all writes to one leader introduces cross-continent latency. Multi-leader or leaderless systems accept writes locally.
CRDTs and conflict-free data: If your data model uses Conflict-Free Replicated Data Types (counters, sets, registers), concurrent writes merge automatically without a leader.

Leader vs Leaderless Comparison

Aspect	Leader-Based	Leaderless
Consistency	Strong (linearizable)	Eventual (tunable)
Write throughput	Limited by leader	Scales with nodes
Read throughput	Can scale with followers	Scales with nodes
Availability	Leader failure → brief downtime	Any node serves requests
Complexity	Election, failover, fencing	Conflict resolution, read repair
Examples	etcd, ZooKeeper, Kafka (KRaft)	Cassandra, DynamoDB, Riak

💡 Interview Tip: When asked "would you use leader election here?", consider the trade-offs. If the system needs strong consistency and ordered operations (like a distributed lock service or a metadata store), leader election is the right choice. If the system needs high availability and write throughput across regions (like a social media feed or analytics pipeline), leaderless is better. Show that you understand when to use each approach.

11 · Real-World Leader Election

Kafka Controller Election

In Apache Kafka, the controller is a special broker that manages partition leadership, replica assignment, and topic creation. There must be exactly one controller in the cluster.

Pre-KRaft (ZooKeeper-based)

Brokers race to create an ephemeral znode /controller in ZooKeeper.
The first one to create it wins (ZooKeeper's atomic create ensures exclusivity).
Other brokers set a watch on /controller.
When the controller crashes, the ephemeral znode is deleted, watches fire, and brokers race again.
Problem: thundering herd — all N brokers rush to ZooKeeper simultaneously.

KRaft (Raft-based, Kafka 3.x+)

Kafka runs its own Raft-based consensus (KRaft) among a set of controller quorum nodes.
These nodes elect a leader using Raft's standard term + majority vote mechanism.
The elected leader maintains the metadata log (partition assignments, ISR lists, configs).
Benefits: no external dependency (ZooKeeper removed), fewer moving parts, faster failover.

        KRaft Config
# server.properties for a KRaft controller
process.roles=controller
node.id=1
controller.quorum.voters=1@controller1:9093,2@controller2:9093,3@controller3:9093
controller.listener.names=CONTROLLER
listeners=CONTROLLER://controller1:9093
      

Elasticsearch Master Election

Elasticsearch uses a master node for cluster state management: index creation/deletion, shard allocation, and node membership. The election uses a Zen Discovery protocol (pre-7.x) or a custom Raft-like protocol (7.x+).

Master-eligible nodes: Configured with node.roles: [master]. Only these participate in elections.
Quorum: discovery.seed_hosts lists known master-eligible nodes. The cluster needs a majority (⌈n/2⌉+1) to elect a master.
Cluster state: The elected master publishes cluster state updates. Nodes that don't acknowledge are removed.
Minimum master nodes: In pre-7.x, discovery.zen.minimum_master_nodes had to be set manually to avoid split brain. 7.x+ automatically manages the quorum.

        elasticsearch.yml
# Elasticsearch 8.x master election config
node.roles: [master, data]
cluster.name: my-cluster
cluster.initial_master_nodes:
  - master-node-1
  - master-node-2
  - master-node-3
discovery.seed_hosts:
  - "10.0.1.1:9300"
  - "10.0.1.2:9300"
  - "10.0.1.3:9300"
      

More Real-World Examples

System	Election Mechanism	Notes
Kubernetes	etcd lease-based (via `client-go/tools/leaderelection`)	kube-controller-manager and kube-scheduler use leader election for HA
Redis Sentinel	Raft-like majority vote among Sentinels	Sentinels vote on which Sentinel performs the failover
MySQL Group Replication	Paxos-based consensus	Single-primary mode elects one writable node
CockroachDB	Per-range Raft	Each range (shard) has its own Raft group and elected leader
HDFS	ZooKeeper ZKFC (NameNode HA)	ZooKeeper Failover Controller elects active NameNode
Google Chubby	Paxos	The original distributed lock service; inspired ZooKeeper

12 · Summary & Decision Framework

Choosing the right leader election approach depends on your system's requirements. Here is a decision framework:

Requirement	Recommended Approach
Simple, small cluster (<10 nodes), no partitions	Bully or Ring algorithm
Strong consistency, partition tolerance	Raft / Paxos / ZAB
Application-level election using existing infra	ZooKeeper ephemeral nodes or etcd leases
Kubernetes-native HA	etcd + `leaderelection` package
High availability, eventual consistency OK	Avoid election — use leaderless (Cassandra, DynamoDB)
Need split-brain protection	Fencing tokens + quorum writes

💡 Interview Tip: In system design interviews, when leader election comes up, mention: (1) why you need a leader (single writer, coordination), (2) how you'll elect one (Raft-based, ZooKeeper, etcd), (3) split brain prevention (fencing tokens, majority quorum), and (4) failover time (typically TTL + election time ≈ 5–30 seconds). If the interviewer asks about alternatives, discuss leaderless architectures and when they're appropriate.