Consensus: Paxos & Raft
The Consensus Problem
At the heart of every reliable distributed system lies a deceptively simple question: how do multiple nodes agree on a single value? This is the consensus problem, and it is the foundation upon which leader election, replicated state machines, distributed locks, atomic broadcasts, and total order delivery are all built.
Formally, a consensus protocol must satisfy three properties:
- Agreement: All non-faulty processes decide on the same value.
- Validity: The decided value must have been proposed by some process (no "magic" values).
- Termination: Every non-faulty process eventually decides (liveness).
In practice, we also care about total order — not just agreeing on one value, but on a sequence of values (a replicated log). This is what turns consensus into a replicated state machine: every node applies the same commands in the same order, so all nodes converge to the same state.
Replicated State Machine:
Client → propose(cmd) → Consensus Module → Log
↓
State Machine applies log[0], log[1], log[2]...
↓
All replicas reach identical state
Why is this hard? Because distributed systems must contend with arbitrary message delays, message loss, network partitions, and node crashes — all happening simultaneously. You cannot simply "ask everyone and take a vote" because messages may never arrive, nodes may crash mid-vote, and the network may split into disconnected halves.
FLP Impossibility Theorem
This is the most important impossibility result in distributed computing. Let's break it down carefully:
- Deterministic: The protocol's behavior is fully determined by the messages it receives — no random coin flips.
- Asynchronous: There is no upper bound on message delivery time. A message sent now might arrive in 1ms or 1 year — you can't tell the difference between a slow node and a dead one.
- One faulty process: Even if only a single process can crash (fail-stop, not Byzantine), consensus becomes impossible to guarantee.
The proof works by showing that the system can always be in a bivalent state — a state from which both 0 and 1 are still possible outcomes. The adversary (scheduler) can always delay messages to keep the system in a bivalent state indefinitely, preventing termination.
FLP Proof Sketch:
1. Start: System is in a bivalent initial configuration
(both 0-valent and 1-valent states are reachable)
2. Key lemma: From any bivalent configuration, there exists
a single step (delivering one message) that leads to
another bivalent configuration.
3. An adversary can perpetually pick these "bivalence-preserving"
message deliveries → the system never decides.
4. Therefore: No deterministic protocol can guarantee termination
in an asynchronous system with even 1 crash failure.
Implications & Workarounds
FLP does not say consensus is impossible in practice. It says you cannot have all three of determinism, asynchrony, and guaranteed termination. Real systems break one of these assumptions:
| Workaround | What It Breaks | Example |
|---|---|---|
| Randomization | Determinism | Ben-Or's protocol, randomized Paxos |
| Failure detectors (timeouts) | Pure asynchrony | Paxos, Raft use election timeouts |
| Partial synchrony | Pure asynchrony | DLS protocol — assumes messages arrive within a bound eventually |
Paxos and Raft break pure asynchrony by using timeouts as failure detectors. They remain safe (never decide inconsistently) regardless of timing, but liveness (termination) depends on timeouts eventually being accurate — a partially synchronous assumption.
Paxos
Paxos, invented by Leslie Lamport in 1989 (published 1998), is the foundational consensus algorithm. It solves single-value consensus in an asynchronous system with crash failures. Lamport originally described it as a protocol used by legislators on the fictional Greek island of Paxos — a presentation so allegorical that it took the community years to fully parse it.
Roles
🎯 Proposer
Proposes values to the acceptors. Any node can be a proposer. In practice, a distinguished proposer (leader) reduces conflicts.
✅ Acceptor
Votes on proposals. A value is chosen when accepted by a majority (quorum) of acceptors.
📖 Learner
Learns the chosen value. Doesn't participate in the protocol — just observes the outcome.
In practice, a single physical node often plays all three roles simultaneously.
The Two Phases of Basic Paxos
Basic Paxos decides on a single value. It proceeds in two phases, each with a request and response:
Phase 1: Prepare / Promise
- Prepare(n): A proposer selects a proposal number
n(globally unique, monotonically increasing) and sends aPrepare(n)request to a majority of acceptors. - Promise(n, v?): An acceptor receiving
Prepare(n)checks:- If
nis greater than any proposal number it has already responded to, it promises not to accept any proposal with a number less thann. - If it has already accepted a proposal
(m, v), it includes that in its promise: "I promise, and by the way, I already accepted valuevat proposalm." - If
nis not greater, it ignores (or NACKs) the prepare.
- If
Phase 2: Accept / Accepted
- Accept(n, v): Once the proposer receives promises from a majority, it sends
Accept(n, v)where:- If any promise included a previously accepted value, the proposer must use the value from the highest-numbered accepted proposal. This is the critical safety rule.
- If no promise included a previously accepted value, the proposer can choose any value.
- Accepted(n, v): An acceptor receiving
Accept(n, v)accepts it unless it has already promised to a higher proposal number. If accepted, it notifies the learners.
Basic Paxos — Single Value Consensus:
Proposer Acceptor A1 Acceptor A2 Acceptor A3
| | | |
|--- Prepare(1) --→ | | |
|--- Prepare(1) ----------------→ | |
|--- Prepare(1) --------------------------------→ |
| | | |
|←-- Promise(1,⊥) --| | |
|←-- Promise(1,⊥) --------------| | |
| | | |
| (majority received — 2 of 3) | |
| | | |
|--- Accept(1,"X") → | | |
|--- Accept(1,"X") --------------→ | |
|--- Accept(1,"X") --------------------------------→
| | | |
|←-- Accepted(1,"X")| | |
|←-- Accepted(1,"X")------------| | |
| | | |
| Value "X" is CHOSEN (majority accepted)
Why Paxos is Safe
The key insight is the Prepare phase constraint: if any value has already been accepted by a majority, any future proposer will discover that value during its Prepare phase (because any two majorities overlap in at least one member). The proposer is then forced to re-propose that value, ensuring the chosen value can never change.
2f+1 acceptors, any two majorities of f+1 share at least one member. This guarantees that a new proposer's Prepare phase will always intersect with any previously accepted majority.
Quorum Intersection Example (5 acceptors, majority = 3):
Majority 1: {A1, A2, A3} accepted value "X" at proposal 5
Majority 2: {A3, A4, A5} respond to Prepare(7)
A3 is in both majorities → A3 tells Proposer 2:
"I already accepted (5, 'X')"
Proposer 2 must re-propose "X" → safety preserved!
Dueling Proposers (Livelock)
Basic Paxos has a liveness issue: two proposers can continuously preempt each other's Prepare phases, creating livelock:
Livelock scenario:
P1: Prepare(1) → gets promises → about to send Accept(1,X)
P2: Prepare(2) → preempts P1's promises
P1: sees rejection → Prepare(3) → preempts P2
P2: sees rejection → Prepare(4) → preempts P1
... neither ever reaches Accept phase
The standard fix is to elect a distinguished proposer (leader). Only the leader proposes, eliminating conflicts. If the leader crashes, a new leader is elected. This is exactly what Multi-Paxos and Raft formalize.
Multi-Paxos
Basic Paxos decides one value. To build a replicated log, we need to agree on a sequence of values — one Paxos instance per log slot. Multi-Paxos optimizes this by electing a stable leader who skips Phase 1 for subsequent slots:
Multi-Paxos Optimization:
Slot 1: Full Paxos (Phase 1 + Phase 2) — leader established
Slot 2: Phase 2 only (leader already has promises)
Slot 3: Phase 2 only
Slot 4: Phase 2 only
...
If leader crashes → new leader runs full Phase 1 for next slot
Steady-state: 1 round-trip per consensus decision
(vs 2 round-trips for basic Paxos)
Multi-Paxos is what systems like Google's Chubby and Spanner actually implement. However, Lamport's papers leave many practical details unspecified — leader election, log compaction, membership changes, snapshotting — which is why Raft was created.
Raft: Understandable Consensus
Raft was designed by Diego Ongaro and John Ousterhout (2014) with one overriding goal: understandability. They conducted a user study at Stanford showing that students learned Raft significantly faster and more correctly than Paxos. The key design choice was decomposition: Raft separates consensus into three relatively independent sub-problems.
- Leader Election — how to elect a leader, and what happens when it fails.
- Log Replication — how the leader replicates commands to followers.
- Safety — what guarantees ensure the log is never inconsistent.
Node States
Every Raft node is in exactly one of three states:
State Transitions:
┌──────────┐ election timeout ┌───────────┐ receives majority ┌────────┐
│ Follower │ ─────────────────→ │ Candidate │ ──────────────────→ │ Leader │
└──────────┘ └───────────┘ └────────┘
↑ │ ↑ │
│ discovers current leader │ │ election timeout │
│ or new term │ │ (split vote → retry) │
│←──────────────────────────────┘ └─────────────────────────────│
│ │
└────────────────── discovers higher term ───────────────────────┘
All nodes start as Followers. A follower passively receives RPCs from the leader. If it doesn't hear from a leader within the election timeout (randomized, typically 150–300ms), it becomes a Candidate and starts an election. If it wins, it becomes the Leader.
Terms
Raft divides time into terms, numbered with consecutive integers. Each term begins with an election. Terms act as a logical clock in Raft — they allow nodes to detect stale leaders and obsolete information.
Term Timeline:
Term 1 Term 2 Term 3 Term 4 Term 5
┌──────────────┐┌──────────┐┌──────────┐┌────────────┐┌────────→
│ Election → L1││Election→L2││Election ││Election→ L4││Election→L5
│ normal ops ││normal ops ││(no ││normal ops ││normal ops
│ ││ ││ winner) ││ ││
└──────────────┘└──────────┘└──────────┘└────────────┘└────────→
↑
split vote — term with no leader
Rules about terms:
- Each term has at most one leader.
- If a node receives a message with a higher term, it immediately updates its term and reverts to Follower.
- If a node receives a message with a lower term, it rejects the message.
- Terms are the fundamental mechanism for leader deduplication — they ensure two nodes never believe they are leader for the same term.
Leader Election
When a follower's election timeout expires without receiving an AppendEntries heartbeat from the leader, it starts an election:
- Increment current term by 1.
- Transition to Candidate state.
- Vote for itself.
- Send
RequestVoteRPCs to all other nodes, including:term— the candidate's current term.candidateId— the candidate's identifier.lastLogIndex— index of the candidate's last log entry.lastLogTerm— term of the candidate's last log entry.
Three outcomes are possible:
- Wins: Receives votes from a majority (including its own vote). Becomes Leader. Immediately sends heartbeats to all nodes to establish authority and prevent new elections.
- Loses: Receives an
AppendEntriesfrom a node with a term ≥ its own. Reverts to Follower — someone else won. - Split vote: Neither wins. Election timeout expires again → increment term → start new election.
RequestVote RPC:
Arguments:
term — candidate's term
candidateId — candidate requesting vote
lastLogIndex — index of candidate's last log entry
lastLogTerm — term of candidate's last log entry
Results:
term — currentTerm, for candidate to update itself
voteGranted — true means candidate received vote
Receiver implementation:
1. Reply false if term < currentTerm
2. If votedFor is null or candidateId, AND candidate's log
is at least as up-to-date as receiver's log, grant vote
The Election Restriction (Safety)
A voter only grants its vote if the candidate's log is at least as up-to-date as the voter's log. "Up-to-date" is compared lexicographically: first by last log entry's term, then by log length.
Log comparison (who is more up-to-date?):
Node A: [T1:x, T1:y, T3:z] → lastLogTerm=3, lastLogIndex=3
Node B: [T1:x, T1:y, T2:w] → lastLogTerm=2, lastLogIndex=3
Node C: [T1:x, T1:y] → lastLogTerm=1, lastLogIndex=2
A > B (higher last term: 3 > 2)
B > C (higher last term: 2 > 1)
A > C (higher last term: 3 > 1)
Only A can become leader — it has the most up-to-date log.
This restriction guarantees the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms. This is Raft's key safety guarantee.
▶ Raft Leader Election
Watch how a follower becomes a candidate, collects votes, and becomes leader. Then see a split-vote scenario and retry.
Split Vote Handling
Split votes occur when multiple nodes time out simultaneously and become candidates. Raft handles this with randomized election timeouts:
Split Vote Prevention:
Each node picks a random election timeout from [150ms, 300ms].
Scenario: Leader crashes at t=0
Node B timeout: 173ms → becomes Candidate at t=173ms
Node C timeout: 241ms → would become Candidate at t=241ms
Node D timeout: 198ms → would become Candidate at t=198ms
Node E timeout: 289ms → would become Candidate at t=289ms
Node B starts election first (t=173ms).
Network round-trip ≈ 15ms, so by t=188ms:
B has votes from B, C, D, E (majority reached before D times out)
B becomes leader at t=188ms.
D's timeout at t=198ms → receives B's heartbeat → stays Follower.
Split votes are rare because the timeout range (150ms) is much
larger than the network round-trip time (~15ms).
Log Replication
Once a leader is elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machine. The leader appends the command to its log as a new entry, then replicates it to followers using AppendEntries RPCs.
AppendEntries RPC:
Arguments:
term — leader's term
leaderId — so followers can redirect clients
prevLogIndex — index of log entry immediately preceding new ones
prevLogTerm — term of prevLogIndex entry
entries[] — log entries to store (empty for heartbeat)
leaderCommit — leader's commitIndex
Results:
term — currentTerm, for leader to update itself
success — true if follower contained entry matching
prevLogIndex and prevLogTerm
Receiver implementation:
1. Reply false if term < currentTerm
2. Reply false if log doesn't contain entry at prevLogIndex
whose term matches prevLogTerm
3. If existing entry conflicts with new one (same index,
different term), delete it and all that follow
4. Append any new entries not already in the log
5. If leaderCommit > commitIndex, set commitIndex =
min(leaderCommit, index of last new entry)
Commit Rules
A log entry is committed once the leader has replicated it to a majority of nodes. Once committed, the entry is durable — it will eventually be applied to every node's state machine.
Commit Process:
Leader log: [T1:x] [T1:y] [T2:z] [T3:w]
✓ ✓ ✓ ?
committed committed committed uncommitted
Node A (Leader): [T1:x] [T1:y] [T2:z] [T3:w] commitIndex=3
Node B: [T1:x] [T1:y] [T2:z] [T3:w] commitIndex=3
Node C: [T1:x] [T1:y] [T2:z] commitIndex=3
Node D: [T1:x] [T1:y] [T2:z] [T3:w] commitIndex=3
Node E: [T1:x] [T1:y] commitIndex=2
Entry [T3:w] at index 4:
Replicated on: A, B, D = 3 out of 5 = majority ✓
→ Leader advances commitIndex to 4
→ Leader tells followers via next AppendEntries
→ All nodes apply [T3:w] to state machine
Figure 8 Scenario (Why the Commit Rule Matters):
Time Leader Action
──── ────── ──────
T2 S1 Replicates [T2:cmd] to S1, S2 only → NOT committed
T3 S5 Elected leader, overwrites with [T3:cmd'] on S3,S4,S5
T4 S1 Re-elected! Sees [T2:cmd] on S1,S2
Can S1 commit [T2:cmd] just because it's on a majority now?
NO! S5 might still win election and overwrite it.
S1 must first replicate a NEW entry [T4:something] to a majority.
Once [T4:something] is committed, all entries before it
(including [T2:cmd]) are implicitly committed.
This is the "never commit entries from previous terms by counting
replicas" rule — one of the subtlest points in Raft.
▶ Raft Log Replication
See how a client write propagates through the leader to followers, gets committed after majority acknowledgment, and is applied to the state machine.
Log Matching Property
Raft maintains the following invariant:
- If two entries in different logs have the same index and same term, they store the same command.
- If two entries in different logs have the same index and same term, then the logs are identical in all preceding entries.
The first property follows from the fact that a leader creates at most one entry per index per term. The second follows from the consistency check in AppendEntries: the leader includes prevLogIndex and prevLogTerm, and the follower rejects the RPC if its log doesn't match at that point. This creates an inductive proof that the logs are consistent.
Log Repair (Handling Inconsistencies)
After a leader crash, follower logs may diverge — they may have missing entries, extra uncommitted entries, or both. The leader repairs this by finding the latest point of agreement with each follower:
Log Repair with nextIndex:
Leader maintains nextIndex[i] for each follower i.
Initially set to leader's last log index + 1.
If AppendEntries fails (consistency check):
leader decrements nextIndex[i]
retries AppendEntries with earlier entries
Repeats until follower accepts → logs are now consistent.
Example:
Leader log: [1:a] [1:b] [2:c] [3:d] [3:e]
Follower: [1:a] [1:b] [2:c] [2:f] ← diverged at index 4
Step 1: Leader sends entry 5 with prevLogIndex=4, prevLogTerm=3
Follower: "I don't have term 3 at index 4" → REJECT
Step 2: Leader sends entries 4,5 with prevLogIndex=3, prevLogTerm=2
Follower: "I have term 2 at index 3!" → ACCEPT
Follower deletes [2:f], appends [3:d] [3:e]
Follower log now: [1:a] [1:b] [2:c] [3:d] [3:e] ✓
Log Compaction & Snapshots
If the log grows without bound, it will eventually consume all storage and take forever to replay on restart. Raft uses snapshotting to compact the log:
- Each node independently takes a snapshot of its state machine state at a committed log index.
- All log entries up to that index are discarded.
- The snapshot includes:
- The state machine state.
lastIncludedIndex— the last log entry included in the snapshot.lastIncludedTerm— the term of that entry.
Snapshot:
Before: Log = [1:a][1:b][2:c][2:d][3:e][3:f][3:g][4:h]
↑ commitIndex=8
After snapshot at index 5:
Snapshot = { state: {...}, lastIncludedIndex: 5, lastIncludedTerm: 3 }
Log = [3:f][3:g][4:h]
Memory/disk savings: 5 entries removed
If a follower is so far behind that the leader has already discarded the entries it needs, the leader sends an InstallSnapshot RPC instead of log entries. The follower replaces its state machine with the snapshot and resumes replication from there.
Cluster Membership Changes
Adding or removing nodes from a Raft cluster is handled via joint consensus — a two-phase approach where the cluster transitions through a configuration that requires majorities from both the old and new configurations:
Joint Consensus for Membership Change:
C_old = {A, B, C}
C_new = {A, B, C, D, E}
Phase 1: Leader commits C_old,new (joint configuration)
Decisions require majorities from BOTH C_old AND C_new.
e.g., need 2-of-{A,B,C} AND 3-of-{A,B,C,D,E}
Phase 2: Leader commits C_new
Now only C_new majority required.
This prevents "split-brain" — at no point can two independent
majorities exist that could elect different leaders.
Raft in Practice
etcd
etcd is a distributed key-value store built on Raft. It's the backbone of Kubernetes — storing all cluster state: pod definitions, service endpoints, ConfigMaps, Secrets, RBAC rules, and more.
# etcd cluster operations
# Write a key
$ etcdctl put /registry/services/nginx '{"port": 80, "replicas": 3}'
OK
# Read a key
$ etcdctl get /registry/services/nginx
/registry/services/nginx
{"port": 80, "replicas": 3}
# Watch for changes (linearizable reads)
$ etcdctl watch /registry/services/nginx
PUT
/registry/services/nginx
{"port": 80, "replicas": 5}
# Check cluster health
$ etcdctl endpoint health --cluster
https://etcd1:2379 is healthy: committed proposal: took = 1.5ms
https://etcd2:2379 is healthy: committed proposal: took = 1.8ms
https://etcd3:2379 is healthy: committed proposal: took = 2.1ms
# Check who is the Raft leader
$ etcdctl endpoint status --cluster -w table
+------------------------+------------------+--------+---------+
| ENDPOINT | ID | LEADER | RAFT IDX|
+------------------------+------------------+--------+---------+
| https://etcd1:2379 | 8e9e05c52164 | true | 45892 |
| https://etcd2:2379 | 4b3e89d1a234 | false | 45892 |
| https://etcd3:2379 | 91c7b2ef8901 | false | 45890 |
+------------------------+------------------+--------+---------+
etcd implements Raft with several production-grade enhancements:
- Lease-based leader election: Prevents split-brain by requiring the leader to periodically renew its lease.
- Linearizable reads: By default, reads go through Raft (ReadIndex protocol) to guarantee linearizability. This adds ~1 RTT to reads but ensures you never read stale data.
- Watch API: Clients can subscribe to key changes — essential for Kubernetes controllers.
- MVCC storage: etcd stores all revisions of a key, enabling point-in-time queries and efficient watches.
- Snapshot + WAL: Combines periodic snapshots with a write-ahead log for durability and fast recovery.
Consul
HashiCorp Consul uses Raft for its service mesh and KV store. Like etcd, it elects a leader among server nodes, which then replicates all state changes. Consul adds service discovery, health checking, and a Connect service mesh on top of Raft.
# Consul Raft operations
# Check Raft peers
$ consul operator raft list-peers
Node ID Address State Voter
dc1-s1 aedf4c8a-7b5c-8d2e-9f1a-3c4b5d6e7f8a 10.0.0.1:8300 leader true
dc1-s2 bf1e5d9b-8c6d-9e3f-a02b-4d5c6e7f8091 10.0.0.2:8300 follower true
dc1-s3 c0f26ea3-9d7e-af40-b13c-5e6d7f809102 10.0.0.3:8300 follower true
CockroachDB
CockroachDB uses Raft at a different granularity: each range (a contiguous slice of the keyspace, typically ~512MB) has its own Raft group. A single CockroachDB cluster may have thousands of independent Raft groups, each with its own leader. This design enables:
- Horizontal scalability: Leadership is distributed across nodes, avoiding a single bottleneck.
- Locality-aware replication: Ranges can have different replication topologies (e.g., pin certain ranges to specific regions for low-latency reads).
- Parallel consensus: Different ranges can commit independently, enabling high throughput.
CockroachDB Range-Level Raft:
Keyspace: [a ─────── g ─────── m ─────── s ─────── z]
Range 1 Range 2 Range 3 Range 4
Range 1: Raft group {Node1★, Node2, Node3} ★ = leader
Range 2: Raft group {Node2★, Node3, Node4}
Range 3: Raft group {Node3★, Node4, Node1}
Range 4: Raft group {Node4★, Node1, Node2}
Each range has its own Raft leader → leadership is distributed!
Transaction spanning Range 1 + Range 3 uses parallel Raft commits.
ZAB (ZooKeeper Atomic Broadcast)
ZAB is the consensus protocol used by Apache ZooKeeper. It was designed specifically for primary-backup systems and predates the Raft paper. While similar to Raft in many ways, ZAB has some key differences:
- Primary-ordered atomic broadcast: ZAB guarantees that all writes from a single primary are delivered in order, and that writes from different primaries are totally ordered across all replicas.
- Two modes: Recovery (leader election + synchronization) and Broadcast (normal operation).
- Epoch-based: Like Raft's terms, ZAB uses epochs. Each new leader increments the epoch, and all nodes adopt the new epoch during recovery.
ZAB Protocol Phases:
Phase 0: Leader Election
- Nodes exchange votes. Node with highest (epoch, zxid) wins.
- Similar to Raft's RequestVote but uses a different comparison.
Phase 1: Discovery
- New leader collects followers' histories.
- Determines the most up-to-date follower history.
Phase 2: Synchronization
- Leader sends its history to all followers.
- Followers update their state to match.
- Analogous to Raft's log repair but done eagerly on election.
Phase 3: Broadcast (normal operation)
- Leader proposes writes (PROPOSAL → ACK → COMMIT).
- 2-phase commit: wait for majority ACK, then COMMIT.
- Similar to Raft's AppendEntries.
Key differences from Raft:
- ZAB uses transaction IDs (zxids) — 64-bit values where the high 32 bits are the epoch and the low 32 bits are a counter. This embeds the "term" inside the transaction ID.
- ZAB performs full synchronization on leader election (Phase 2). Raft repairs logs incrementally during normal operation.
- ZAB's broadcast is a two-phase commit (PROPOSAL → ACK → COMMIT), while Raft uses a single
AppendEntriesRPC with piggybacked commit information.
Viewstamped Replication (VR)
Viewstamped Replication, designed by Brian Oki and Barbara Liskov (1988), is one of the earliest replicated state machine protocols. It predates both Paxos and Raft and introduced many of the ideas that Raft later adopted.
- Views: Equivalent to Raft's terms. Each view has a designated primary (leader). View changes are triggered by timeout-based failure detection.
- View Change Protocol: When the primary is suspected of failing, replicas initiate a view change — similar to Raft's leader election but more complex.
- Normal Operation: The primary receives client requests, assigns them operation numbers, and replicates them to backups. Once a majority acknowledges, the primary executes the operation and replies to the client.
Viewstamped Replication — Normal Operation:
Client → Primary: REQUEST(op, client-id, request-number)
Primary → Backups: PREPARE(view, op-number, op, commit-number)
Backups → Primary: PREPAREOK(view, op-number, replica-id)
Once majority PREPAREOK received:
Primary executes op → Client: REPLY(view, request-number, result)
Primary piggybacks commit-number in next PREPARE
Almost identical to Raft's AppendEntries flow!
VR's view change is more complex than Raft's leader election because VR doesn't use persistent storage — it recovers state from other replicas rather than from disk. The VR Revisited paper (2012) simplified this significantly.
Protocol Comparison
| Property | Paxos | Raft | ZAB | Viewstamped Replication |
|---|---|---|---|---|
| Year | 1989 (pub. 1998) | 2014 | 2008 | 1988 (revised 2012) |
| Author(s) | Lamport | Ongaro & Ousterhout | Junqueira et al. | Oki & Liskov |
| Leader Required? | Optional (Multi-Paxos yes) | Yes (strong leader) | Yes (primary) | Yes (primary) |
| Logical Clock | Proposal numbers | Terms | Epochs + zxid | View numbers |
| Phases per Decision | 2 (basic), 1 (Multi-Paxos steady state) | 1 (AppendEntries) | 2 (PROPOSAL→COMMIT) | 1 (PREPARE→PREPAREOK) |
| Log Repair | Not specified | Incremental (nextIndex decrement) | Eager (on leader election) | Eager (on view change) |
| Understandability | Low (notoriously hard) | High (designed for it) | Medium | Medium |
| Used By | Chubby, Spanner | etcd, Consul, CockroachDB, TiKV | ZooKeeper | Harp, PBFT (influenced) |
| Fault Tolerance | f of 2f+1 crash failures | f of 2f+1 crash failures | f of 2f+1 crash failures | f of 2f+1 crash failures |
| Persistent State | Required (voted-for, accepted) | Required (currentTerm, votedFor, log) | Required (epoch, history) | Not required (recovers from peers) |
Correctness Arguments
The Raft paper proves five key properties. Together they establish that Raft implements a correct replicated state machine:
| Property | Statement | Mechanism |
|---|---|---|
| Election Safety | At most one leader per term | Each node votes at most once per term. Majority required to win. Two majorities must overlap → at most one winner. |
| Leader Append-Only | A leader never overwrites or deletes entries in its log | Leader only appends new entries. Never modifies existing entries. |
| Log Matching | If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index | AppendEntries consistency check: prevLogIndex + prevLogTerm must match. Inductive argument from first entry. |
| Leader Completeness | If a log entry is committed in a given term, it will be present in the logs of all leaders for all higher terms | Election restriction: voter rejects candidates with less up-to-date logs. Committed entry is on a majority → any future majority overlaps → future leader has the entry. |
| State Machine Safety | If a node has applied a log entry at a given index to its state machine, no other node will ever apply a different log entry at that index | Follows from Log Matching + Leader Completeness. If entry is applied, it's committed. By Leader Completeness, all future leaders have it. By Log Matching, no different entry can exist at that index. |
Practical Considerations
Performance Characteristics
Raft Latency Breakdown:
Client write:
1. Client → Leader: ~0.5ms (within datacenter)
2. Leader appends to local log: ~0.1ms (fsync to WAL)
3. Leader → Followers: ~0.5ms (parallel RPCs)
4. Followers append + respond: ~0.6ms (fsync + network)
5. Leader commits + responds: ~0.1ms
────────────────────────────────────────
Total write latency: ~1.8ms (within DC)
Cross-region (multi-DC): ~50-150ms (WAN RTT dominates)
Client read (linearizable):
Option A: Read through Raft log: ~1.8ms (same as write)
Option B: ReadIndex (leader check): ~0.8ms (1 RTT to confirm leadership)
Option C: Lease-based reads: ~0.1ms (local read, requires clock sync)
Throughput:
Typical: 10,000-50,000 writes/sec (3-node cluster, SSD)
etcd benchmark: ~16,000 writes/sec (256-byte values, 3 nodes)
Cluster Sizing
Recommended Cluster Sizes:
3 nodes: Tolerates 1 failure. Minimum for production.
Best for: small services, dev/staging, low-latency requirements.
5 nodes: Tolerates 2 failures. Standard for critical services.
Best for: Kubernetes control plane, production databases.
7 nodes: Tolerates 3 failures. Rarely needed.
Best for: cross-region deployments (2 regions × 3 + 1 witness).
Why always odd?
Even numbers waste a node without improving fault tolerance:
4 nodes → tolerates 1 failure (same as 3!)
6 nodes → tolerates 2 failures (same as 5!)
Formula: f failures tolerated with 2f+1 nodes.
Common Pitfalls
- Clock skew: Raft doesn't depend on synchronized clocks for safety, but election timeouts behave poorly if clocks drift significantly. Lease-based reads do depend on bounded clock skew.
- Disk latency: Every Raft write requires an
fsyncbefore responding. Slow disks (spinning rust, overloaded EBS) directly increase commit latency. Use SSDs. - Network partitions: A partitioned leader continues to accept writes that will never commit (no majority). Clients connected to the partitioned leader will experience timeouts. Implement client-side retries with leader rediscovery.
- Large snapshots: If the state machine is large (GBs), snapshot transfer to a lagging follower can saturate the network. Throttle snapshot transfers and use incremental approaches where possible.
- Thundering herd on leader failure: When the leader dies, all followers time out simultaneously. Randomized timeouts help, but in pathological cases you may see several failed elections before a leader is established.
Summary
Consensus is the foundation of reliable distributed systems. The FLP impossibility result tells us we can't have everything — determinism, asynchrony, and guaranteed termination — so practical protocols like Paxos, Raft, and ZAB trade pure asynchrony for timeout-based failure detection.
- Paxos is the theoretical foundation — provably correct, but notoriously difficult to implement.
- Raft is Paxos made understandable — same safety guarantees, cleaner decomposition, widely adopted in production systems.
- ZAB powers ZooKeeper with eager synchronization and epoch-based ordering.
- Viewstamped Replication introduced many ideas that Raft later formalized, including view-based leader election and replicated state machines.
In the next post, we explore vector clocks — the mechanism for tracking causality and detecting conflicts in systems that don't use consensus for every operation.