← All Posts
High Level Design Series · Distributed Systems · Part 1

Consensus: Paxos & Raft

The Consensus Problem

At the heart of every reliable distributed system lies a deceptively simple question: how do multiple nodes agree on a single value? This is the consensus problem, and it is the foundation upon which leader election, replicated state machines, distributed locks, atomic broadcasts, and total order delivery are all built.

Formally, a consensus protocol must satisfy three properties:

In practice, we also care about total order — not just agreeing on one value, but on a sequence of values (a replicated log). This is what turns consensus into a replicated state machine: every node applies the same commands in the same order, so all nodes converge to the same state.

Replicated State Machine:
  Client → propose(cmd) → Consensus Module → Log
                                               ↓
                             State Machine applies log[0], log[1], log[2]...
                                               ↓
                             All replicas reach identical state

Why is this hard? Because distributed systems must contend with arbitrary message delays, message loss, network partitions, and node crashes — all happening simultaneously. You cannot simply "ask everyone and take a vote" because messages may never arrive, nodes may crash mid-vote, and the network may split into disconnected halves.

FLP Impossibility Theorem

Fischer, Lynch, Paterson (1985): It is impossible for a deterministic asynchronous consensus protocol to guarantee agreement in the presence of even one faulty process.

This is the most important impossibility result in distributed computing. Let's break it down carefully:

The proof works by showing that the system can always be in a bivalent state — a state from which both 0 and 1 are still possible outcomes. The adversary (scheduler) can always delay messages to keep the system in a bivalent state indefinitely, preventing termination.

FLP Proof Sketch:

  1. Start: System is in a bivalent initial configuration
     (both 0-valent and 1-valent states are reachable)

  2. Key lemma: From any bivalent configuration, there exists
     a single step (delivering one message) that leads to
     another bivalent configuration.

  3. An adversary can perpetually pick these "bivalence-preserving"
     message deliveries → the system never decides.

  4. Therefore: No deterministic protocol can guarantee termination
     in an asynchronous system with even 1 crash failure.

Implications & Workarounds

FLP does not say consensus is impossible in practice. It says you cannot have all three of determinism, asynchrony, and guaranteed termination. Real systems break one of these assumptions:

WorkaroundWhat It BreaksExample
RandomizationDeterminismBen-Or's protocol, randomized Paxos
Failure detectors (timeouts)Pure asynchronyPaxos, Raft use election timeouts
Partial synchronyPure asynchronyDLS protocol — assumes messages arrive within a bound eventually

Paxos and Raft break pure asynchrony by using timeouts as failure detectors. They remain safe (never decide inconsistently) regardless of timing, but liveness (termination) depends on timeouts eventually being accurate — a partially synchronous assumption.

"The FLP result doesn't mean we give up — it means we know exactly which assumption we're relaxing and why." — Nancy Lynch

Paxos

Paxos, invented by Leslie Lamport in 1989 (published 1998), is the foundational consensus algorithm. It solves single-value consensus in an asynchronous system with crash failures. Lamport originally described it as a protocol used by legislators on the fictional Greek island of Paxos — a presentation so allegorical that it took the community years to fully parse it.

Roles

🎯 Proposer

Proposes values to the acceptors. Any node can be a proposer. In practice, a distinguished proposer (leader) reduces conflicts.

✅ Acceptor

Votes on proposals. A value is chosen when accepted by a majority (quorum) of acceptors.

📖 Learner

Learns the chosen value. Doesn't participate in the protocol — just observes the outcome.

In practice, a single physical node often plays all three roles simultaneously.

The Two Phases of Basic Paxos

Basic Paxos decides on a single value. It proceeds in two phases, each with a request and response:

Phase 1: Prepare / Promise

  1. Prepare(n): A proposer selects a proposal number n (globally unique, monotonically increasing) and sends a Prepare(n) request to a majority of acceptors.
  2. Promise(n, v?): An acceptor receiving Prepare(n) checks:
    • If n is greater than any proposal number it has already responded to, it promises not to accept any proposal with a number less than n.
    • If it has already accepted a proposal (m, v), it includes that in its promise: "I promise, and by the way, I already accepted value v at proposal m."
    • If n is not greater, it ignores (or NACKs) the prepare.

Phase 2: Accept / Accepted

  1. Accept(n, v): Once the proposer receives promises from a majority, it sends Accept(n, v) where:
    • If any promise included a previously accepted value, the proposer must use the value from the highest-numbered accepted proposal. This is the critical safety rule.
    • If no promise included a previously accepted value, the proposer can choose any value.
  2. Accepted(n, v): An acceptor receiving Accept(n, v) accepts it unless it has already promised to a higher proposal number. If accepted, it notifies the learners.
Basic Paxos — Single Value Consensus:

  Proposer             Acceptor A1    Acceptor A2    Acceptor A3
     |                    |              |              |
     |--- Prepare(1) --→  |              |              |
     |--- Prepare(1) ----------------→   |              |
     |--- Prepare(1) --------------------------------→  |
     |                    |              |              |
     |←-- Promise(1,⊥) --|              |              |
     |←-- Promise(1,⊥) --------------|  |              |
     |                    |              |              |
     | (majority received — 2 of 3)     |              |
     |                    |              |              |
     |--- Accept(1,"X") → |              |              |
     |--- Accept(1,"X") --------------→  |              |
     |--- Accept(1,"X") --------------------------------→
     |                    |              |              |
     |←-- Accepted(1,"X")|              |              |
     |←-- Accepted(1,"X")------------|  |              |
     |                    |              |              |
     | Value "X" is CHOSEN (majority accepted)

Why Paxos is Safe

The key insight is the Prepare phase constraint: if any value has already been accepted by a majority, any future proposer will discover that value during its Prepare phase (because any two majorities overlap in at least one member). The proposer is then forced to re-propose that value, ensuring the chosen value can never change.

Quorum Intersection: With 2f+1 acceptors, any two majorities of f+1 share at least one member. This guarantees that a new proposer's Prepare phase will always intersect with any previously accepted majority.
Quorum Intersection Example (5 acceptors, majority = 3):

  Majority 1: {A1, A2, A3} accepted value "X" at proposal 5
  Majority 2: {A3, A4, A5} respond to Prepare(7)

  A3 is in both majorities → A3 tells Proposer 2:
    "I already accepted (5, 'X')"
  
  Proposer 2 must re-propose "X" → safety preserved!

Dueling Proposers (Livelock)

Basic Paxos has a liveness issue: two proposers can continuously preempt each other's Prepare phases, creating livelock:

Livelock scenario:
  P1: Prepare(1) → gets promises → about to send Accept(1,X)
  P2: Prepare(2) → preempts P1's promises
  P1: sees rejection → Prepare(3) → preempts P2
  P2: sees rejection → Prepare(4) → preempts P1
  ... neither ever reaches Accept phase

The standard fix is to elect a distinguished proposer (leader). Only the leader proposes, eliminating conflicts. If the leader crashes, a new leader is elected. This is exactly what Multi-Paxos and Raft formalize.

Multi-Paxos

Basic Paxos decides one value. To build a replicated log, we need to agree on a sequence of values — one Paxos instance per log slot. Multi-Paxos optimizes this by electing a stable leader who skips Phase 1 for subsequent slots:

Multi-Paxos Optimization:

  Slot 1: Full Paxos (Phase 1 + Phase 2) — leader established
  Slot 2: Phase 2 only (leader already has promises)
  Slot 3: Phase 2 only
  Slot 4: Phase 2 only
  ...
  
  If leader crashes → new leader runs full Phase 1 for next slot
  
  Steady-state: 1 round-trip per consensus decision
  (vs 2 round-trips for basic Paxos)

Multi-Paxos is what systems like Google's Chubby and Spanner actually implement. However, Lamport's papers leave many practical details unspecified — leader election, log compaction, membership changes, snapshotting — which is why Raft was created.

The Paxos Problem: Paxos is notoriously difficult to understand and even harder to implement correctly. Google's Chubby team wrote: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the final system will be based on an unproven protocol."

Raft: Understandable Consensus

Raft was designed by Diego Ongaro and John Ousterhout (2014) with one overriding goal: understandability. They conducted a user study at Stanford showing that students learned Raft significantly faster and more correctly than Paxos. The key design choice was decomposition: Raft separates consensus into three relatively independent sub-problems.

  1. Leader Election — how to elect a leader, and what happens when it fails.
  2. Log Replication — how the leader replicates commands to followers.
  3. Safety — what guarantees ensure the log is never inconsistent.

Node States

Every Raft node is in exactly one of three states:

Follower Candidate Leader
State Transitions:

  ┌──────────┐  election timeout  ┌───────────┐  receives majority  ┌────────┐
  │ Follower │ ─────────────────→ │ Candidate │ ──────────────────→ │ Leader │
  └──────────┘                    └───────────┘                     └────────┘
       ↑                               │  ↑                             │
       │   discovers current leader    │  │  election timeout           │
       │   or new term                 │  │  (split vote → retry)       │
       │←──────────────────────────────┘  └─────────────────────────────│
       │                                                                │
       └────────────────── discovers higher term ───────────────────────┘

All nodes start as Followers. A follower passively receives RPCs from the leader. If it doesn't hear from a leader within the election timeout (randomized, typically 150–300ms), it becomes a Candidate and starts an election. If it wins, it becomes the Leader.

Terms

Raft divides time into terms, numbered with consecutive integers. Each term begins with an election. Terms act as a logical clock in Raft — they allow nodes to detect stale leaders and obsolete information.

Term Timeline:

  Term 1          Term 2      Term 3       Term 4        Term 5
  ┌──────────────┐┌──────────┐┌──────────┐┌────────────┐┌────────→
  │ Election → L1││Election→L2││Election  ││Election→ L4││Election→L5
  │ normal ops   ││normal ops ││(no       ││normal ops  ││normal ops
  │              ││           ││ winner)  ││            ││
  └──────────────┘└──────────┘└──────────┘└────────────┘└────────→
                                  ↑
                            split vote — term with no leader

Rules about terms:

Leader Election

When a follower's election timeout expires without receiving an AppendEntries heartbeat from the leader, it starts an election:

  1. Increment current term by 1.
  2. Transition to Candidate state.
  3. Vote for itself.
  4. Send RequestVote RPCs to all other nodes, including:
    • term — the candidate's current term.
    • candidateId — the candidate's identifier.
    • lastLogIndex — index of the candidate's last log entry.
    • lastLogTerm — term of the candidate's last log entry.

Three outcomes are possible:

  1. Wins: Receives votes from a majority (including its own vote). Becomes Leader. Immediately sends heartbeats to all nodes to establish authority and prevent new elections.
  2. Loses: Receives an AppendEntries from a node with a term ≥ its own. Reverts to Follower — someone else won.
  3. Split vote: Neither wins. Election timeout expires again → increment term → start new election.
RequestVote RPC:

  Arguments:
    term         — candidate's term
    candidateId  — candidate requesting vote
    lastLogIndex — index of candidate's last log entry
    lastLogTerm  — term of candidate's last log entry

  Results:
    term         — currentTerm, for candidate to update itself
    voteGranted  — true means candidate received vote

  Receiver implementation:
    1. Reply false if term < currentTerm
    2. If votedFor is null or candidateId, AND candidate's log
       is at least as up-to-date as receiver's log, grant vote

The Election Restriction (Safety)

A voter only grants its vote if the candidate's log is at least as up-to-date as the voter's log. "Up-to-date" is compared lexicographically: first by last log entry's term, then by log length.

Log comparison (who is more up-to-date?):

  Node A: [T1:x, T1:y, T3:z]   → lastLogTerm=3, lastLogIndex=3
  Node B: [T1:x, T1:y, T2:w]   → lastLogTerm=2, lastLogIndex=3
  Node C: [T1:x, T1:y]         → lastLogTerm=1, lastLogIndex=2

  A > B  (higher last term: 3 > 2)
  B > C  (higher last term: 2 > 1)
  A > C  (higher last term: 3 > 1)

  Only A can become leader — it has the most up-to-date log.

This restriction guarantees the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms. This is Raft's key safety guarantee.

▶ Raft Leader Election

Watch how a follower becomes a candidate, collects votes, and becomes leader. Then see a split-vote scenario and retry.

Split Vote Handling

Split votes occur when multiple nodes time out simultaneously and become candidates. Raft handles this with randomized election timeouts:

Split Vote Prevention:

  Each node picks a random election timeout from [150ms, 300ms].
  
  Scenario: Leader crashes at t=0
  
    Node B timeout: 173ms → becomes Candidate at t=173ms
    Node C timeout: 241ms → would become Candidate at t=241ms
    Node D timeout: 198ms → would become Candidate at t=198ms
    Node E timeout: 289ms → would become Candidate at t=289ms
  
  Node B starts election first (t=173ms).
  Network round-trip ≈ 15ms, so by t=188ms:
    B has votes from B, C, D, E (majority reached before D times out)
    B becomes leader at t=188ms.
  
  D's timeout at t=198ms → receives B's heartbeat → stays Follower.
  
  Split votes are rare because the timeout range (150ms) is much
  larger than the network round-trip time (~15ms).
Raft Paper Insight: The election timeout range should be an order of magnitude greater than the broadcast time. If broadcast time is 15ms, a range of [150ms, 300ms] ensures that in most cases, only one node times out before the election completes.

Log Replication

Once a leader is elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machine. The leader appends the command to its log as a new entry, then replicates it to followers using AppendEntries RPCs.

AppendEntries RPC:

  Arguments:
    term           — leader's term
    leaderId       — so followers can redirect clients
    prevLogIndex   — index of log entry immediately preceding new ones
    prevLogTerm    — term of prevLogIndex entry
    entries[]      — log entries to store (empty for heartbeat)
    leaderCommit   — leader's commitIndex

  Results:
    term           — currentTerm, for leader to update itself
    success        — true if follower contained entry matching
                     prevLogIndex and prevLogTerm

  Receiver implementation:
    1. Reply false if term < currentTerm
    2. Reply false if log doesn't contain entry at prevLogIndex
       whose term matches prevLogTerm
    3. If existing entry conflicts with new one (same index,
       different term), delete it and all that follow
    4. Append any new entries not already in the log
    5. If leaderCommit > commitIndex, set commitIndex =
       min(leaderCommit, index of last new entry)

Commit Rules

A log entry is committed once the leader has replicated it to a majority of nodes. Once committed, the entry is durable — it will eventually be applied to every node's state machine.

Commit Process:

  Leader log:   [T1:x] [T1:y] [T2:z] [T3:w]
                  ✓       ✓      ✓      ?
                committed committed committed uncommitted

  Node A (Leader): [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node B:          [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node C:          [T1:x] [T1:y] [T2:z]          commitIndex=3
  Node D:          [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node E:          [T1:x] [T1:y]                  commitIndex=2

  Entry [T3:w] at index 4:
    Replicated on: A, B, D = 3 out of 5 = majority ✓
    → Leader advances commitIndex to 4
    → Leader tells followers via next AppendEntries
    → All nodes apply [T3:w] to state machine
Critical Safety Rule: A leader can only commit entries from its current term by counting replicas. Entries from previous terms are committed indirectly — they are committed only when a subsequent entry from the current term is committed. This prevents the subtle scenario where a log entry could be replicated to a majority, then overwritten by a new leader (see Figure 8 of the Raft paper).
Figure 8 Scenario (Why the Commit Rule Matters):

  Time  Leader  Action
  ────  ──────  ──────
  T2    S1      Replicates [T2:cmd] to S1, S2 only → NOT committed
  T3    S5      Elected leader, overwrites with [T3:cmd'] on S3,S4,S5
  T4    S1      Re-elected! Sees [T2:cmd] on S1,S2
  
  Can S1 commit [T2:cmd] just because it's on a majority now?
  NO! S5 might still win election and overwrite it.
  
  S1 must first replicate a NEW entry [T4:something] to a majority.
  Once [T4:something] is committed, all entries before it
  (including [T2:cmd]) are implicitly committed.
  
  This is the "never commit entries from previous terms by counting
  replicas" rule — one of the subtlest points in Raft.

▶ Raft Log Replication

See how a client write propagates through the leader to followers, gets committed after majority acknowledgment, and is applied to the state machine.

Log Matching Property

Raft maintains the following invariant:

The first property follows from the fact that a leader creates at most one entry per index per term. The second follows from the consistency check in AppendEntries: the leader includes prevLogIndex and prevLogTerm, and the follower rejects the RPC if its log doesn't match at that point. This creates an inductive proof that the logs are consistent.

Log Repair (Handling Inconsistencies)

After a leader crash, follower logs may diverge — they may have missing entries, extra uncommitted entries, or both. The leader repairs this by finding the latest point of agreement with each follower:

Log Repair with nextIndex:

  Leader maintains nextIndex[i] for each follower i.
  Initially set to leader's last log index + 1.

  If AppendEntries fails (consistency check):
    leader decrements nextIndex[i]
    retries AppendEntries with earlier entries
  
  Repeats until follower accepts → logs are now consistent.

  Example:
    Leader log: [1:a] [1:b] [2:c] [3:d] [3:e]
    Follower:   [1:a] [1:b] [2:c] [2:f]    ← diverged at index 4

    Step 1: Leader sends entry 5 with prevLogIndex=4, prevLogTerm=3
            Follower: "I don't have term 3 at index 4" → REJECT
    Step 2: Leader sends entries 4,5 with prevLogIndex=3, prevLogTerm=2
            Follower: "I have term 2 at index 3!" → ACCEPT
            Follower deletes [2:f], appends [3:d] [3:e]
    
    Follower log now: [1:a] [1:b] [2:c] [3:d] [3:e] ✓

Log Compaction & Snapshots

If the log grows without bound, it will eventually consume all storage and take forever to replay on restart. Raft uses snapshotting to compact the log:

  1. Each node independently takes a snapshot of its state machine state at a committed log index.
  2. All log entries up to that index are discarded.
  3. The snapshot includes:
    • The state machine state.
    • lastIncludedIndex — the last log entry included in the snapshot.
    • lastIncludedTerm — the term of that entry.
Snapshot:

  Before: Log = [1:a][1:b][2:c][2:d][3:e][3:f][3:g][4:h]
                                                      ↑ commitIndex=8
  
  After snapshot at index 5:
    Snapshot = { state: {...}, lastIncludedIndex: 5, lastIncludedTerm: 3 }
    Log = [3:f][3:g][4:h]
    
  Memory/disk savings: 5 entries removed

If a follower is so far behind that the leader has already discarded the entries it needs, the leader sends an InstallSnapshot RPC instead of log entries. The follower replaces its state machine with the snapshot and resumes replication from there.

Cluster Membership Changes

Adding or removing nodes from a Raft cluster is handled via joint consensus — a two-phase approach where the cluster transitions through a configuration that requires majorities from both the old and new configurations:

Joint Consensus for Membership Change:

  C_old = {A, B, C}
  C_new = {A, B, C, D, E}

  Phase 1: Leader commits C_old,new (joint configuration)
    Decisions require majorities from BOTH C_old AND C_new.
    e.g., need 2-of-{A,B,C} AND 3-of-{A,B,C,D,E}
    
  Phase 2: Leader commits C_new
    Now only C_new majority required.
    
  This prevents "split-brain" — at no point can two independent
  majorities exist that could elect different leaders.
Single-Server Changes (Raft Optimization): If you add/remove one server at a time, a simpler protocol suffices — no joint consensus needed. This is what etcd and most Raft implementations use in practice.

Raft in Practice

etcd

etcd is a distributed key-value store built on Raft. It's the backbone of Kubernetes — storing all cluster state: pod definitions, service endpoints, ConfigMaps, Secrets, RBAC rules, and more.

# etcd cluster operations

# Write a key
$ etcdctl put /registry/services/nginx '{"port": 80, "replicas": 3}'
OK

# Read a key
$ etcdctl get /registry/services/nginx
/registry/services/nginx
{"port": 80, "replicas": 3}

# Watch for changes (linearizable reads)
$ etcdctl watch /registry/services/nginx
PUT
/registry/services/nginx
{"port": 80, "replicas": 5}

# Check cluster health
$ etcdctl endpoint health --cluster
https://etcd1:2379 is healthy: committed proposal: took = 1.5ms
https://etcd2:2379 is healthy: committed proposal: took = 1.8ms
https://etcd3:2379 is healthy: committed proposal: took = 2.1ms

# Check who is the Raft leader
$ etcdctl endpoint status --cluster -w table
+------------------------+------------------+--------+---------+
|        ENDPOINT        |        ID        | LEADER | RAFT IDX|
+------------------------+------------------+--------+---------+
| https://etcd1:2379     | 8e9e05c52164     | true   |   45892 |
| https://etcd2:2379     | 4b3e89d1a234     | false  |   45892 |
| https://etcd3:2379     | 91c7b2ef8901     | false  |   45890 |
+------------------------+------------------+--------+---------+

etcd implements Raft with several production-grade enhancements:

etcd Performance: etcd is designed for metadata storage, not bulk data. Recommended limits: ≤8GB database size, ≤1.5MB per value, ≤100 values per transaction. Kubernetes clusters with >5,000 nodes or >100,000 pods may need etcd tuning (heartbeat interval, election timeout, snapshot count).

Consul

HashiCorp Consul uses Raft for its service mesh and KV store. Like etcd, it elects a leader among server nodes, which then replicates all state changes. Consul adds service discovery, health checking, and a Connect service mesh on top of Raft.

# Consul Raft operations

# Check Raft peers
$ consul operator raft list-peers
Node    ID                                    Address           State    Voter
dc1-s1  aedf4c8a-7b5c-8d2e-9f1a-3c4b5d6e7f8a 10.0.0.1:8300   leader   true
dc1-s2  bf1e5d9b-8c6d-9e3f-a02b-4d5c6e7f8091 10.0.0.2:8300   follower true
dc1-s3  c0f26ea3-9d7e-af40-b13c-5e6d7f809102 10.0.0.3:8300   follower true

CockroachDB

CockroachDB uses Raft at a different granularity: each range (a contiguous slice of the keyspace, typically ~512MB) has its own Raft group. A single CockroachDB cluster may have thousands of independent Raft groups, each with its own leader. This design enables:

CockroachDB Range-Level Raft:

  Keyspace: [a ─────── g ─────── m ─────── s ─────── z]
              Range 1    Range 2    Range 3    Range 4
              
  Range 1: Raft group {Node1★, Node2, Node3}   ★ = leader
  Range 2: Raft group {Node2★, Node3, Node4}
  Range 3: Raft group {Node3★, Node4, Node1}
  Range 4: Raft group {Node4★, Node1, Node2}
  
  Each range has its own Raft leader → leadership is distributed!
  Transaction spanning Range 1 + Range 3 uses parallel Raft commits.

ZAB (ZooKeeper Atomic Broadcast)

ZAB is the consensus protocol used by Apache ZooKeeper. It was designed specifically for primary-backup systems and predates the Raft paper. While similar to Raft in many ways, ZAB has some key differences:

ZAB Protocol Phases:

  Phase 0: Leader Election
    - Nodes exchange votes. Node with highest (epoch, zxid) wins.
    - Similar to Raft's RequestVote but uses a different comparison.

  Phase 1: Discovery
    - New leader collects followers' histories.
    - Determines the most up-to-date follower history.

  Phase 2: Synchronization
    - Leader sends its history to all followers.
    - Followers update their state to match.
    - Analogous to Raft's log repair but done eagerly on election.

  Phase 3: Broadcast (normal operation)
    - Leader proposes writes (PROPOSAL → ACK → COMMIT).
    - 2-phase commit: wait for majority ACK, then COMMIT.
    - Similar to Raft's AppendEntries.

Key differences from Raft:

Viewstamped Replication (VR)

Viewstamped Replication, designed by Brian Oki and Barbara Liskov (1988), is one of the earliest replicated state machine protocols. It predates both Paxos and Raft and introduced many of the ideas that Raft later adopted.

Viewstamped Replication — Normal Operation:

  Client → Primary: REQUEST(op, client-id, request-number)
  Primary → Backups: PREPARE(view, op-number, op, commit-number)
  Backups → Primary: PREPAREOK(view, op-number, replica-id)
  
  Once majority PREPAREOK received:
    Primary executes op → Client: REPLY(view, request-number, result)
    Primary piggybacks commit-number in next PREPARE

  Almost identical to Raft's AppendEntries flow!

VR's view change is more complex than Raft's leader election because VR doesn't use persistent storage — it recovers state from other replicas rather than from disk. The VR Revisited paper (2012) simplified this significantly.

Protocol Comparison

Property Paxos Raft ZAB Viewstamped Replication
Year 1989 (pub. 1998) 2014 2008 1988 (revised 2012)
Author(s) Lamport Ongaro & Ousterhout Junqueira et al. Oki & Liskov
Leader Required? Optional (Multi-Paxos yes) Yes (strong leader) Yes (primary) Yes (primary)
Logical Clock Proposal numbers Terms Epochs + zxid View numbers
Phases per Decision 2 (basic), 1 (Multi-Paxos steady state) 1 (AppendEntries) 2 (PROPOSAL→COMMIT) 1 (PREPARE→PREPAREOK)
Log Repair Not specified Incremental (nextIndex decrement) Eager (on leader election) Eager (on view change)
Understandability Low (notoriously hard) High (designed for it) Medium Medium
Used By Chubby, Spanner etcd, Consul, CockroachDB, TiKV ZooKeeper Harp, PBFT (influenced)
Fault Tolerance f of 2f+1 crash failures f of 2f+1 crash failures f of 2f+1 crash failures f of 2f+1 crash failures
Persistent State Required (voted-for, accepted) Required (currentTerm, votedFor, log) Required (epoch, history) Not required (recovers from peers)

Correctness Arguments

The Raft paper proves five key properties. Together they establish that Raft implements a correct replicated state machine:

PropertyStatementMechanism
Election Safety At most one leader per term Each node votes at most once per term. Majority required to win. Two majorities must overlap → at most one winner.
Leader Append-Only A leader never overwrites or deletes entries in its log Leader only appends new entries. Never modifies existing entries.
Log Matching If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index AppendEntries consistency check: prevLogIndex + prevLogTerm must match. Inductive argument from first entry.
Leader Completeness If a log entry is committed in a given term, it will be present in the logs of all leaders for all higher terms Election restriction: voter rejects candidates with less up-to-date logs. Committed entry is on a majority → any future majority overlaps → future leader has the entry.
State Machine Safety If a node has applied a log entry at a given index to its state machine, no other node will ever apply a different log entry at that index Follows from Log Matching + Leader Completeness. If entry is applied, it's committed. By Leader Completeness, all future leaders have it. By Log Matching, no different entry can exist at that index.
"The Raft algorithm was designed so that its safety properties could be formally verified. The TLA+ specification of Raft has been model-checked." — Ongaro, PhD dissertation

Practical Considerations

Performance Characteristics

Raft Latency Breakdown:

  Client write:
    1. Client → Leader:              ~0.5ms (within datacenter)
    2. Leader appends to local log:  ~0.1ms (fsync to WAL)
    3. Leader → Followers:           ~0.5ms (parallel RPCs)
    4. Followers append + respond:   ~0.6ms (fsync + network)
    5. Leader commits + responds:    ~0.1ms
    ────────────────────────────────────────
    Total write latency:             ~1.8ms (within DC)
    Cross-region (multi-DC):         ~50-150ms (WAN RTT dominates)

  Client read (linearizable):
    Option A: Read through Raft log:    ~1.8ms (same as write)
    Option B: ReadIndex (leader check): ~0.8ms (1 RTT to confirm leadership)
    Option C: Lease-based reads:        ~0.1ms (local read, requires clock sync)

  Throughput:
    Typical: 10,000-50,000 writes/sec (3-node cluster, SSD)
    etcd benchmark: ~16,000 writes/sec (256-byte values, 3 nodes)

Cluster Sizing

Recommended Cluster Sizes:

  3 nodes: Tolerates 1 failure. Minimum for production.
           Best for: small services, dev/staging, low-latency requirements.
  
  5 nodes: Tolerates 2 failures. Standard for critical services.
           Best for: Kubernetes control plane, production databases.
  
  7 nodes: Tolerates 3 failures. Rarely needed.
           Best for: cross-region deployments (2 regions × 3 + 1 witness).
  
  Why always odd?
    Even numbers waste a node without improving fault tolerance:
    4 nodes → tolerates 1 failure (same as 3!)
    6 nodes → tolerates 2 failures (same as 5!)
    
  Formula: f failures tolerated with 2f+1 nodes.

Common Pitfalls

Summary

Consensus is the foundation of reliable distributed systems. The FLP impossibility result tells us we can't have everything — determinism, asynchrony, and guaranteed termination — so practical protocols like Paxos, Raft, and ZAB trade pure asynchrony for timeout-based failure detection.

In the next post, we explore vector clocks — the mechanism for tracking causality and detecting conflicts in systems that don't use consensus for every operation.