High Level Design Series · Distributed Systems · Part 1

Consensus: Paxos & Raft

April 2026 · 26 min read

The Consensus Problem

At the heart of every reliable distributed system lies a deceptively simple question: how do multiple nodes agree on a single value? This is the consensus problem, and it is the foundation upon which leader election, replicated state machines, distributed locks, atomic broadcasts, and total order delivery are all built.

Formally, a consensus protocol must satisfy three properties:

Agreement: All non-faulty processes decide on the same value.
Validity: The decided value must have been proposed by some process (no "magic" values).
Termination: Every non-faulty process eventually decides (liveness).

In practice, we also care about total order — not just agreeing on one value, but on a sequence of values (a replicated log). This is what turns consensus into a replicated state machine: every node applies the same commands in the same order, so all nodes converge to the same state.

Replicated State Machine:
  Client → propose(cmd) → Consensus Module → Log
                                               ↓
                             State Machine applies log[0], log[1], log[2]...
                                               ↓
                             All replicas reach identical state

Why is this hard? Because distributed systems must contend with arbitrary message delays, message loss, network partitions, and node crashes — all happening simultaneously. You cannot simply "ask everyone and take a vote" because messages may never arrive, nodes may crash mid-vote, and the network may split into disconnected halves.

FLP Impossibility Theorem

Fischer, Lynch, Paterson (1985): It is impossible for a deterministic asynchronous consensus protocol to guarantee agreement in the presence of even one faulty process.

This is the most important impossibility result in distributed computing. Let's break it down carefully:

Deterministic: The protocol's behavior is fully determined by the messages it receives — no random coin flips.
Asynchronous: There is no upper bound on message delivery time. A message sent now might arrive in 1ms or 1 year — you can't tell the difference between a slow node and a dead one.
One faulty process: Even if only a single process can crash (fail-stop, not Byzantine), consensus becomes impossible to guarantee.

The proof works by showing that the system can always be in a bivalent state — a state from which both 0 and 1 are still possible outcomes. The adversary (scheduler) can always delay messages to keep the system in a bivalent state indefinitely, preventing termination.

FLP Proof Sketch:

  1. Start: System is in a bivalent initial configuration
     (both 0-valent and 1-valent states are reachable)

  2. Key lemma: From any bivalent configuration, there exists
     a single step (delivering one message) that leads to
     another bivalent configuration.

  3. An adversary can perpetually pick these "bivalence-preserving"
     message deliveries → the system never decides.

  4. Therefore: No deterministic protocol can guarantee termination
     in an asynchronous system with even 1 crash failure.

Implications & Workarounds

FLP does not say consensus is impossible in practice. It says you cannot have all three of determinism, asynchrony, and guaranteed termination. Real systems break one of these assumptions:

Workaround	What It Breaks	Example
Randomization	Determinism	Ben-Or's protocol, randomized Paxos
Failure detectors (timeouts)	Pure asynchrony	Paxos, Raft use election timeouts
Partial synchrony	Pure asynchrony	DLS protocol — assumes messages arrive within a bound eventually

Paxos and Raft break pure asynchrony by using timeouts as failure detectors. They remain safe (never decide inconsistently) regardless of timing, but liveness (termination) depends on timeouts eventually being accurate — a partially synchronous assumption.

"The FLP result doesn't mean we give up — it means we know exactly which assumption we're relaxing and why." — Nancy Lynch

Paxos

Paxos, invented by Leslie Lamport in 1989 (published 1998), is the foundational consensus algorithm. It solves single-value consensus in an asynchronous system with crash failures. Lamport originally described it as a protocol used by legislators on the fictional Greek island of Paxos — a presentation so allegorical that it took the community years to fully parse it.

Roles

🎯 Proposer

Proposes values to the acceptors. Any node can be a proposer. In practice, a distinguished proposer (leader) reduces conflicts.

✅ Acceptor

Votes on proposals. A value is chosen when accepted by a majority (quorum) of acceptors.

📖 Learner

Learns the chosen value. Doesn't participate in the protocol — just observes the outcome.

In practice, a single physical node often plays all three roles simultaneously.

The Two Phases of Basic Paxos

Basic Paxos decides on a single value. It proceeds in two phases, each with a request and response:

Phase 1: Prepare / Promise

Prepare(n): A proposer selects a proposal number n (globally unique, monotonically increasing) and sends a Prepare(n) request to a majority of acceptors.
Promise(n, v?): An acceptor receiving Prepare(n) checks:
- If n is greater than any proposal number it has already responded to, it promises not to accept any proposal with a number less than n.
- If it has already accepted a proposal (m, v), it includes that in its promise: "I promise, and by the way, I already accepted value v at proposal m."
- If n is not greater, it ignores (or NACKs) the prepare.

Phase 2: Accept / Accepted

Accept(n, v): Once the proposer receives promises from a majority, it sends Accept(n, v) where:
- If any promise included a previously accepted value, the proposer must use the value from the highest-numbered accepted proposal. This is the critical safety rule.
- If no promise included a previously accepted value, the proposer can choose any value.
Accepted(n, v): An acceptor receiving Accept(n, v) accepts it unless it has already promised to a higher proposal number. If accepted, it notifies the learners.

Basic Paxos — Single Value Consensus:

  Proposer             Acceptor A1    Acceptor A2    Acceptor A3
     |                    |              |              |
     |--- Prepare(1) --→  |              |              |
     |--- Prepare(1) ----------------→   |              |
     |--- Prepare(1) --------------------------------→  |
     |                    |              |              |
     |←-- Promise(1,⊥) --|              |              |
     |←-- Promise(1,⊥) --------------|  |              |
     |                    |              |              |
     | (majority received — 2 of 3)     |              |
     |                    |              |              |
     |--- Accept(1,"X") → |              |              |
     |--- Accept(1,"X") --------------→  |              |
     |--- Accept(1,"X") --------------------------------→
     |                    |              |              |
     |←-- Accepted(1,"X")|              |              |
     |←-- Accepted(1,"X")------------|  |              |
     |                    |              |              |
     | Value "X" is CHOSEN (majority accepted)

Why Paxos is Safe

The key insight is the Prepare phase constraint: if any value has already been accepted by a majority, any future proposer will discover that value during its Prepare phase (because any two majorities overlap in at least one member). The proposer is then forced to re-propose that value, ensuring the chosen value can never change.

Quorum Intersection: With 2f+1 acceptors, any two majorities of f+1 share at least one member. This guarantees that a new proposer's Prepare phase will always intersect with any previously accepted majority.

Quorum Intersection Example (5 acceptors, majority = 3):

  Majority 1: {A1, A2, A3} accepted value "X" at proposal 5
  Majority 2: {A3, A4, A5} respond to Prepare(7)

  A3 is in both majorities → A3 tells Proposer 2:
    "I already accepted (5, 'X')"
  
  Proposer 2 must re-propose "X" → safety preserved!

Dueling Proposers (Livelock)

Basic Paxos has a liveness issue: two proposers can continuously preempt each other's Prepare phases, creating livelock:

Livelock scenario:
  P1: Prepare(1) → gets promises → about to send Accept(1,X)
  P2: Prepare(2) → preempts P1's promises
  P1: sees rejection → Prepare(3) → preempts P2
  P2: sees rejection → Prepare(4) → preempts P1
  ... neither ever reaches Accept phase

The standard fix is to elect a distinguished proposer (leader). Only the leader proposes, eliminating conflicts. If the leader crashes, a new leader is elected. This is exactly what Multi-Paxos and Raft formalize.

Multi-Paxos

Basic Paxos decides one value. To build a replicated log, we need to agree on a sequence of values — one Paxos instance per log slot. Multi-Paxos optimizes this by electing a stable leader who skips Phase 1 for subsequent slots:

Multi-Paxos Optimization:

  Slot 1: Full Paxos (Phase 1 + Phase 2) — leader established
  Slot 2: Phase 2 only (leader already has promises)
  Slot 3: Phase 2 only
  Slot 4: Phase 2 only
  ...
  
  If leader crashes → new leader runs full Phase 1 for next slot
  
  Steady-state: 1 round-trip per consensus decision
  (vs 2 round-trips for basic Paxos)

Multi-Paxos is what systems like Google's Chubby and Spanner actually implement. However, Lamport's papers leave many practical details unspecified — leader election, log compaction, membership changes, snapshotting — which is why Raft was created.

The Paxos Problem: Paxos is notoriously difficult to understand and even harder to implement correctly. Google's Chubby team wrote: "There are significant gaps between the description of the Paxos algorithm and the needs of a real-world system... the final system will be based on an unproven protocol."

Raft: Understandable Consensus

Raft was designed by Diego Ongaro and John Ousterhout (2014) with one overriding goal: understandability. They conducted a user study at Stanford showing that students learned Raft significantly faster and more correctly than Paxos. The key design choice was decomposition: Raft separates consensus into three relatively independent sub-problems.

Leader Election — how to elect a leader, and what happens when it fails.
Log Replication — how the leader replicates commands to followers.
Safety — what guarantees ensure the log is never inconsistent.

Node States

Every Raft node is in exactly one of three states:

Follower Candidate Leader

State Transitions:

  ┌──────────┐  election timeout  ┌───────────┐  receives majority  ┌────────┐
  │ Follower │ ─────────────────→ │ Candidate │ ──────────────────→ │ Leader │
  └──────────┘                    └───────────┘                     └────────┘
       ↑                               │  ↑                             │
       │   discovers current leader    │  │  election timeout           │
       │   or new term                 │  │  (split vote → retry)       │
       │←──────────────────────────────┘  └─────────────────────────────│
       │                                                                │
       └────────────────── discovers higher term ───────────────────────┘

All nodes start as Followers. A follower passively receives RPCs from the leader. If it doesn't hear from a leader within the election timeout (randomized, typically 150–300ms), it becomes a Candidate and starts an election. If it wins, it becomes the Leader.

Terms

Raft divides time into terms, numbered with consecutive integers. Each term begins with an election. Terms act as a logical clock in Raft — they allow nodes to detect stale leaders and obsolete information.

Term Timeline:

  Term 1          Term 2      Term 3       Term 4        Term 5
  ┌──────────────┐┌──────────┐┌──────────┐┌────────────┐┌────────→
  │ Election → L1││Election→L2││Election  ││Election→ L4││Election→L5
  │ normal ops   ││normal ops ││(no       ││normal ops  ││normal ops
  │              ││           ││ winner)  ││            ││
  └──────────────┘└──────────┘└──────────┘└────────────┘└────────→
                                  ↑
                            split vote — term with no leader

Rules about terms:

Each term has at most one leader.
If a node receives a message with a higher term, it immediately updates its term and reverts to Follower.
If a node receives a message with a lower term, it rejects the message.
Terms are the fundamental mechanism for leader deduplication — they ensure two nodes never believe they are leader for the same term.

Leader Election

When a follower's election timeout expires without receiving an AppendEntries heartbeat from the leader, it starts an election:

Increment current term by 1.
Transition to Candidate state.
Vote for itself.
Send RequestVote RPCs to all other nodes, including:
- term — the candidate's current term.
- candidateId — the candidate's identifier.
- lastLogIndex — index of the candidate's last log entry.
- lastLogTerm — term of the candidate's last log entry.

Three outcomes are possible:

Wins: Receives votes from a majority (including its own vote). Becomes Leader. Immediately sends heartbeats to all nodes to establish authority and prevent new elections.
Loses: Receives an AppendEntries from a node with a term ≥ its own. Reverts to Follower — someone else won.
Split vote: Neither wins. Election timeout expires again → increment term → start new election.

RequestVote RPC:

  Arguments:
    term         — candidate's term
    candidateId  — candidate requesting vote
    lastLogIndex — index of candidate's last log entry
    lastLogTerm  — term of candidate's last log entry

  Results:
    term         — currentTerm, for candidate to update itself
    voteGranted  — true means candidate received vote

  Receiver implementation:
    1. Reply false if term < currentTerm
    2. If votedFor is null or candidateId, AND candidate's log
       is at least as up-to-date as receiver's log, grant vote

The Election Restriction (Safety)

A voter only grants its vote if the candidate's log is at least as up-to-date as the voter's log. "Up-to-date" is compared lexicographically: first by last log entry's term, then by log length.

Log comparison (who is more up-to-date?):

  Node A: [T1:x, T1:y, T3:z]   → lastLogTerm=3, lastLogIndex=3
  Node B: [T1:x, T1:y, T2:w]   → lastLogTerm=2, lastLogIndex=3
  Node C: [T1:x, T1:y]         → lastLogTerm=1, lastLogIndex=2

  A > B  (higher last term: 3 > 2)
  B > C  (higher last term: 2 > 1)
  A > C  (higher last term: 3 > 1)

  Only A can become leader — it has the most up-to-date log.

This restriction guarantees the Leader Completeness Property: if a log entry is committed in a given term, that entry will be present in the logs of all leaders for all higher-numbered terms. This is Raft's key safety guarantee.

▶ Raft Leader Election

Watch how a follower becomes a candidate, collects votes, and becomes leader. Then see a split-vote scenario and retry.

Split Vote Handling

Split votes occur when multiple nodes time out simultaneously and become candidates. Raft handles this with randomized election timeouts:

Split Vote Prevention:

  Each node picks a random election timeout from [150ms, 300ms].
  
  Scenario: Leader crashes at t=0
  
    Node B timeout: 173ms → becomes Candidate at t=173ms
    Node C timeout: 241ms → would become Candidate at t=241ms
    Node D timeout: 198ms → would become Candidate at t=198ms
    Node E timeout: 289ms → would become Candidate at t=289ms
  
  Node B starts election first (t=173ms).
  Network round-trip ≈ 15ms, so by t=188ms:
    B has votes from B, C, D, E (majority reached before D times out)
    B becomes leader at t=188ms.
  
  D's timeout at t=198ms → receives B's heartbeat → stays Follower.
  
  Split votes are rare because the timeout range (150ms) is much
  larger than the network round-trip time (~15ms).

Raft Paper Insight: The election timeout range should be an order of magnitude greater than the broadcast time. If broadcast time is 15ms, a range of [150ms, 300ms] ensures that in most cases, only one node times out before the election completes.

Log Replication

Once a leader is elected, it begins servicing client requests. Each client request contains a command to be executed by the replicated state machine. The leader appends the command to its log as a new entry, then replicates it to followers using AppendEntries RPCs.

AppendEntries RPC:

  Arguments:
    term           — leader's term
    leaderId       — so followers can redirect clients
    prevLogIndex   — index of log entry immediately preceding new ones
    prevLogTerm    — term of prevLogIndex entry
    entries[]      — log entries to store (empty for heartbeat)
    leaderCommit   — leader's commitIndex

  Results:
    term           — currentTerm, for leader to update itself
    success        — true if follower contained entry matching
                     prevLogIndex and prevLogTerm

  Receiver implementation:
    1. Reply false if term < currentTerm
    2. Reply false if log doesn't contain entry at prevLogIndex
       whose term matches prevLogTerm
    3. If existing entry conflicts with new one (same index,
       different term), delete it and all that follow
    4. Append any new entries not already in the log
    5. If leaderCommit > commitIndex, set commitIndex =
       min(leaderCommit, index of last new entry)

Commit Rules

A log entry is committed once the leader has replicated it to a majority of nodes. Once committed, the entry is durable — it will eventually be applied to every node's state machine.

Commit Process:

  Leader log:   [T1:x] [T1:y] [T2:z] [T3:w]
                  ✓       ✓      ✓      ?
                committed committed committed uncommitted

  Node A (Leader): [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node B:          [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node C:          [T1:x] [T1:y] [T2:z]          commitIndex=3
  Node D:          [T1:x] [T1:y] [T2:z] [T3:w]  commitIndex=3
  Node E:          [T1:x] [T1:y]                  commitIndex=2

  Entry [T3:w] at index 4:
    Replicated on: A, B, D = 3 out of 5 = majority ✓
    → Leader advances commitIndex to 4
    → Leader tells followers via next AppendEntries
    → All nodes apply [T3:w] to state machine

Critical Safety Rule: A leader can only commit entries from its current term by counting replicas. Entries from previous terms are committed indirectly — they are committed only when a subsequent entry from the current term is committed. This prevents the subtle scenario where a log entry could be replicated to a majority, then overwritten by a new leader (see Figure 8 of the Raft paper).

Figure 8 Scenario (Why the Commit Rule Matters):

  Time  Leader  Action
  ────  ──────  ──────
  T2    S1      Replicates [T2:cmd] to S1, S2 only → NOT committed
  T3    S5      Elected leader, overwrites with [T3:cmd'] on S3,S4,S5
  T4    S1      Re-elected! Sees [T2:cmd] on S1,S2
  
  Can S1 commit [T2:cmd] just because it's on a majority now?
  NO! S5 might still win election and overwrite it.
  
  S1 must first replicate a NEW entry [T4:something] to a majority.
  Once [T4:something] is committed, all entries before it
  (including [T2:cmd]) are implicitly committed.
  
  This is the "never commit entries from previous terms by counting
  replicas" rule — one of the subtlest points in Raft.

▶ Raft Log Replication

See how a client write propagates through the leader to followers, gets committed after majority acknowledgment, and is applied to the state machine.

Log Matching Property

Raft maintains the following invariant:

If two entries in different logs have the same index and same term, they store the same command.
If two entries in different logs have the same index and same term, then the logs are identical in all preceding entries.

The first property follows from the fact that a leader creates at most one entry per index per term. The second follows from the consistency check in AppendEntries: the leader includes prevLogIndex and prevLogTerm, and the follower rejects the RPC if its log doesn't match at that point. This creates an inductive proof that the logs are consistent.

Log Repair (Handling Inconsistencies)

After a leader crash, follower logs may diverge — they may have missing entries, extra uncommitted entries, or both. The leader repairs this by finding the latest point of agreement with each follower:

Log Repair with nextIndex:

  Leader maintains nextIndex[i] for each follower i.
  Initially set to leader's last log index + 1.

  If AppendEntries fails (consistency check):
    leader decrements nextIndex[i]
    retries AppendEntries with earlier entries
  
  Repeats until follower accepts → logs are now consistent.

  Example:
    Leader log: [1:a] [1:b] [2:c] [3:d] [3:e]
    Follower:   [1:a] [1:b] [2:c] [2:f]    ← diverged at index 4

    Step 1: Leader sends entry 5 with prevLogIndex=4, prevLogTerm=3
            Follower: "I don't have term 3 at index 4" → REJECT
    Step 2: Leader sends entries 4,5 with prevLogIndex=3, prevLogTerm=2
            Follower: "I have term 2 at index 3!" → ACCEPT
            Follower deletes [2:f], appends [3:d] [3:e]
    
    Follower log now: [1:a] [1:b] [2:c] [3:d] [3:e] ✓

Log Compaction & Snapshots

If the log grows without bound, it will eventually consume all storage and take forever to replay on restart. Raft uses snapshotting to compact the log:

Each node independently takes a snapshot of its state machine state at a committed log index.
All log entries up to that index are discarded.
The snapshot includes:
- The state machine state.
- lastIncludedIndex — the last log entry included in the snapshot.
- lastIncludedTerm — the term of that entry.

Snapshot:

  Before: Log = [1:a][1:b][2:c][2:d][3:e][3:f][3:g][4:h]
                                                      ↑ commitIndex=8
  
  After snapshot at index 5:
    Snapshot = { state: {...}, lastIncludedIndex: 5, lastIncludedTerm: 3 }
    Log = [3:f][3:g][4:h]
    
  Memory/disk savings: 5 entries removed

If a follower is so far behind that the leader has already discarded the entries it needs, the leader sends an InstallSnapshot RPC instead of log entries. The follower replaces its state machine with the snapshot and resumes replication from there.

Cluster Membership Changes

Adding or removing nodes from a Raft cluster is handled via joint consensus — a two-phase approach where the cluster transitions through a configuration that requires majorities from both the old and new configurations:

Joint Consensus for Membership Change:

  C_old = {A, B, C}
  C_new = {A, B, C, D, E}

  Phase 1: Leader commits C_old,new (joint configuration)
    Decisions require majorities from BOTH C_old AND C_new.
    e.g., need 2-of-{A,B,C} AND 3-of-{A,B,C,D,E}
    
  Phase 2: Leader commits C_new
    Now only C_new majority required.
    
  This prevents "split-brain" — at no point can two independent
  majorities exist that could elect different leaders.

Single-Server Changes (Raft Optimization): If you add/remove one server at a time, a simpler protocol suffices — no joint consensus needed. This is what etcd and most Raft implementations use in practice.

Raft in Practice

etcd

etcd is a distributed key-value store built on Raft. It's the backbone of Kubernetes — storing all cluster state: pod definitions, service endpoints, ConfigMaps, Secrets, RBAC rules, and more.

# etcd cluster operations

# Write a key
$ etcdctl put /registry/services/nginx '{"port": 80, "replicas": 3}'
OK

# Read a key
$ etcdctl get /registry/services/nginx
/registry/services/nginx
{"port": 80, "replicas": 3}

# Watch for changes (linearizable reads)
$ etcdctl watch /registry/services/nginx
PUT
/registry/services/nginx
{"port": 80, "replicas": 5}

# Check cluster health
$ etcdctl endpoint health --cluster
https://etcd1:2379 is healthy: committed proposal: took = 1.5ms
https://etcd2:2379 is healthy: committed proposal: took = 1.8ms
https://etcd3:2379 is healthy: committed proposal: took = 2.1ms

# Check who is the Raft leader
$ etcdctl endpoint status --cluster -w table
+------------------------+------------------+--------+---------+
|        ENDPOINT        |        ID        | LEADER | RAFT IDX|
+------------------------+------------------+--------+---------+
| https://etcd1:2379     | 8e9e05c52164     | true   |   45892 |
| https://etcd2:2379     | 4b3e89d1a234     | false  |   45892 |
| https://etcd3:2379     | 91c7b2ef8901     | false  |   45890 |
+------------------------+------------------+--------+---------+

etcd implements Raft with several production-grade enhancements:

Lease-based leader election: Prevents split-brain by requiring the leader to periodically renew its lease.
Linearizable reads: By default, reads go through Raft (ReadIndex protocol) to guarantee linearizability. This adds ~1 RTT to reads but ensures you never read stale data.
Watch API: Clients can subscribe to key changes — essential for Kubernetes controllers.
MVCC storage: etcd stores all revisions of a key, enabling point-in-time queries and efficient watches.
Snapshot + WAL: Combines periodic snapshots with a write-ahead log for durability and fast recovery.

etcd Performance: etcd is designed for metadata storage, not bulk data. Recommended limits: ≤8GB database size, ≤1.5MB per value, ≤100 values per transaction. Kubernetes clusters with >5,000 nodes or >100,000 pods may need etcd tuning (heartbeat interval, election timeout, snapshot count).

Consul

HashiCorp Consul uses Raft for its service mesh and KV store. Like etcd, it elects a leader among server nodes, which then replicates all state changes. Consul adds service discovery, health checking, and a Connect service mesh on top of Raft.

# Consul Raft operations

# Check Raft peers
$ consul operator raft list-peers
Node    ID                                    Address           State    Voter
dc1-s1  aedf4c8a-7b5c-8d2e-9f1a-3c4b5d6e7f8a 10.0.0.1:8300   leader   true
dc1-s2  bf1e5d9b-8c6d-9e3f-a02b-4d5c6e7f8091 10.0.0.2:8300   follower true
dc1-s3  c0f26ea3-9d7e-af40-b13c-5e6d7f809102 10.0.0.3:8300   follower true

CockroachDB

CockroachDB uses Raft at a different granularity: each range (a contiguous slice of the keyspace, typically ~512MB) has its own Raft group. A single CockroachDB cluster may have thousands of independent Raft groups, each with its own leader. This design enables:

Horizontal scalability: Leadership is distributed across nodes, avoiding a single bottleneck.
Locality-aware replication: Ranges can have different replication topologies (e.g., pin certain ranges to specific regions for low-latency reads).
Parallel consensus: Different ranges can commit independently, enabling high throughput.

CockroachDB Range-Level Raft:

  Keyspace: [a ─────── g ─────── m ─────── s ─────── z]
              Range 1    Range 2    Range 3    Range 4
              
  Range 1: Raft group {Node1★, Node2, Node3}   ★ = leader
  Range 2: Raft group {Node2★, Node3, Node4}
  Range 3: Raft group {Node3★, Node4, Node1}
  Range 4: Raft group {Node4★, Node1, Node2}
  
  Each range has its own Raft leader → leadership is distributed!
  Transaction spanning Range 1 + Range 3 uses parallel Raft commits.

ZAB (ZooKeeper Atomic Broadcast)

ZAB is the consensus protocol used by Apache ZooKeeper. It was designed specifically for primary-backup systems and predates the Raft paper. While similar to Raft in many ways, ZAB has some key differences:

Primary-ordered atomic broadcast: ZAB guarantees that all writes from a single primary are delivered in order, and that writes from different primaries are totally ordered across all replicas.
Two modes: Recovery (leader election + synchronization) and Broadcast (normal operation).
Epoch-based: Like Raft's terms, ZAB uses epochs. Each new leader increments the epoch, and all nodes adopt the new epoch during recovery.

ZAB Protocol Phases:

  Phase 0: Leader Election
    - Nodes exchange votes. Node with highest (epoch, zxid) wins.
    - Similar to Raft's RequestVote but uses a different comparison.

  Phase 1: Discovery
    - New leader collects followers' histories.
    - Determines the most up-to-date follower history.

  Phase 2: Synchronization
    - Leader sends its history to all followers.
    - Followers update their state to match.
    - Analogous to Raft's log repair but done eagerly on election.

  Phase 3: Broadcast (normal operation)
    - Leader proposes writes (PROPOSAL → ACK → COMMIT).
    - 2-phase commit: wait for majority ACK, then COMMIT.
    - Similar to Raft's AppendEntries.

Key differences from Raft:

ZAB uses transaction IDs (zxids) — 64-bit values where the high 32 bits are the epoch and the low 32 bits are a counter. This embeds the "term" inside the transaction ID.
ZAB performs full synchronization on leader election (Phase 2). Raft repairs logs incrementally during normal operation.
ZAB's broadcast is a two-phase commit (PROPOSAL → ACK → COMMIT), while Raft uses a single AppendEntries RPC with piggybacked commit information.

Viewstamped Replication (VR)

Viewstamped Replication, designed by Brian Oki and Barbara Liskov (1988), is one of the earliest replicated state machine protocols. It predates both Paxos and Raft and introduced many of the ideas that Raft later adopted.

Views: Equivalent to Raft's terms. Each view has a designated primary (leader). View changes are triggered by timeout-based failure detection.
View Change Protocol: When the primary is suspected of failing, replicas initiate a view change — similar to Raft's leader election but more complex.
Normal Operation: The primary receives client requests, assigns them operation numbers, and replicates them to backups. Once a majority acknowledges, the primary executes the operation and replies to the client.

Viewstamped Replication — Normal Operation:

  Client → Primary: REQUEST(op, client-id, request-number)
  Primary → Backups: PREPARE(view, op-number, op, commit-number)
  Backups → Primary: PREPAREOK(view, op-number, replica-id)
  
  Once majority PREPAREOK received:
    Primary executes op → Client: REPLY(view, request-number, result)
    Primary piggybacks commit-number in next PREPARE

  Almost identical to Raft's AppendEntries flow!

VR's view change is more complex than Raft's leader election because VR doesn't use persistent storage — it recovers state from other replicas rather than from disk. The VR Revisited paper (2012) simplified this significantly.

Protocol Comparison

Property	Paxos	Raft	ZAB	Viewstamped Replication
Year	1989 (pub. 1998)	2014	2008	1988 (revised 2012)
Author(s)	Lamport	Ongaro & Ousterhout	Junqueira et al.	Oki & Liskov
Leader Required?	Optional (Multi-Paxos yes)	Yes (strong leader)	Yes (primary)	Yes (primary)
Logical Clock	Proposal numbers	Terms	Epochs + zxid	View numbers
Phases per Decision	2 (basic), 1 (Multi-Paxos steady state)	1 (AppendEntries)	2 (PROPOSAL→COMMIT)	1 (PREPARE→PREPAREOK)
Log Repair	Not specified	Incremental (nextIndex decrement)	Eager (on leader election)	Eager (on view change)
Understandability	Low (notoriously hard)	High (designed for it)	Medium	Medium
Used By	Chubby, Spanner	etcd, Consul, CockroachDB, TiKV	ZooKeeper	Harp, PBFT (influenced)
Fault Tolerance	f of 2f+1 crash failures	f of 2f+1 crash failures	f of 2f+1 crash failures	f of 2f+1 crash failures
Persistent State	Required (voted-for, accepted)	Required (currentTerm, votedFor, log)	Required (epoch, history)	Not required (recovers from peers)

Correctness Arguments

The Raft paper proves five key properties. Together they establish that Raft implements a correct replicated state machine:

Property	Statement	Mechanism
Election Safety	At most one leader per term	Each node votes at most once per term. Majority required to win. Two majorities must overlap → at most one winner.
Leader Append-Only	A leader never overwrites or deletes entries in its log	Leader only appends new entries. Never modifies existing entries.
Log Matching	If two logs contain an entry with the same index and term, then the logs are identical in all entries up through that index	AppendEntries consistency check: prevLogIndex + prevLogTerm must match. Inductive argument from first entry.
Leader Completeness	If a log entry is committed in a given term, it will be present in the logs of all leaders for all higher terms	Election restriction: voter rejects candidates with less up-to-date logs. Committed entry is on a majority → any future majority overlaps → future leader has the entry.
State Machine Safety	If a node has applied a log entry at a given index to its state machine, no other node will ever apply a different log entry at that index	Follows from Log Matching + Leader Completeness. If entry is applied, it's committed. By Leader Completeness, all future leaders have it. By Log Matching, no different entry can exist at that index.

"The Raft algorithm was designed so that its safety properties could be formally verified. The TLA+ specification of Raft has been model-checked." — Ongaro, PhD dissertation

Practical Considerations

Performance Characteristics

Raft Latency Breakdown:

  Client write:
    1. Client → Leader:              ~0.5ms (within datacenter)
    2. Leader appends to local log:  ~0.1ms (fsync to WAL)
    3. Leader → Followers:           ~0.5ms (parallel RPCs)
    4. Followers append + respond:   ~0.6ms (fsync + network)
    5. Leader commits + responds:    ~0.1ms
    ────────────────────────────────────────
    Total write latency:             ~1.8ms (within DC)
    Cross-region (multi-DC):         ~50-150ms (WAN RTT dominates)

  Client read (linearizable):
    Option A: Read through Raft log:    ~1.8ms (same as write)
    Option B: ReadIndex (leader check): ~0.8ms (1 RTT to confirm leadership)
    Option C: Lease-based reads:        ~0.1ms (local read, requires clock sync)

  Throughput:
    Typical: 10,000-50,000 writes/sec (3-node cluster, SSD)
    etcd benchmark: ~16,000 writes/sec (256-byte values, 3 nodes)

Cluster Sizing

Recommended Cluster Sizes:

  3 nodes: Tolerates 1 failure. Minimum for production.
           Best for: small services, dev/staging, low-latency requirements.
  
  5 nodes: Tolerates 2 failures. Standard for critical services.
           Best for: Kubernetes control plane, production databases.
  
  7 nodes: Tolerates 3 failures. Rarely needed.
           Best for: cross-region deployments (2 regions × 3 + 1 witness).
  
  Why always odd?
    Even numbers waste a node without improving fault tolerance:
    4 nodes → tolerates 1 failure (same as 3!)
    6 nodes → tolerates 2 failures (same as 5!)
    
  Formula: f failures tolerated with 2f+1 nodes.

Common Pitfalls

Clock skew: Raft doesn't depend on synchronized clocks for safety, but election timeouts behave poorly if clocks drift significantly. Lease-based reads do depend on bounded clock skew.
Disk latency: Every Raft write requires an fsync before responding. Slow disks (spinning rust, overloaded EBS) directly increase commit latency. Use SSDs.
Network partitions: A partitioned leader continues to accept writes that will never commit (no majority). Clients connected to the partitioned leader will experience timeouts. Implement client-side retries with leader rediscovery.
Large snapshots: If the state machine is large (GBs), snapshot transfer to a lagging follower can saturate the network. Throttle snapshot transfers and use incremental approaches where possible.
Thundering herd on leader failure: When the leader dies, all followers time out simultaneously. Randomized timeouts help, but in pathological cases you may see several failed elections before a leader is established.

Summary

Consensus is the foundation of reliable distributed systems. The FLP impossibility result tells us we can't have everything — determinism, asynchrony, and guaranteed termination — so practical protocols like Paxos, Raft, and ZAB trade pure asynchrony for timeout-based failure detection.

Paxos is the theoretical foundation — provably correct, but notoriously difficult to implement.
Raft is Paxos made understandable — same safety guarantees, cleaner decomposition, widely adopted in production systems.
ZAB powers ZooKeeper with eager synchronization and epoch-based ordering.
Viewstamped Replication introduced many ideas that Raft later formalized, including view-based leader election and replicated state machines.

In the next post, we explore vector clocks — the mechanism for tracking causality and detecting conflicts in systems that don't use consensus for every operation.