← All Posts
High Level Design Series · Building Blocks · Part 3

Replication & Consistency Models

Every distributed system eventually faces the same question: how do we keep data safe and accessible when machines fail? The answer is replication — maintaining copies of data across multiple nodes. But copies introduce a harder question: how consistent should those copies be? This post walks through every major replication strategy, the consistency models they enable, and the sharp trade-offs you'll navigate in system design interviews and real-world architectures.

1 — Why Replicate?

Replication serves three fundamental goals:

GoalDescriptionExample
High Availability Survive node failures without downtime. If one replica dies, another serves requests. A PostgreSQL primary crashes — a standby is promoted in seconds.
Read Scaling Distribute read traffic across replicas. Writes go to fewer nodes, reads fan out. An e-commerce catalog with 100:1 read-to-write ratio adds 10 read replicas.
Geographic Distribution Place data near users to reduce latency. A user in Tokyo reads from a Tokyo replica instead of crossing the Pacific. Netflix caches region-specific content in local CDN/replica clusters.
The Fundamental Trade-off. Replication introduces a tension between consistency (all replicas agree) and performance/availability (respond fast, even if replicas haven't caught up). The CAP theorem formalizes one dimension of this: during a network partition, you must choose between consistency and availability. The PACELC extension adds that even without partitions, there's a latency-vs-consistency trade-off.

Before we dive into models, here's the landscape of replication approaches:

┌─────────────────────────────────────────────────────────────┐ │ Replication Approaches │ ├────────────────────┬────────────────────┬───────────────────┤ │ Single-Leader │ Multi-Leader │ Leaderless │ │ │ │ │ │ 1 writer, N │ K writers, N │ Any node writes │ │ followers │ followers │ + reads │ │ │ │ │ │ PostgreSQL │ CouchDB │ Cassandra │ │ MySQL │ Galera Cluster │ DynamoDB │ │ MongoDB (RS) │ Tungsten │ Riak │ └────────────────────┴────────────────────┴───────────────────┘

2 — Single-Leader Replication

This is the most common model. One node (the leader or primary) accepts all writes. It persists the change to its local storage, then ships the write-ahead log (WAL) or binlog to followers (replicas/standby). Followers apply these changes in the same order.

Synchronous vs. Asynchronous Replication

ModeHow It WorksDurabilityLatency Impact
Synchronous Leader waits for at least one follower to confirm the write before acknowledging the client. Guaranteed: at least 2 nodes have the data. Higher write latency (network round-trip to follower).
Asynchronous Leader acknowledges immediately after local write. Followers catch up eventually. Risk of data loss if leader fails before followers catch up. Lowest write latency.
Semi-synchronous One follower is synchronous, rest are asynchronous. If the sync follower falls behind, another is promoted. At least 2 copies guaranteed, without blocking on all followers. Moderate. Only one extra round-trip.

PostgreSQL Streaming Replication Configuration

# postgresql.conf — Primary Server wal_level = replica max_wal_senders = 5 # max number of replicas synchronous_commit = on # 'on' = sync, 'off' = async synchronous_standby_names = 'replica1' # named sync replica wal_keep_size = 1GB # retain WAL for slow replicas # pg_hba.conf — Allow replication connections host replication replicator 10.0.1.0/24 md5 # On Replica: recovery.conf / standby.signal primary_conninfo = 'host=10.0.1.1 port=5432 user=replicator password=*** application_name=replica1' primary_slot_name = 'replica1_slot'

Failover Process

When a leader fails, the system must promote a follower. This process is deceptively complex:

  1. 1 Detect Failure. Typically via heartbeats or health checks. A follower that hasn't heard from the leader for wal_receiver_timeout (default 60s in PostgreSQL) suspects failure. Consensus among multiple watchers avoids false positives.
  2. 2 Choose New Leader. The follower with the most up-to-date WAL position is the best candidate. In PostgreSQL, tools like pg_ctl promote or Patroni handle this automatically. The chosen follower stops replay, switches to read-write mode, and starts accepting WAL from new followers.
  3. 3 Redirect Clients. DNS, a virtual IP (VIP), or a proxy (PgBouncer, HAProxy) reroutes traffic to the new leader. Patroni updates etcd/ZooKeeper to advertise the new leader.
  4. 4 Reconcile Old Leader. When the old leader recovers, it must rejoin as a follower. Any writes it accepted after the last replicated WAL position are lost or must be manually reconciled.
⚠ Split-Brain Problem. If the old leader comes back online before the system finishes failover, two nodes believe they are the leader. Both accept writes, causing divergent data. Fencing mechanisms (STONITH — "Shoot The Other Node In The Head") forcibly power off the old leader. Without fencing, split-brain can corrupt your entire dataset.

▶ Leader-Follower Replication

Watch writes propagate from leader to followers, then see failover in action.

3 — Multi-Leader Replication

Multiple nodes accept writes independently. Each leader replicates to the others. This increases write availability and is essential when you need writes from multiple locations simultaneously.

When to Use Multi-Leader

Conflict Resolution Strategies

The core problem: two leaders can modify the same row concurrently. When their changes are replicated to each other, there's a conflict.

StrategyHow It WorksProsCons
Last-Write-Wins (LWW) Attach a timestamp to each write. The latest timestamp wins; earlier writes are silently discarded. Simple, deterministic. Loses data. Clock skew across nodes can cause wrong "last" write to win.
Merge Concatenate or union conflicting values. E.g., two users add different tags — merge keeps both. No data loss for additive operations. Not meaningful for all data types (e.g., you can't merge two different email addresses).
Custom Application Logic Write a conflict handler. On conflict, the application decides — maybe show the user both versions, or apply domain-specific rules. Most flexible, domain-aware. Complex to implement and test.
CRDTs Data structures that mathematically guarantee convergence without coordination: G-Counter, PN-Counter, LWW-Register, OR-Set. Automatic conflict-free merging. Mathematically proven convergence. Limited to certain data types. Higher memory overhead (metadata for convergence).

CRDTs: Conflict-free Replicated Data Types

CRDTs are algebraic data structures where all concurrent operations commute — meaning the order of application doesn't matter. Every replica arrives at the same state regardless of the order it receives updates.

G-Counter (Grow-Only Counter) Each node maintains its own counter. The merged value is the sum of all nodes. Node A: [A:3, B:0, C:0] → Total = 3 Node B: [A:0, B:5, C:0] → Total = 5 Node C: [A:0, B:0, C:2] → Total = 2 Merged: [A:3, B:5, C:2] → Total = 10 PN-Counter (Positive-Negative Counter) Two G-Counters: one for increments, one for decrements. Value = P - N. OR-Set (Observed-Remove Set) Each element is tagged with a unique ID on add. Remove only removes observed tags, so concurrent add + remove results in the element being present (add wins).

Real-world examples: CouchDB uses revision trees with application-level conflict resolution. Galera Cluster (MySQL) uses certification-based replication where conflicting transactions are rolled back and retried. Riak uses CRDTs natively for counters, sets, and maps.

4 — Leaderless Replication

In leaderless (Dynamo-style) systems, there is no designated leader. Any replica can accept both reads and writes. The client (or a coordinator) sends writes to multiple replicas in parallel and reads from multiple replicas, then resolves discrepancies.

Write and Read Protocol

Write path (N=3, W=2): Client → Node A ✓ (ack) Client → Node B ✓ (ack) → W=2 acks received, write succeeds Client → Node C ✗ (down) → ignored for now Read path (N=3, R=2): Client ← Node A (v=42, timestamp=T5) Client ← Node B (v=42, timestamp=T5) → R=2 responses, consistent Client ← Node C (timeout) → ignored Conflict scenario: Client ← Node A (v=42, timestamp=T5) Client ← Node B (v=40, timestamp=T3) → stale! use v=42 (latest timestamp)

Key Mechanisms

MechanismPurposeHow It Works
Sloppy Quorum Increase availability during node failures. If a designated replica is down, the write is sent to a different node temporarily. That node holds the data via hinted handoff and forwards it when the original node recovers.
Read Repair Fix stale replicas on the read path. When a read detects a stale response from one replica, the coordinator sends the latest value back to that replica. Repairs happen lazily, on read.
Anti-Entropy (Merkle Trees) Background consistency repair. Replicas periodically exchange Merkle trees (hash trees) of their data. Differing branches identify which key ranges are inconsistent. Only the inconsistent ranges are synced — this is efficient even for terabytes of data.

Cassandra Consistency Levels

-- Cassandra CQL: tunable consistency per query -- Strong consistency: W + R > N (e.g., QUORUM + QUORUM with RF=3) CONSISTENCY QUORUM; INSERT INTO users (id, name) VALUES (1, 'Alice'); -- Read with QUORUM: reads from ⌈3/2⌉+1 = 2 nodes CONSISTENCY QUORUM; SELECT * FROM users WHERE id = 1; -- Fast writes, eventual reads CONSISTENCY ONE; INSERT INTO events (id, data) VALUES (uuid(), 'click'); -- Strongest: all replicas must respond CONSISTENCY ALL; SELECT * FROM users WHERE id = 1; -- Cross-datacenter: LOCAL_QUORUM for same-DC strong consistency CONSISTENCY LOCAL_QUORUM; INSERT INTO orders (id, total) VALUES (uuid(), 99.99);

Dynamo-style systems: Amazon DynamoDB, Apache Cassandra, Riak, and Voldemort all follow this pattern. Each tunes the W/R/N knobs differently, and each adds its own conflict resolution (DynamoDB uses LWW by default, Riak supports CRDTs, Cassandra uses LWW with tombstones).

5 — Consistency Models

A consistency model defines the contract between the data store and the application: what are you guaranteed to see when you read? Models are ordered from strongest to weakest:

🔴 Strong Consistency (Linearizability)

Once a write completes, every subsequent read (from any node) sees the new value. The system behaves as if there's a single copy. Example: You update your profile picture. Any user on any continent who loads your profile after the update sees the new picture. Cost: High latency (must coordinate across replicas), lower availability during partitions. Systems: ZooKeeper, etcd, Spanner (TrueTime).

🟠 Sequential Consistency

All nodes see operations in the same order, but that order may not match real-time wall-clock order. Example: User A posts a message, User B posts after. All users see both messages in the same order, but B's message might appear before A's if that's the order the system chose. Systems: Some configurations of ZooKeeper (ordered reads from a single server).

🟡 Causal Consistency

If operation A causally precedes B (A happened-before B), then every node sees A before B. Concurrent operations (no causal relation) may be seen in different orders by different nodes. Example: Alice posts "What's for lunch?", Bob replies "Pizza." Every node sees Alice's post before Bob's reply. But Charlie's unrelated post can appear anywhere. Systems: MongoDB (causal consistency sessions), COPS.

🟢 Eventual Consistency

If no new writes arrive, all replicas eventually converge to the same value. No guarantees about what you see in the meantime. Example: You update your bio. For the next few seconds, some users see the old bio, others see the new one. Eventually all see the new one. Systems: DynamoDB (default), Cassandra (ONE), DNS, CDN caches.

Session Guarantees (Practical Consistency)

Pure eventual consistency is often too weak. Most systems layer session guarantees on top:

GuaranteePromiseExampleImplementation
Read-Your-Writes After you write a value, your subsequent reads see that value (or a newer one). After posting a comment, you see your comment immediately (even if other users see it later). Route the user's reads to the same replica that accepted the write, or track a logical timestamp.
Monotonic Reads Once you read a value at time T, you never see an older value in subsequent reads. Refreshing a page never shows older data than what you saw before. Pin the user to a single replica, or track the latest read timestamp per user.
Consistent Prefix Reads If a sequence of writes occurs in order (A, B, C), you never see them out of order (e.g., B, A, C). In a chat, you never see a reply before the message it's replying to. Ensure causally related writes are routed to the same partition, or use causal ordering.
💡 Interview Tip. When asked "what consistency model would you choose?", don't default to "eventual consistency because it's fast." Instead, identify the minimum consistency your application needs. A banking ledger needs linearizability for balances. A social media feed can tolerate eventual consistency for likes but needs read-your-writes for the author's own posts.

6 — Quorum Consensus

Quorums are the mathematical backbone of leaderless replication. With N replicas, a write quorum W and a read quorum R, the key invariant is:

W + R > N

This guarantees that the set of nodes responding to a read overlaps with the set of nodes that acknowledged the write. At least one node in the read set has the latest data.

Tunable Consistency

Configuration (N=5)WRW+RBehavior
Fast writes 1 5 6 > 5 ✓ Writes are instant (ack from 1 node). Reads query all 5 nodes. Best for write-heavy, read-rare workloads.
Fast reads 5 1 6 > 5 ✓ Writes block until all 5 nodes ack. Reads are instant from any 1 node. Best for read-heavy, write-rare workloads.
Balanced 3 3 6 > 5 ✓ Both writes and reads contact a majority. Good all-around choice. Most common in practice (⌈N/2⌉+1).
Eventual (weak) 1 1 2 ≤ 5 ✗ No overlap guaranteed. Fast, but you may read stale data. Used when stale reads are acceptable.

▶ Quorum Reads/Writes

Visualize W+R>N overlap ensuring at least one node has the latest data.

Sloppy Quorum

A strict quorum requires responses from the designated "home" nodes. A sloppy quorum relaxes this: if a home node is unreachable, the request is handled by a different node that is not one of the N designated replicas. This increases write availability but weakens the consistency guarantee — the W+R>N invariant no longer guarantees overlap among the correct nodes.

When the original node recovers, the temporary stand-in sends the data back via hinted handoff. Riak calls this "sloppy quorum" explicitly and makes it configurable. DynamoDB uses it transparently.

7 — Conflict Resolution

Whenever two or more nodes accept writes concurrently (multi-leader or leaderless), conflicts can arise. Here's a deep dive into each resolution strategy:

Last-Write-Wins (LWW)

✅ Pros
  • Simple to implement
  • Deterministic — all replicas converge
  • No application code needed
❌ Cons
  • Silently drops concurrent writes
  • Clock skew can pick the "wrong" winner
  • Not suitable for data that must not be lost

Use case: Session stores, caches, sensor data where only the latest reading matters. Cassandra uses LWW as its default conflict resolution.

Vector Clocks

A vector clock is a list of (node, counter) pairs. Each node increments its own counter on each write. When two vector clocks are incomparable (neither dominates the other), the writes are concurrent and must be resolved.

Example: Two concurrent writes Client 1 writes to Node A: VC = {A:1} Client 2 writes to Node B: VC = {B:1} Neither dominates → CONFLICT detected {A:1} vs {B:1} → concurrent writes After merge by application: VC = {A:1, B:1} Example: Causal ordering Write 1 on A: VC = {A:1} Write 2 on A: VC = {A:2} (A:2 dominates A:1 → no conflict) Write 3 on B VC = {A:2, B:1} (saw A:2, then wrote on B) {A:2, B:1} dominates {A:2} → causal, no conflict
✅ Pros
  • Detects true conflicts (no false positives)
  • Tracks causal relationships
  • No data loss (conflicts surfaced to app)
❌ Cons
  • Clock size grows with number of nodes
  • Application must handle conflict resolution
  • Pruning old entries can miss conflicts

Use case: Amazon's original Dynamo paper (2007), Riak (dotted version vectors, an optimization). Suitable when data loss is unacceptable and conflicts must be surfaced.

Application-Level Merge

The application provides a merge function that is called when conflicts are detected. This is the most flexible but most complex approach.

// Example: CouchDB conflict resolution in JavaScript function resolve(conflict) { const versions = conflict.getConflictingRevisions(); // Domain-specific merge: union of shopping cart items const merged = { items: [...new Set( versions.flatMap(v => v.items) )], updatedAt: new Date().toISOString() }; conflict.resolve(merged); }

Use case: Shopping carts (union of items), collaborative documents (operational transforms), any domain where a meaningful merge exists.

CRDTs (Revisited)

CRDTs eliminate the need for conflict resolution entirely by designing data structures where all operations commute. They are ideal for counters, sets, registers, and graphs in eventually consistent systems.

✅ Pros
  • Zero coordination needed
  • Mathematically guaranteed convergence
  • Works offline and across partitions
❌ Cons
  • Only certain data structures supported
  • Metadata overhead (tombstones, version info)
  • Complex to implement correctly

Use case: Riak data types, Redis CRDTs (Enterprise), Automerge for collaborative editing, Figma's multiplayer editing engine.

8 — Replication Lag

Replication lag is the delay between a write being accepted by the leader (or quorum) and that write being visible on a follower (or all replicas). In asynchronous replication, lag is inevitable and variable.

Measuring Lag

# PostgreSQL: check replication lag on the primary SELECT client_addr, application_name, pg_wal_lsn_diff( pg_current_wal_lsn(), sent_lsn ) AS send_lag_bytes, pg_wal_lsn_diff( sent_lsn, replay_lsn ) AS replay_lag_bytes, replay_lag AS replay_lag_time FROM pg_stat_replication; # MySQL: check seconds_behind_master SHOW SLAVE STATUS\G # Look for: Seconds_Behind_Master: 3 # Cassandra: no built-in lag metric (leaderless) # Monitor via read-repair counts and speculative retry rates nodetool tablestats keyspace.table | grep repair

Impact of Lag

ProblemScenarioMitigation
Stale Reads User writes a comment, then reads from a lagging follower and doesn't see it. Read-your-writes: route the author's reads to the leader (or a recent replica) for a time window after writing.
Non-monotonic Reads User refreshes twice. First request hits an up-to-date follower (sees 10 comments). Second hits a lagging follower (sees 8 comments). Feels like data disappeared. Monotonic reads: pin the user to a single replica using consistent hashing on the user ID.
Cross-DC Lag A write in US-East takes 150ms+ to reach EU-West replica. During that window, EU users see stale data. LOCAL_QUORUM writes + LOCAL_QUORUM reads within each DC. Accept that cross-DC reads may be eventually consistent.
Causality Violation Alice writes A, Bob reads A and writes B (referencing A). A third user sees B but not A — the reply without the original message. Causal consistency via vector clocks or logical timestamps. Ensure causally related writes are visible in order.
Real-world lag numbers. Same-rack async replication: <1ms. Same-DC async: 1–5ms. Cross-DC (same continent): 20–50ms. Cross-ocean: 100–300ms. These are best-case; under load, lag can spike to seconds or minutes. Always design for the worst case, not the average.

Putting It All Together

Here's a decision guide for choosing a replication strategy:

RequirementReplication ModelConsistencyExample System
Simple OLTP, moderate scale Single-leader (sync standby) Strong (reads from leader) PostgreSQL + Patroni
Multi-region writes, offline support Multi-leader Eventual + conflict resolution CouchDB, Galera
Massive scale, tunable per-query Leaderless (quorum) Tunable (QUORUM → strong, ONE → eventual) Cassandra, DynamoDB
Global strong consistency Single-leader with Paxos/Raft Linearizable Google Spanner, CockroachDB
Collaborative real-time editing Multi-leader + CRDTs Causal + convergence Figma, Automerge
💡 System Design Interview Framework:
  1. Identify read:write ratio — high reads → single-leader with read replicas; balanced → leaderless quorum.
  2. Determine consistency requirements — banking → linearizable; social feed → eventual + read-your-writes.
  3. Consider geographic distribution — single DC → single-leader; multi-DC → multi-leader or leaderless with LOCAL_QUORUM.
  4. Plan for failure — how does failover work? What happens during a partition? What's the blast radius?
  5. Quantify lag tolerance — can the user wait 100ms for consistency? Or must reads always be fresh?