Replication & Consistency Models
Every distributed system eventually faces the same question: how do we keep data safe and accessible when machines fail? The answer is replication — maintaining copies of data across multiple nodes. But copies introduce a harder question: how consistent should those copies be? This post walks through every major replication strategy, the consistency models they enable, and the sharp trade-offs you'll navigate in system design interviews and real-world architectures.
1 — Why Replicate?
Replication serves three fundamental goals:
| Goal | Description | Example |
|---|---|---|
| High Availability | Survive node failures without downtime. If one replica dies, another serves requests. | A PostgreSQL primary crashes — a standby is promoted in seconds. |
| Read Scaling | Distribute read traffic across replicas. Writes go to fewer nodes, reads fan out. | An e-commerce catalog with 100:1 read-to-write ratio adds 10 read replicas. |
| Geographic Distribution | Place data near users to reduce latency. A user in Tokyo reads from a Tokyo replica instead of crossing the Pacific. | Netflix caches region-specific content in local CDN/replica clusters. |
Before we dive into models, here's the landscape of replication approaches:
2 — Single-Leader Replication
This is the most common model. One node (the leader or primary) accepts all writes. It persists the change to its local storage, then ships the write-ahead log (WAL) or binlog to followers (replicas/standby). Followers apply these changes in the same order.
Synchronous vs. Asynchronous Replication
| Mode | How It Works | Durability | Latency Impact |
|---|---|---|---|
| Synchronous | Leader waits for at least one follower to confirm the write before acknowledging the client. | Guaranteed: at least 2 nodes have the data. | Higher write latency (network round-trip to follower). |
| Asynchronous | Leader acknowledges immediately after local write. Followers catch up eventually. | Risk of data loss if leader fails before followers catch up. | Lowest write latency. |
| Semi-synchronous | One follower is synchronous, rest are asynchronous. If the sync follower falls behind, another is promoted. | At least 2 copies guaranteed, without blocking on all followers. | Moderate. Only one extra round-trip. |
PostgreSQL Streaming Replication Configuration
Failover Process
When a leader fails, the system must promote a follower. This process is deceptively complex:
- 1 Detect Failure. Typically via heartbeats or health checks. A follower that hasn't heard from the leader for
wal_receiver_timeout(default 60s in PostgreSQL) suspects failure. Consensus among multiple watchers avoids false positives. - 2 Choose New Leader. The follower with the most up-to-date WAL position is the best candidate. In PostgreSQL, tools like
pg_ctl promoteor Patroni handle this automatically. The chosen follower stops replay, switches to read-write mode, and starts accepting WAL from new followers. - 3 Redirect Clients. DNS, a virtual IP (VIP), or a proxy (PgBouncer, HAProxy) reroutes traffic to the new leader. Patroni updates etcd/ZooKeeper to advertise the new leader.
- 4 Reconcile Old Leader. When the old leader recovers, it must rejoin as a follower. Any writes it accepted after the last replicated WAL position are lost or must be manually reconciled.
▶ Leader-Follower Replication
Watch writes propagate from leader to followers, then see failover in action.
3 — Multi-Leader Replication
Multiple nodes accept writes independently. Each leader replicates to the others. This increases write availability and is essential when you need writes from multiple locations simultaneously.
When to Use Multi-Leader
- Multi-datacenter deployments: Each datacenter has its own leader. Writes are local (low latency), then replicated cross-DC asynchronously. If one DC goes down, others continue operating.
- Offline-capable clients: Think Google Docs or a notes app. Each device is a "leader" that accepts writes offline. When connectivity returns, changes sync bidirectionally.
- Collaborative editing: Multiple users editing the same document simultaneously — each user's client acts as a leader, and changes are merged in real time.
Conflict Resolution Strategies
The core problem: two leaders can modify the same row concurrently. When their changes are replicated to each other, there's a conflict.
| Strategy | How It Works | Pros | Cons |
|---|---|---|---|
| Last-Write-Wins (LWW) | Attach a timestamp to each write. The latest timestamp wins; earlier writes are silently discarded. | Simple, deterministic. | Loses data. Clock skew across nodes can cause wrong "last" write to win. |
| Merge | Concatenate or union conflicting values. E.g., two users add different tags — merge keeps both. | No data loss for additive operations. | Not meaningful for all data types (e.g., you can't merge two different email addresses). |
| Custom Application Logic | Write a conflict handler. On conflict, the application decides — maybe show the user both versions, or apply domain-specific rules. | Most flexible, domain-aware. | Complex to implement and test. |
| CRDTs | Data structures that mathematically guarantee convergence without coordination: G-Counter, PN-Counter, LWW-Register, OR-Set. | Automatic conflict-free merging. Mathematically proven convergence. | Limited to certain data types. Higher memory overhead (metadata for convergence). |
CRDTs: Conflict-free Replicated Data Types
CRDTs are algebraic data structures where all concurrent operations commute — meaning the order of application doesn't matter. Every replica arrives at the same state regardless of the order it receives updates.
Real-world examples: CouchDB uses revision trees with application-level conflict resolution. Galera Cluster (MySQL) uses certification-based replication where conflicting transactions are rolled back and retried. Riak uses CRDTs natively for counters, sets, and maps.
4 — Leaderless Replication
In leaderless (Dynamo-style) systems, there is no designated leader. Any replica can accept both reads and writes. The client (or a coordinator) sends writes to multiple replicas in parallel and reads from multiple replicas, then resolves discrepancies.
Write and Read Protocol
Key Mechanisms
| Mechanism | Purpose | How It Works |
|---|---|---|
| Sloppy Quorum | Increase availability during node failures. | If a designated replica is down, the write is sent to a different node temporarily. That node holds the data via hinted handoff and forwards it when the original node recovers. |
| Read Repair | Fix stale replicas on the read path. | When a read detects a stale response from one replica, the coordinator sends the latest value back to that replica. Repairs happen lazily, on read. |
| Anti-Entropy (Merkle Trees) | Background consistency repair. | Replicas periodically exchange Merkle trees (hash trees) of their data. Differing branches identify which key ranges are inconsistent. Only the inconsistent ranges are synced — this is efficient even for terabytes of data. |
Cassandra Consistency Levels
Dynamo-style systems: Amazon DynamoDB, Apache Cassandra, Riak, and Voldemort all follow this pattern. Each tunes the W/R/N knobs differently, and each adds its own conflict resolution (DynamoDB uses LWW by default, Riak supports CRDTs, Cassandra uses LWW with tombstones).
5 — Consistency Models
A consistency model defines the contract between the data store and the application: what are you guaranteed to see when you read? Models are ordered from strongest to weakest:
🔴 Strong Consistency (Linearizability)
Once a write completes, every subsequent read (from any node) sees the new value. The system behaves as if there's a single copy. Example: You update your profile picture. Any user on any continent who loads your profile after the update sees the new picture. Cost: High latency (must coordinate across replicas), lower availability during partitions. Systems: ZooKeeper, etcd, Spanner (TrueTime).
🟠 Sequential Consistency
All nodes see operations in the same order, but that order may not match real-time wall-clock order. Example: User A posts a message, User B posts after. All users see both messages in the same order, but B's message might appear before A's if that's the order the system chose. Systems: Some configurations of ZooKeeper (ordered reads from a single server).
🟡 Causal Consistency
If operation A causally precedes B (A happened-before B), then every node sees A before B. Concurrent operations (no causal relation) may be seen in different orders by different nodes. Example: Alice posts "What's for lunch?", Bob replies "Pizza." Every node sees Alice's post before Bob's reply. But Charlie's unrelated post can appear anywhere. Systems: MongoDB (causal consistency sessions), COPS.
🟢 Eventual Consistency
If no new writes arrive, all replicas eventually converge to the same value. No guarantees about what you see in the meantime. Example: You update your bio. For the next few seconds, some users see the old bio, others see the new one. Eventually all see the new one. Systems: DynamoDB (default), Cassandra (ONE), DNS, CDN caches.
Session Guarantees (Practical Consistency)
Pure eventual consistency is often too weak. Most systems layer session guarantees on top:
| Guarantee | Promise | Example | Implementation |
|---|---|---|---|
| Read-Your-Writes | After you write a value, your subsequent reads see that value (or a newer one). | After posting a comment, you see your comment immediately (even if other users see it later). | Route the user's reads to the same replica that accepted the write, or track a logical timestamp. |
| Monotonic Reads | Once you read a value at time T, you never see an older value in subsequent reads. | Refreshing a page never shows older data than what you saw before. | Pin the user to a single replica, or track the latest read timestamp per user. |
| Consistent Prefix Reads | If a sequence of writes occurs in order (A, B, C), you never see them out of order (e.g., B, A, C). | In a chat, you never see a reply before the message it's replying to. | Ensure causally related writes are routed to the same partition, or use causal ordering. |
6 — Quorum Consensus
Quorums are the mathematical backbone of leaderless replication. With N replicas, a write quorum W and a read quorum R, the key invariant is:
This guarantees that the set of nodes responding to a read overlaps with the set of nodes that acknowledged the write. At least one node in the read set has the latest data.
Tunable Consistency
| Configuration (N=5) | W | R | W+R | Behavior |
|---|---|---|---|---|
| Fast writes | 1 | 5 | 6 > 5 ✓ | Writes are instant (ack from 1 node). Reads query all 5 nodes. Best for write-heavy, read-rare workloads. |
| Fast reads | 5 | 1 | 6 > 5 ✓ | Writes block until all 5 nodes ack. Reads are instant from any 1 node. Best for read-heavy, write-rare workloads. |
| Balanced | 3 | 3 | 6 > 5 ✓ | Both writes and reads contact a majority. Good all-around choice. Most common in practice (⌈N/2⌉+1). |
| Eventual (weak) | 1 | 1 | 2 ≤ 5 ✗ | No overlap guaranteed. Fast, but you may read stale data. Used when stale reads are acceptable. |
▶ Quorum Reads/Writes
Visualize W+R>N overlap ensuring at least one node has the latest data.
Sloppy Quorum
A strict quorum requires responses from the designated "home" nodes. A sloppy quorum relaxes this: if a home node is unreachable, the request is handled by a different node that is not one of the N designated replicas. This increases write availability but weakens the consistency guarantee — the W+R>N invariant no longer guarantees overlap among the correct nodes.
When the original node recovers, the temporary stand-in sends the data back via hinted handoff. Riak calls this "sloppy quorum" explicitly and makes it configurable. DynamoDB uses it transparently.
7 — Conflict Resolution
Whenever two or more nodes accept writes concurrently (multi-leader or leaderless), conflicts can arise. Here's a deep dive into each resolution strategy:
Last-Write-Wins (LWW)
✅ Pros
- Simple to implement
- Deterministic — all replicas converge
- No application code needed
❌ Cons
- Silently drops concurrent writes
- Clock skew can pick the "wrong" winner
- Not suitable for data that must not be lost
Use case: Session stores, caches, sensor data where only the latest reading matters. Cassandra uses LWW as its default conflict resolution.
Vector Clocks
A vector clock is a list of (node, counter) pairs. Each node increments its own counter on each write. When two vector clocks are incomparable (neither dominates the other), the writes are concurrent and must be resolved.
✅ Pros
- Detects true conflicts (no false positives)
- Tracks causal relationships
- No data loss (conflicts surfaced to app)
❌ Cons
- Clock size grows with number of nodes
- Application must handle conflict resolution
- Pruning old entries can miss conflicts
Use case: Amazon's original Dynamo paper (2007), Riak (dotted version vectors, an optimization). Suitable when data loss is unacceptable and conflicts must be surfaced.
Application-Level Merge
The application provides a merge function that is called when conflicts are detected. This is the most flexible but most complex approach.
Use case: Shopping carts (union of items), collaborative documents (operational transforms), any domain where a meaningful merge exists.
CRDTs (Revisited)
CRDTs eliminate the need for conflict resolution entirely by designing data structures where all operations commute. They are ideal for counters, sets, registers, and graphs in eventually consistent systems.
✅ Pros
- Zero coordination needed
- Mathematically guaranteed convergence
- Works offline and across partitions
❌ Cons
- Only certain data structures supported
- Metadata overhead (tombstones, version info)
- Complex to implement correctly
Use case: Riak data types, Redis CRDTs (Enterprise), Automerge for collaborative editing, Figma's multiplayer editing engine.
8 — Replication Lag
Replication lag is the delay between a write being accepted by the leader (or quorum) and that write being visible on a follower (or all replicas). In asynchronous replication, lag is inevitable and variable.
Measuring Lag
Impact of Lag
| Problem | Scenario | Mitigation |
|---|---|---|
| Stale Reads | User writes a comment, then reads from a lagging follower and doesn't see it. | Read-your-writes: route the author's reads to the leader (or a recent replica) for a time window after writing. |
| Non-monotonic Reads | User refreshes twice. First request hits an up-to-date follower (sees 10 comments). Second hits a lagging follower (sees 8 comments). Feels like data disappeared. | Monotonic reads: pin the user to a single replica using consistent hashing on the user ID. |
| Cross-DC Lag | A write in US-East takes 150ms+ to reach EU-West replica. During that window, EU users see stale data. | LOCAL_QUORUM writes + LOCAL_QUORUM reads within each DC. Accept that cross-DC reads may be eventually consistent. |
| Causality Violation | Alice writes A, Bob reads A and writes B (referencing A). A third user sees B but not A — the reply without the original message. | Causal consistency via vector clocks or logical timestamps. Ensure causally related writes are visible in order. |
Putting It All Together
Here's a decision guide for choosing a replication strategy:
| Requirement | Replication Model | Consistency | Example System |
|---|---|---|---|
| Simple OLTP, moderate scale | Single-leader (sync standby) | Strong (reads from leader) | PostgreSQL + Patroni |
| Multi-region writes, offline support | Multi-leader | Eventual + conflict resolution | CouchDB, Galera |
| Massive scale, tunable per-query | Leaderless (quorum) | Tunable (QUORUM → strong, ONE → eventual) | Cassandra, DynamoDB |
| Global strong consistency | Single-leader with Paxos/Raft | Linearizable | Google Spanner, CockroachDB |
| Collaborative real-time editing | Multi-leader + CRDTs | Causal + convergence | Figma, Automerge |
- Identify read:write ratio — high reads → single-leader with read replicas; balanced → leaderless quorum.
- Determine consistency requirements — banking → linearizable; social feed → eventual + read-your-writes.
- Consider geographic distribution — single DC → single-leader; multi-DC → multi-leader or leaderless with LOCAL_QUORUM.
- Plan for failure — how does failover work? What happens during a partition? What's the blast radius?
- Quantify lag tolerance — can the user wait 100ms for consistency? Or must reads always be fresh?