CAP Theorem & PACELC
If you have ever designed or evaluated a distributed data store, you have encountered a fundamental tension: you cannot have everything. The CAP theorem formalizes that tension, and the PACELC extension adds the nuance that real-world engineers need. This post will give you the rigorous background, the intuition, and the interview-ready vocabulary to reason about consistency and availability trade-offs in any distributed system.
The CAP Theorem
In the year 2000, Eric Brewer presented a conjecture at the ACM Symposium on Principles of Distributed Computing (PODC): it is impossible for a distributed data store to simultaneously provide more than two out of three guarantees — Consistency, Availability, and Partition Tolerance. Two years later, Seth Gilbert and Nancy Lynch of MIT published a formal proof, elevating the conjecture to a theorem.
The Three Guarantees
Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability (not to be confused with the "C" in ACID, which refers to database integrity constraints). Formally: for any execution, there exists a total order of operations such that each operation appears to take effect atomically at a single point in time between its invocation and response, and every read of a key returns the value written by the most recent preceding write to that key.
Availability (A) — Every request received by a non-failing node in the system must result in a response. There is no timeout or error allowed — the node must eventually return a meaningful result. Note the subtlety: availability is defined per non-failing node, not per client. A client whose network cable is cut is not relevant; a healthy server that refuses to answer violates availability.
Partition Tolerance (P) — The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes. A network partition divides the cluster into two (or more) groups of nodes that cannot communicate with each other.
Proof Intuition
Consider the simplest possible distributed system: two nodes, N1 and N2, each holding a copy of variable X (initially X = v0). A network partition separates them — no messages can get through.
- A client sends a write to N1:
X = v1. - N1 cannot replicate to N2 (partition).
- Another client sends a read to N2.
N2 must now choose:
- Return
v0(the stale value) → violates Consistency (the most recent write wasv1). - Wait / return an error → violates Availability (N2 is healthy but not responding).
There is no third option. This is the crux of the Gilbert-Lynch proof. It holds regardless of the number of nodes, the replication protocol, or the consensus algorithm. The fundamental impossibility stems from the speed of light: information cannot propagate instantaneously, and partitions can prevent it from propagating at all.
Formal Statement
Let us state the theorem more precisely. Consider an asynchronous network model (no bound on message delivery time). It is impossible to implement a read/write data object that guarantees all three of the following properties:
- Atomicity (Linearizability) — equivalent to there being a single copy of the data.
- Availability — every request to a non-failing node eventually receives a response.
- Partition tolerance — the network may lose arbitrarily many messages between nodes.
In the partially synchronous model (bounded message delay), you can achieve all three when there is no partition, but you must sacrifice either C or A during a partition.
▶ CAP Venn Diagram
Click each region to explore the trade-offs. Use Step to walk through each combination.
CP Systems
CP systems choose Consistency over Availability during a network partition. When a partition occurs, nodes that cannot confirm they have the latest data will refuse to serve requests rather than return stale results.
Behavior During a Partition
Imagine a 5-node cluster using Raft consensus. The leader is in partition A (with 2 followers), and 2 followers are isolated in partition B:
- Partition A (leader + 2 followers = 3/5 majority): Can still form a quorum. Writes succeed. Reads return the latest committed data. The system is consistent and available within this partition.
- Partition B (2 followers, no leader): Cannot form a quorum. Cannot elect a new leader. All reads and writes are rejected. The system sacrifices availability for these nodes to maintain consistency.
Examples of CP Systems
| System | Consistency Mechanism | Partition Behavior |
|---|---|---|
| ZooKeeper | ZAB protocol (majority quorum) | Minority partition becomes read-only, then unavailable |
| etcd | Raft consensus | Minority partition cannot serve reads or writes |
| HBase | Single RegionServer per region + ZK | Region unavailable if its server is partitioned |
| MongoDB | Majority read/write concern | Minority partition rejects writes; reads depend on config |
| Redis (Sentinel) | Single master per shard | Old master in minority stops accepting writes after timeout |
| Google Spanner | Paxos + TrueTime | Minority partition unavailable; majority continues |
Use Cases for CP
- Financial transactions: A bank transfer cannot afford to show inconsistent balances. Better to return an error than to double-spend money.
- Inventory management: Selling the last item in stock to two customers simultaneously is worse than temporarily showing "unavailable."
- Leader election / distributed locks: ZooKeeper and etcd power leader election for Kafka, HDFS, and Kubernetes. A wrong leader is catastrophic; brief unavailability is acceptable.
- Configuration management: When deploying feature flags or routing rules, every node must have the same view.
AP Systems
AP systems choose Availability over Consistency during a network partition. Every node continues to accept reads and writes, even if it cannot communicate with other nodes. The trade-off: clients may see stale or divergent data.
Behavior During a Partition
Consider a 3-node Cassandra cluster with replication factor 3 and CONSISTENCY ONE:
- A partition isolates Node C from Nodes A and B.
- A client writes
user.email = "new@example.com"to Node A. Nodes A and B receive the write. Node C still has the old email. - Another client reads from Node C and gets the old email — stale data.
- Both reads return a response (available), but they are inconsistent.
- When the partition heals, Cassandra's anti-entropy repair, read repair, and hinted handoff mechanisms reconcile the data using last-write-wins (LWW) timestamps.
Examples of AP Systems
| System | Conflict Resolution | Consistency Model |
|---|---|---|
| Cassandra | Last-write-wins (LWW) timestamps | Eventual; tunable per query |
| DynamoDB | LWW or application-level | Eventually consistent (default) or strong reads |
| CouchDB | MVCC + revision tree, manual conflict resolution | Eventual consistency |
| DNS | TTL-based expiry | Eventual (propagation delay) |
| Riak | Vector clocks / CRDTs | Eventual consistency |
| Amazon S3 | Internal reconciliation | Read-after-write for PUTs; eventual for DELETEs |
Use Cases for AP
- Social media feeds: Showing a slightly stale feed is far better than showing an error page. Users tolerate seeing a post 2 seconds late.
- Product catalogs: An e-commerce site showing yesterday's price for 10 seconds is better than the page being down.
- Analytics / metrics: Losing a few data points or seeing slightly delayed counts is acceptable. Downtime is not.
- Shopping carts: Amazon's original Dynamo paper famously argued that a customer should always be able to add items to their cart, even during partitions.
- DNS: The entire internet's name resolution is an AP system. Stale DNS records are common and tolerated.
▶ Partition Scenario: CP vs AP
Step through to see how CP and AP systems behave during a network partition.
Why Not CA?
The CA combination — Consistency + Availability without Partition Tolerance — is not achievable in a real distributed system. Here is why:
Network Partitions Are Inevitable
In any system that spans more than one machine connected by a network, partitions will happen. This is not a theoretical possibility — it is an operational certainty:
- Hardware failures: Switches fail, NICs die, cables get accidentally unplugged.
- Software bugs: A firewall rule change, a misconfigured router, or a kernel network stack bug can isolate nodes.
- Cloud outages: Even within a single cloud region, availability zone (AZ) partitions occur. AWS, GCP, and Azure have all experienced them.
- GC pauses: A long garbage collection pause can look like a partition to other nodes (the paused node is "unreachable").
The Single-Node "Loophole"
A traditional single-node RDBMS (PostgreSQL on one server) appears to be CA: it provides strong consistency and is always available (as long as the single node is up). But this is a trivial case — there is no network between nodes, so the question of partition tolerance does not arise. The moment you replicate that PostgreSQL database to a second server, you re-enter CAP territory and must choose between CP and AP.
Common Misunderstanding
Many people misinterpret CAP as "pick any 2 of 3," as if all three combinations are equally valid. This is misleading. The correct mental model is:
- Accept that P is mandatory (partitions happen).
- When a partition occurs, choose between C and A.
- When there is no partition, you can (and should) have both C and A.
This nuance is exactly what the PACELC theorem captures.
The PACELC Extension
The PACELC theorem, introduced by Daniel Abadi in 2012, extends CAP by acknowledging that even when there is no partition, there is a trade-off between Latency and Consistency.
The Statement
If there is a Partition, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.
This gives four possible combinations:
| Partition → A or C | Else → L or C | Notation | Examples |
|---|---|---|---|
| Availability (PA) | Latency (EL) | PA/EL | Cassandra, DynamoDB, Riak, CouchDB |
| Availability (PA) | Consistency (EC) | PA/EC | Rare — would mean "sloppy during partition, strict otherwise" |
| Consistency (PC) | Latency (EL) | PC/EL | Yahoo! PNUTS (historical), Cosmos DB (session consistency) |
| Consistency (PC) | Consistency (EC) | PC/EC | HBase, MongoDB (majority), Google Spanner, ZooKeeper, etcd |
Why Latency Matters
Even without a partition, enforcing strong consistency requires coordination:
- Synchronous replication: A write is not acknowledged until all (or a quorum of) replicas confirm it. This adds round-trip latency proportional to the distance between nodes.
- Consensus protocols: Raft/Paxos require at least one round trip between the leader and a majority of followers for every write.
- Google Spanner: Uses TrueTime (GPS + atomic clocks) to achieve external consistency, but write latency includes the "commit wait" (typically 7–10 ms) to account for clock uncertainty.
An AP/EL system like Cassandra with CONSISTENCY ONE acknowledges writes as soon as one local node persists them — extremely fast, but at the cost of consistency. An AP/EL system can achieve single-digit millisecond writes even across data centers.
Practical Implications
Most modern databases let you tune the trade-off per operation:
- Cassandra:
CONSISTENCY ONEvsCONSISTENCY QUORUMvsCONSISTENCY ALL— you choose the PA/EL vs PC/EC spectrum per query. - DynamoDB: Eventually consistent reads (default, PA/EL) vs strongly consistent reads (more latency, PC/EC behavior).
- MongoDB:
readConcern: "majority"+writeConcern: "majority"= PC/EC. Lower concerns = PA/EL. - Cosmos DB: Five consistency levels from Strong to Eventual, mapping to different PACELC points.
▶ PACELC Matrix
Step through each quadrant to see database examples and trade-offs.
Consistency Models Deep Dive
CAP defines consistency as linearizability, but real systems implement a spectrum of consistency models. Understanding this spectrum is crucial for system design interviews and for making informed database choices.
Strong Consistency (Linearizability)
The gold standard. Every operation appears to take effect atomically at a single point in time. Once a write completes, all subsequent reads (from any node) return that value. Equivalent to having a single copy of the data.
Systems: Google Spanner, etcd, ZooKeeper, CockroachDB, single-node RDBMS.
Cost: High latency (coordination required). Low throughput under contention.
Sequential Consistency
All operations appear to execute in some sequential order that is consistent with the program order of each individual client. However, this order need not correspond to real-time ordering. Two clients may disagree about the order of operations from different clients, as long as each client's own operations appear in order.
Difference from linearizability: Linearizability requires the total order to respect real-time ordering. Sequential consistency only requires it to respect per-client ordering.
Systems: ZooKeeper (for reads — it guarantees sequential consistency, not linearizability, for reads by default).
Causal Consistency
Operations that are causally related must be seen by all nodes in the same order. Concurrent operations (with no causal relationship) may be seen in different orders by different nodes.
Causality: if operation A could have influenced operation B (e.g., A wrote a value that B read before writing), then A must be ordered before B everywhere.
Systems: MongoDB (with causal consistency sessions), some CRDTs, COPS (research system).
Benefit: Much lower latency than linearizability while preserving intuitive ordering.
Eventual Consistency
If no new updates are made to a data item, eventually all accesses will return the last updated value. There is no bound on how long "eventually" takes, and in the interim, different nodes may return different values.
Systems: Cassandra (default), DynamoDB (default reads), DNS, Amazon S3 (for deletes).
Variants:
- Strong eventual consistency (SEC): Nodes that have received the same set of updates will have the same state, regardless of the order of updates. Achieved using CRDTs (Conflict-free Replicated Data Types) or operational transforms.
Read-Your-Writes Consistency
A client will always see its own writes. After a client writes a value, subsequent reads by that same client will return the written value (or a newer one). Other clients may still see stale data.
Systems: DynamoDB (with consistent reads), most web applications (session stickiness).
Implementation: Often achieved by routing reads to the same node that accepted the write, or by including a write timestamp in the read request.
Monotonic Reads
If a client reads a value for a key, any subsequent read of the same key by that client will return the same value or a newer one. Time never "goes backward" for a single client.
Without monotonic reads: Client reads X=5 from Node A, then reads X=3 from Node B (which hasn't received the update yet). Confusing!
With monotonic reads: The system ensures the second read returns at least X=5.
The Consistency Spectrum
From strongest to weakest:
Linearizability
↓
Sequential Consistency
↓
Causal Consistency
↓
Read-Your-Writes + Monotonic Reads
↓
Eventual Consistency
Each step down relaxes guarantees but improves latency, throughput, and availability. The art of system design is choosing the weakest model that your application can tolerate.
Real-World CAP Classification
The following table classifies popular databases by their CAP and PACELC behavior. Note that many databases are tunable — they can behave as CP or AP depending on configuration.
| Database | CAP | PACELC | Default Behavior | Notes |
|---|---|---|---|---|
| PostgreSQL (single node) | CA* | N/A (single node) | Strong consistency, always available | Not distributed; CA is trivial |
| PostgreSQL (streaming replication) | CP | PC/EC | Sync replication: consistent but follower may be unavailable | Async replication behaves more like AP |
| MySQL (Group Replication) | CP | PC/EC | Majority-based consensus; minority partitions become read-only | Single-primary mode is CP; multi-primary can have conflicts |
| MongoDB | CP | PC/EC (with majority concern) | Majority write concern ensures durability; minority unavailable | Tunable: lower write concern trades consistency for availability |
| Cassandra | AP | PA/EL | Always writable; LWW conflict resolution; eventual consistency | With QUORUM reads/writes, behaves more like CP |
| DynamoDB | AP | PA/EL | Eventually consistent reads by default; always available | Strongly consistent reads available (higher latency) |
| Redis (Cluster) | CP | PC/EL | Single master per slot; async replication can lose recent writes | During partition: minority partition stops serving the slot |
| CouchDB | AP | PA/EL | Multi-master; accepts writes anywhere; conflict detection on sync | Manual conflict resolution via revision trees |
| HBase | CP | PC/EC | One RegionServer per region; strong consistency within regions | Region unavailable during RegionServer failure until reassignment |
| etcd | CP | PC/EC | Raft consensus; linearizable reads and writes | Minority partition completely unavailable |
| Consul | CP | PC/EC | Raft consensus for KV store; strong consistency by default | Stale reads available for AP-like behavior |
| ZooKeeper | CP | PC/EC | ZAB protocol; writes are linearizable; reads are sequentially consistent | Sync command available for linearizable reads |
| Google Spanner | CP | PC/EC | External consistency via TrueTime; Paxos-based replication | Highest consistency guarantee of any globally distributed DB |
| CockroachDB | CP | PC/EC | Serializable isolation; Raft-based replication | Spanner-inspired; uses hybrid logical clocks |
CAP in System Design Interviews
When to Bring Up CAP
CAP is most relevant when you are:
- Choosing a database or data store for a distributed system.
- Designing replication strategy (sync vs async).
- Discussing trade-offs in a system that spans multiple data centers or regions.
- Explaining why you chose eventual consistency vs strong consistency.
Framework for Discussing CAP
- Identify the consistency requirement: "Does this feature need strong consistency, or can we tolerate stale data?"
- Assess the availability requirement: "What happens if this service returns an error for 30 seconds? Is that acceptable?"
- Choose the trade-off: "For this use case, I'll prefer availability and accept eventual consistency because..."
- Discuss PACELC: "Even without partitions, I need low latency for reads, so I'll choose PA/EL — Cassandra with CONSISTENCY ONE."
- Mention mitigation strategies: How you handle the downside of your choice (conflict resolution, retries, compensating transactions).
Example: Design a Chat System
Consider a WhatsApp-like messaging system. Different components have different CAP requirements:
| Component | CAP Choice | Reasoning |
|---|---|---|
| Message delivery | AP | Users should always be able to send messages. Duplicate delivery is better than message loss. Eventually consistent ordering is acceptable (messages may arrive out of order briefly). |
| Group membership | CP | Adding/removing members must be consistent. Two admins simultaneously removing each other creates paradoxes. Use a CP store (e.g., ZooKeeper or Raft-based service). |
| Read receipts | AP | Showing a read receipt 2 seconds late is fine. Showing "unread" when the message was read is annoying but not catastrophic. |
| User profiles | AP | Profile changes can propagate eventually. Seeing an old profile photo for a few seconds is acceptable. |
| End-to-end encryption keys | CP | Key exchange must be strongly consistent. Using a stale public key means the message is unreadable. |
Common Interview Questions
- "Is MongoDB CP or AP?" — MongoDB is CP by default (with majority read/write concern). But with lower write concerns, it can behave more like AP. The answer should mention tunability.
- "Can we have all three?" — No, but during normal operation (no partition) you can have both C and A. The sacrifice only happens during a partition. Mention PACELC for bonus points.
- "How does Cassandra handle conflicts?" — Last-write-wins using timestamps. This can lose writes if clocks are skewed. Alternatives: vector clocks (Riak), CRDTs, or application-level resolution.
- "What about microservices?" — Each microservice can make its own CAP trade-off. The order service might be CP (don't oversell), while the recommendation service is AP (stale recommendations are fine).
Anti-Patterns to Avoid
- "We'll just use a CA database." — Shows you don't understand that P is mandatory in a distributed system.
- Treating CAP as a one-time choice. — Different parts of the same system can (and should) have different trade-offs.
- Ignoring PACELC. — CAP only describes behavior during partitions. Most of the time, there are no partitions, and the latency-consistency trade-off matters more.
- Binary thinking. — Consistency and availability are not binary. There is a spectrum of consistency models and a spectrum of availability targets (99.9% vs 99.999%).
The CAP theorem is a starting point, not an endpoint. Real-world system design is about understanding the nuances — tunable consistency, partial failures, and the fact that most of the time, your system is operating normally (no partition), where the latency-consistency trade-off from PACELC is what really matters. In the next post, we will explore ACID vs BASE, two philosophies for data consistency that emerge directly from the CAP trade-offs we discussed here.