High Level Design Series · Foundations · Part 6

CAP Theorem & PACELC

April 2026 · 18 min read

If you have ever designed or evaluated a distributed data store, you have encountered a fundamental tension: you cannot have everything. The CAP theorem formalizes that tension, and the PACELC extension adds the nuance that real-world engineers need. This post will give you the rigorous background, the intuition, and the interview-ready vocabulary to reason about consistency and availability trade-offs in any distributed system.

The CAP Theorem

In the year 2000, Eric Brewer presented a conjecture at the ACM Symposium on Principles of Distributed Computing (PODC): it is impossible for a distributed data store to simultaneously provide more than two out of three guarantees — Consistency, Availability, and Partition Tolerance. Two years later, Seth Gilbert and Nancy Lynch of MIT published a formal proof, elevating the conjecture to a theorem.

The Three Guarantees

Consistency (C) — Every read receives the most recent write or an error. All nodes see the same data at the same time. This is linearizability (not to be confused with the "C" in ACID, which refers to database integrity constraints). Formally: for any execution, there exists a total order of operations such that each operation appears to take effect atomically at a single point in time between its invocation and response, and every read of a key returns the value written by the most recent preceding write to that key.

Availability (A) — Every request received by a non-failing node in the system must result in a response. There is no timeout or error allowed — the node must eventually return a meaningful result. Note the subtlety: availability is defined per non-failing node, not per client. A client whose network cable is cut is not relevant; a healthy server that refuses to answer violates availability.

Partition Tolerance (P) — The system continues to operate despite an arbitrary number of messages being dropped or delayed by the network between nodes. A network partition divides the cluster into two (or more) groups of nodes that cannot communicate with each other.

Key insight: In any real distributed system running over a network, partitions will happen — switches fail, cables get cut, cloud regions lose connectivity. Therefore P is not optional. The practical question is always: when a partition occurs, do you sacrifice C or A?

Proof Intuition

Consider the simplest possible distributed system: two nodes, N1 and N2, each holding a copy of variable X (initially X = v0). A network partition separates them — no messages can get through.

A client sends a write to N1: X = v1.
N1 cannot replicate to N2 (partition).
Another client sends a read to N2.

N2 must now choose:

Return v0 (the stale value) → violates Consistency (the most recent write was v1).
Wait / return an error → violates Availability (N2 is healthy but not responding).

There is no third option. This is the crux of the Gilbert-Lynch proof. It holds regardless of the number of nodes, the replication protocol, or the consensus algorithm. The fundamental impossibility stems from the speed of light: information cannot propagate instantaneously, and partitions can prevent it from propagating at all.

Formal Statement

Let us state the theorem more precisely. Consider an asynchronous network model (no bound on message delivery time). It is impossible to implement a read/write data object that guarantees all three of the following properties:

Atomicity (Linearizability) — equivalent to there being a single copy of the data.
Availability — every request to a non-failing node eventually receives a response.
Partition tolerance — the network may lose arbitrarily many messages between nodes.

In the partially synchronous model (bounded message delay), you can achieve all three when there is no partition, but you must sacrifice either C or A during a partition.

▶ CAP Venn Diagram

Click each region to explore the trade-offs. Use Step to walk through each combination.

CP Systems

CP systems choose Consistency over Availability during a network partition. When a partition occurs, nodes that cannot confirm they have the latest data will refuse to serve requests rather than return stale results.

Behavior During a Partition

Imagine a 5-node cluster using Raft consensus. The leader is in partition A (with 2 followers), and 2 followers are isolated in partition B:

Partition A (leader + 2 followers = 3/5 majority): Can still form a quorum. Writes succeed. Reads return the latest committed data. The system is consistent and available within this partition.
Partition B (2 followers, no leader): Cannot form a quorum. Cannot elect a new leader. All reads and writes are rejected. The system sacrifices availability for these nodes to maintain consistency.

Examples of CP Systems

System	Consistency Mechanism	Partition Behavior
ZooKeeper	ZAB protocol (majority quorum)	Minority partition becomes read-only, then unavailable
etcd	Raft consensus	Minority partition cannot serve reads or writes
HBase	Single RegionServer per region + ZK	Region unavailable if its server is partitioned
MongoDB	Majority read/write concern	Minority partition rejects writes; reads depend on config
Redis (Sentinel)	Single master per shard	Old master in minority stops accepting writes after timeout
Google Spanner	Paxos + TrueTime	Minority partition unavailable; majority continues

Use Cases for CP

Financial transactions: A bank transfer cannot afford to show inconsistent balances. Better to return an error than to double-spend money.
Inventory management: Selling the last item in stock to two customers simultaneously is worse than temporarily showing "unavailable."
Leader election / distributed locks: ZooKeeper and etcd power leader election for Kafka, HDFS, and Kubernetes. A wrong leader is catastrophic; brief unavailability is acceptable.
Configuration management: When deploying feature flags or routing rules, every node must have the same view.

AP Systems

AP systems choose Availability over Consistency during a network partition. Every node continues to accept reads and writes, even if it cannot communicate with other nodes. The trade-off: clients may see stale or divergent data.

Behavior During a Partition

Consider a 3-node Cassandra cluster with replication factor 3 and CONSISTENCY ONE:

A partition isolates Node C from Nodes A and B.
A client writes user.email = "new@example.com" to Node A. Nodes A and B receive the write. Node C still has the old email.
Another client reads from Node C and gets the old email — stale data.
Both reads return a response (available), but they are inconsistent.
When the partition heals, Cassandra's anti-entropy repair, read repair, and hinted handoff mechanisms reconcile the data using last-write-wins (LWW) timestamps.

Examples of AP Systems

System	Conflict Resolution	Consistency Model
Cassandra	Last-write-wins (LWW) timestamps	Eventual; tunable per query
DynamoDB	LWW or application-level	Eventually consistent (default) or strong reads
CouchDB	MVCC + revision tree, manual conflict resolution	Eventual consistency
DNS	TTL-based expiry	Eventual (propagation delay)
Riak	Vector clocks / CRDTs	Eventual consistency
Amazon S3	Internal reconciliation	Read-after-write for PUTs; eventual for DELETEs

Use Cases for AP

Social media feeds: Showing a slightly stale feed is far better than showing an error page. Users tolerate seeing a post 2 seconds late.
Product catalogs: An e-commerce site showing yesterday's price for 10 seconds is better than the page being down.
Analytics / metrics: Losing a few data points or seeing slightly delayed counts is acceptable. Downtime is not.
Shopping carts: Amazon's original Dynamo paper famously argued that a customer should always be able to add items to their cart, even during partitions.
DNS: The entire internet's name resolution is an AP system. Stale DNS records are common and tolerated.

▶ Partition Scenario: CP vs AP

Step through to see how CP and AP systems behave during a network partition.

Why Not CA?

The CA combination — Consistency + Availability without Partition Tolerance — is not achievable in a real distributed system. Here is why:

Network Partitions Are Inevitable

In any system that spans more than one machine connected by a network, partitions will happen. This is not a theoretical possibility — it is an operational certainty:

Hardware failures: Switches fail, NICs die, cables get accidentally unplugged.
Software bugs: A firewall rule change, a misconfigured router, or a kernel network stack bug can isolate nodes.
Cloud outages: Even within a single cloud region, availability zone (AZ) partitions occur. AWS, GCP, and Azure have all experienced them.
GC pauses: A long garbage collection pause can look like a partition to other nodes (the paused node is "unreachable").

The Single-Node "Loophole"

A traditional single-node RDBMS (PostgreSQL on one server) appears to be CA: it provides strong consistency and is always available (as long as the single node is up). But this is a trivial case — there is no network between nodes, so the question of partition tolerance does not arise. The moment you replicate that PostgreSQL database to a second server, you re-enter CAP territory and must choose between CP and AP.

CA is a degenerate case: You can have CA if and only if you have a single node (or a perfectly reliable network, which does not exist). In practice, "CA" means "not distributed." Every real distributed system is either CP or AP (or somewhere on the spectrum between them).

Common Misunderstanding

Many people misinterpret CAP as "pick any 2 of 3," as if all three combinations are equally valid. This is misleading. The correct mental model is:

Accept that P is mandatory (partitions happen).
When a partition occurs, choose between C and A.
When there is no partition, you can (and should) have both C and A.

This nuance is exactly what the PACELC theorem captures.

The PACELC Extension

The PACELC theorem, introduced by Daniel Abadi in 2012, extends CAP by acknowledging that even when there is no partition, there is a trade-off between Latency and Consistency.

The Statement

If there is a Partition, choose between Availability and Consistency. Else (no partition), choose between Latency and Consistency.

This gives four possible combinations:

Partition → A or C	Else → L or C	Notation	Examples
Availability (PA)	Latency (EL)	PA/EL	Cassandra, DynamoDB, Riak, CouchDB
Availability (PA)	Consistency (EC)	PA/EC	Rare — would mean "sloppy during partition, strict otherwise"
Consistency (PC)	Latency (EL)	PC/EL	Yahoo! PNUTS (historical), Cosmos DB (session consistency)
Consistency (PC)	Consistency (EC)	PC/EC	HBase, MongoDB (majority), Google Spanner, ZooKeeper, etcd

Why Latency Matters

Even without a partition, enforcing strong consistency requires coordination:

Synchronous replication: A write is not acknowledged until all (or a quorum of) replicas confirm it. This adds round-trip latency proportional to the distance between nodes.
Consensus protocols: Raft/Paxos require at least one round trip between the leader and a majority of followers for every write.
Google Spanner: Uses TrueTime (GPS + atomic clocks) to achieve external consistency, but write latency includes the "commit wait" (typically 7–10 ms) to account for clock uncertainty.

An AP/EL system like Cassandra with CONSISTENCY ONE acknowledges writes as soon as one local node persists them — extremely fast, but at the cost of consistency. An AP/EL system can achieve single-digit millisecond writes even across data centers.

Practical Implications

Most modern databases let you tune the trade-off per operation:

Cassandra: CONSISTENCY ONE vs CONSISTENCY QUORUM vs CONSISTENCY ALL — you choose the PA/EL vs PC/EC spectrum per query.
DynamoDB: Eventually consistent reads (default, PA/EL) vs strongly consistent reads (more latency, PC/EC behavior).
MongoDB: readConcern: "majority" + writeConcern: "majority" = PC/EC. Lower concerns = PA/EL.
Cosmos DB: Five consistency levels from Strong to Eventual, mapping to different PACELC points.

▶ PACELC Matrix

Step through each quadrant to see database examples and trade-offs.

Consistency Models Deep Dive

CAP defines consistency as linearizability, but real systems implement a spectrum of consistency models. Understanding this spectrum is crucial for system design interviews and for making informed database choices.

Strong Consistency (Linearizability)

The gold standard. Every operation appears to take effect atomically at a single point in time. Once a write completes, all subsequent reads (from any node) return that value. Equivalent to having a single copy of the data.

Client A

write(X=1) ──────── ack ✓

Client B

─────── read(X) → 1 ✓ (always sees the write)

Systems: Google Spanner, etcd, ZooKeeper, CockroachDB, single-node RDBMS.

Cost: High latency (coordination required). Low throughput under contention.

Sequential Consistency

All operations appear to execute in some sequential order that is consistent with the program order of each individual client. However, this order need not correspond to real-time ordering. Two clients may disagree about the order of operations from different clients, as long as each client's own operations appear in order.

Difference from linearizability: Linearizability requires the total order to respect real-time ordering. Sequential consistency only requires it to respect per-client ordering.

Systems: ZooKeeper (for reads — it guarantees sequential consistency, not linearizability, for reads by default).

Causal Consistency

Operations that are causally related must be seen by all nodes in the same order. Concurrent operations (with no causal relationship) may be seen in different orders by different nodes.

Causality: if operation A could have influenced operation B (e.g., A wrote a value that B read before writing), then A must be ordered before B everywhere.

Systems: MongoDB (with causal consistency sessions), some CRDTs, COPS (research system).

Benefit: Much lower latency than linearizability while preserving intuitive ordering.

Eventual Consistency

If no new updates are made to a data item, eventually all accesses will return the last updated value. There is no bound on how long "eventually" takes, and in the interim, different nodes may return different values.

Systems: Cassandra (default), DynamoDB (default reads), DNS, Amazon S3 (for deletes).

Variants:

Strong eventual consistency (SEC): Nodes that have received the same set of updates will have the same state, regardless of the order of updates. Achieved using CRDTs (Conflict-free Replicated Data Types) or operational transforms.

Read-Your-Writes Consistency

A client will always see its own writes. After a client writes a value, subsequent reads by that same client will return the written value (or a newer one). Other clients may still see stale data.

Systems: DynamoDB (with consistent reads), most web applications (session stickiness).

Implementation: Often achieved by routing reads to the same node that accepted the write, or by including a write timestamp in the read request.

Monotonic Reads

If a client reads a value for a key, any subsequent read of the same key by that client will return the same value or a newer one. Time never "goes backward" for a single client.

Without monotonic reads: Client reads X=5 from Node A, then reads X=3 from Node B (which hasn't received the update yet). Confusing!

With monotonic reads: The system ensures the second read returns at least X=5.

The Consistency Spectrum

From strongest to weakest:

Linearizability
    ↓
Sequential Consistency
    ↓
Causal Consistency
    ↓
Read-Your-Writes + Monotonic Reads
    ↓
Eventual Consistency

Each step down relaxes guarantees but improves latency, throughput, and availability. The art of system design is choosing the weakest model that your application can tolerate.

Real-World CAP Classification

The following table classifies popular databases by their CAP and PACELC behavior. Note that many databases are tunable — they can behave as CP or AP depending on configuration.

Database	CAP	PACELC	Default Behavior	Notes
PostgreSQL (single node)	CA*	N/A (single node)	Strong consistency, always available	Not distributed; CA is trivial
PostgreSQL (streaming replication)	CP	PC/EC	Sync replication: consistent but follower may be unavailable	Async replication behaves more like AP
MySQL (Group Replication)	CP	PC/EC	Majority-based consensus; minority partitions become read-only	Single-primary mode is CP; multi-primary can have conflicts
MongoDB	CP	PC/EC (with majority concern)	Majority write concern ensures durability; minority unavailable	Tunable: lower write concern trades consistency for availability
Cassandra	AP	PA/EL	Always writable; LWW conflict resolution; eventual consistency	With QUORUM reads/writes, behaves more like CP
DynamoDB	AP	PA/EL	Eventually consistent reads by default; always available	Strongly consistent reads available (higher latency)
Redis (Cluster)	CP	PC/EL	Single master per slot; async replication can lose recent writes	During partition: minority partition stops serving the slot
CouchDB	AP	PA/EL	Multi-master; accepts writes anywhere; conflict detection on sync	Manual conflict resolution via revision trees
HBase	CP	PC/EC	One RegionServer per region; strong consistency within regions	Region unavailable during RegionServer failure until reassignment
etcd	CP	PC/EC	Raft consensus; linearizable reads and writes	Minority partition completely unavailable
Consul	CP	PC/EC	Raft consensus for KV store; strong consistency by default	Stale reads available for AP-like behavior
ZooKeeper	CP	PC/EC	ZAB protocol; writes are linearizable; reads are sequentially consistent	Sync command available for linearizable reads
Google Spanner	CP	PC/EC	External consistency via TrueTime; Paxos-based replication	Highest consistency guarantee of any globally distributed DB
CockroachDB	CP	PC/EC	Serializable isolation; Raft-based replication	Spanner-inspired; uses hybrid logical clocks

Remember: CAP classification is a simplification. Most modern databases are tunable — you can configure them to behave anywhere on the CP ↔ AP spectrum. The classification above reflects the default or most common configuration.

CAP in System Design Interviews

When to Bring Up CAP

CAP is most relevant when you are:

Choosing a database or data store for a distributed system.
Designing replication strategy (sync vs async).
Discussing trade-offs in a system that spans multiple data centers or regions.
Explaining why you chose eventual consistency vs strong consistency.

Interview tip: Don't just say "I'll use Cassandra because it's AP." Explain why AP is the right choice for this use case, what the implications are (eventual consistency, conflict resolution), and how you'll handle the edge cases. Show you understand the trade-off, not just the label.

Framework for Discussing CAP

Identify the consistency requirement: "Does this feature need strong consistency, or can we tolerate stale data?"
Assess the availability requirement: "What happens if this service returns an error for 30 seconds? Is that acceptable?"
Choose the trade-off: "For this use case, I'll prefer availability and accept eventual consistency because..."
Discuss PACELC: "Even without partitions, I need low latency for reads, so I'll choose PA/EL — Cassandra with CONSISTENCY ONE."
Mention mitigation strategies: How you handle the downside of your choice (conflict resolution, retries, compensating transactions).

Example: Design a Chat System

Consider a WhatsApp-like messaging system. Different components have different CAP requirements:

Component	CAP Choice	Reasoning
Message delivery	AP	Users should always be able to send messages. Duplicate delivery is better than message loss. Eventually consistent ordering is acceptable (messages may arrive out of order briefly).
Group membership	CP	Adding/removing members must be consistent. Two admins simultaneously removing each other creates paradoxes. Use a CP store (e.g., ZooKeeper or Raft-based service).
Read receipts	AP	Showing a read receipt 2 seconds late is fine. Showing "unread" when the message was read is annoying but not catastrophic.
User profiles	AP	Profile changes can propagate eventually. Seeing an old profile photo for a few seconds is acceptable.
End-to-end encryption keys	CP	Key exchange must be strongly consistent. Using a stale public key means the message is unreadable.

Common Interview Questions

"Is MongoDB CP or AP?" — MongoDB is CP by default (with majority read/write concern). But with lower write concerns, it can behave more like AP. The answer should mention tunability.
"Can we have all three?" — No, but during normal operation (no partition) you can have both C and A. The sacrifice only happens during a partition. Mention PACELC for bonus points.
"How does Cassandra handle conflicts?" — Last-write-wins using timestamps. This can lose writes if clocks are skewed. Alternatives: vector clocks (Riak), CRDTs, or application-level resolution.
"What about microservices?" — Each microservice can make its own CAP trade-off. The order service might be CP (don't oversell), while the recommendation service is AP (stale recommendations are fine).

Anti-Patterns to Avoid

"We'll just use a CA database." — Shows you don't understand that P is mandatory in a distributed system.
Treating CAP as a one-time choice. — Different parts of the same system can (and should) have different trade-offs.
Ignoring PACELC. — CAP only describes behavior during partitions. Most of the time, there are no partitions, and the latency-consistency trade-off matters more.
Binary thinking. — Consistency and availability are not binary. There is a spectrum of consistency models and a spectrum of availability targets (99.9% vs 99.999%).

Pro tip for interviews: After explaining your CAP choice, always follow up with: "However, we can mitigate the downside by..." This shows maturity. For AP systems, mention conflict resolution, compensating transactions, and client-side caching. For CP systems, mention graceful degradation, circuit breakers, and stale reads from a cache.

The CAP theorem is a starting point, not an endpoint. Real-world system design is about understanding the nuances — tunable consistency, partial failures, and the fact that most of the time, your system is operating normally (no partition), where the latency-consistency trade-off from PACELC is what really matters. In the next post, we will explore ACID vs BASE, two philosophies for data consistency that emerge directly from the CAP trade-offs we discussed here.