High Level Design Series · Distributed Systems · Part 4

Two-Phase Commit (2PC)

April 2026 · 16 min read

When a single database transaction spans multiple machines — an order service debiting inventory, a payment service charging a credit card, and a shipping service reserving a slot — you need all of them to commit or all of them to roll back. There is no acceptable middle ground where payment succeeds but inventory is not deducted. The Two-Phase Commit protocol (2PC) is the classic solution to this atomic commitment problem in distributed systems. It is battle-tested, formally proven correct, and powers everything from traditional XA transactions to Google Spanner. It is also slow, blocking, and fragile in the face of certain failures. This post gives you the complete picture.

The Distributed Transaction Problem

On a single machine, atomicity is straightforward: the database writes to a write-ahead log (WAL), and either all changes are flushed or the entire transaction is rolled back from the log. The ACID guarantees hold because a single process controls the outcome.

In a distributed system, no single node controls all the data. Each participant holds a shard of the state, runs on separate hardware, and can fail independently. The fundamental challenge is:

Atomic Commitment Problem: Given N participants that each must decide to commit or abort a transaction, how do we ensure that either all commit or all abort — even when participants can fail and the network can lose messages?

Requirements for a Solution

Agreement — All participants that reach a decision reach the same decision (commit or abort).
Validity — If all participants vote YES and no failure occurs, the decision is COMMIT. If any participant votes NO, the decision is ABORT.
Termination — Every non-failed participant eventually reaches a decision. (This is the one 2PC cannot fully guarantee — more on that later.)
Durability — Once a participant commits, the changes survive crashes.

Why Not Just "Send Commit to Everyone"?

Consider a naïve approach: the coordinator sends COMMIT to all participants simultaneously.

Participant A commits successfully.
Participant B detects a constraint violation and cannot commit.
Participant C never receives the message (network partition).

Result: A committed, B aborted, C is in an unknown state. The transaction is neither atomic nor consistent. We need a voting phase before the decision phase, and that is exactly what 2PC provides.

The 2PC Protocol

The protocol has two roles: a Coordinator (also called the Transaction Manager, or TM) and multiple Participants (also called Resource Managers, or RMs). The protocol proceeds in two phases:

Phase 1: Prepare (Voting)

Step 1: The coordinator writes PREPARE to its own log and sends a PREPARE message to all participants.

Step 2: Each participant receives PREPARE and decides whether it can commit. It checks constraints, acquires locks, writes all changes to its WAL (but does not commit yet), and forces the log to durable storage.

Step 3: If the participant can commit, it responds YES (a vote to commit). If it cannot (constraint violation, resource exhaustion, timeout), it responds NO (a vote to abort) and can immediately roll back locally.

Step 4: The coordinator collects all votes. A YES vote is a promise: the participant guarantees it can commit the transaction at any point in the future, surviving crashes (because the changes are in the WAL).

The YES vote is irrevocable. Once a participant votes YES, it enters an uncertain or prepared state. It has surrendered its right to abort unilaterally. It must hold all locks and wait for the coordinator's decision. This is the fundamental tension of 2PC.

Phase 2: Commit / Abort (Decision)

Step 5: If all participants voted YES, the coordinator writes COMMIT to its log (force-write) and sends COMMIT to all participants.

Step 5 (alt): If any participant voted NO (or timed out), the coordinator writes ABORT to its log and sends ABORT to all participants.

Step 6: Each participant receives the decision, applies it (commit or rollback), writes the outcome to its log, releases locks, and sends an ACK to the coordinator.

Step 7: After receiving ACKs from all participants, the coordinator writes END to its log and garbage-collects the transaction state.

▶ 2PC Happy Path

Step through the normal 2PC flow where all participants vote YES.

Formal State Machines

Coordinator FSM:

INITBegin txn

→

WAITSent PREPARE
Collecting votes

→

COMMITAll YES
Decision logged

or →

ABORTAny NO / timeout
Decision logged

Participant FSM:

INITWorking

→

PREPAREDVoted YES
Holding locks ⚠

→

COMMITTEDApplied
Locks released

INITWorking

→

ABORTEDVoted NO
Rolled back

Write-Ahead Logging

Both coordinator and participants use force-write WAL at critical points to survive crashes. The log entries enable recovery:

Role	Log Entry	When	Purpose on Recovery
Coordinator	`PREPARE`	Before sending PREPARE	Know that voting was initiated
Coordinator	`COMMIT / ABORT`	Before sending decision	Re-send decision to participants
Coordinator	`END`	After all ACKs	Safe to garbage-collect
Participant	`YES / NO`	Before sending vote	Know what was voted
Participant	`COMMIT / ABORT`	After receiving decision	Know final outcome after crash

Interview tip: The force-write to the WAL at the coordinator's COMMIT decision is the commit point — the exact moment the distributed transaction becomes irrevocably committed. Everything before this point can still be aborted. Everything after must be committed. This is the single most important detail of 2PC.

Failure Mode Analysis

2PC handles many failure scenarios correctly. The problems arise in one specific window. Let us enumerate every case.

Participant Crashes Before Voting

⚠ Failure: Participant crashes before sending YES/NO

Effect: Coordinator times out waiting for this participant's vote. Since not all voted YES, coordinator decides ABORT and sends ABORT to all other participants.

Recovery: When the crashed participant restarts, it finds no YES record in its log, so it aborts locally. Handled correctly.

Participant Crashes After Voting YES

⚠ Failure: Participant crashes after voting YES

Effect: Coordinator may or may not have received the vote. If coordinator already has all YES votes, it commits. Otherwise, it aborts.

Recovery: When the crashed participant restarts, it reads its WAL, finds the YES record, and contacts the coordinator to learn the outcome. If the coordinator is up, it gets the answer and applies it. Handled correctly.

Coordinator Crashes Before Decision

🔴 Failure: Coordinator crashes after sending PREPARE but before writing COMMIT/ABORT

Effect: All participants that voted YES are now stuck in the PREPARED state. They are holding locks and cannot unilaterally decide — another participant might have voted NO, in which case the decision should be ABORT. But they do not know.

Recovery: Participants must wait for the coordinator to recover. If the coordinator never recovers, the transaction is blocked indefinitely. BLOCKING — the critical weakness of 2PC.

Coordinator Crashes After Decision

⚠ Failure: Coordinator crashes after logging COMMIT but before sending to all participants

Effect: Some participants received the decision, others did not. Those that did not are stuck waiting.

Recovery: When the coordinator restarts, it reads its WAL and re-sends the decision to all participants that have not ACKed. Handled correctly.

Network Partition

🔴 Failure: Network partition separates coordinator from participants

Effect: Identical to coordinator crash from the participants' perspective. If they are in the PREPARED state, they are blocked. They cannot distinguish "coordinator is slow" from "coordinator is dead" from "network partition."

Resolution: Participants can attempt to contact each other to learn the outcome (cooperative termination). If even one participant knows the decision, the others can learn it. But if no participant knows the decision (all voted YES, none received the final decision), the protocol is blocked. BLOCKING.

Coordinator + Participant Crash Simultaneously

🔴 Failure: Coordinator crashes after sending COMMIT to Participant A, then A also crashes

Effect: The only node that knew the decision (the coordinator and A) are both down. The remaining participants are in PREPARED state with no way to learn the outcome. They must hold locks and wait.

Recovery: The system is blocked until either the coordinator or A recovers. BLOCKING.

The fundamental weakness of 2PC: There exists a vulnerability window between the moment participants vote YES and the moment they learn the decision. If the coordinator fails in this window, participants are stuck. The protocol is blocking — it cannot guarantee termination in the presence of certain failure patterns.

▶ 2PC Failure Scenarios

Step through: a NO vote causing abort, then a coordinator crash leaving participants stuck.

The Blocking Problem in Depth

To truly understand why 2PC blocks, consider the information asymmetry in the PREPARED state.

The Uncertainty Principle

When a participant is in the PREPARED state (voted YES, waiting for decision):

It cannot commit unilaterally — another participant might have voted NO, requiring an abort.
It cannot abort unilaterally — all participants might have voted YES, and the coordinator might have already committed. Aborting would violate the agreement property.
It cannot release locks — releasing locks before knowing the outcome could allow other transactions to read uncommitted data.

This is the only state in 2PC where a participant is uncertain. In INIT, it can abort. After COMMITTED or ABORTED, it knows the outcome. But in PREPARED, it is completely dependent on external information — the coordinator's decision.

Cooperative Termination Protocol

A partial mitigation: when a participant times out waiting for the coordinator, it contacts the other participants:

Other Participant's State	What the Querying Participant Learns	Action
COMMITTED	Decision was COMMIT	Commit locally
ABORTED	Decision was ABORT	Abort locally
INIT (never voted)	Can safely abort (this participant will vote NO)	Both abort
PREPARED	No information — the other participant is also stuck	Still blocked

If all reachable participants are in the PREPARED state, no progress can be made. This is the worst case: everyone voted YES, the coordinator crashed before logging its decision, and now every participant is holding locks, unable to do anything.

Real-World Impact

In production systems, the blocking problem manifests as:

Lock contention: Prepared transactions hold row-level or table-level locks. Other transactions that need those rows are blocked, causing a cascading pile-up.
Resource exhaustion: Each prepared transaction consumes memory (undo logs, lock tables, connection slots). Enough blocked transactions can exhaust the system.
Timeout cascades: Upstream services waiting for the blocked transaction time out, trigger retries, and create more load.
Manual intervention: Often, a DBA must manually resolve in-doubt transactions after a coordinator crash — a process called heuristic resolution that can violate atomicity.

Three-Phase Commit (3PC)

In 1983, Dale Skeen proposed a modification to 2PC that eliminates the blocking problem under certain failure models: the Three-Phase Commit protocol (3PC). The key insight is to add an extra phase that propagates the coordinator's decision to participants before asking them to commit.

The Three Phases

Phase 1: CanCommit (Voting) — Identical to 2PC's prepare phase. Coordinator sends CAN-COMMIT?, participants respond YES or NO.

Phase 2: PreCommit — If all participants voted YES, the coordinator sends PRE-COMMIT. Participants acknowledge receipt. This is the crucial addition: it acts as a buffer between voting and committing. All participants now know that every other participant voted YES.

Phase 3: DoCommit — The coordinator sends DO-COMMIT. Participants commit and ACK.

Why 3PC is Non-Blocking

The magic of the PRE-COMMIT phase: if the coordinator crashes after sending PRE-COMMIT to some participants but not others, a recovery coordinator can safely decide:

If any participant received PRE-COMMIT, we know all voted YES → safe to commit.
If no participant received PRE-COMMIT, we can safely abort.

There is no longer a state where participants are uncertain and cannot determine the outcome from each other. The PRE-COMMIT phase ensures that before any participant commits, all participants know the decision.

3PC Trade-Offs

Aspect	2PC	3PC
Message rounds	2 (Prepare + Commit)	3 (CanCommit + PreCommit + DoCommit)
Messages per txn (N participants)	4N	6N
Forced log writes	3 (coord) + 2 (each participant)	4 (coord) + 3 (each participant)
Blocking?	Yes (coordinator crash)	No (with reliable failure detector)
Latency	2 round trips	3 round trips
Network partitions	Blocks	Can violate safety — split-brain possible

Critical caveat: 3PC is non-blocking only under the crash-stop failure model (nodes crash but don't recover with stale state) and with a reliable failure detector. In the presence of network partitions, 3PC can produce inconsistency: one partition might commit while the other aborts. This is why 3PC is rarely used in practice — network partitions are common, and 3PC trades the blocking problem of 2PC for a potential safety violation, which is usually worse.

Google Spanner & TrueTime

Google Spanner is arguably the most ambitious distributed database ever built, and it uses 2PC as its distributed transaction protocol. How does it deal with the blocking problem?

Architecture Overview

Spanner combines several innovations:

Paxos groups: Each shard is replicated across 5 data centers using Paxos consensus. The Paxos leader for each shard acts as the 2PC participant. If a leader fails, Paxos elects a new one in seconds — so the "participant" is effectively always available.
2PC over Paxos: The coordinator itself is a Paxos group. If the coordinator leader fails, a new coordinator leader is elected. This eliminates the single coordinator as a point of failure.
TrueTime API: GPS and atomic clocks in every data center provide a global time API with bounded uncertainty: TT.now() returns an interval [earliest, latest] where the true time is guaranteed to be within. Typical uncertainty is 1–7 ms.

External Consistency via TrueTime

Spanner uses TrueTime to assign globally meaningful timestamps to transactions:

After all participants vote YES (prepare), the coordinator picks a commit timestamp s ≥ TT.now().latest.
Before sending the commit, the coordinator waits until TT.now().earliest > s — the so-called commit-wait.
This guarantees that any transaction that starts after this one in real time will see a later timestamp, ensuring external consistency (equivalent to linearizability).

The commit-wait is typically 7–14 ms (twice the clock uncertainty), which adds latency but provides something no other large-scale system offers: externally consistent distributed transactions at global scale.

Spanner's Answer to Blocking

Spanner does not eliminate 2PC's blocking — it makes it extremely unlikely:

Coordinator crash → Paxos elects a new coordinator in ~10 seconds.
Participant crash → Paxos elects a new shard leader with the prepared state already replicated.
The "vulnerability window" where blocking could occur is reduced to the few seconds of Paxos leader election, rather than the hours or days of a crashed machine being replaced.

Interview tip: When asked "How does Spanner avoid the problems of 2PC?", the answer is: "It doesn't avoid 2PC — it wraps 2PC with Paxos so that both coordinator and participants are replicated consensus groups. This makes the single point of failure (coordinator crash) survivable. TrueTime provides the clock guarantees needed for external consistency."

Performance Analysis

2PC is not a fast protocol. Let us quantify exactly how slow it is and where the time goes.

Latency Breakdown

Phase	Network Round Trips	Forced Disk Writes	Description
Prepare	1 (coord → participants → coord)	1 coord + N participants	Vote collection
Commit	1 (coord → participants → coord)	1 coord + N participants	Decision + ACK
Total	2 round trips	2 + 2N forced writes

For a cross-data-center transaction with 50 ms RTT and 5 ms fsync:

Network: 2 × 50 ms = 100 ms
Disk (critical path): 2 × 5 ms (coordinator prepare + commit) = 10 ms
Lock holding time: at least 110 ms (from prepare to commit acknowledgment)

Compare this to a local transaction that takes ~5 ms. The overhead is 20× or more.

Throughput Impact

The real cost is not just latency — it is the lock duration. For the entire 2 round trips + processing time, the participant holds exclusive locks on the affected rows. This means:

Other transactions that need those rows are queued, reducing throughput.
The longer the lock is held, the higher the probability of deadlocks with other distributed transactions.
With N participants across data centers, the abort rate increases due to timeouts, further reducing effective throughput.

Common Optimizations

Optimization	How It Helps	Trade-off
Presumed Abort	Coordinator does not force-write the ABORT decision. Participants that time out assume ABORT. Saves one forced write on the abort path.	Cannot distinguish crash from abort on recovery; must re-query.
Presumed Commit	After the commit point, the coordinator does not need ACKs before completing. Participants that miss the COMMIT will eventually ask.	Slightly more complex recovery.
Read-Only Optimization	If a participant has only performed reads (no writes), it votes READ-ONLY and releases locks immediately — no phase 2 needed.	Only applicable to read-only participants.
Last Agent Optimization	The coordinator sends the commit decision directly to the last participant, skipping the separate commit message. The last participant commits without waiting for the explicit commit.	Specific ordering requirement.
Group Commit	Batch multiple transaction log entries into a single disk write, amortizing fsync cost.	Adds small latency for batching window.

2PC vs Alternatives

2PC is one solution to distributed transactions. Here is how it compares to the major alternatives:

2PC (Two-Phase Commit)

Consistency: Strong (ACID across nodes)
Availability: Low (blocking on failures)
Latency: 2 round trips + lock hold time
Complexity: Medium
Use when: Strong consistency required, low-latency network, limited participants

Saga Pattern

Consistency: Eventual (compensating transactions)
Availability: High (no global locks)
Latency: Sum of individual steps
Complexity: High (compensating logic)
Use when: Long-lived transactions, microservices, availability > consistency

TCC (Try-Confirm-Cancel)

Consistency: Eventual (with reservation)
Availability: High
Latency: 2 phases but no global locks
Complexity: High (3 APIs per service)
Use when: Need reservation semantics, flexible consistency

Outbox + Event Sourcing

Consistency: Eventual
Availability: Very high
Latency: Low (async propagation)
Complexity: Medium
Use when: Decoupled services, event-driven architecture

Detailed Comparison Table

Property	2PC	3PC	Saga	Paxos Commit
Atomicity	Yes	Yes	Eventual	Yes
Blocking	Yes	No*	No	No
Network Partition Safe	Blocks	Unsafe	Safe	Safe
Message Complexity	O(N)	O(N)	O(N)	O(N²)
Round Trips	2	3	N steps	2
Lock Duration	Long	Longer	None (global)	Long
Practical Use	Common	Rare	Common	Spanner-like

* 3PC is non-blocking only with reliable failure detectors, which are impossible to build in asynchronous networks (FLP impossibility).

When to Use 2PC

Good Fit ✓

Traditional RDBMS replication (XA transactions): Enterprise systems using Java EE / JTA with multiple database resources. The latency overhead is acceptable because the databases are on the same LAN.
Cross-shard transactions (same data center): Systems like Spanner, CockroachDB, and YugabyteDB use 2PC for cross-shard writes within a tightly controlled infrastructure.
Financial transfers between accounts in the same bank: When atomicity is legally required and all participants are controlled by the same organization.
Schema migrations across shards: Coordinating DDL changes that must be atomic.
Short-lived transactions with 2–3 participants: The fewer participants, the lower the probability of failure during the vulnerability window.

Bad Fit ✗

Microservices across organizations: Different teams control different services. Holding locks across service boundaries is impractical — a slow downstream service blocks everyone.
Long-running business processes: Booking a trip (flight + hotel + car) should not hold database locks for seconds or minutes. Use Sagas instead.
High-throughput event processing: Systems processing millions of events per second cannot afford 2 round trips and lock holding for each event.
Cross-region / high-latency networks: With 200 ms RTT, the lock holding time becomes 400+ ms — unacceptable for most workloads.
Systems requiring high availability: 2PC's blocking nature means any participant or coordinator failure can halt progress. Use eventual consistency patterns.

Interview decision framework: Ask three questions: (1) Is strong atomicity legally or business-critical? (2) Are all participants on the same low-latency network? (3) Is the transaction short-lived with few participants? If all three are YES → 2PC is appropriate. If any is NO → consider Sagas, TCC, or event-driven patterns.

Real-World Implementations

System	2PC Variant	Key Innovation
XA / JTA	Standard 2PC	Industry standard interface (javax.transaction). Supported by most RDBMS and message brokers.
Google Spanner	2PC over Paxos	TrueTime for external consistency. Paxos eliminates single coordinator SPOF.
CockroachDB	Parallel Commits	Writes commit record in parallel with the commit phase, reducing latency to 1 round trip for the common case.
TiDB / TiKV	Percolator-style	Inspired by Google Percolator. Uses a primary key as the coordinator — no separate coordinator node.
YugabyteDB	2PC with Raft	Each shard is a Raft group. Similar to Spanner's 2PC-over-Paxos approach.
MySQL (InnoDB + Binlog)	Internal 2PC	Coordinates between InnoDB storage engine and binary log for crash-safe replication.
PostgreSQL	PREPARE TRANSACTION	Supports XA-style prepared transactions. Manual resolution required after coordinator crash.

Summary & Key Takeaways

Concept	Key Point
What 2PC solves	Atomic commitment across multiple nodes — all commit or all abort.
Phase 1 (Prepare)	Coordinator asks participants to vote YES/NO. YES is an irrevocable promise to commit.
Phase 2 (Commit/Abort)	Coordinator writes decision, broadcasts to all. The force-write is the commit point.
Critical weakness	Blocking: coordinator failure after PREPARE leaves participants stuck with locks held.
3PC improvement	Adds PRE-COMMIT phase to eliminate blocking, but unsafe under network partitions.
Spanner approach	2PC over Paxos + TrueTime: replicated coordinator eliminates SPOF; TrueTime gives external consistency.
Performance cost	2 round trips + 2+2N forced writes + lock holding for entire duration.
When to use	Strong atomicity required, low-latency network, short transactions, few participants.
When to avoid	Microservices, cross-region, long-running transactions, high throughput, high availability needs.

The bottom line: 2PC is a necessary evil for strong distributed atomicity. It is correct but blocking, slow but deterministic. The trend in modern systems is to either (a) wrap 2PC with consensus (Spanner/CockroachDB) to make it resilient, or (b) avoid it entirely with Sagas and eventual consistency for better availability. Know both approaches and when each is appropriate — that is what distinguishes a good distributed systems engineer.