Two-Phase Commit (2PC)
When a single database transaction spans multiple machines — an order service debiting inventory, a payment service charging a credit card, and a shipping service reserving a slot — you need all of them to commit or all of them to roll back. There is no acceptable middle ground where payment succeeds but inventory is not deducted. The Two-Phase Commit protocol (2PC) is the classic solution to this atomic commitment problem in distributed systems. It is battle-tested, formally proven correct, and powers everything from traditional XA transactions to Google Spanner. It is also slow, blocking, and fragile in the face of certain failures. This post gives you the complete picture.
The Distributed Transaction Problem
On a single machine, atomicity is straightforward: the database writes to a write-ahead log (WAL), and either all changes are flushed or the entire transaction is rolled back from the log. The ACID guarantees hold because a single process controls the outcome.
In a distributed system, no single node controls all the data. Each participant holds a shard of the state, runs on separate hardware, and can fail independently. The fundamental challenge is:
Requirements for a Solution
- Agreement — All participants that reach a decision reach the same decision (commit or abort).
- Validity — If all participants vote YES and no failure occurs, the decision is COMMIT. If any participant votes NO, the decision is ABORT.
- Termination — Every non-failed participant eventually reaches a decision. (This is the one 2PC cannot fully guarantee — more on that later.)
- Durability — Once a participant commits, the changes survive crashes.
Why Not Just "Send Commit to Everyone"?
Consider a naïve approach: the coordinator sends COMMIT to all participants simultaneously.
- Participant A commits successfully.
- Participant B detects a constraint violation and cannot commit.
- Participant C never receives the message (network partition).
Result: A committed, B aborted, C is in an unknown state. The transaction is neither atomic nor consistent. We need a voting phase before the decision phase, and that is exactly what 2PC provides.
The 2PC Protocol
The protocol has two roles: a Coordinator (also called the Transaction Manager, or TM) and multiple Participants (also called Resource Managers, or RMs). The protocol proceeds in two phases:
Phase 1: Prepare (Voting)
PREPARE to its own log and sends a PREPARE message to all participants.PREPARE and decides whether it can commit. It checks constraints, acquires locks, writes all changes to its WAL (but does not commit yet), and forces the log to durable storage.YES (a vote to commit). If it cannot (constraint violation, resource exhaustion, timeout), it responds NO (a vote to abort) and can immediately roll back locally.YES vote is a promise: the participant guarantees it can commit the transaction at any point in the future, surviving crashes (because the changes are in the WAL).Phase 2: Commit / Abort (Decision)
COMMIT to its log (force-write) and sends COMMIT to all participants.ABORT to its log and sends ABORT to all participants.ACK to the coordinator.END to its log and garbage-collects the transaction state.▶ 2PC Happy Path
Step through the normal 2PC flow where all participants vote YES.
Formal State Machines
Coordinator FSM:
Collecting votes
Decision logged
Decision logged
Participant FSM:
Holding locks ⚠
Locks released
Rolled back
Write-Ahead Logging
Both coordinator and participants use force-write WAL at critical points to survive crashes. The log entries enable recovery:
| Role | Log Entry | When | Purpose on Recovery |
|---|---|---|---|
| Coordinator | PREPARE | Before sending PREPARE | Know that voting was initiated |
| Coordinator | COMMIT / ABORT | Before sending decision | Re-send decision to participants |
| Coordinator | END | After all ACKs | Safe to garbage-collect |
| Participant | YES / NO | Before sending vote | Know what was voted |
| Participant | COMMIT / ABORT | After receiving decision | Know final outcome after crash |
Failure Mode Analysis
2PC handles many failure scenarios correctly. The problems arise in one specific window. Let us enumerate every case.
Participant Crashes Before Voting
⚠ Failure: Participant crashes before sending YES/NO
Effect: Coordinator times out waiting for this participant's vote. Since not all voted YES, coordinator decides ABORT and sends ABORT to all other participants.
Recovery: When the crashed participant restarts, it finds no YES record in its log, so it aborts locally. Handled correctly.
Participant Crashes After Voting YES
⚠ Failure: Participant crashes after voting YES
Effect: Coordinator may or may not have received the vote. If coordinator already has all YES votes, it commits. Otherwise, it aborts.
Recovery: When the crashed participant restarts, it reads its WAL, finds the YES record, and contacts the coordinator to learn the outcome. If the coordinator is up, it gets the answer and applies it. Handled correctly.
Coordinator Crashes Before Decision
🔴 Failure: Coordinator crashes after sending PREPARE but before writing COMMIT/ABORT
Effect: All participants that voted YES are now stuck in the PREPARED state. They are holding locks and cannot unilaterally decide — another participant might have voted NO, in which case the decision should be ABORT. But they do not know.
Recovery: Participants must wait for the coordinator to recover. If the coordinator never recovers, the transaction is blocked indefinitely. BLOCKING — the critical weakness of 2PC.
Coordinator Crashes After Decision
⚠ Failure: Coordinator crashes after logging COMMIT but before sending to all participants
Effect: Some participants received the decision, others did not. Those that did not are stuck waiting.
Recovery: When the coordinator restarts, it reads its WAL and re-sends the decision to all participants that have not ACKed. Handled correctly.
Network Partition
🔴 Failure: Network partition separates coordinator from participants
Effect: Identical to coordinator crash from the participants' perspective. If they are in the PREPARED state, they are blocked. They cannot distinguish "coordinator is slow" from "coordinator is dead" from "network partition."
Resolution: Participants can attempt to contact each other to learn the outcome (cooperative termination). If even one participant knows the decision, the others can learn it. But if no participant knows the decision (all voted YES, none received the final decision), the protocol is blocked. BLOCKING.
Coordinator + Participant Crash Simultaneously
🔴 Failure: Coordinator crashes after sending COMMIT to Participant A, then A also crashes
Effect: The only node that knew the decision (the coordinator and A) are both down. The remaining participants are in PREPARED state with no way to learn the outcome. They must hold locks and wait.
Recovery: The system is blocked until either the coordinator or A recovers. BLOCKING.
▶ 2PC Failure Scenarios
Step through: a NO vote causing abort, then a coordinator crash leaving participants stuck.
The Blocking Problem in Depth
To truly understand why 2PC blocks, consider the information asymmetry in the PREPARED state.
The Uncertainty Principle
When a participant is in the PREPARED state (voted YES, waiting for decision):
- It cannot commit unilaterally — another participant might have voted NO, requiring an abort.
- It cannot abort unilaterally — all participants might have voted YES, and the coordinator might have already committed. Aborting would violate the agreement property.
- It cannot release locks — releasing locks before knowing the outcome could allow other transactions to read uncommitted data.
This is the only state in 2PC where a participant is uncertain. In INIT, it can abort. After COMMITTED or ABORTED, it knows the outcome. But in PREPARED, it is completely dependent on external information — the coordinator's decision.
Cooperative Termination Protocol
A partial mitigation: when a participant times out waiting for the coordinator, it contacts the other participants:
| Other Participant's State | What the Querying Participant Learns | Action |
|---|---|---|
| COMMITTED | Decision was COMMIT | Commit locally |
| ABORTED | Decision was ABORT | Abort locally |
| INIT (never voted) | Can safely abort (this participant will vote NO) | Both abort |
| PREPARED | No information — the other participant is also stuck | Still blocked |
If all reachable participants are in the PREPARED state, no progress can be made. This is the worst case: everyone voted YES, the coordinator crashed before logging its decision, and now every participant is holding locks, unable to do anything.
Real-World Impact
In production systems, the blocking problem manifests as:
- Lock contention: Prepared transactions hold row-level or table-level locks. Other transactions that need those rows are blocked, causing a cascading pile-up.
- Resource exhaustion: Each prepared transaction consumes memory (undo logs, lock tables, connection slots). Enough blocked transactions can exhaust the system.
- Timeout cascades: Upstream services waiting for the blocked transaction time out, trigger retries, and create more load.
- Manual intervention: Often, a DBA must manually resolve in-doubt transactions after a coordinator crash — a process called heuristic resolution that can violate atomicity.
Three-Phase Commit (3PC)
In 1983, Dale Skeen proposed a modification to 2PC that eliminates the blocking problem under certain failure models: the Three-Phase Commit protocol (3PC). The key insight is to add an extra phase that propagates the coordinator's decision to participants before asking them to commit.
The Three Phases
Phase 1: CanCommit (Voting) — Identical to 2PC's prepare phase. Coordinator sends CAN-COMMIT?, participants respond YES or NO.
Phase 2: PreCommit — If all participants voted YES, the coordinator sends PRE-COMMIT. Participants acknowledge receipt. This is the crucial addition: it acts as a buffer between voting and committing. All participants now know that every other participant voted YES.
Phase 3: DoCommit — The coordinator sends DO-COMMIT. Participants commit and ACK.
Why 3PC is Non-Blocking
The magic of the PRE-COMMIT phase: if the coordinator crashes after sending PRE-COMMIT to some participants but not others, a recovery coordinator can safely decide:
- If any participant received
PRE-COMMIT, we know all voted YES → safe to commit. - If no participant received
PRE-COMMIT, we can safely abort.
There is no longer a state where participants are uncertain and cannot determine the outcome from each other. The PRE-COMMIT phase ensures that before any participant commits, all participants know the decision.
3PC Trade-Offs
| Aspect | 2PC | 3PC |
|---|---|---|
| Message rounds | 2 (Prepare + Commit) | 3 (CanCommit + PreCommit + DoCommit) |
| Messages per txn (N participants) | 4N | 6N |
| Forced log writes | 3 (coord) + 2 (each participant) | 4 (coord) + 3 (each participant) |
| Blocking? | Yes (coordinator crash) | No (with reliable failure detector) |
| Latency | 2 round trips | 3 round trips |
| Network partitions | Blocks | Can violate safety — split-brain possible |
Google Spanner & TrueTime
Google Spanner is arguably the most ambitious distributed database ever built, and it uses 2PC as its distributed transaction protocol. How does it deal with the blocking problem?
Architecture Overview
Spanner combines several innovations:
- Paxos groups: Each shard is replicated across 5 data centers using Paxos consensus. The Paxos leader for each shard acts as the 2PC participant. If a leader fails, Paxos elects a new one in seconds — so the "participant" is effectively always available.
- 2PC over Paxos: The coordinator itself is a Paxos group. If the coordinator leader fails, a new coordinator leader is elected. This eliminates the single coordinator as a point of failure.
- TrueTime API: GPS and atomic clocks in every data center provide a global time API with bounded uncertainty:
TT.now()returns an interval[earliest, latest]where the true time is guaranteed to be within. Typical uncertainty is 1–7 ms.
External Consistency via TrueTime
Spanner uses TrueTime to assign globally meaningful timestamps to transactions:
- After all participants vote YES (prepare), the coordinator picks a commit timestamp
s≥TT.now().latest. - Before sending the commit, the coordinator waits until
TT.now().earliest > s— the so-called commit-wait. - This guarantees that any transaction that starts after this one in real time will see a later timestamp, ensuring external consistency (equivalent to linearizability).
The commit-wait is typically 7–14 ms (twice the clock uncertainty), which adds latency but provides something no other large-scale system offers: externally consistent distributed transactions at global scale.
Spanner's Answer to Blocking
Spanner does not eliminate 2PC's blocking — it makes it extremely unlikely:
- Coordinator crash → Paxos elects a new coordinator in ~10 seconds.
- Participant crash → Paxos elects a new shard leader with the prepared state already replicated.
- The "vulnerability window" where blocking could occur is reduced to the few seconds of Paxos leader election, rather than the hours or days of a crashed machine being replaced.
Performance Analysis
2PC is not a fast protocol. Let us quantify exactly how slow it is and where the time goes.
Latency Breakdown
| Phase | Network Round Trips | Forced Disk Writes | Description |
|---|---|---|---|
| Prepare | 1 (coord → participants → coord) | 1 coord + N participants | Vote collection |
| Commit | 1 (coord → participants → coord) | 1 coord + N participants | Decision + ACK |
| Total | 2 round trips | 2 + 2N forced writes |
For a cross-data-center transaction with 50 ms RTT and 5 ms fsync:
- Network: 2 × 50 ms = 100 ms
- Disk (critical path): 2 × 5 ms (coordinator prepare + commit) = 10 ms
- Lock holding time: at least 110 ms (from prepare to commit acknowledgment)
Compare this to a local transaction that takes ~5 ms. The overhead is 20× or more.
Throughput Impact
The real cost is not just latency — it is the lock duration. For the entire 2 round trips + processing time, the participant holds exclusive locks on the affected rows. This means:
- Other transactions that need those rows are queued, reducing throughput.
- The longer the lock is held, the higher the probability of deadlocks with other distributed transactions.
- With N participants across data centers, the abort rate increases due to timeouts, further reducing effective throughput.
Common Optimizations
| Optimization | How It Helps | Trade-off |
|---|---|---|
| Presumed Abort | Coordinator does not force-write the ABORT decision. Participants that time out assume ABORT. Saves one forced write on the abort path. | Cannot distinguish crash from abort on recovery; must re-query. |
| Presumed Commit | After the commit point, the coordinator does not need ACKs before completing. Participants that miss the COMMIT will eventually ask. | Slightly more complex recovery. |
| Read-Only Optimization | If a participant has only performed reads (no writes), it votes READ-ONLY and releases locks immediately — no phase 2 needed. | Only applicable to read-only participants. |
| Last Agent Optimization | The coordinator sends the commit decision directly to the last participant, skipping the separate commit message. The last participant commits without waiting for the explicit commit. | Specific ordering requirement. |
| Group Commit | Batch multiple transaction log entries into a single disk write, amortizing fsync cost. | Adds small latency for batching window. |
2PC vs Alternatives
2PC is one solution to distributed transactions. Here is how it compares to the major alternatives:
2PC (Two-Phase Commit)
- Consistency: Strong (ACID across nodes)
- Availability: Low (blocking on failures)
- Latency: 2 round trips + lock hold time
- Complexity: Medium
- Use when: Strong consistency required, low-latency network, limited participants
Saga Pattern
- Consistency: Eventual (compensating transactions)
- Availability: High (no global locks)
- Latency: Sum of individual steps
- Complexity: High (compensating logic)
- Use when: Long-lived transactions, microservices, availability > consistency
TCC (Try-Confirm-Cancel)
- Consistency: Eventual (with reservation)
- Availability: High
- Latency: 2 phases but no global locks
- Complexity: High (3 APIs per service)
- Use when: Need reservation semantics, flexible consistency
Outbox + Event Sourcing
- Consistency: Eventual
- Availability: Very high
- Latency: Low (async propagation)
- Complexity: Medium
- Use when: Decoupled services, event-driven architecture
Detailed Comparison Table
| Property | 2PC | 3PC | Saga | Paxos Commit |
|---|---|---|---|---|
| Atomicity | Yes | Yes | Eventual | Yes |
| Blocking | Yes | No* | No | No |
| Network Partition Safe | Blocks | Unsafe | Safe | Safe |
| Message Complexity | O(N) | O(N) | O(N) | O(N²) |
| Round Trips | 2 | 3 | N steps | 2 |
| Lock Duration | Long | Longer | None (global) | Long |
| Practical Use | Common | Rare | Common | Spanner-like |
* 3PC is non-blocking only with reliable failure detectors, which are impossible to build in asynchronous networks (FLP impossibility).
When to Use 2PC
Good Fit ✓
- Traditional RDBMS replication (XA transactions): Enterprise systems using Java EE / JTA with multiple database resources. The latency overhead is acceptable because the databases are on the same LAN.
- Cross-shard transactions (same data center): Systems like Spanner, CockroachDB, and YugabyteDB use 2PC for cross-shard writes within a tightly controlled infrastructure.
- Financial transfers between accounts in the same bank: When atomicity is legally required and all participants are controlled by the same organization.
- Schema migrations across shards: Coordinating DDL changes that must be atomic.
- Short-lived transactions with 2–3 participants: The fewer participants, the lower the probability of failure during the vulnerability window.
Bad Fit ✗
- Microservices across organizations: Different teams control different services. Holding locks across service boundaries is impractical — a slow downstream service blocks everyone.
- Long-running business processes: Booking a trip (flight + hotel + car) should not hold database locks for seconds or minutes. Use Sagas instead.
- High-throughput event processing: Systems processing millions of events per second cannot afford 2 round trips and lock holding for each event.
- Cross-region / high-latency networks: With 200 ms RTT, the lock holding time becomes 400+ ms — unacceptable for most workloads.
- Systems requiring high availability: 2PC's blocking nature means any participant or coordinator failure can halt progress. Use eventual consistency patterns.
Real-World Implementations
| System | 2PC Variant | Key Innovation |
|---|---|---|
| XA / JTA | Standard 2PC | Industry standard interface (javax.transaction). Supported by most RDBMS and message brokers. |
| Google Spanner | 2PC over Paxos | TrueTime for external consistency. Paxos eliminates single coordinator SPOF. |
| CockroachDB | Parallel Commits | Writes commit record in parallel with the commit phase, reducing latency to 1 round trip for the common case. |
| TiDB / TiKV | Percolator-style | Inspired by Google Percolator. Uses a primary key as the coordinator — no separate coordinator node. |
| YugabyteDB | 2PC with Raft | Each shard is a Raft group. Similar to Spanner's 2PC-over-Paxos approach. |
| MySQL (InnoDB + Binlog) | Internal 2PC | Coordinates between InnoDB storage engine and binary log for crash-safe replication. |
| PostgreSQL | PREPARE TRANSACTION | Supports XA-style prepared transactions. Manual resolution required after coordinator crash. |
Summary & Key Takeaways
| Concept | Key Point |
|---|---|
| What 2PC solves | Atomic commitment across multiple nodes — all commit or all abort. |
| Phase 1 (Prepare) | Coordinator asks participants to vote YES/NO. YES is an irrevocable promise to commit. |
| Phase 2 (Commit/Abort) | Coordinator writes decision, broadcasts to all. The force-write is the commit point. |
| Critical weakness | Blocking: coordinator failure after PREPARE leaves participants stuck with locks held. |
| 3PC improvement | Adds PRE-COMMIT phase to eliminate blocking, but unsafe under network partitions. |
| Spanner approach | 2PC over Paxos + TrueTime: replicated coordinator eliminates SPOF; TrueTime gives external consistency. |
| Performance cost | 2 round trips + 2+2N forced writes + lock holding for entire duration. |
| When to use | Strong atomicity required, low-latency network, short transactions, few participants. |
| When to avoid | Microservices, cross-region, long-running transactions, high throughput, high availability needs. |