← All Posts
High Level Design Series · Distributed Systems · Part 4

Two-Phase Commit (2PC)

When a single database transaction spans multiple machines — an order service debiting inventory, a payment service charging a credit card, and a shipping service reserving a slot — you need all of them to commit or all of them to roll back. There is no acceptable middle ground where payment succeeds but inventory is not deducted. The Two-Phase Commit protocol (2PC) is the classic solution to this atomic commitment problem in distributed systems. It is battle-tested, formally proven correct, and powers everything from traditional XA transactions to Google Spanner. It is also slow, blocking, and fragile in the face of certain failures. This post gives you the complete picture.

The Distributed Transaction Problem

On a single machine, atomicity is straightforward: the database writes to a write-ahead log (WAL), and either all changes are flushed or the entire transaction is rolled back from the log. The ACID guarantees hold because a single process controls the outcome.

In a distributed system, no single node controls all the data. Each participant holds a shard of the state, runs on separate hardware, and can fail independently. The fundamental challenge is:

Atomic Commitment Problem: Given N participants that each must decide to commit or abort a transaction, how do we ensure that either all commit or all abort — even when participants can fail and the network can lose messages?

Requirements for a Solution

  1. Agreement — All participants that reach a decision reach the same decision (commit or abort).
  2. Validity — If all participants vote YES and no failure occurs, the decision is COMMIT. If any participant votes NO, the decision is ABORT.
  3. Termination — Every non-failed participant eventually reaches a decision. (This is the one 2PC cannot fully guarantee — more on that later.)
  4. Durability — Once a participant commits, the changes survive crashes.

Why Not Just "Send Commit to Everyone"?

Consider a naïve approach: the coordinator sends COMMIT to all participants simultaneously.

Result: A committed, B aborted, C is in an unknown state. The transaction is neither atomic nor consistent. We need a voting phase before the decision phase, and that is exactly what 2PC provides.

The 2PC Protocol

The protocol has two roles: a Coordinator (also called the Transaction Manager, or TM) and multiple Participants (also called Resource Managers, or RMs). The protocol proceeds in two phases:

Phase 1: Prepare (Voting)

Step 1: The coordinator writes PREPARE to its own log and sends a PREPARE message to all participants.
Step 2: Each participant receives PREPARE and decides whether it can commit. It checks constraints, acquires locks, writes all changes to its WAL (but does not commit yet), and forces the log to durable storage.
Step 3: If the participant can commit, it responds YES (a vote to commit). If it cannot (constraint violation, resource exhaustion, timeout), it responds NO (a vote to abort) and can immediately roll back locally.
Step 4: The coordinator collects all votes. A YES vote is a promise: the participant guarantees it can commit the transaction at any point in the future, surviving crashes (because the changes are in the WAL).
The YES vote is irrevocable. Once a participant votes YES, it enters an uncertain or prepared state. It has surrendered its right to abort unilaterally. It must hold all locks and wait for the coordinator's decision. This is the fundamental tension of 2PC.

Phase 2: Commit / Abort (Decision)

Step 5: If all participants voted YES, the coordinator writes COMMIT to its log (force-write) and sends COMMIT to all participants.
Step 5 (alt): If any participant voted NO (or timed out), the coordinator writes ABORT to its log and sends ABORT to all participants.
Step 6: Each participant receives the decision, applies it (commit or rollback), writes the outcome to its log, releases locks, and sends an ACK to the coordinator.
Step 7: After receiving ACKs from all participants, the coordinator writes END to its log and garbage-collects the transaction state.

▶ 2PC Happy Path

Step through the normal 2PC flow where all participants vote YES.

Formal State Machines

Coordinator FSM:

INITBegin txn
WAITSent PREPARE
Collecting votes
COMMITAll YES
Decision logged
or →
ABORTAny NO / timeout
Decision logged

Participant FSM:

INITWorking
PREPAREDVoted YES
Holding locks ⚠
COMMITTEDApplied
Locks released
INITWorking
ABORTEDVoted NO
Rolled back

Write-Ahead Logging

Both coordinator and participants use force-write WAL at critical points to survive crashes. The log entries enable recovery:

RoleLog EntryWhenPurpose on Recovery
CoordinatorPREPAREBefore sending PREPAREKnow that voting was initiated
CoordinatorCOMMIT / ABORTBefore sending decisionRe-send decision to participants
CoordinatorENDAfter all ACKsSafe to garbage-collect
ParticipantYES / NOBefore sending voteKnow what was voted
ParticipantCOMMIT / ABORTAfter receiving decisionKnow final outcome after crash
Interview tip: The force-write to the WAL at the coordinator's COMMIT decision is the commit point — the exact moment the distributed transaction becomes irrevocably committed. Everything before this point can still be aborted. Everything after must be committed. This is the single most important detail of 2PC.

Failure Mode Analysis

2PC handles many failure scenarios correctly. The problems arise in one specific window. Let us enumerate every case.

Participant Crashes Before Voting

⚠ Failure: Participant crashes before sending YES/NO

Effect: Coordinator times out waiting for this participant's vote. Since not all voted YES, coordinator decides ABORT and sends ABORT to all other participants.

Recovery: When the crashed participant restarts, it finds no YES record in its log, so it aborts locally. Handled correctly.

Participant Crashes After Voting YES

⚠ Failure: Participant crashes after voting YES

Effect: Coordinator may or may not have received the vote. If coordinator already has all YES votes, it commits. Otherwise, it aborts.

Recovery: When the crashed participant restarts, it reads its WAL, finds the YES record, and contacts the coordinator to learn the outcome. If the coordinator is up, it gets the answer and applies it. Handled correctly.

Coordinator Crashes Before Decision

🔴 Failure: Coordinator crashes after sending PREPARE but before writing COMMIT/ABORT

Effect: All participants that voted YES are now stuck in the PREPARED state. They are holding locks and cannot unilaterally decide — another participant might have voted NO, in which case the decision should be ABORT. But they do not know.

Recovery: Participants must wait for the coordinator to recover. If the coordinator never recovers, the transaction is blocked indefinitely. BLOCKING — the critical weakness of 2PC.

Coordinator Crashes After Decision

⚠ Failure: Coordinator crashes after logging COMMIT but before sending to all participants

Effect: Some participants received the decision, others did not. Those that did not are stuck waiting.

Recovery: When the coordinator restarts, it reads its WAL and re-sends the decision to all participants that have not ACKed. Handled correctly.

Network Partition

🔴 Failure: Network partition separates coordinator from participants

Effect: Identical to coordinator crash from the participants' perspective. If they are in the PREPARED state, they are blocked. They cannot distinguish "coordinator is slow" from "coordinator is dead" from "network partition."

Resolution: Participants can attempt to contact each other to learn the outcome (cooperative termination). If even one participant knows the decision, the others can learn it. But if no participant knows the decision (all voted YES, none received the final decision), the protocol is blocked. BLOCKING.

Coordinator + Participant Crash Simultaneously

🔴 Failure: Coordinator crashes after sending COMMIT to Participant A, then A also crashes

Effect: The only node that knew the decision (the coordinator and A) are both down. The remaining participants are in PREPARED state with no way to learn the outcome. They must hold locks and wait.

Recovery: The system is blocked until either the coordinator or A recovers. BLOCKING.

The fundamental weakness of 2PC: There exists a vulnerability window between the moment participants vote YES and the moment they learn the decision. If the coordinator fails in this window, participants are stuck. The protocol is blocking — it cannot guarantee termination in the presence of certain failure patterns.

▶ 2PC Failure Scenarios

Step through: a NO vote causing abort, then a coordinator crash leaving participants stuck.

The Blocking Problem in Depth

To truly understand why 2PC blocks, consider the information asymmetry in the PREPARED state.

The Uncertainty Principle

When a participant is in the PREPARED state (voted YES, waiting for decision):

This is the only state in 2PC where a participant is uncertain. In INIT, it can abort. After COMMITTED or ABORTED, it knows the outcome. But in PREPARED, it is completely dependent on external information — the coordinator's decision.

Cooperative Termination Protocol

A partial mitigation: when a participant times out waiting for the coordinator, it contacts the other participants:

Other Participant's StateWhat the Querying Participant LearnsAction
COMMITTEDDecision was COMMITCommit locally
ABORTEDDecision was ABORTAbort locally
INIT (never voted)Can safely abort (this participant will vote NO)Both abort
PREPAREDNo information — the other participant is also stuckStill blocked

If all reachable participants are in the PREPARED state, no progress can be made. This is the worst case: everyone voted YES, the coordinator crashed before logging its decision, and now every participant is holding locks, unable to do anything.

Real-World Impact

In production systems, the blocking problem manifests as:

Three-Phase Commit (3PC)

In 1983, Dale Skeen proposed a modification to 2PC that eliminates the blocking problem under certain failure models: the Three-Phase Commit protocol (3PC). The key insight is to add an extra phase that propagates the coordinator's decision to participants before asking them to commit.

The Three Phases

Phase 1: CanCommit (Voting) — Identical to 2PC's prepare phase. Coordinator sends CAN-COMMIT?, participants respond YES or NO.

Phase 2: PreCommit — If all participants voted YES, the coordinator sends PRE-COMMIT. Participants acknowledge receipt. This is the crucial addition: it acts as a buffer between voting and committing. All participants now know that every other participant voted YES.

Phase 3: DoCommit — The coordinator sends DO-COMMIT. Participants commit and ACK.

Why 3PC is Non-Blocking

The magic of the PRE-COMMIT phase: if the coordinator crashes after sending PRE-COMMIT to some participants but not others, a recovery coordinator can safely decide:

There is no longer a state where participants are uncertain and cannot determine the outcome from each other. The PRE-COMMIT phase ensures that before any participant commits, all participants know the decision.

3PC Trade-Offs

Aspect2PC3PC
Message rounds2 (Prepare + Commit)3 (CanCommit + PreCommit + DoCommit)
Messages per txn (N participants)4N6N
Forced log writes3 (coord) + 2 (each participant)4 (coord) + 3 (each participant)
Blocking?Yes (coordinator crash)No (with reliable failure detector)
Latency2 round trips3 round trips
Network partitionsBlocksCan violate safety — split-brain possible
Critical caveat: 3PC is non-blocking only under the crash-stop failure model (nodes crash but don't recover with stale state) and with a reliable failure detector. In the presence of network partitions, 3PC can produce inconsistency: one partition might commit while the other aborts. This is why 3PC is rarely used in practice — network partitions are common, and 3PC trades the blocking problem of 2PC for a potential safety violation, which is usually worse.

Google Spanner & TrueTime

Google Spanner is arguably the most ambitious distributed database ever built, and it uses 2PC as its distributed transaction protocol. How does it deal with the blocking problem?

Architecture Overview

Spanner combines several innovations:

External Consistency via TrueTime

Spanner uses TrueTime to assign globally meaningful timestamps to transactions:

  1. After all participants vote YES (prepare), the coordinator picks a commit timestamp sTT.now().latest.
  2. Before sending the commit, the coordinator waits until TT.now().earliest > s — the so-called commit-wait.
  3. This guarantees that any transaction that starts after this one in real time will see a later timestamp, ensuring external consistency (equivalent to linearizability).

The commit-wait is typically 7–14 ms (twice the clock uncertainty), which adds latency but provides something no other large-scale system offers: externally consistent distributed transactions at global scale.

Spanner's Answer to Blocking

Spanner does not eliminate 2PC's blocking — it makes it extremely unlikely:

Interview tip: When asked "How does Spanner avoid the problems of 2PC?", the answer is: "It doesn't avoid 2PC — it wraps 2PC with Paxos so that both coordinator and participants are replicated consensus groups. This makes the single point of failure (coordinator crash) survivable. TrueTime provides the clock guarantees needed for external consistency."

Performance Analysis

2PC is not a fast protocol. Let us quantify exactly how slow it is and where the time goes.

Latency Breakdown

PhaseNetwork Round TripsForced Disk WritesDescription
Prepare1 (coord → participants → coord)1 coord + N participantsVote collection
Commit1 (coord → participants → coord)1 coord + N participantsDecision + ACK
Total2 round trips2 + 2N forced writes

For a cross-data-center transaction with 50 ms RTT and 5 ms fsync:

Compare this to a local transaction that takes ~5 ms. The overhead is 20× or more.

Throughput Impact

The real cost is not just latency — it is the lock duration. For the entire 2 round trips + processing time, the participant holds exclusive locks on the affected rows. This means:

Common Optimizations

OptimizationHow It HelpsTrade-off
Presumed AbortCoordinator does not force-write the ABORT decision. Participants that time out assume ABORT. Saves one forced write on the abort path.Cannot distinguish crash from abort on recovery; must re-query.
Presumed CommitAfter the commit point, the coordinator does not need ACKs before completing. Participants that miss the COMMIT will eventually ask.Slightly more complex recovery.
Read-Only OptimizationIf a participant has only performed reads (no writes), it votes READ-ONLY and releases locks immediately — no phase 2 needed.Only applicable to read-only participants.
Last Agent OptimizationThe coordinator sends the commit decision directly to the last participant, skipping the separate commit message. The last participant commits without waiting for the explicit commit.Specific ordering requirement.
Group CommitBatch multiple transaction log entries into a single disk write, amortizing fsync cost.Adds small latency for batching window.

2PC vs Alternatives

2PC is one solution to distributed transactions. Here is how it compares to the major alternatives:

2PC (Two-Phase Commit)

  • Consistency: Strong (ACID across nodes)
  • Availability: Low (blocking on failures)
  • Latency: 2 round trips + lock hold time
  • Complexity: Medium
  • Use when: Strong consistency required, low-latency network, limited participants

Saga Pattern

  • Consistency: Eventual (compensating transactions)
  • Availability: High (no global locks)
  • Latency: Sum of individual steps
  • Complexity: High (compensating logic)
  • Use when: Long-lived transactions, microservices, availability > consistency

TCC (Try-Confirm-Cancel)

  • Consistency: Eventual (with reservation)
  • Availability: High
  • Latency: 2 phases but no global locks
  • Complexity: High (3 APIs per service)
  • Use when: Need reservation semantics, flexible consistency

Outbox + Event Sourcing

  • Consistency: Eventual
  • Availability: Very high
  • Latency: Low (async propagation)
  • Complexity: Medium
  • Use when: Decoupled services, event-driven architecture

Detailed Comparison Table

Property2PC3PCSagaPaxos Commit
AtomicityYesYesEventualYes
BlockingYesNo*NoNo
Network Partition SafeBlocksUnsafeSafeSafe
Message ComplexityO(N)O(N)O(N)O(N²)
Round Trips23N steps2
Lock DurationLongLongerNone (global)Long
Practical UseCommonRareCommonSpanner-like

* 3PC is non-blocking only with reliable failure detectors, which are impossible to build in asynchronous networks (FLP impossibility).

When to Use 2PC

Good Fit ✓

Bad Fit ✗

Interview decision framework: Ask three questions: (1) Is strong atomicity legally or business-critical? (2) Are all participants on the same low-latency network? (3) Is the transaction short-lived with few participants? If all three are YES → 2PC is appropriate. If any is NO → consider Sagas, TCC, or event-driven patterns.

Real-World Implementations

System2PC VariantKey Innovation
XA / JTAStandard 2PCIndustry standard interface (javax.transaction). Supported by most RDBMS and message brokers.
Google Spanner2PC over PaxosTrueTime for external consistency. Paxos eliminates single coordinator SPOF.
CockroachDBParallel CommitsWrites commit record in parallel with the commit phase, reducing latency to 1 round trip for the common case.
TiDB / TiKVPercolator-styleInspired by Google Percolator. Uses a primary key as the coordinator — no separate coordinator node.
YugabyteDB2PC with RaftEach shard is a Raft group. Similar to Spanner's 2PC-over-Paxos approach.
MySQL (InnoDB + Binlog)Internal 2PCCoordinates between InnoDB storage engine and binary log for crash-safe replication.
PostgreSQLPREPARE TRANSACTIONSupports XA-style prepared transactions. Manual resolution required after coordinator crash.

Summary & Key Takeaways

ConceptKey Point
What 2PC solvesAtomic commitment across multiple nodes — all commit or all abort.
Phase 1 (Prepare)Coordinator asks participants to vote YES/NO. YES is an irrevocable promise to commit.
Phase 2 (Commit/Abort)Coordinator writes decision, broadcasts to all. The force-write is the commit point.
Critical weaknessBlocking: coordinator failure after PREPARE leaves participants stuck with locks held.
3PC improvementAdds PRE-COMMIT phase to eliminate blocking, but unsafe under network partitions.
Spanner approach2PC over Paxos + TrueTime: replicated coordinator eliminates SPOF; TrueTime gives external consistency.
Performance cost2 round trips + 2+2N forced writes + lock holding for entire duration.
When to useStrong atomicity required, low-latency network, short transactions, few participants.
When to avoidMicroservices, cross-region, long-running transactions, high throughput, high availability needs.
The bottom line: 2PC is a necessary evil for strong distributed atomicity. It is correct but blocking, slow but deterministic. The trend in modern systems is to either (a) wrap 2PC with consensus (Spanner/CockroachDB) to make it resilient, or (b) avoid it entirely with Sagas and eventual consistency for better availability. Know both approaches and when each is appropriate — that is what distinguishes a good distributed systems engineer.