Saga Pattern
Microservices are great for team autonomy and independent deployment, but they introduce a hard problem: how do you manage transactions that span multiple services? A monolithic application can wrap everything in a single ACID transaction, but in a distributed system where each service owns its own database, there is no global transaction coordinator. The Two-Phase Commit (2PC) protocol solves this theoretically, but it is blocking, fragile, and doesn't scale well across autonomous services. Enter the Saga pattern — a sequence of local transactions coordinated by events or a central orchestrator, with compensating transactions that undo work if something fails partway through.
The Distributed Transaction Problem
Consider an e-commerce order flow. When a customer clicks "Place Order," several things must happen:
- Order Service creates the order record.
- Payment Service charges the customer's credit card.
- Inventory Service reserves the items from stock.
- Shipping Service schedules the delivery.
In a monolith, you would wrap all four operations in a single database transaction. If the shipping step fails, the entire transaction rolls back — the payment is never charged, the inventory is never reserved, the order is never created. Simple and safe.
But in a microservices architecture, each service has its own database. There is no shared transaction log. You cannot do a BEGIN TRANSACTION that spans four different databases owned by four different services. The fundamental reasons are:
- No shared ACID boundary: Each service uses its own database (possibly even different database engines — PostgreSQL here, MongoDB there). There is no global transaction manager.
- Network unreliability: Services communicate over the network. Messages can be delayed, duplicated, or lost. A coordinator cannot guarantee it can reach all participants simultaneously.
- Autonomy violation: Locking resources across services for the duration of a distributed transaction defeats the purpose of microservices. The Payment Service shouldn't be blocked because the Inventory Service is slow.
- Scalability: 2PC requires synchronous coordination. At high throughput, this becomes a bottleneck.
Why Not Just Use 2PC?
Two-Phase Commit (covered in the previous post) guarantees atomicity across distributed participants, but it has critical drawbacks in a microservices context:
| 2PC Drawback | Impact on Microservices |
|---|---|
| Blocking protocol | All participants hold locks until the coordinator commits. A slow Payment Service blocks the Inventory Service and Shipping Service. |
| Single point of failure | If the coordinator crashes between prepare and commit, all participants are stuck in an uncertain state, holding locks indefinitely. |
| Tight coupling | Every participant must implement the 2PC protocol. Adding a new service to the transaction requires coordinating with the transaction manager. |
| Latency | Two round-trips (prepare + commit) across all participants. Latency = 2 × max(participant response times). |
| Not supported across heterogeneous systems | If Order uses PostgreSQL and Payment uses Stripe's API, there is no XA-compatible interface for Stripe. |
The Saga pattern was introduced by Hector Garcia-Molina and Kenneth Salem in their 1987 paper as an alternative to long-lived transactions. The core idea: break a long-lived transaction into a sequence of shorter transactions, each of which can be compensated if a later step fails.
Saga Pattern: Core Concepts
Definition
A Saga is a sequence of local transactions T1, T2, ..., Tn where:
- Each
Tiupdates the local database of serviceSiand publishes an event or message to trigger the next step. - Each
Tihas a corresponding compensating transactionCithat semantically undoes the effect ofTi. - If all transactions
T1...Tnsucceed, the saga completes successfully. - If any transaction
Tjfails, the saga executes compensating transactionsCj-1, Cj-2, ..., C1in reverse order.
If T3 (Reserve Stock) fails:
Compensating Transactions
A compensating transaction is not the same as a database rollback. It is a new, forward-moving transaction that semantically undoes the effect of a prior transaction. This is a crucial distinction:
| Original Transaction (Ti) | Compensating Transaction (Ci) | Why Not a Simple Rollback? |
|---|---|---|
| Create order (status: PENDING) | Update order status to CANCELLED | Order may have been visible to user; audit trail needed |
| Charge credit card $99.99 | Issue refund of $99.99 | Payment already left the system; can't "un-charge" |
| Reserve 2 units of SKU-1234 | Release 2 units of SKU-1234 | Other orders may have seen the updated stock count |
| Send shipping notification email | Send cancellation email | Email is already sent; can't unsend |
Types of Saga Transactions
Not all steps in a saga are created equal. Garcia-Molina and Salem classified saga steps into three categories:
| Type | Definition | Example |
|---|---|---|
| Compensatable | Can be undone by a compensating transaction. These are the normal saga steps. | Reserve inventory (can release), charge payment (can refund) |
| Pivot | The point of no return. If the pivot succeeds, the saga will run to completion. If it fails, compensation begins. | Charge credit card (if this succeeds, we commit to the order) |
| Retriable | Steps after the pivot that are guaranteed to eventually succeed (with retries). They never need compensation because they always complete. | Send confirmation email (can always be retried), update analytics |
The ordering is always: [Compensatable*] → Pivot → [Retriable*]. You design the saga so that once the pivot transaction succeeds, all remaining steps are retriable — they don't need compensations because they will eventually succeed.
Choreography-Based Saga
In a choreography-based saga, there is no central coordinator. Each service listens for events, performs its local transaction, and publishes a new event. The saga emerges from the interaction of independently reacting services — like dancers in a ballet who each know their part without a choreographer directing them in real time.
How It Works
- Order Service creates an order (status: PENDING) and publishes
OrderCreated. - Payment Service listens for
OrderCreated, charges the card, publishesPaymentCompleted. - Inventory Service listens for
PaymentCompleted, reserves stock, publishesStockReserved. - Shipping Service listens for
StockReserved, schedules delivery, publishesShipmentScheduled. - Order Service listens for
ShipmentScheduled, updates order to CONFIRMED.
Failure & Compensation
If the Shipping Service fails (e.g., address is unreachable):
- Shipping Service publishes
ShippingFailed. - Inventory Service listens for
ShippingFailed, releases reserved stock, publishesStockReleased. - Payment Service listens for
StockReleased, refunds the charge, publishesPaymentRefunded. - Order Service listens for
PaymentRefunded, updates order to CANCELLED.
▶ Saga Choreography
Step through the happy path, then watch the failure cascade with compensating transactions in reverse.
Code Example: Choreography with Events
// Payment Service — listens for OrderCreated
async function onOrderCreated(event) {
const { orderId, customerId, amount } = event.payload;
try {
// Idempotency: check if payment already processed for this order
const existing = await db.payments.findOne({ orderId });
if (existing) return; // already processed — idempotent
const charge = await stripe.charges.create({
amount,
currency: 'usd',
customer: customerId,
idempotency_key: `order-${orderId}`,
});
await db.payments.insert({
orderId, chargeId: charge.id, amount, status: 'COMPLETED'
});
await eventBus.publish('PaymentCompleted', {
orderId, chargeId: charge.id, amount
});
} catch (err) {
await eventBus.publish('PaymentFailed', {
orderId, reason: err.message
});
}
}
// Compensating handler — listens for StockReleased (during rollback)
async function onStockReleased(event) {
const { orderId } = event.payload;
const payment = await db.payments.findOne({ orderId });
if (!payment || payment.status === 'REFUNDED') return;
await stripe.refunds.create({ charge: payment.chargeId });
await db.payments.update(
{ orderId },
{ $set: { status: 'REFUNDED' } }
);
await eventBus.publish('PaymentRefunded', { orderId });
}
Choreography: Pros & Cons
| Pros | Cons |
|---|---|
| Simple — no central coordinator to build/maintain | Hard to understand the full saga flow (spread across services) |
| Loosely coupled — services only know about events | Cyclic dependencies possible if services listen to each other |
| Easy to add new services that react to events | No single place to see saga status — debugging is difficult |
| No single point of failure (no coordinator) | Risk of "event spaghetti" as sagas grow in complexity |
| Good for simple, linear sagas (3-4 steps) | Difficult to implement complex business rules and branching |
Orchestration-Based Saga
In an orchestration-based saga, a central Saga Orchestrator (sometimes called the Saga Execution Coordinator or SEC) directs the saga. It tells each participant what to do, waits for the response, and decides the next step. Think of it as a conductor directing an orchestra — each musician (service) plays their part when told.
How It Works
- The Orchestrator sends a
CreateOrdercommand to the Order Service. - Order Service creates the order, replies with success.
- Orchestrator sends
ChargePaymentcommand to the Payment Service. - Payment Service charges the card, replies with success.
- Orchestrator sends
ReserveStockcommand to the Inventory Service. - Inventory Service reserves stock, replies with success.
- Orchestrator sends
ScheduleShipmentcommand to the Shipping Service. - Shipping Service schedules delivery, replies with success.
- Orchestrator marks the saga as COMPLETED.
Failure & Compensation
If the Shipping Service fails:
- Shipping Service replies with failure.
- Orchestrator sends
ReleaseStockto Inventory Service. - Orchestrator sends
RefundPaymentto Payment Service. - Orchestrator sends
CancelOrderto Order Service. - Orchestrator marks the saga as COMPENSATED.
▶ Saga Orchestration
Central orchestrator directs each service step-by-step. Watch the command-response flow.
Saga Execution Coordinator (SEC)
The SEC is the heart of an orchestration-based saga. It is a stateful component that:
- Persists saga state: Stores the current step, the status of each participant, and the saga's overall state in a durable log (database or event store).
- Manages transitions: Implements a state machine — for each step, it knows what command to send next on success, and what compensating command to send on failure.
- Handles retries: If a participant doesn't respond within a timeout, the SEC retries the command (participants must be idempotent).
- Manages concurrent sagas: Each saga instance has a unique ID. The SEC can run thousands of sagas concurrently.
// Saga Orchestrator — state machine definition
const orderSagaDefinition = {
name: 'OrderSaga',
steps: [
{
name: 'createOrder',
action: { service: 'order', command: 'CreateOrder' },
compensation: { service: 'order', command: 'CancelOrder' },
},
{
name: 'chargePayment',
action: { service: 'payment', command: 'ChargePayment' },
compensation: { service: 'payment', command: 'RefundPayment' },
},
{
name: 'reserveStock',
action: { service: 'inventory', command: 'ReserveStock' },
compensation: { service: 'inventory', command: 'ReleaseStock' },
},
{
name: 'scheduleShipment',
action: { service: 'shipping', command: 'ScheduleShipment' },
// No compensation — last step; if it fails, we compensate prior steps
},
],
};
class SagaOrchestrator {
async execute(sagaDef, payload) {
const sagaId = generateId();
const saga = await this.store.create({
id: sagaId, definition: sagaDef.name,
currentStep: 0, status: 'RUNNING', payload,
completedSteps: [],
});
for (let i = 0; i < sagaDef.steps.length; i++) {
const step = sagaDef.steps[i];
try {
const result = await this.sendCommand(
step.action.service, step.action.command, { sagaId, ...payload }
);
saga.completedSteps.push({ step: i, result });
await this.store.update(sagaId, { currentStep: i + 1, completedSteps: saga.completedSteps });
} catch (err) {
// Step failed — begin compensation
await this.compensate(sagaDef, saga, i - 1);
return;
}
}
await this.store.update(sagaId, { status: 'COMPLETED' });
}
async compensate(sagaDef, saga, fromStep) {
for (let i = fromStep; i >= 0; i--) {
const step = sagaDef.steps[i];
if (step.compensation) {
await this.sendCommandWithRetry(
step.compensation.service, step.compensation.command,
{ sagaId: saga.id, ...saga.payload }
);
}
}
await this.store.update(saga.id, { status: 'COMPENSATED' });
}
}
Orchestration: Pros & Cons
| Pros | Cons |
|---|---|
| Easy to understand — saga flow is explicit in the orchestrator | Central coordinator is a potential single point of failure |
| No cyclic dependencies between services | More coupling — orchestrator must know about all participants |
| Easy to implement complex business rules and branching | Risk of centralizing too much logic in the orchestrator |
| Single place to observe saga status and debug | Orchestrator code can become a "god class" if not carefully designed |
| Good for complex sagas with many steps and branches | Additional infrastructure (saga store, message broker) required |
Choreography vs Orchestration
| Aspect | Choreography | Orchestration |
|---|---|---|
| Coordination | Decentralized — event-driven | Centralized — command-driven |
| Coupling | Loose (services know events, not each other) | Tighter (orchestrator knows all participants) |
| Visibility | Hard to trace — saga spread across services | Easy — orchestrator has full saga state |
| Complexity | Simple for linear sagas; messy for branching | Handles complex branching well |
| Failure point | No SPOF (no coordinator) | Orchestrator is a SPOF (mitigated by HA deployment) |
| Testing | Integration tests across services | Unit test the orchestrator's state machine |
| Adding steps | Add a new listener — minimal changes | Modify orchestrator definition |
| Best for | 2-4 step linear sagas | 5+ step sagas with branching/conditions |
E-Commerce Saga: Complete Example
Let's walk through a complete e-commerce order saga with all the details — the happy path, the failure path, the state transitions, and the edge cases.
Services Involved
| Service | Local Transaction (Ti) | Compensating Transaction (Ci) | Events Published |
|---|---|---|---|
| Order | Create order (PENDING) | Set order to CANCELLED | OrderCreated / OrderCancelled |
| Payment | Charge credit card | Refund credit card | PaymentCompleted / PaymentRefunded |
| Inventory | Reserve stock (decrement available) | Release stock (increment available) | StockReserved / StockReleased |
| Shipping | Schedule shipment | Cancel shipment (if not yet dispatched) | ShipmentScheduled / ShippingFailed |
Happy Path: State Transitions
/* Saga State Machine — Happy Path */
Order: PENDING ──────────────────────────────────────────── → CONFIRMED
Payment: PENDING → CHARGED ───────────────────────────────────
Inventory: AVAILABLE → RESERVED ────────────
Shipping: → SCHEDULED
Timeline: ─────T1──────────T2──────────T3──────────T4───────→ DONE
│ │ │ │
OrderCreated PaymentDone StockReserved ShipScheduled
Failure Path: Shipping Fails
/* Saga State Machine — Failure at T4 (Shipping) */
Order: PENDING ──────────────────────────────────── → CANCELLED
Payment: PENDING → CHARGED ──────── → REFUNDED
Inventory: RESERVED → RELEASED
Shipping: ✗ FAILED
Timeline: ─────T1──────T2──────T3──────T4 FAIL──C3──────C2──────C1───→ COMPENSATED
│ │ │ │ │ │ │
Created Charged Reserved Fail Released Refunded Cancelled
Edge Cases to Handle
- Payment fails: Only Order needs compensation (cancel the order). No need to compensate inventory or shipping — they haven't executed yet.
- Compensation fails: Retry with exponential backoff. Compensating transactions must be idempotent. If all retries fail, flag for manual intervention (dead letter queue).
- Duplicate events: Due to at-least-once delivery, a service might receive
PaymentCompletedtwice. Each handler must check if it has already processed the event (using orderId + event type as an idempotency key). - Out-of-order events:
StockReleasedarrives beforeStockReservedis processed. Services should handle this by checking the current state before acting. - Timeout: If the Inventory Service doesn't respond within 30 seconds, the orchestrator assumes failure and begins compensation.
Isolation Challenges
One of the biggest trade-offs of the Saga pattern is the lack of isolation. In a traditional ACID transaction, the "I" (Isolation) guarantees that concurrent transactions don't interfere with each other. A saga has no such guarantee because intermediate states are visible to other transactions.
Anomalies Without Isolation
| Anomaly | Description | E-Commerce Example |
|---|---|---|
| Lost updates | One saga overwrites the update of another without seeing it. | Two sagas both read inventory = 10, both reserve 8 items. Final inventory = 2 (should be −6, i.e., oversold). |
| Dirty reads | A saga reads data that is later compensated (rolled back). | Saga B reads order as CONFIRMED (during Saga A's execution). Saga A later fails and compensates to CANCELLED. Saga B acted on stale/wrong state. |
| Fuzzy / non-repeatable reads | A saga reads the same data twice and gets different values because another saga modified it. | Inventory Service reads stock = 10, then moments later reads stock = 3 because another saga reserved 7 units in between. |
Countermeasures for Isolation
Since sagas cannot provide ACID isolation, you apply countermeasures — design techniques that mitigate the anomalies:
| Countermeasure | How It Works | Example |
|---|---|---|
| Semantic locking | Use application-level flags to indicate a resource is being processed by a saga. Other sagas see the flag and wait or skip. | Order status = APPROVAL_PENDING instead of PENDING. Other sagas that need to modify this order see the lock flag and defer. |
| Commutative updates | Design operations so the order of execution doesn't matter. Use increments/decrements instead of absolute values. | Instead of SET stock = 8, use stock = stock - 2. Two concurrent reservations of 2 each produce the same result regardless of order. |
| Pessimistic view | Reorder saga steps so that risky operations (those prone to failure) happen early, before committing external side effects. | Validate payment before reserving inventory. If payment is likely to fail, you avoid the need to compensate inventory. |
| Reread value | Before committing, reread the value to check if it has been modified by another saga since the original read. | Before reserving stock, reread the current stock level. If it changed, re-evaluate whether the reservation is still possible. |
| Version file | Record the operations on a record so that they can be reordered if they arrive out of order. | Attach a version number to each inventory update. If an update with version 3 arrives before version 2, buffer it and apply in order. |
| By value (risk-based) | Use saga for low-risk transactions; use 2PC or manual review for high-value ones. | Orders under $100 use saga. Orders over $10,000 use 2PC or require manager approval. |
Implementation Patterns
Saga + Event Sourcing
Event sourcing pairs naturally with sagas. Each service stores its state as a sequence of events. The saga events become first-class citizens in the event store:
// Event store for Order Service
[
{ type: 'OrderCreated', orderId: 'ORD-001', timestamp: '...', data: { items: [...], total: 99.99 } },
{ type: 'PaymentConfirmed', orderId: 'ORD-001', timestamp: '...', data: { chargeId: 'ch_xxx' } },
{ type: 'StockReserved', orderId: 'ORD-001', timestamp: '...', data: { warehouse: 'WH-East' } },
{ type: 'ShippingFailed', orderId: 'ORD-001', timestamp: '...', data: { reason: 'address unreachable' } },
{ type: 'StockReleased', orderId: 'ORD-001', timestamp: '...', data: {} },
{ type: 'PaymentRefunded', orderId: 'ORD-001', timestamp: '...', data: { refundId: 're_xxx' } },
{ type: 'OrderCancelled', orderId: 'ORD-001', timestamp: '...', data: { reason: 'shipping failed' } },
]
// Rebuild current state by replaying events
function rebuildOrderState(events) {
let state = { status: 'UNKNOWN' };
for (const event of events) {
switch (event.type) {
case 'OrderCreated': state = { ...state, ...event.data, status: 'PENDING' }; break;
case 'PaymentConfirmed': state.status = 'PAYMENT_DONE'; break;
case 'StockReserved': state.status = 'STOCK_RESERVED'; break;
case 'OrderCancelled': state.status = 'CANCELLED'; break;
}
}
return state;
}
Saga + Transactional Outbox
A critical implementation detail: how do you atomically update the local database and publish an event? If the service crashes after the DB commit but before publishing the event, the saga stalls. The Transactional Outbox pattern solves this:
- Within the same local transaction, write the business data and an event record to an
outboxtable. - A separate relay process (or CDC — Change Data Capture) reads unpublished events from the outbox table and publishes them to the message broker.
- Once published, the relay marks the outbox record as sent.
-- Single local transaction in the Payment Service
BEGIN;
INSERT INTO payments (order_id, charge_id, amount, status)
VALUES ('ORD-001', 'ch_xxx', 99.99, 'COMPLETED');
INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
VALUES (
uuid(),
'Payment',
'ORD-001',
'PaymentCompleted',
'{"orderId":"ORD-001","chargeId":"ch_xxx","amount":99.99}'
);
COMMIT;
-- Relay process (Debezium CDC or polling) picks up the outbox row
-- and publishes it to Kafka / RabbitMQ / SNS
Ensuring Idempotency
Every saga participant must handle duplicate messages gracefully. Common approaches:
// Pattern 1: Idempotency key in the database
async function handleReserveStock(command) {
const existing = await db.reservations.findOne({
sagaId: command.sagaId,
step: 'reserve-stock',
});
if (existing) {
// Already processed — return same result (idempotent)
return existing.result;
}
const result = await doReserveStock(command.items);
await db.reservations.insert({
sagaId: command.sagaId,
step: 'reserve-stock',
result,
processedAt: new Date(),
});
return result;
}
// Pattern 2: Database unique constraint
// CREATE UNIQUE INDEX idx_reservation_saga ON reservations(saga_id, step);
// INSERT will fail on duplicate — catch the error and return success
Saga vs 2PC vs Outbox Pattern
| Aspect | Saga | 2PC (Two-Phase Commit) | Transactional Outbox |
|---|---|---|---|
| Purpose | Coordinate distributed transactions across services | Ensure atomicity across distributed databases | Reliably publish events after local DB commit |
| Scope | Cross-service business workflows | Cross-database atomic commits | Single-service: DB write + event publish |
| Isolation | None — intermediate states visible | Full ACID — locked until commit | Local only — single DB transaction |
| Consistency | Eventual (via compensations) | Strong (immediate) | At-least-once delivery (idempotent consumers) |
| Blocking? | No | Yes — participants hold locks | No |
| Scalability | High — async, non-blocking | Low — synchronous, lock-heavy | High — local transaction only |
| Failure handling | Compensating transactions | Coordinator recovery log | Retry relay until published |
| Complexity | High (compensation logic, idempotency) | Medium (protocol is well-defined) | Low-Medium (outbox table + relay) |
| Use case | Microservice workflows (e.g., order processing) | Homogeneous databases, short transactions | Reliable event publishing within a single service |
| Often combined? | Yes — saga + outbox pattern is the standard approach | Rarely combined with saga (different philosophies) | Yes — outbox is a building block inside saga participants |
Real-World Saga Implementations
Frameworks & Libraries
| Framework | Language | Type | Notable Features |
|---|---|---|---|
| Temporal | Go, Java, Python, TS | Orchestration | Durable workflow engine; built-in retries and timeouts; saga is modeled as a workflow with compensations |
| Axon Framework | Java/Kotlin | Both | Event sourcing + saga; SagaManager coordinates; integrates with Axon Server |
| MassTransit | C# (.NET) | Both | State machine-based sagas; supports RabbitMQ, Azure Service Bus, Amazon SQS |
| Eventuate Tram | Java | Both | By Chris Richardson (author of microservices patterns); transactional outbox built-in |
| NServiceBus | C# (.NET) | Orchestration | Enterprise-grade; saga persistence; built-in message retry and dead-letter |
| AWS Step Functions | Any (via Lambda) | Orchestration | Serverless saga orchestrator; visual workflow designer; built-in error handling |
Example: Saga with Temporal
// Temporal workflow (Go) — saga with automatic compensation
func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
saga := NewSaga()
// Step 1: Create Order
err := workflow.ExecuteActivity(ctx, CreateOrder, order).Get(ctx, nil)
if err != nil { return err }
saga.AddCompensation(ctx, CancelOrder, order.ID)
// Step 2: Charge Payment
var chargeResult ChargeResult
err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeResult)
if err != nil {
saga.Compensate(ctx) // Runs CancelOrder
return err
}
saga.AddCompensation(ctx, RefundPayment, chargeResult.ChargeID)
// Step 3: Reserve Inventory
err = workflow.ExecuteActivity(ctx, ReserveStock, order.Items).Get(ctx, nil)
if err != nil {
saga.Compensate(ctx) // Runs RefundPayment, then CancelOrder
return err
}
saga.AddCompensation(ctx, ReleaseStock, order.Items)
// Step 4: Schedule Shipping
err = workflow.ExecuteActivity(ctx, ScheduleShipping, order).Get(ctx, nil)
if err != nil {
saga.Compensate(ctx) // Runs ReleaseStock, RefundPayment, CancelOrder
return err
}
return nil // Saga completed successfully
}
Best Practices
- Make all participants idempotent. With at-least-once message delivery, every handler will receive duplicates. Use idempotency keys, unique constraints, or state checks.
- Use the transactional outbox pattern. Never rely on "commit DB then publish event" — the process can crash between the two. Use an outbox table with CDC or polling relay.
- Design compensations carefully. A compensation is not always a simple "undo." Think about what side effects have occurred (emails sent, webhooks fired, external APIs called) and handle each one.
- Keep sagas short. The longer a saga runs, the larger the window for isolation anomalies and the more complex the compensation logic. Aim for 3–5 steps.
- Order steps by failure probability. Put the most likely-to-fail step early in the saga. If payment validation often fails, do it before inventory reservation — fewer compensations needed.
- Implement timeouts. If a participant doesn't respond within a deadline, treat it as a failure and begin compensation. Don't let sagas hang indefinitely.
- Use semantic locking. Flag resources being processed by a saga (e.g., order status =
PROCESSING) so other sagas don't interfere. - Monitor and alert. Track saga durations, failure rates, compensation rates, and stuck sagas. A saga that has been in COMPENSATING state for 10 minutes needs attention.
- Dead letter queue for failed compensations. If a compensating transaction fails after all retries, don't silently swallow the failure. Send it to a dead letter queue for manual intervention.
- Use saga IDs for correlation. Every message and log entry should include the saga ID so you can trace the entire flow across services.
Saga Pattern in System Design Interviews
When to Bring Up Sagas
Sagas are relevant whenever you're designing a microservices system with operations that span multiple services:
- E-commerce order processing (order → payment → inventory → shipping)
- Travel booking (flight → hotel → car rental)
- Banking (transfer money between accounts at different banks)
- User registration (create account → verify email → provision resources)
- Subscription management (activate plan → charge billing → provision features)
Framework for Answering
- Identify the distributed transaction: "Placing an order involves Order, Payment, Inventory, and Shipping services."
- Explain why 2PC won't work: "2PC is blocking and doesn't scale for microservices with heterogeneous databases."
- Choose saga type: "I'd use orchestration because we have 4+ steps with potential branching."
- Define steps and compensations: List each Ti and Ci.
- Address isolation: "We use semantic locking and commutative updates to mitigate isolation anomalies."
- Address reliability: "Each service uses the transactional outbox pattern for atomic DB + event publishing. All handlers are idempotent."
- Address failure: "If a step fails, the orchestrator runs compensations in reverse. Failed compensations go to a dead letter queue."
Common Interview Questions
- "What happens if a compensating transaction fails?" — Retry with exponential backoff. Compensations must be idempotent. After max retries, send to DLQ for manual resolution. Log everything for audit.
- "How do you guarantee exactly-once processing?" — You can't in a distributed system. Instead, design for at-least-once delivery with idempotent handlers. Use unique constraints or idempotency keys.
- "What about performance?" — Sagas are asynchronous and non-blocking, so throughput is high. Latency is higher than a single ACID transaction (due to multiple network hops), but acceptable for most workflows.
- "When would you NOT use a saga?" — When strong isolation is required (e.g., financial ledger updates within a single database), when the transaction involves only 2 services and can use 2PC, or when the business logic is simple enough that a single service can own the entire workflow.
Summary
| Concept | Key Takeaway |
|---|---|
| Saga | Sequence of local transactions + compensating transactions for distributed workflows |
| Choreography | Event-driven, decentralized; best for simple, linear sagas |
| Orchestration | Central coordinator directs steps; best for complex, branching sagas |
| Compensation | Semantic undo (not rollback); must be idempotent |
| Isolation | No ACID isolation; use semantic locking, commutative updates, pessimistic view |
| Outbox | Building block for reliable event publishing within each saga participant |
| SEC | Saga Execution Coordinator — stateful state machine managing the saga lifecycle |
The Saga pattern is the de facto standard for managing distributed transactions in microservices. It trades the strong guarantees of ACID for the availability, scalability, and loose coupling that microservices demand. The key to a successful saga implementation is careful compensation design, idempotent participants, and reliable event delivery via the transactional outbox pattern. In the next post, we will explore the Circuit Breaker pattern — another essential tool for building resilient distributed systems.