High Level Design Series · Distributed Systems · Part 5

Saga Pattern

April 2026 · 20 min read

Microservices are great for team autonomy and independent deployment, but they introduce a hard problem: how do you manage transactions that span multiple services? A monolithic application can wrap everything in a single ACID transaction, but in a distributed system where each service owns its own database, there is no global transaction coordinator. The Two-Phase Commit (2PC) protocol solves this theoretically, but it is blocking, fragile, and doesn't scale well across autonomous services. Enter the Saga pattern — a sequence of local transactions coordinated by events or a central orchestrator, with compensating transactions that undo work if something fails partway through.

Key insight: A Saga trades atomicity for availability. Instead of one big ACID transaction, you get a series of small ones, each locally atomic, connected by a coordination mechanism. If any step fails, previously completed steps are compensated (logically reversed) rather than rolled back.

The Distributed Transaction Problem

Consider an e-commerce order flow. When a customer clicks "Place Order," several things must happen:

Order Service creates the order record.
Payment Service charges the customer's credit card.
Inventory Service reserves the items from stock.
Shipping Service schedules the delivery.

In a monolith, you would wrap all four operations in a single database transaction. If the shipping step fails, the entire transaction rolls back — the payment is never charged, the inventory is never reserved, the order is never created. Simple and safe.

But in a microservices architecture, each service has its own database. There is no shared transaction log. You cannot do a BEGIN TRANSACTION that spans four different databases owned by four different services. The fundamental reasons are:

No shared ACID boundary: Each service uses its own database (possibly even different database engines — PostgreSQL here, MongoDB there). There is no global transaction manager.
Network unreliability: Services communicate over the network. Messages can be delayed, duplicated, or lost. A coordinator cannot guarantee it can reach all participants simultaneously.
Autonomy violation: Locking resources across services for the duration of a distributed transaction defeats the purpose of microservices. The Payment Service shouldn't be blocked because the Inventory Service is slow.
Scalability: 2PC requires synchronous coordination. At high throughput, this becomes a bottleneck.

Why Not Just Use 2PC?

Two-Phase Commit (covered in the previous post) guarantees atomicity across distributed participants, but it has critical drawbacks in a microservices context:

2PC Drawback	Impact on Microservices
Blocking protocol	All participants hold locks until the coordinator commits. A slow Payment Service blocks the Inventory Service and Shipping Service.
Single point of failure	If the coordinator crashes between prepare and commit, all participants are stuck in an uncertain state, holding locks indefinitely.
Tight coupling	Every participant must implement the 2PC protocol. Adding a new service to the transaction requires coordinating with the transaction manager.
Latency	Two round-trips (prepare + commit) across all participants. Latency = 2 × max(participant response times).
Not supported across heterogeneous systems	If Order uses PostgreSQL and Payment uses Stripe's API, there is no XA-compatible interface for Stripe.

The Saga pattern was introduced by Hector Garcia-Molina and Kenneth Salem in their 1987 paper as an alternative to long-lived transactions. The core idea: break a long-lived transaction into a sequence of shorter transactions, each of which can be compensated if a later step fails.

Saga Pattern: Core Concepts

Definition

A Saga is a sequence of local transactions T1, T2, ..., Tn where:

Each Ti updates the local database of service Si and publishes an event or message to trigger the next step.
Each Ti has a corresponding compensating transaction Ci that semantically undoes the effect of Ti.
If all transactions T1...Tn succeed, the saga completes successfully.
If any transaction Tj fails, the saga executes compensating transactions Cj-1, Cj-2, ..., C1 in reverse order.

T1: Create Order → T2: Charge Payment → T3: Reserve Stock → T4: Schedule Shipping → ✓ Complete

If T3 (Reserve Stock) fails:

T1 ✓ → T2 ✓ → T3 ✗ FAIL → C2: Refund Payment → C1: Cancel Order

Compensating Transactions

A compensating transaction is not the same as a database rollback. It is a new, forward-moving transaction that semantically undoes the effect of a prior transaction. This is a crucial distinction:

Original Transaction (Ti)	Compensating Transaction (Ci)	Why Not a Simple Rollback?
Create order (status: PENDING)	Update order status to CANCELLED	Order may have been visible to user; audit trail needed
Charge credit card $99.99	Issue refund of $99.99	Payment already left the system; can't "un-charge"
Reserve 2 units of SKU-1234	Release 2 units of SKU-1234	Other orders may have seen the updated stock count
Send shipping notification email	Send cancellation email	Email is already sent; can't unsend

Important: Compensating transactions must be idempotent — if the compensation is retried (due to network failure), it should produce the same result. This typically means using idempotency keys, checking current state before acting, and designing for at-least-once delivery.

Types of Saga Transactions

Not all steps in a saga are created equal. Garcia-Molina and Salem classified saga steps into three categories:

Type	Definition	Example
Compensatable	Can be undone by a compensating transaction. These are the normal saga steps.	Reserve inventory (can release), charge payment (can refund)
Pivot	The point of no return. If the pivot succeeds, the saga will run to completion. If it fails, compensation begins.	Charge credit card (if this succeeds, we commit to the order)
Retriable	Steps after the pivot that are guaranteed to eventually succeed (with retries). They never need compensation because they always complete.	Send confirmation email (can always be retried), update analytics

The ordering is always: [Compensatable*] → Pivot → [Retriable*]. You design the saga so that once the pivot transaction succeeds, all remaining steps are retriable — they don't need compensations because they will eventually succeed.

Choreography-Based Saga

In a choreography-based saga, there is no central coordinator. Each service listens for events, performs its local transaction, and publishes a new event. The saga emerges from the interaction of independently reacting services — like dancers in a ballet who each know their part without a choreographer directing them in real time.

How It Works

Order Service creates an order (status: PENDING) and publishes OrderCreated.
Payment Service listens for OrderCreated, charges the card, publishes PaymentCompleted.
Inventory Service listens for PaymentCompleted, reserves stock, publishes StockReserved.
Shipping Service listens for StockReserved, schedules delivery, publishes ShipmentScheduled.
Order Service listens for ShipmentScheduled, updates order to CONFIRMED.

Failure & Compensation

If the Shipping Service fails (e.g., address is unreachable):

Shipping Service publishes ShippingFailed.
Inventory Service listens for ShippingFailed, releases reserved stock, publishes StockReleased.
Payment Service listens for StockReleased, refunds the charge, publishes PaymentRefunded.
Order Service listens for PaymentRefunded, updates order to CANCELLED.

▶ Saga Choreography

Step through the happy path, then watch the failure cascade with compensating transactions in reverse.

Code Example: Choreography with Events

// Payment Service — listens for OrderCreated
async function onOrderCreated(event) {
  const { orderId, customerId, amount } = event.payload;

  try {
    // Idempotency: check if payment already processed for this order
    const existing = await db.payments.findOne({ orderId });
    if (existing) return;  // already processed — idempotent

    const charge = await stripe.charges.create({
      amount,
      currency: 'usd',
      customer: customerId,
      idempotency_key: `order-${orderId}`,
    });

    await db.payments.insert({
      orderId, chargeId: charge.id, amount, status: 'COMPLETED'
    });

    await eventBus.publish('PaymentCompleted', {
      orderId, chargeId: charge.id, amount
    });
  } catch (err) {
    await eventBus.publish('PaymentFailed', {
      orderId, reason: err.message
    });
  }
}

// Compensating handler — listens for StockReleased (during rollback)
async function onStockReleased(event) {
  const { orderId } = event.payload;
  const payment = await db.payments.findOne({ orderId });
  if (!payment || payment.status === 'REFUNDED') return;

  await stripe.refunds.create({ charge: payment.chargeId });
  await db.payments.update(
    { orderId },
    { $set: { status: 'REFUNDED' } }
  );
  await eventBus.publish('PaymentRefunded', { orderId });
}

Choreography: Pros & Cons

Pros	Cons
Simple — no central coordinator to build/maintain	Hard to understand the full saga flow (spread across services)
Loosely coupled — services only know about events	Cyclic dependencies possible if services listen to each other
Easy to add new services that react to events	No single place to see saga status — debugging is difficult
No single point of failure (no coordinator)	Risk of "event spaghetti" as sagas grow in complexity
Good for simple, linear sagas (3-4 steps)	Difficult to implement complex business rules and branching

Orchestration-Based Saga

In an orchestration-based saga, a central Saga Orchestrator (sometimes called the Saga Execution Coordinator or SEC) directs the saga. It tells each participant what to do, waits for the response, and decides the next step. Think of it as a conductor directing an orchestra — each musician (service) plays their part when told.

How It Works

The Orchestrator sends a CreateOrder command to the Order Service.
Order Service creates the order, replies with success.
Orchestrator sends ChargePayment command to the Payment Service.
Payment Service charges the card, replies with success.
Orchestrator sends ReserveStock command to the Inventory Service.
Inventory Service reserves stock, replies with success.
Orchestrator sends ScheduleShipment command to the Shipping Service.
Shipping Service schedules delivery, replies with success.
Orchestrator marks the saga as COMPLETED.

Failure & Compensation

If the Shipping Service fails:

Shipping Service replies with failure.
Orchestrator sends ReleaseStock to Inventory Service.
Orchestrator sends RefundPayment to Payment Service.
Orchestrator sends CancelOrder to Order Service.
Orchestrator marks the saga as COMPENSATED.

▶ Saga Orchestration

Central orchestrator directs each service step-by-step. Watch the command-response flow.

Saga Execution Coordinator (SEC)

The SEC is the heart of an orchestration-based saga. It is a stateful component that:

Persists saga state: Stores the current step, the status of each participant, and the saga's overall state in a durable log (database or event store).
Manages transitions: Implements a state machine — for each step, it knows what command to send next on success, and what compensating command to send on failure.
Handles retries: If a participant doesn't respond within a timeout, the SEC retries the command (participants must be idempotent).
Manages concurrent sagas: Each saga instance has a unique ID. The SEC can run thousands of sagas concurrently.

// Saga Orchestrator — state machine definition
const orderSagaDefinition = {
  name: 'OrderSaga',
  steps: [
    {
      name: 'createOrder',
      action:      { service: 'order',     command: 'CreateOrder' },
      compensation: { service: 'order',     command: 'CancelOrder' },
    },
    {
      name: 'chargePayment',
      action:      { service: 'payment',   command: 'ChargePayment' },
      compensation: { service: 'payment',   command: 'RefundPayment' },
    },
    {
      name: 'reserveStock',
      action:      { service: 'inventory', command: 'ReserveStock' },
      compensation: { service: 'inventory', command: 'ReleaseStock' },
    },
    {
      name: 'scheduleShipment',
      action:      { service: 'shipping',  command: 'ScheduleShipment' },
      // No compensation — last step; if it fails, we compensate prior steps
    },
  ],
};

class SagaOrchestrator {
  async execute(sagaDef, payload) {
    const sagaId = generateId();
    const saga = await this.store.create({
      id: sagaId, definition: sagaDef.name,
      currentStep: 0, status: 'RUNNING', payload,
      completedSteps: [],
    });

    for (let i = 0; i < sagaDef.steps.length; i++) {
      const step = sagaDef.steps[i];
      try {
        const result = await this.sendCommand(
          step.action.service, step.action.command, { sagaId, ...payload }
        );
        saga.completedSteps.push({ step: i, result });
        await this.store.update(sagaId, { currentStep: i + 1, completedSteps: saga.completedSteps });
      } catch (err) {
        // Step failed — begin compensation
        await this.compensate(sagaDef, saga, i - 1);
        return;
      }
    }
    await this.store.update(sagaId, { status: 'COMPLETED' });
  }

  async compensate(sagaDef, saga, fromStep) {
    for (let i = fromStep; i >= 0; i--) {
      const step = sagaDef.steps[i];
      if (step.compensation) {
        await this.sendCommandWithRetry(
          step.compensation.service, step.compensation.command,
          { sagaId: saga.id, ...saga.payload }
        );
      }
    }
    await this.store.update(saga.id, { status: 'COMPENSATED' });
  }
}

Orchestration: Pros & Cons

Pros	Cons
Easy to understand — saga flow is explicit in the orchestrator	Central coordinator is a potential single point of failure
No cyclic dependencies between services	More coupling — orchestrator must know about all participants
Easy to implement complex business rules and branching	Risk of centralizing too much logic in the orchestrator
Single place to observe saga status and debug	Orchestrator code can become a "god class" if not carefully designed
Good for complex sagas with many steps and branches	Additional infrastructure (saga store, message broker) required

Choreography vs Orchestration

Aspect	Choreography	Orchestration
Coordination	Decentralized — event-driven	Centralized — command-driven
Coupling	Loose (services know events, not each other)	Tighter (orchestrator knows all participants)
Visibility	Hard to trace — saga spread across services	Easy — orchestrator has full saga state
Complexity	Simple for linear sagas; messy for branching	Handles complex branching well
Failure point	No SPOF (no coordinator)	Orchestrator is a SPOF (mitigated by HA deployment)
Testing	Integration tests across services	Unit test the orchestrator's state machine
Adding steps	Add a new listener — minimal changes	Modify orchestrator definition
Best for	2-4 step linear sagas	5+ step sagas with branching/conditions

Interview tip: When asked "choreography or orchestration?", don't pick one blindly. Say: "For simple, linear workflows (e.g., 3 services), choreography keeps things decoupled. For complex workflows with branching, conditional steps, or many participants, orchestration provides clarity and easier debugging. In practice, many systems use both — orchestration within a bounded context, choreography across bounded contexts."

E-Commerce Saga: Complete Example

Let's walk through a complete e-commerce order saga with all the details — the happy path, the failure path, the state transitions, and the edge cases.

Services Involved

Service	Local Transaction (Ti)	Compensating Transaction (Ci)	Events Published
Order	Create order (PENDING)	Set order to CANCELLED	`OrderCreated` / `OrderCancelled`
Payment	Charge credit card	Refund credit card	`PaymentCompleted` / `PaymentRefunded`
Inventory	Reserve stock (decrement available)	Release stock (increment available)	`StockReserved` / `StockReleased`
Shipping	Schedule shipment	Cancel shipment (if not yet dispatched)	`ShipmentScheduled` / `ShippingFailed`

Happy Path: State Transitions

/* Saga State Machine — Happy Path */

Order:     PENDING ──────────────────────────────────────────── → CONFIRMED
Payment:            PENDING → CHARGED ───────────────────────────────────
Inventory:                              AVAILABLE → RESERVED ────────────
Shipping:                                                    → SCHEDULED

Timeline:  ─────T1──────────T2──────────T3──────────T4───────→ DONE
                 │           │           │           │
           OrderCreated  PaymentDone  StockReserved  ShipScheduled

Failure Path: Shipping Fails

/* Saga State Machine — Failure at T4 (Shipping) */

Order:     PENDING ──────────────────────────────────── → CANCELLED
Payment:            PENDING → CHARGED ──────── → REFUNDED
Inventory:                              RESERVED → RELEASED
Shipping:                                        ✗ FAILED

Timeline:  ─────T1──────T2──────T3──────T4 FAIL──C3──────C2──────C1───→ COMPENSATED
                 │       │       │       │        │       │       │
           Created   Charged  Reserved  Fail  Released  Refunded Cancelled

Edge Cases to Handle

Payment fails: Only Order needs compensation (cancel the order). No need to compensate inventory or shipping — they haven't executed yet.
Compensation fails: Retry with exponential backoff. Compensating transactions must be idempotent. If all retries fail, flag for manual intervention (dead letter queue).
Duplicate events: Due to at-least-once delivery, a service might receive PaymentCompleted twice. Each handler must check if it has already processed the event (using orderId + event type as an idempotency key).
Out-of-order events: StockReleased arrives before StockReserved is processed. Services should handle this by checking the current state before acting.
Timeout: If the Inventory Service doesn't respond within 30 seconds, the orchestrator assumes failure and begins compensation.

Isolation Challenges

One of the biggest trade-offs of the Saga pattern is the lack of isolation. In a traditional ACID transaction, the "I" (Isolation) guarantees that concurrent transactions don't interfere with each other. A saga has no such guarantee because intermediate states are visible to other transactions.

Anomalies Without Isolation

Anomaly	Description	E-Commerce Example
Lost updates	One saga overwrites the update of another without seeing it.	Two sagas both read inventory = 10, both reserve 8 items. Final inventory = 2 (should be −6, i.e., oversold).
Dirty reads	A saga reads data that is later compensated (rolled back).	Saga B reads order as CONFIRMED (during Saga A's execution). Saga A later fails and compensates to CANCELLED. Saga B acted on stale/wrong state.
Fuzzy / non-repeatable reads	A saga reads the same data twice and gets different values because another saga modified it.	Inventory Service reads stock = 10, then moments later reads stock = 3 because another saga reserved 7 units in between.

Countermeasures for Isolation

Since sagas cannot provide ACID isolation, you apply countermeasures — design techniques that mitigate the anomalies:

Countermeasure	How It Works	Example
Semantic locking	Use application-level flags to indicate a resource is being processed by a saga. Other sagas see the flag and wait or skip.	Order status = `APPROVAL_PENDING` instead of `PENDING`. Other sagas that need to modify this order see the lock flag and defer.
Commutative updates	Design operations so the order of execution doesn't matter. Use increments/decrements instead of absolute values.	Instead of `SET stock = 8`, use `stock = stock - 2`. Two concurrent reservations of 2 each produce the same result regardless of order.
Pessimistic view	Reorder saga steps so that risky operations (those prone to failure) happen early, before committing external side effects.	Validate payment before reserving inventory. If payment is likely to fail, you avoid the need to compensate inventory.
Reread value	Before committing, reread the value to check if it has been modified by another saga since the original read.	Before reserving stock, reread the current stock level. If it changed, re-evaluate whether the reservation is still possible.
Version file	Record the operations on a record so that they can be reordered if they arrive out of order.	Attach a version number to each inventory update. If an update with version 3 arrives before version 2, buffer it and apply in order.
By value (risk-based)	Use saga for low-risk transactions; use 2PC or manual review for high-value ones.	Orders under $100 use saga. Orders over $10,000 use 2PC or require manager approval.

Real-world wisdom: In practice, most e-commerce systems are fine with the reduced isolation of sagas. The window of vulnerability (between a local commit and the next step) is typically milliseconds. The probability of two sagas conflicting on the exact same resource at the exact same moment is low — and when it does happen, the compensating transaction handles it. The business impact of a rare double-charge followed by an automatic refund is much lower than the cost of a blocking 2PC protocol.

Implementation Patterns

Saga + Event Sourcing

Event sourcing pairs naturally with sagas. Each service stores its state as a sequence of events. The saga events become first-class citizens in the event store:

// Event store for Order Service
[
  { type: 'OrderCreated',   orderId: 'ORD-001', timestamp: '...', data: { items: [...], total: 99.99 } },
  { type: 'PaymentConfirmed', orderId: 'ORD-001', timestamp: '...', data: { chargeId: 'ch_xxx' } },
  { type: 'StockReserved',   orderId: 'ORD-001', timestamp: '...', data: { warehouse: 'WH-East' } },
  { type: 'ShippingFailed',  orderId: 'ORD-001', timestamp: '...', data: { reason: 'address unreachable' } },
  { type: 'StockReleased',   orderId: 'ORD-001', timestamp: '...', data: {} },
  { type: 'PaymentRefunded', orderId: 'ORD-001', timestamp: '...', data: { refundId: 're_xxx' } },
  { type: 'OrderCancelled',  orderId: 'ORD-001', timestamp: '...', data: { reason: 'shipping failed' } },
]

// Rebuild current state by replaying events
function rebuildOrderState(events) {
  let state = { status: 'UNKNOWN' };
  for (const event of events) {
    switch (event.type) {
      case 'OrderCreated':    state = { ...state, ...event.data, status: 'PENDING' }; break;
      case 'PaymentConfirmed': state.status = 'PAYMENT_DONE'; break;
      case 'StockReserved':    state.status = 'STOCK_RESERVED'; break;
      case 'OrderCancelled':  state.status = 'CANCELLED'; break;
    }
  }
  return state;
}

Saga + Transactional Outbox

A critical implementation detail: how do you atomically update the local database and publish an event? If the service crashes after the DB commit but before publishing the event, the saga stalls. The Transactional Outbox pattern solves this:

Within the same local transaction, write the business data and an event record to an outbox table.
A separate relay process (or CDC — Change Data Capture) reads unpublished events from the outbox table and publishes them to the message broker.
Once published, the relay marks the outbox record as sent.

-- Single local transaction in the Payment Service
BEGIN;
  INSERT INTO payments (order_id, charge_id, amount, status)
    VALUES ('ORD-001', 'ch_xxx', 99.99, 'COMPLETED');

  INSERT INTO outbox (id, aggregate_type, aggregate_id, event_type, payload)
    VALUES (
      uuid(),
      'Payment',
      'ORD-001',
      'PaymentCompleted',
      '{"orderId":"ORD-001","chargeId":"ch_xxx","amount":99.99}'
    );
COMMIT;

-- Relay process (Debezium CDC or polling) picks up the outbox row
-- and publishes it to Kafka / RabbitMQ / SNS

Ensuring Idempotency

Every saga participant must handle duplicate messages gracefully. Common approaches:

// Pattern 1: Idempotency key in the database
async function handleReserveStock(command) {
  const existing = await db.reservations.findOne({
    sagaId: command.sagaId,
    step: 'reserve-stock',
  });
  if (existing) {
    // Already processed — return same result (idempotent)
    return existing.result;
  }

  const result = await doReserveStock(command.items);

  await db.reservations.insert({
    sagaId: command.sagaId,
    step: 'reserve-stock',
    result,
    processedAt: new Date(),
  });

  return result;
}

// Pattern 2: Database unique constraint
// CREATE UNIQUE INDEX idx_reservation_saga ON reservations(saga_id, step);
// INSERT will fail on duplicate — catch the error and return success

Saga vs 2PC vs Outbox Pattern

Aspect	Saga	2PC (Two-Phase Commit)	Transactional Outbox
Purpose	Coordinate distributed transactions across services	Ensure atomicity across distributed databases	Reliably publish events after local DB commit
Scope	Cross-service business workflows	Cross-database atomic commits	Single-service: DB write + event publish
Isolation	None — intermediate states visible	Full ACID — locked until commit	Local only — single DB transaction
Consistency	Eventual (via compensations)	Strong (immediate)	At-least-once delivery (idempotent consumers)
Blocking?	No	Yes — participants hold locks	No
Scalability	High — async, non-blocking	Low — synchronous, lock-heavy	High — local transaction only
Failure handling	Compensating transactions	Coordinator recovery log	Retry relay until published
Complexity	High (compensation logic, idempotency)	Medium (protocol is well-defined)	Low-Medium (outbox table + relay)
Use case	Microservice workflows (e.g., order processing)	Homogeneous databases, short transactions	Reliable event publishing within a single service
Often combined?	Yes — saga + outbox pattern is the standard approach	Rarely combined with saga (different philosophies)	Yes — outbox is a building block inside saga participants

Key relationship: The Outbox pattern is not an alternative to sagas — it is a building block used within each saga participant. Each service uses the outbox pattern to atomically update its local DB and publish events. The saga pattern then coordinates the sequence of these local transactions. Think of outbox as "how a single service reliably publishes events" and saga as "how multiple services coordinate a workflow."

Real-World Saga Implementations

Frameworks & Libraries

Framework	Language	Type	Notable Features
Temporal	Go, Java, Python, TS	Orchestration	Durable workflow engine; built-in retries and timeouts; saga is modeled as a workflow with compensations
Axon Framework	Java/Kotlin	Both	Event sourcing + saga; SagaManager coordinates; integrates with Axon Server
MassTransit	C# (.NET)	Both	State machine-based sagas; supports RabbitMQ, Azure Service Bus, Amazon SQS
Eventuate Tram	Java	Both	By Chris Richardson (author of microservices patterns); transactional outbox built-in
NServiceBus	C# (.NET)	Orchestration	Enterprise-grade; saga persistence; built-in message retry and dead-letter
AWS Step Functions	Any (via Lambda)	Orchestration	Serverless saga orchestrator; visual workflow designer; built-in error handling

Example: Saga with Temporal

// Temporal workflow (Go) — saga with automatic compensation
func OrderSagaWorkflow(ctx workflow.Context, order Order) error {
    saga := NewSaga()

    // Step 1: Create Order
    err := workflow.ExecuteActivity(ctx, CreateOrder, order).Get(ctx, nil)
    if err != nil { return err }
    saga.AddCompensation(ctx, CancelOrder, order.ID)

    // Step 2: Charge Payment
    var chargeResult ChargeResult
    err = workflow.ExecuteActivity(ctx, ChargePayment, order).Get(ctx, &chargeResult)
    if err != nil {
        saga.Compensate(ctx)  // Runs CancelOrder
        return err
    }
    saga.AddCompensation(ctx, RefundPayment, chargeResult.ChargeID)

    // Step 3: Reserve Inventory
    err = workflow.ExecuteActivity(ctx, ReserveStock, order.Items).Get(ctx, nil)
    if err != nil {
        saga.Compensate(ctx)  // Runs RefundPayment, then CancelOrder
        return err
    }
    saga.AddCompensation(ctx, ReleaseStock, order.Items)

    // Step 4: Schedule Shipping
    err = workflow.ExecuteActivity(ctx, ScheduleShipping, order).Get(ctx, nil)
    if err != nil {
        saga.Compensate(ctx)  // Runs ReleaseStock, RefundPayment, CancelOrder
        return err
    }

    return nil  // Saga completed successfully
}

Best Practices

Make all participants idempotent. With at-least-once message delivery, every handler will receive duplicates. Use idempotency keys, unique constraints, or state checks.
Use the transactional outbox pattern. Never rely on "commit DB then publish event" — the process can crash between the two. Use an outbox table with CDC or polling relay.
Design compensations carefully. A compensation is not always a simple "undo." Think about what side effects have occurred (emails sent, webhooks fired, external APIs called) and handle each one.
Keep sagas short. The longer a saga runs, the larger the window for isolation anomalies and the more complex the compensation logic. Aim for 3–5 steps.
Order steps by failure probability. Put the most likely-to-fail step early in the saga. If payment validation often fails, do it before inventory reservation — fewer compensations needed.
Implement timeouts. If a participant doesn't respond within a deadline, treat it as a failure and begin compensation. Don't let sagas hang indefinitely.
Use semantic locking. Flag resources being processed by a saga (e.g., order status = PROCESSING) so other sagas don't interfere.
Monitor and alert. Track saga durations, failure rates, compensation rates, and stuck sagas. A saga that has been in COMPENSATING state for 10 minutes needs attention.
Dead letter queue for failed compensations. If a compensating transaction fails after all retries, don't silently swallow the failure. Send it to a dead letter queue for manual intervention.
Use saga IDs for correlation. Every message and log entry should include the saga ID so you can trace the entire flow across services.

Saga Pattern in System Design Interviews

When to Bring Up Sagas

Sagas are relevant whenever you're designing a microservices system with operations that span multiple services:

E-commerce order processing (order → payment → inventory → shipping)
Travel booking (flight → hotel → car rental)
Banking (transfer money between accounts at different banks)
User registration (create account → verify email → provision resources)
Subscription management (activate plan → charge billing → provision features)

Interview tip: When the interviewer asks "how do you handle distributed transactions?", start with: "In microservices, we avoid 2PC because it's blocking and doesn't scale. Instead, we use the Saga pattern — a sequence of local transactions with compensating transactions for rollback. I'd choose between choreography and orchestration based on the complexity of the workflow."

Framework for Answering

Identify the distributed transaction: "Placing an order involves Order, Payment, Inventory, and Shipping services."
Explain why 2PC won't work: "2PC is blocking and doesn't scale for microservices with heterogeneous databases."
Choose saga type: "I'd use orchestration because we have 4+ steps with potential branching."
Define steps and compensations: List each Ti and Ci.
Address isolation: "We use semantic locking and commutative updates to mitigate isolation anomalies."
Address reliability: "Each service uses the transactional outbox pattern for atomic DB + event publishing. All handlers are idempotent."
Address failure: "If a step fails, the orchestrator runs compensations in reverse. Failed compensations go to a dead letter queue."

Common Interview Questions

"What happens if a compensating transaction fails?" — Retry with exponential backoff. Compensations must be idempotent. After max retries, send to DLQ for manual resolution. Log everything for audit.
"How do you guarantee exactly-once processing?" — You can't in a distributed system. Instead, design for at-least-once delivery with idempotent handlers. Use unique constraints or idempotency keys.
"What about performance?" — Sagas are asynchronous and non-blocking, so throughput is high. Latency is higher than a single ACID transaction (due to multiple network hops), but acceptable for most workflows.
"When would you NOT use a saga?" — When strong isolation is required (e.g., financial ledger updates within a single database), when the transaction involves only 2 services and can use 2PC, or when the business logic is simple enough that a single service can own the entire workflow.

Summary

Concept	Key Takeaway
Saga	Sequence of local transactions + compensating transactions for distributed workflows
Choreography	Event-driven, decentralized; best for simple, linear sagas
Orchestration	Central coordinator directs steps; best for complex, branching sagas
Compensation	Semantic undo (not rollback); must be idempotent
Isolation	No ACID isolation; use semantic locking, commutative updates, pessimistic view
Outbox	Building block for reliable event publishing within each saga participant
SEC	Saga Execution Coordinator — stateful state machine managing the saga lifecycle

The Saga pattern is the de facto standard for managing distributed transactions in microservices. It trades the strong guarantees of ACID for the availability, scalability, and loose coupling that microservices demand. The key to a successful saga implementation is careful compensation design, idempotent participants, and reliable event delivery via the transactional outbox pattern. In the next post, we will explore the Circuit Breaker pattern — another essential tool for building resilient distributed systems.