The Saga Pattern: Distributed Transactions Without the Pain

When a single database holds all your data, transactions are straightforward: begin, commit, rollback. But in a microservices architecture, where each service owns its database, a single business operation — like placing an order — might need to update data across three or four separate services. Two-phase commit (2PC) is theoretically possible but practically a disaster: it introduces tight coupling, holds locks across services, and collapses under failure.

The Saga pattern is the standard answer. Instead of one big distributed transaction, a saga breaks the operation into a sequence of local transactions, each affecting only one service’s database. If something fails partway through, the saga runs compensating transactions to undo what came before. It’s not ACID — you get eventual consistency — but it works reliably at scale.

There are two fundamentally different ways to coordinate a saga: choreography (event-driven, decentralized) and orchestration (command-driven, centralized). Choosing the wrong one creates either spaghetti event chains or a bottleneck orchestrator. Here’s how both work, when to use each, and what the Go implementation looks like.

How a Saga Works

Consider an e-commerce order flow that spans three services:

  • Order Service — creates the order
  • Payment Service — charges the customer’s credit card
  • Inventory Service — reserves stock for the items

In a saga, this becomes a sequence of local transactions: create order (PENDING) → reserve credit → reserve inventory → approve order. If the inventory reservation fails (out of stock), the saga runs compensating transactions backward: release the credit hold, cancel the order. Each step is a normal database transaction within a single service, followed by a message published to a broker (NATS, Kafka, RabbitMQ — pick your poison).

The critical constraint: compensating transactions must be idempotent. If a message gets delivered twice (and eventually it will), releasing a credit hold twice shouldn’t fail or corrupt data. Design every undo operation so that running it multiple times produces the same result as running it once.

Choreography: Events All the Way Down

In a choreographed saga, there’s no central coordinator. Each service listens for events and reacts. The order service emits OrderCreated, the payment service picks it up and emits CreditReserved or CreditLimitExceeded, the inventory service reacts to CreditReserved, and so on. Each service knows only about its immediate predecessor and successor in the chain.

// Choreography: Order Service reacts to events from other services
func (h *OrderHandler) HandleCreditReserved(ctx context.Context, evt CreditReserved) error {
    // Move order to next step — try reserving inventory
    order, err := h.repo.GetByID(ctx, evt.OrderID)
    if err != nil {
        return err
    }
    order.Status = "CREDIT_RESERVED"
    if err := h.repo.Save(ctx, order); err != nil {
        return err
    }
    // Publish event for the next service in the chain
    return h.bus.Publish(ctx, "order.credit_reserved", OrderCreditReserved{
        OrderID: order.ID,
        Items:   order.Items,
    })
}

func (h *OrderHandler) HandleInventoryReserved(ctx context.Context, evt InventoryReserved) error {
    order, err := h.repo.GetByID(ctx, evt.OrderID)
    if err != nil {
        return err
    }
    order.Status = "APPROVED"
    return h.repo.Save(ctx, order)
}

// Compensating handler: if inventory fails, undo credit reservation
func (h *OrderHandler) HandleInventoryReservationFailed(ctx context.Context, evt InventoryReservationFailed) error {
    order, err := h.repo.GetByID(ctx, evt.OrderID)
    if err != nil {
        return err
    }
    order.Status = "CANCELLED"
    order.CancelReason = "out of stock"
    if err := h.repo.Save(ctx, order); err != nil {
        return err
    }
    // Tell payment service to release the credit hold
    return h.bus.Publish(ctx, "order.cancelled", OrderCancelled{
        OrderID: order.ID,
    })
}

Choreography is attractive because it’s loosely coupled — services don’t know about each other directly. But the coupling doesn’t disappear; it shifts into the event flow itself. When you need to understand what happens when an order is placed, you have to trace events across four services. For simple flows (3-4 steps), this is manageable. For complex workflows with branching logic, it becomes a maintenance nightmare.

Orchestration: A Central Conductor

In an orchestrated saga, a dedicated saga orchestrator (typically running inside the service that initiates the operation) tells each participant what to do and tracks the overall state. Instead of services chaining events to each other, the orchestrator sends commands and receives replies.

type SagaStep struct {
    Name       string
    Action     string // command to send
    Compensate string // compensating command if something fails later
}

type CreateOrderSaga struct {
    OrderID string
    Steps   []SagaStep
    Current int
}

func NewCreateOrderSaga(orderID string) *CreateOrderSaga {
    return &CreateOrderSaga{
        OrderID: orderID,
        Steps: []SagaStep{
            {Name: "reserve_credit", Action: "ReserveCredit", Compensate: "ReleaseCredit"},
            {Name: "reserve_inventory", Action: "ReserveInventory", Compensate: "ReleaseInventory"},
            {Name: "approve_order", Action: "ApproveOrder"},
        },
        Current: 0,
    }
}

func (s *CreateOrderSaga) HandleReply(ctx context.Context, reply Reply, bus MessageBus) error {
    if !reply.Success {
        // Run compensating transactions for all completed steps, in reverse
        return s.compensate(ctx, bus)
    }

    s.Current++
    if s.Current >= len(s.Steps) {
        return nil // saga complete
    }

    // Send next command
    return bus.SendCommand(ctx, s.Steps[s.Current].Action, s.OrderID)
}

func (s *CreateOrderSaga) compensate(ctx context.Context, bus MessageBus) error {
    var errs []error
    for i := s.Current - 1; i >= 0; i-- {
        if err := bus.SendCommand(ctx, s.Steps[i].Compensate, s.OrderID); err != nil {
            errs = append(errs, err)
        }
    }
    if len(errs) > 0 {
        return fmt.Errorf("compensation had %d errors: %w", len(errs), errs[0])
    }
    return nil
}

The orchestrator holds the entire workflow in one place. When a new developer joins the team, they can read the saga definition and understand the full order flow in a single file. The tradeoff: the orchestrator becomes a central piece of infrastructure that needs to be highly available and recoverable.

Persistence: Don’t Lose the Saga State

If the orchestrator crashes mid-saga, it must be able to pick up where it left off. This means persisting saga state to a durable store — typically the same database the initiating service already uses. A simple table tracks each saga instance:

CREATE TABLE saga_instances (
    id            UUID PRIMARY KEY,
    saga_type     VARCHAR(100) NOT NULL,
    correlation_id UUID NOT NULL,
    current_step  INT NOT NULL DEFAULT 0,
    status        VARCHAR(20) NOT NULL, -- PENDING, COMPLETED, COMPENSATING, FAILED
    payload       JSONB,
    created_at    TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    updated_at    TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

When a reply arrives, the handler loads the saga instance, advances the state machine, and saves it back — all within a local transaction. If the service restarts, it queries for sagas in PENDING or COMPENSATING status and re-sends the pending commands. This is where the idempotency requirement pays off: re-sending a command that was already processed is safe because the receiving service handles duplicates gracefully.

Choosing Between Choreography and Orchestration

The decision isn’t binary — many systems use both. Here’s a practical guideline:

  • Use choreography when the flow is simple (2-3 steps), services are owned by different teams, and you want maximum decoupling. Example: a user signup flow that sends a welcome email and creates a default workspace.
  • Use orchestration when the flow has conditional branching, timeouts, or more than 4-5 steps. The orchestrator gives you a single place to reason about the workflow, handle failures, and add observability. Example: an order fulfillment pipeline that reserves credit, checks inventory, calculates shipping, applies loyalty discounts, and notifies the warehouse.
  • Hybrid approach: use orchestration for the main business flow, but let individual services publish domain events for side effects (sending notifications, updating read models) that don’t require compensation.

Pitfalls That Will Bite You

  • Semantic locking: A saga holds a “soft lock” on resources (order in PENDING, credit reserved but not charged). If another saga tries to reserve the same credit, it needs to check the PENDING state. This is not a database lock — it’s a business-level check that you must implement explicitly.
  • Missing compensations: Every step that modifies state needs a compensation. Forgetting one leaves your system in an inconsistent state when failures happen. Audit your saga definitions with this question: “If the process crashes after this step, can we cleanly undo it?”
  • Observability gap: In a choreographed saga, tracing the flow requires correlation IDs propagated through every event. Without them, debugging a failed order means grepping logs across multiple services. In an orchestrated saga, the orchestrator’s log gives you the flow, but you still need tracing to see what happened inside each participant.
  • Saga isolation: Two sagas operating on the same entity can produce conflicting intermediate states. This is the “lack of isolation” problem inherent to sagas. Countermeasures include semantic locks (PENDING states), commutative updates, or pessimistic locks at the saga level — but they all add complexity.

When to Avoid Sagas Entirely

Sagas add significant complexity. Before reaching for one, ask whether you actually need it. If two services can share a database (not always wrong, especially early on), a local transaction is simpler. If the operation is rarely used and manual compensation is acceptable (a human reverses the charge), the engineering effort of a saga may not be worth it. And if you can model the operation as a single API call to one service that publishes events for downstream processing (the Outbox Pattern), you avoid the saga entirely.

Sagas are the right tool when you have genuine multi-service transactions that happen frequently, require automated compensation, and can’t be modeled as a simpler event-driven flow. For everything else, simpler patterns win.

Wrapping Up

The Saga pattern trades ACID’s simplicity for distributed-systems reality. Choreography keeps services independent but scatters the workflow logic across event handlers. Orchestration centralizes control but introduces a stateful component you need to persist and recover. Both approaches require idempotent operations, compensating transactions, and correlation IDs for observability.

Start with the simplest approach that works — often choreography for the first 2-3 cross-service flows — and graduate to orchestration when the event chains become hard to follow. The key insight: sagas are not just a pattern, they’re a design discipline. Every step must be undoable, every operation idempotent, and every failure mode planned for. Skip any of those and you’ll be debugging inconsistent state at 2 AM.

Leave a Reply

Your email address will not be published. Required fields are marked *