The Saga Pattern in Distributed Systems: Orchestrating Transactions Across Microservices

Every developer who has worked with microservices eventually hits the same wall: a single business operation needs to touch multiple services, and if any step fails, the whole thing needs to unwind cleanly. A user places an order, but the payment service approves it while the inventory service can’t reserve stock. Now what? You can’t just pretend it didn’t happen — the payment already went through.

This is the distributed transaction problem, and it doesn’t have a clean solution the way monolithic databases do with ACID transactions. The Saga pattern is the industry’s answer: a sequence of local transactions where each step has a compensating action that reverses it if something downstream fails.

In this post, I’ll walk through both saga coordination styles — choreography and orchestration — with working Go code you can adapt to your own services.

The Problem: Why ACID Doesn’t Work Across Services

In a monolith, creating an order might insert a row into an orders table, deduct from inventory, and charge a credit card — all in one database transaction. If the card charge fails, the entire transaction rolls back. Clean.

Split these into separate microservices with their own databases and that single transaction vanishes. Each service can only guarantee consistency within its own boundary. The Saga pattern fills this gap by replacing one atomic transaction with a sequence of smaller, compensatable ones.

The contract is simple: every forward step must have a corresponding compensating action. If step 3 of 4 fails, you execute compensating actions for steps 2 and 1 to undo the partial work.

Choreography vs. Orchestration

Sagas come in two flavors, and choosing between them shapes your architecture:

Choreography — Each service emits events and reacts to events from other services. There’s no central coordinator; services are decoupled and communicate through a message broker. When Service A completes its work, it publishes an event. Service B listens and acts, then publishes its own event, and so on. If a service publishes a failure event, previous participants listen and execute their compensating actions.

Orchestration — A central saga coordinator (sometimes called a “saga manager”) tells each participant what to do next. It sends a command to Service A, waits for the result, then sends a command to Service B. If any step fails, the coordinator sends compensation commands to all previously completed steps.

Choreography works well for simple flows with few participants. It avoids single points of failure and keeps services loosely coupled. But as the flow grows, understanding the end-to-end behavior requires tracing events across multiple services, and adding a new step means updating multiple listeners.

Orchestration gives you a clear view of the entire flow in one place. Adding, removing, or reordering steps only requires changing the coordinator. The tradeoff is tighter coupling to the coordinator and a potential bottleneck if it becomes a single point of failure.

In practice, most teams use orchestration for flows with four or more steps, and choreography for simpler two-to-three-step chains. Many systems end up using both.

A Practical Example: Order Fulfillment

Let’s walk through an order fulfillment flow with three services: Order, Payment, and Inventory. The happy path is straightforward — create the order, charge payment, reserve inventory. But we need to handle failures at each step and compensate correctly.

Orchestrated Saga in Go

Here’s a concrete implementation using an orchestration approach. The saga coordinator manages the sequence and handles compensation:

package saga

import (
	"context"
	"errors"
	"fmt"
	"log"
)

// SagaStep defines a single step in the saga with its compensating action.
type SagaStep struct {
	Name       string
	Action     func(ctx context.Context) error
	Compensate func(ctx context.Context) error
}

// Orchestrator executes saga steps in order and compensates on failure.
type Orchestrator struct {
	steps []SagaStep
}

func NewOrchestrator(steps []SagaStep) *Orchestrator {
	return &Orchestrator{steps: steps}
}

func (o *Orchestrator) Run(ctx context.Context) error {
	// Track which steps completed so we can compensate in reverse.
	completed := make([]SagaStep, 0, len(o.steps))

	for _, step := range o.steps {
		log.Printf("[saga] executing step: %s", step.Name)
		if err := step.Action(ctx); err != nil {
			log.Printf("[saga] step %s failed: %v — compensating", step.Name, err)
			return o.compensate(ctx, completed)
		}
		completed = append(completed, step)
	}

	log.Printf("[saga] all %d steps completed successfully", len(o.steps))
	return nil
}

func (o *Orchestrator) compensate(ctx context.Context, completed []SagaStep) error {
	// Compensate in reverse order — undo the most recent step first.
	var errs []error
	for i := len(completed) - 1; i >= 0; i-- {
		step := completed[i]
		if err := step.Compensate(ctx); err != nil {
			log.Printf("[saga] compensation failed for %s: %v", step.Name, err)
			errs = append(errs, err)
		} else {
			log.Printf("[saga] compensated step: %s", step.Name)
		}
	}

	if len(errs) > 0 {
		return fmt.Errorf("saga compensation completed with %d errors", len(errs))
	}
	return errors.New("saga rolled back due to step failure")
}

Now let’s wire up the order fulfillment saga:

package main

import (
	"context"
	"errors"
	"log"
	"math/rand"
)

func main() {
	saga := saga.NewOrchestrator([]saga.SagaStep{
		{
			Name: "create-order",
			Action: func(ctx context.Context) error {
				log.Println("Order service: creating order...")
				return nil // Assume this always succeeds
			},
			Compensate: func(ctx context.Context) error {
				log.Println("Order service: cancelling order")
				return nil
			},
		},
		{
			Name: "charge-payment",
			Action: func(ctx context.Context) error {
				log.Println("Payment service: charging customer...")
				// Simulate occasional payment failures
				if rand.Intn(10) == 0 {
					return errors.New("payment declined")
				}
				return nil
			},
			Compensate: func(ctx context.Context) error {
				log.Println("Payment service: refunding charge")
				return nil
			},
		},
		{
			Name: "reserve-inventory",
			Action: func(ctx context.Context) error {
				log.Println("Inventory service: reserving stock...")
				if rand.Intn(8) == 0 {
					return errors.New("insufficient stock")
				}
				return nil
			},
			Compensate: func(ctx context.Context) error {
				log.Println("Inventory service: releasing reserved stock")
				return nil
			},
		},
	})

	if err := saga.Run(context.Background()); err != nil {
		log.Printf("Saga failed: %v", err)
	}

	// Check saga state and notify the user accordingly
}

If the payment step fails, the orchestrator compensates the order creation. If inventory reservation fails, it compensates both the payment (issue a refund) and the order (cancel it). The compensation always runs in reverse order, which matters — you refund before you cancel the order, not after.

Choreography with Event Publishing

The same flow, implemented as choreography, uses events instead of direct calls. Each service subscribes to relevant events and acts independently:

package saga

import (
	"context"
	"log"
)

// Event represents a saga lifecycle event.
type Event struct {
	SagaID   string
	Type     string
	Payload  interface{}
}

// EventHandler processes saga events.
type EventHandler struct {
	broker *EventBroker
}

func (h *EventHandler) StartOrderSaga(ctx context.Context, sagaID string) {
	// Step 1: Create the order and publish event
	log.Println("Order service: creating order")
	h.broker.Publish(Event{
		SagaID:  sagaID,
		Type:    "order_created",
		Payload: map[string]string{"order_id": "ORD-123"},
	})
}

// PaymentService listens for order_created and processes payment.
func (h *EventHandler) OnOrderCreated(event Event) {
	log.Println("Payment service: processing payment...")
	// Simulate payment processing
	paymentErr := processPayment(event.Payload)

	if paymentErr != nil {
		h.broker.Publish(Event{
			SagaID: event.SagaID,
			Type:   "payment_failed",
			Payload: map[string]string{
				"reason": paymentErr.Error(),
			},
		})
		return
	}

	h.broker.Publish(Event{
		SagaID: event.SagaID,
		Type:   "payment_succeeded",
		Payload: map[string]string{"payment_id": "PAY-456"},
	})
}

// InventoryService listens for payment_succeeded and reserves stock.
func (h *EventHandler) OnPaymentSucceeded(event Event) {
	log.Println("Inventory service: reserving stock...")
	if stockErr := reserveStock(event.Payload); stockErr != nil {
		// Notify payment service to refund
		h.broker.Publish(Event{
			SagaID: event.SagaID,
			Type:   "inventory_failed",
			Payload: map[string]string{
				"reason": stockErr.Error(),
			},
		})
		return
	}

	h.broker.Publish(Event{
		SagaID: event.SagaID,
		Type:   "inventory_reserved",
	})
}

// Compensating handlers listen for failure events.
func (h *EventHandler) OnPaymentFailed(event Event) {
	log.Println("Order service: cancelling order due to payment failure")
}

func (h *EventHandler) OnInventoryFailed(event Event) {
	log.Println("Payment service: refunding payment due to inventory failure")
	log.Println("Order service: cancelling order due to inventory failure")
}

// EventBroker is a simplified message broker interface.
type EventBroker struct {
	handlers map[string][]func(Event)
}

func NewEventBroker() *EventBroker {
	return &EventBroker{handlers: make(map[string][]func(Event))}
}

func (b *EventBroker) On(eventType string, handler func(Event)) {
	b.handlers[eventType] = append(b.handlers[eventType], handler)
}

func (b *EventBroker) Publish(event Event) {
	for _, handler := range b.handlers[event.Type] {
		handler(event)
	}
}

Notice how each service only knows about the events it cares about. The payment service doesn’t know about inventory, and inventory doesn’t know about orders — they communicate through events. This is the key advantage of choreography, and also its biggest challenge: the flow logic is distributed across all participants, making it harder to reason about the complete sequence.

Handling the Hard Cases

Compensation That Fails

Compensating actions can fail too — a refund API might be temporarily down, or a message queue might lose an event. This is the uncomfortable reality of sagas: you trade the certainty of ACID transactions for eventual consistency and the need for manual reconciliation or retry queues.

Practical strategies include wrapping compensating actions in retry loops with exponential backoff, persisting saga state to a database so you can resume interrupted compensations, and implementing alerting that triggers when a compensation enters a failed state for manual intervention.

Idempotency Is Non-Negotiable

Every saga step and every compensating action must be idempotent. Network retries, duplicate message delivery, and coordinator restarts can cause the same step to execute multiple times. If your “reserve inventory” step isn’t idempotent, a retry could double-reserve stock.

The simplest approach is to use a unique saga ID as an idempotency key. Before processing, check if that saga ID has already been handled. This works whether you store the key in a database, a Redis set, or the message broker’s deduplication features.

Persistent State

The orchestrator should persist saga state at each step — which steps completed, which compensations ran, and which are pending. This way, if the coordinator crashes mid-saga, it can resume from where it left off rather than starting over or leaving the saga in an ambiguous state. A simple approach is a saga_instances table tracking the current step, status, and any needed context for each step.

When to Use the Saga Pattern

Sagas are worth the complexity when you have business operations that span multiple services and can’t tolerate partial completion. Not every multi-service call needs a saga — if an operation is read-only or failure simply means showing an error message to the user with no side effects to undo, a simple synchronous call chain is fine.

The pattern really earns its keep in workflows with financial transactions, inventory operations, provisioning, and any domain where “it half-worked” is worse than “it fully failed.” If you’re building e-commerce, booking systems, payment processing, or resource provisioning — you almost certainly need sagas.

Start with orchestration for new flows — the centralized coordinator makes the flow visible and debuggable. Introduce choreography only when you have a specific reason for it, like extremely high throughput where a central coordinator becomes a bottleneck, or when you need services to evolve independently without coordinating schema changes.

For production sagas, consider established frameworks like Temporal (which provides durable execution with built-in saga support), Watermill (a Go library for working with message streams), or Apache Kafka for event-driven choreography. These handle the persistence, retry, and exactly-once semantics that are tedious to build from scratch.

The Saga pattern isn’t free complexity — it adds significant operational overhead. But for the right use case, it’s the difference between a system that handles failure gracefully and one that leaves orphaned payments, phantom reservations, and confused customers.

Leave a Reply

Your email address will not be published. Required fields are marked *