The Outbox Pattern: Solving the Dual-Write Problem in Event-Driven Microservices

Event-driven architecture is the default for most microservices systems today. Services communicate by publishing events to a message broker — Kafka, RabbitMQ, NATS, you name it. It decouples producers from consumers, enables asynchronous processing, and makes the whole system more resilient to individual service failures. But there’s a subtle problem that catches teams off guard every time: how do you publish an event and update your database reliably in the same operation?

This is the dual-write problem, and the Outbox Pattern is the cleanest solution. Let’s walk through why it matters and how to implement it in Go.

The Dual-Write Problem

Imagine an order service. When a customer places an order, you need to:

1. Insert the order into your PostgreSQL database
2. Publish an OrderCreated event to Kafka so the inventory and notification services can react

The naive implementation does both in sequence:

func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
    // Step 1: Save to database
    if err := s.db.Insert(ctx, order); err != nil {
        return err
    }
    
    // Step 2: Publish event
    if err := s.kafka.Publish(ctx, "order-created", order.Event()); err != nil {
        return err  // Order is in the DB, but no event was published
    }
    
    return nil
}

This looks fine until you think about failure modes. If the Kafka publish fails after the database write succeeds, your order exists in the database but the inventory service never learns about it. The customer’s items are reserved in your system but never deducted from the warehouse. You can try compensating with retries or sagas, but now you’re building complex recovery logic for what should be a simple operation.

Reverse the order and you have the opposite problem: the event is published but the database write fails. Now downstream services try to process an order that doesn’t exist.

You could wrap both operations in a distributed transaction, but that introduces tight coupling between your database and message broker, kills performance, and is fragile under partitions. There’s a better way.

The Outbox Pattern

The idea is simple: instead of writing to the database and the message broker separately, write the event to an outbox table in the same database as your business data, within the same transaction. Then, a separate process reads from the outbox table and publishes events to the message broker.

Since the database insert and the outbox write happen in a single transaction, they either both succeed or both fail. No dual-write, no inconsistency.

Implementing the Outbox in Go

First, define the outbox table schema:

CREATE TABLE outbox_events (
    id            BIGSERIAL PRIMARY KEY,
    aggregate_type VARCHAR(255) NOT NULL,
    aggregate_id   VARCHAR(255) NOT NULL,
    event_type     VARCHAR(255) NOT NULL,
    payload        JSONB        NOT NULL,
    created_at     TIMESTAMPTZ  NOT NULL DEFAULT now(),
    published_at   TIMESTAMPTZ  NULL
);

CREATE INDEX idx_outbox_unpublished 
    ON outbox_events (created_at) 
    WHERE published_at IS NULL;

The published_at column with the partial index is key — it lets the poller efficiently find only events that haven’t been sent yet.

Now the order service writes both the order and the event in a single transaction:

func (s *OrderService) CreateOrder(ctx context.Context, order Order) error {
    tx, err := s.db.BeginTx(ctx, nil)
    if err != nil {
        return err
    }
    defer tx.Rollback()

    // Insert the order
    if _, err := tx.ExecContext(ctx,
        `INSERT INTO orders (id, customer_id, total, status, created_at)
         VALUES ($1, $2, $3, $4, now())`,
        order.ID, order.CustomerID, order.Total, "pending",
    ); err != nil {
        return err
    }

    // Write the event to the outbox — same transaction
    eventPayload, _ := json.Marshal(map[string]interface{}{
        "order_id":    order.ID,
        "customer_id": order.CustomerID,
        "total":       order.Total,
        "items":       order.Items,
    })

    if _, err := tx.ExecContext(ctx,
        `INSERT INTO outbox_events (aggregate_type, aggregate_id, event_type, payload)
         VALUES ($1, $2, $3, $4)`,
        "order", order.ID, "OrderCreated", eventPayload,
    ); err != nil {
        return err
    }

    return tx.Commit()
}

If either operation fails, the transaction rolls back and nothing is partially committed. The system stays consistent.

The Relay: Publishing Events

The second piece is a background process that polls the outbox table and publishes events to Kafka. Here’s a production-ready implementation with batching and graceful shutdown:

type OutboxRelay struct {
    db       *sql.DB
    producer kafka.Producer
    interval time.Duration
    batchSize int
    stopCh   chan struct{}
}

func NewOutboxRelay(db *sql.DB, producer kafka.Producer) *OutboxRelay {
    return &OutboxRelay{
        db:        db,
        producer:  producer,
        interval:  time.Second,
        batchSize: 100,
        stopCh:    make(chan struct{}),
    }
}

func (r *OutboxRelay) Start(ctx context.Context) {
    ticker := time.NewTicker(r.interval)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-r.stopCh:
            return
        case <-ticker.C:
            r.publishBatch(ctx)
        }
    }
}

func (r *OutboxRelay) Stop() {
    close(r.stopCh)
}

func (r *OutboxRelay) publishBatch(ctx context.Context) {
    rows, err := r.db.QueryContext(ctx,
        `SELECT id, aggregate_type, aggregate_id, event_type, payload
         FROM outbox_events
         WHERE published_at IS NULL
         ORDER BY created_at
         LIMIT $1`, r.batchSize,
    )
    if err != nil {
        slog.Error("outbox poll failed", "error", err)
        return
    }
    defer rows.Close()

    var eventIDs []int64
    for rows.Next() {
        var e OutboxEvent
        if err := rows.Scan(&e.ID, &e.AggregateType, &e.AggregateID,
            &e.EventType, &e.Payload); err != nil {
            slog.Error("scan failed", "error", err)
            continue
        }

        topic := fmt.Sprintf("%s.%s", e.AggregateType, e.EventType)
        if err := r.producer.Publish(ctx, topic, e.Payload); err != nil {
            slog.Error("publish failed", "event_id", e.ID, "error", err)
            continue
        }
        eventIDs = append(eventIDs, e.ID)
    }

    // Mark published events
    if len(eventIDs) > 0 {
        _, err := r.db.ExecContext(ctx,
            `UPDATE outbox_events SET published_at = now()
             WHERE id = ANY($1)`, pq.Array(eventIDs),
        )
        if err != nil {
            slog.Error("mark published failed", "error", err)
        }
    }
}

A few design decisions worth noting:

Batching — Fetching and publishing events in batches reduces database round-trips and Kafka producer overhead. A batch size of 50–200 is usually a good starting point.

Polling interval — One second is a reasonable default. For systems that need lower latency, you can drop to 100–200ms, but watch your database load. Some teams combine polling with LISTEN/NOTIFY for near-instant delivery without hammering the database.

At-least-once delivery — If the relay crashes after publishing to Kafka but before marking events as published, those events will be re-published on the next poll. Your consumers need to be idempotent. This is standard for event-driven systems and usually handled with deduplication keys or idempotency checks on the consumer side.

CDC Alternative: Debezium

If polling doesn’t fit your latency requirements, there’s an alternative approach: use Change Data Capture (CDC). Debezium connects to your database’s write-ahead log (PostgreSQL WAL, MySQL binlog) and streams row-level changes directly to Kafka. When a new row appears in the outbox table, Debezium captures it and publishes it as a Kafka message — no polling needed.

The tradeoff is operational complexity. Debezium requires running Kafka Connect, managing connector configurations, and monitoring lag. For most teams starting out, the polling approach is simpler to implement and debug. Move to CDC when you have a concrete latency requirement that polling can’t meet.

Cleaning Up Old Events

The outbox table grows indefinitely if you don’t clean it up. A simple retention policy works well:

-- Delete successfully published events older than 7 days
DELETE FROM outbox_events
WHERE published_at IS NOT NULL
  AND published_at < now() - interval '7 days';

Run this as a daily cron job or a scheduled task within your application. Keep enough history for debugging (7 days is typical) but don’t let the table grow to millions of rows.

When to Use the Outbox Pattern

The outbox pattern shines in these scenarios:

Event-driven microservices — Any time a service needs to update its own state and notify other services atomically. This covers most order, payment, and inventory systems.

Audit trails — The outbox table doubles as an event log. You can replay events, debug issues, and reconstruct state from the stored payloads.

Event sourcing — The outbox is essentially a lightweight form of event sourcing. If you’re already writing events to a table, you’re halfway there.

The pattern does add some complexity — you need the relay process, the outbox table, and idempotent consumers. But compared to the alternative (distributed transactions, manual compensation logic, or living with data inconsistency), it’s a straightforward trade-off that pays off quickly.

Wrapping Up

The dual-write problem is one of those issues that’s easy to overlook until it causes a production incident at 3 AM. The outbox pattern eliminates it by turning two distributed writes into a single local transaction plus an asynchronous relay. It’s battle-tested, relatively simple to implement, and works with any combination of database and message broker.

If you’re building event-driven microservices and you’re not using an outbox (or CDC), you almost certainly have a latent consistency bug waiting to surface. Add the outbox table, wire up the relay, and sleep better at night.