Building Resilient Event-Driven Systems: A Practical Guide with Go

I’ve spent the better part of two decades building distributed systems, and if there’s one pattern that has consistently separated scalable architectures from fragile ones, it’s event-driven design. Yet I still see teams reaching for synchronous HTTP calls for everything — order processing, notification dispatch, inventory updates — only to hit a wall when their system needs to scale beyond a few hundred requests per second.

Today, I want to walk you through building a practical event-driven system in Go. Not the academic version with UML diagrams and abstract concepts — the real kind you can deploy next week. We’ll cover the core patterns, show working code, and address the pitfalls that trip up even experienced teams.

Why Event-Driven Architecture Matters Now

Event-driven architecture (EDA) has been around for decades, but three things have made it essential in 2026. First, AI workloads are inherently asynchronous — you submit a prompt and wait. Second, microservice sprawl has made synchronous call chains brittle and slow. Third, the tooling has finally matured. Apache Kafka, NATS, Redis Streams, and cloud-native offerings like AWS EventBridge and Google Cloud Pub/Sub make it genuinely practical to adopt EDA without building your own message broker.

The core idea is simple: instead of Service A calling Service B directly, Service A publishes an event (“OrderCreated”), and any interested service reacts to it. This decouples producers from consumers, enables natural horizontal scaling, and makes your system far more resilient to failures.

The Three Patterns You Actually Need

Most event-driven systems can be built with three fundamental patterns. Don’t overcomplicate things with saga choreography and complex event sourcing until you’ve mastered these.

1. Simple Event Notification

This is the workhorse. One service publishes an event, one or more services consume it. Think of it as a fire-and-forget notification. A user signs up, you publish a UserRegistered event, and your email service picks it up to send a welcome message. No coordination needed.

2. Event-Carried State Transfer

Instead of consumers calling back to the producer to get full details, embed the relevant state directly in the event payload. This eliminates a whole class of distributed queries and reduces coupling. When you publish OrderPlaced, include the order items, customer ID, and total — don’t force the inventory service to make another HTTP call to fetch them.

3. Request-Reply via Correlation IDs

Sometimes you genuinely need a response. The trick is to never block synchronously. Instead, publish a command event with a unique correlation ID, and have the responder publish a result event with that same ID. The original caller listens for its specific correlation ID. This gives you the semantics of a request-response while keeping everything asynchronous and resilient.

A Working Example in Go

Let’s build a minimal but production-ready event bus using NATS JetStream, which I’ve found to be the sweet spot for most teams — simpler than Kafka, more capable than Redis Streams, and excellent Go SDK support.

package main

import (
	"context"
	"encoding/json"
	"fmt"
	"log"
	"os"
	"os/signal"
	"syscall"
	"time"

	"github.com/nats-io/nats.go"
)

// Event is the envelope for all messages on the bus.
type Event struct {
	ID        string          `json:"id"`
	Type      string          `json:"type"`
	Timestamp time.Time       `json:"timestamp"`
	Payload   json.RawMessage `json:"payload"`
}

// EventBus wraps a NATS JetStream connection.
type EventBus struct {
	js nats.JetStreamContext
}

// NewEventBus connects to NATS and returns a ready-to-use event bus.
func NewEventBus(natsURL string) (*EventBus, error) {
	nc, err := nats.Connect(natsURL,
		nats.ReconnectWait(2*time.Second),
		nats.MaxReconnects(5),
	)
	if err != nil {
		return nil, fmt.Errorf("nats connect: %w", err)
	}

	js, err := nc.JetStream(nats.PublishAsyncMaxPending(256))
	if err != nil {
		return nil, fmt.Errorf("jetstream: %w", err)
	}

	return &EventBus{js: js}, nil
}

// Publish sends an event to the specified subject.
func (eb *EventBus) Publish(ctx context.Context, subject string, event *Event) error {
	data, err := json.Marshal(event)
	if err != nil {
		return fmt.Errorf("marshal event: %w", err)
	}

	_, err = eb.js.Publish(subject, data, nats.Context(ctx))
	if err != nil {
		return fmt.Errorf("publish to %s: %w", subject, err)
	}

	return nil
}

// Subscribe registers a handler for events on a given subject.
func (eb *EventBus) Subscribe(subject, durable string, handler func(*Event)) error {
	_, err := eb.js.QueueSubscribe(subject, durable, func(msg *nats.Msg) {
		var event Event
		if err := json.Unmarshal(msg.Data, &event); err != nil {
			log.Printf("malformed event on %s: %v", subject, err)
			msg.Nak()
			return
		}
		handler(&event)
		msg.Ack()
	},
		nats.Durable(durable),
		nats.ManualAck(),
		nats.DeliverAll(),
		nats.AckExplicit(),
	)
	return err
}

func main() {
	bus, err := NewEventBus("nats://localhost:4222")
	if err != nil {
		log.Fatal(err)
	}
	defer bus.js.(interface{ Close() error }).Close()

	// Subscribe to order events
	err = bus.Subscribe("orders.>", "order-service", func(event *Event) {
		log.Printf("received %s: %s", event.Type, string(event.Payload))
	})
	if err != nil {
		log.Fatal(err)
	}

	// Publish a sample event
	event := &Event{
		ID:        "ord_550e8400",
		Type:      "OrderPlaced",
		Timestamp: time.Now().UTC(),
		Payload:   json.RawMessage(`{"customer_id":"cust_123","total":89.99}`),
	}
	if err := bus.Publish(context.Background(), "orders.created", event); err != nil {
		log.Fatal(err)
	}

	log.Println("event bus running, press Ctrl+C to stop")
	sig := make(chan os.Signal, 1)
	signal.Notify(sig, syscall.SIGINT, syscall.SIGTERM)
	<-sig
}

This is roughly 80 lines of code that give you a durable, replayable event bus with acknowledgment semantics. JetStream handles message persistence, redelivery on consumer failure, and consumer groups out of the box. No Kafka cluster setup required.

The Pitfalls Nobody Warns You About

After deploying event-driven systems across dozens of production environments, here are the gotchas that consistently cause outages:

Poison Messages

One malformed message can crash your consumer in a loop, blocking the entire queue. Always deserialize into a struct before processing, and implement a dead-letter queue (DLQ) for messages that fail more than N times. In NATS, use a consumer with MaxDeliver configured, and route exceeded messages to an _DLQ subject.

Event Ordering Guarantees

This is the silent killer. You publish OrderCreated and then OrderCancelled — but the consumer sees them in the wrong order. If ordering matters for a specific entity, partition by entity ID. In Kafka, that means using the entity ID as the message key. In NATS, use a subject like orders.<customer_id>.<event_type> so messages for the same customer always go to the same partition.

Schema Evolution

Six months in, someone adds a required field to the OrderPlaced event, and old consumers crash. Use a schema registry (Confluent Schema Registry works with NATS too) or adopt backward-compatible serialization like Protocol Buffers. At minimum, make all new fields optional and version your event types.

Observability is Hard

Distributed tracing across event boundaries is non-trivial. Inject OpenTelemetry trace context into your event headers so consumers can continue the trace. Without this, debugging a slow request that spans three services and two message queues becomes a guessing game. Every event should carry a trace_id in its metadata.

When NOT to Use Events

Event-driven architecture is not a universal hammer. If you’re building a CRUD API with five endpoints and two services, use synchronous calls. The operational complexity of running a message broker, managing consumer groups, handling retries, and debugging async flows is only worth it when you genuinely need the decoupling and scalability benefits. The sweet spot is typically three or more services that need to react to the same state changes independently.

Getting Started Tomorrow

Here’s my recommended progression for teams adopting EDA:

Week 1: Start with Redis Streams for a single event flow. It’s already in your stack, requires zero new infrastructure, and gives you a taste of pub/sub without commitment.

Week 3: Migrate to NATS JetStream for durability and replay. The migration is straightforward — the mental model is nearly identical, but you gain persistence and consumer groups.

Month 2: Add a schema registry, implement dead-letter queues, and wire up OpenTelemetry tracing across your event flows.

Month 3: Only then consider Kafka — and only if you need exactly-once semantics across partitions, massive throughput (millions of events per second), or integration with an existing Kafka ecosystem.

Event-driven architecture is one of the most powerful tools in a system designer’s toolkit. Start simple, iterate, and let the complexity grow with your actual needs — not with what you imagine you might need someday. Your future self (and your on-call rotation) will thank you.