The Circuit Breaker Pattern in Go: Preventing Cascading Failures in Distributed Systems

When one service in a distributed system starts failing, the cascade can bring down everything downstream. A slow database connection pool exhausts, HTTP clients pile up, retries amplify the load, and suddenly your entire mesh is unresponsive. The circuit breaker pattern is the architectural answer to this problem — it stops the bleeding before it spreads.

The concept is borrowed directly from electrical engineering: a circuit breaker monitors the flow and “trips” open when something goes wrong, preventing damage to the rest of the system. In software, it wraps an external call and watches for failures. When failures exceed a threshold, the breaker opens and short-circuits subsequent calls — returning an error immediately instead of waiting for a timeout. After a cooldown period, it enters a half-open state, tentatively letting a small number of requests through to test whether the downstream service has recovered.

The Three States

A circuit breaker operates as a state machine with three states:

Closed (normal) — Requests flow through normally. The breaker counts successes and failures. When the failure count crosses a configurable threshold, it transitions to the open state.

Open (failing) — All requests are immediately rejected with an error. No actual calls to the downstream service are made. This is the protective state that gives the failing service room to recover. After a configured timeout, the breaker transitions to half-open.

Half-open (probing) — A limited number of requests are allowed through. If they succeed, the breaker closes again. If they fail, it reopens. This probing mechanism means recovery is automatic — no manual intervention needed.

Implementing It in Go with gobreaker

Sony’s gobreaker is a widely-used Go implementation with a clean API and zero dependencies beyond the standard library. It supports generic typed circuit breakers, rolling window failure counting, and configurable trip conditions.

Install it with:

go get github.com/sony/gobreaker/v2

Here’s a basic setup wrapping an HTTP call:

package main

import (
	"fmt"
	"io"
	"net/http"
	"time"

	"github.com/sony/gobreaker/v2"
)

var cb = gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
	Name:        "PaymentService",
	MaxRequests: 3,               // max requests in half-open state
	Interval:    10 * time.Second,  // clear counts every 10s in closed state
	Timeout:     30 * time.Second,  // how long to stay open before half-open
	ReadyToTrip: func(counts gobreaker.Counts) bool {
		// Trip if 5 consecutive failures or 50% failure rate with 10+ requests
		return counts.ConsecutiveFailures > 5 ||
			(counts.Requests > 10 && float64(counts.TotalFailures)/float64(counts.Requests) > 0.5)
	},
	OnStateChange: func(name string, from, to gobreaker.State) {
		fmt.Printf("Circuit breaker '%s' changed from %s to %s\n", name, from, to)
	},
})

func FetchPaymentURL(orderID string) ([]byte, error) {
	return cb.Execute(func() ([]byte, error) {
		resp, err := http.Get("https://payments.internal/api/orders/" + orderID)
		if err != nil {
			return nil, err
		}
		defer resp.Body.Close()

		if resp.StatusCode >= 500 {
			return nil, fmt.Errorf("server error: %d", resp.StatusCode)
		}

		return io.ReadAll(resp.Body)
	})
}

The ReadyToTrip Function: Your Decision Logic

The ReadyToTrip callback is where the real tuning happens. It receives the current Counts struct and returns whether to trip the breaker. The default trips after 5 consecutive failures, but production systems typically need more nuance.

The Counts struct gives you several signals to work with:

type Counts struct {
	Requests             uint32
	TotalSuccesses       uint32
	TotalFailures        uint32
	TotalExclusions      uint32  // errors ignored via IsExcluded
	ConsecutiveSuccesses uint32
	ConsecutiveFailures  uint32
}

A useful pattern combines consecutive failures with an overall failure rate threshold. Consecutive failures catch sudden outages fast, while the rate-based check catches gradual degradation:

ReadyToTrip: func(counts gobreaker.Counts) bool {
	failureRate := float64(counts.TotalFailures) / float64(counts.Requests)
	return counts.ConsecutiveFailures > 5 || 
		(counts.Requests >= 20 && failureRate > 0.6)
},

Excluding Non-Critical Errors

Not every error should count against the circuit breaker. Client-side timeouts from cancelled requests, 4xx validation errors, or rate-limit responses (429) don’t indicate that the downstream service is broken. gobreaker provides IsExcluded and IsSuccessful callbacks for exactly this purpose:

var cb = gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
	Name:    "InventoryService",
	Timeout: 30 * time.Second,
	IsExcluded: func(err error) bool {
		// Don't count client cancellations as failures
		return errors.Is(err, context.Canceled) || 
			errors.Is(err, context.DeadlineExceeded)
	},
	IsSuccessful: func(err error) bool {
		// Treat 429 Too Many Requests as success (service is alive, just busy)
		if apiErr, ok := err.(*APIError); ok {
			return apiErr.StatusCode == 429
		}
		return err == nil
	},
	ReadyToTrip: func(counts gobreaker.Counts) bool {
		return counts.ConsecutiveFailures > 5
	},
})

Rolling Windows vs Fixed Windows

By default, gobreaker uses a fixed window strategy — it resets the failure counters at the end of each Interval period. This means a burst of failures right before the interval boundary gets silently cleared, potentially masking a problem.

Setting BucketPeriod switches to a rolling window where counts are tracked in smaller time buckets that expire independently. This gives smoother, more accurate failure rate tracking:

var cb = gobreaker.NewCircuitBreaker[[]byte](gobreaker.Settings{
	Name:         "ShippingService",
	Interval:     30 * time.Second,
	BucketPeriod: 5 * time.Second,   // 6 buckets in a 30s window
	Timeout:      15 * time.Second,
	ReadyToTrip: func(counts gobreaker.Counts) bool {
		return counts.ConsecutiveFailures > 3 ||
			(counts.Requests > 10 && float64(counts.TotalFailures)/float64(counts.Requests) > 0.5)
	},
})

With a 5-second bucket period and 30-second interval, failures from 25 seconds ago still count. The Interval is automatically adjusted to be a multiple of BucketPeriod, so the library handles the alignment for you.

Integrating with HTTP Middleware

In microservice architectures, you’ll often want to apply circuit breakers at the HTTP client level rather than wrapping individual functions. Here’s a pattern for building a resilient HTTP client with per-host circuit breakers:

type ResilientClient struct {
	client    *http.Client
	breakers  sync.Map // map[string]*gobreaker.CircuitBreaker[*http.Response]
}

func NewResilientClient() *ResilientClient {
	return &ResilientClient{
		client: &http.Client{Timeout: 10 * time.Second},
	}
}

func (rc *ResilientClient) getBreaker(host string) *gobreaker.CircuitBreaker[*http.Response] {
	if v, ok := rc.breakers.Load(host); ok {
		return v.(*gobreaker.CircuitBreaker[*http.Response])
	}
	cb := gobreaker.NewCircuitBreaker[*http.Response](gobreaker.Settings{
		Name:        "client-" + host,
		MaxRequests: 3,
		Interval:    10 * time.Second,
		Timeout:     30 * time.Second,
		ReadyToTrip: func(c gobreaker.Counts) bool {
			return c.ConsecutiveFailures > 5
		},
	})
	actual, _ := rc.breakers.LoadOrStore(host, cb)
	return actual.(*gobreaker.CircuitBreaker[*http.Response])
}

func (rc *ResilientClient) Do(req *http.Request) (*http.Response, error) {
	cb := rc.getBreaker(req.URL.Host)
	return cb.Execute(func() (*http.Response, error) {
		return rc.client.Do(req)
	})
}

Each downstream host gets its own circuit breaker. If the payment service goes down, the breaker for that host trips — but calls to the inventory service continue flowing normally through its own breaker.

Fallbacks: What to Do When the Breaker Trips

An open breaker returns gobreaker.ErrOpenState. How you handle that depends on your domain. For a non-critical recommendation engine, returning cached results or empty data is fine. For a core payment flow, you might want to return a clear error to the user rather than silently degrading.

func GetRecommendations(userID string) ([]Product, error) {
	products, err := recommendationCB.Execute(func() ([]Product, error) {
		return fetchFromService(userID)
	})
	if err != nil {
		if errors.Is(err, gobreaker.ErrOpenState) {
			// Service is down — serve from cache
			return cache.Get(userID)
		}
		if errors.Is(err, gobreaker.ErrTooManyRequests) {
			// Half-open, but probe slots full — return stale cache
			return cache.GetStale(userID)
		}
		return nil, fmt.Errorf("recommendation service failed: %w", err)
	}
	return products, nil
}

Tuning Guidelines

Getting circuit breaker thresholds right requires understanding your service’s normal behavior:

Consecutive failure threshold — Start with 5. Lower it (3) for critical paths where you want to fail fast. Raise it (10+) for services with known flakiness that you want to tolerate before tripping.

Open timeout — This controls how quickly you probe for recovery. Too short (5s) and you hammer a genuinely broken service. Too long (120s) and your service stays degraded even after the downstream recovers. A reasonable default is 30–60 seconds.

Half-open max requests — Keep this low (1–3). The point of half-open is to test the waters, not to let a flood through. If 3 probes succeed, the breaker closes and normal traffic resumes. If even one fails, it reopens immediately.

Window strategy — Use rolling windows (BucketPeriod) when you need smooth failure rate tracking. Use fixed windows (default) for simpler setups where the reset boundary behavior is acceptable.

Observability

The OnStateChange callback is your hook for alerting. Wire it into your metrics pipeline to get visibility into circuit breaker state transitions:

OnStateChange: func(name string, from, to gobreaker.State) {
	// Emit a metric for your monitoring system
	stateGauge.WithLabelValues(name).Set(float64(to))
	
	// Alert when a breaker opens
	if to == gobreaker.StateOpen {
		slog.Warn("circuit breaker opened",
			"breaker", name,
			"previous_state", from.String(),
		)
	}
},

Track the state as a gauge metric and alert on transitions to open state. A breaker opening is a signal that something downstream needs attention — it’s not just an error to log, it’s a system health indicator.

When Not to Use a Circuit Breaker

Circuit breakers add complexity and shouldn’t be applied everywhere. Local database queries don’t need one — if your database is down, the breaker won’t help much since your entire service likely can’t function. Read-only cache lookups with fast timeouts are better served by simple timeout policies. Apply circuit breakers specifically at the boundaries where you make calls to services you don’t control: third-party APIs, other microservices, and external infrastructure that can fail independently of your application.

The circuit breaker pattern isn’t new, but it remains one of the most effective tools for building resilient distributed systems. With libraries like gobreaker, the implementation overhead is minimal — a few lines of configuration and you’ve added a protective layer that can prevent cascading failures across your entire service mesh. The real engineering work is in tuning the thresholds and designing appropriate fallbacks for your specific domain.

Leave a Reply

Your email address will not be published. Required fields are marked *