Rate Limiting Strategies for Production APIs: From Token Buckets to Distributed Throttling

Every public API has a ceiling. Whether you’re running a SaaS platform, a microservice cluster, or a simple REST endpoint, uncontrolled request volume will eventually degrade performance for everyone. Rate limiting isn’t just about protecting your infrastructure — it’s about fairness, cost predictability, and providing a reliable experience to every consumer.

The challenge is that “add rate limiting” is deceptively simple to say and surprisingly nuanced to get right. Different algorithms handle burst traffic differently. Distributed systems need coordination. And the right strategy depends heavily on your traffic pattern — a background job queue has entirely different constraints than a user-facing search API.

Let’s walk through the major rate limiting algorithms, implement them in Go, and explore the patterns that actually work in production environments — from single-process middleware to Redis-backed distributed throttling.

The Four Classic Algorithms

Before writing any code, it’s worth understanding the landscape. There are four rate limiting algorithms that show up repeatedly in production systems, each with distinct trade-offs:

Fixed Window Counter

The simplest approach: divide time into fixed windows (one minute, one hour) and count requests per window. If the counter exceeds the limit, reject. The problem is the boundary burst — a client sending 100 requests at 59 seconds and 100 more at 0 seconds gets 200 requests through in a two-second window when the limit might be 100/minute.

Sliding Window Log

Store a timestamp for each request. To check the limit, count requests within the last N seconds. Accurate but memory-intensive — tracking every request individually doesn’t scale well for high-volume APIs.

Sliding Window Counter

A hybrid approach: use the fixed window counter’s memory efficiency but weight recent windows more heavily than older ones. If the current window is 30% elapsed, calculate the effective count as 70% of the previous window’s count plus the current window’s count. This gives a reasonable approximation of the sliding log without storing individual timestamps.

Token Bucket

Tokens are added to a bucket at a fixed rate, up to a maximum capacity. Each request consumes a token. If the bucket is empty, the request is rejected (or waits). The key insight: burst capacity. A bucket that’s been idle accumulates tokens, allowing short bursts above the sustained rate. This is what most production systems use because it handles real traffic patterns well — APIs naturally have bursts followed by quiet periods.

In-Process Rate Limiting with Go

Go’s extended library ships with a battle-tested token bucket implementation in golang.org/x/time/rate. It’s used internally by the standard library (the net/http transport uses it for connection throttling), and it’s widely adopted across Go projects.

Here’s a practical middleware that limits requests per IP address:

package ratelimit

import (
    "net/http"
    "sync"
    "time"

    "golang.org/x/time/rate"
)

type visitor struct {
    limiter  *rate.Limiter
    lastSeen int64
}

type IPRateLimiter struct {
    mu       sync.RWMutex
    visitors map[string]*visitor
    rate     rate.Limit
    burst    int
}

func NewIPRateLimiter(r rate.Limit, b int) *IPRateLimiter {
    l := &IPRateLimiter{
        visitors: make(map[string]*visitor),
        rate:     r,
        burst:    b,
    }
    go l.cleanupStaleVisitors()
    return l
}

func (l *IPRateLimiter) getLimiter(ip string) *rate.Limiter {
    l.mu.Lock()
    defer l.mu.Unlock()

    if v, exists := l.visitors[ip]; exists {
        v.lastSeen = time.Now().Unix()
        return v.limiter
    }

    limiter := rate.NewLimiter(l.rate, l.burst)
    l.visitors[ip] = &visitor{limiter: limiter, lastSeen: time.Now().Unix()}
    return limiter
}

func (l *IPRateLimiter) Middleware(next http.Handler) http.Handler {
    return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
        limiter := l.getLimiter(r.RemoteAddr)
        if !limiter.Allow() {
            http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
            return
        }
        next.ServeHTTP(w, r)
    })
}

func (l *IPRateLimiter) cleanupStaleVisitors() {
    for {
        time.Sleep(3 * time.Minute)
        l.mu.Lock()
        for ip, v := range l.visitors {
            if time.Now().Unix()-v.lastSeen > 180 {
                delete(l.visitors, ip)
            }
        }
        l.mu.Unlock()
    }
}

The rate.Limit type accepts events-per-second values. rate.Every(100 * time.Millisecond) is equivalent to 10 requests per second. The burst parameter controls how many tokens the bucket can hold — this is what lets legitimate bursts through while still enforcing a sustained rate.

func main() {
    limiter := ratelimit.NewIPRateLimiter(rate.Every(200*time.Millisecond), 5)

    mux := http.NewServeMux()
    mux.HandleFunc("/api/items", handleItems)

    handler := limiter.Middleware(mux)
    http.ListenAndServe(":8080", handler)
}

This allows 5 requests per burst with a sustained rate of 5 requests per second (one token every 200ms). When the burst is exhausted, requests drain at the sustained rate.

Distributed Rate Limiting with Redis

The in-process approach works perfectly for a single server. But most production APIs run behind load balancers with multiple instances. Each instance maintains its own counter — a client hitting different backend instances can bypass the limit entirely. You need a shared counter, and Redis is the standard answer.

The sliding window counter approach maps cleanly to Redis. Use a sorted set where the score is the request timestamp and the value is a unique ID. Each request adds an entry, then trims entries older than the window, and checks the remaining count:

package redislimit

import (
    "context"
    "fmt"
    "time"

    "github.com/redis/go-redis/v9"
)

type RedisRateLimiter struct {
    client *redis.Client
    limit  int64
    window time.Duration
}

func NewRedisRateLimiter(client *redis.Client, limit int64, window time.Duration) *RedisRateLimiter {
    return &RedisRateLimiter{
        client: client,
        limit:  limit,
        window: window,
    }
}

// Allow checks if the key is within the rate limit.
// Returns true if the request should be allowed.
func (r *RedisRateLimiter) Allow(ctx context.Context, key string) (bool, error) {
    now := time.Now().UnixNano()
    windowStart := now - r.window.Nanoseconds()

    pipe := r.client.Pipeline()
    // Remove entries older than the window
    pipe.ZRemRangeByScore(ctx, key, "0", fmt.Sprintf("%d", windowStart))
    // Add the current request
    pipe.ZAdd(ctx, key, redis.Z{Score: float64(now), Member: now})
    // Count requests in the window
    countCmd := pipe.ZCard(ctx, key)
    // Set expiry so the key auto-cleans
    pipe.Expire(ctx, key, r.window+time.Second)

    _, err := pipe.Exec(ctx)
    if err != nil {
        return false, err
    }

    return countCmd.Val() < r.limit, nil
}

The pipeline executes all four commands in a single round trip — no Lua scripting needed. The sorted set naturally acts as a sliding window log, and the Expire call ensures abandoned keys don’t accumulate indefinitely. This implementation uses the go-redis client, which provides a clean API for pipeline operations.

Here’s how it plugs into an HTTP handler:

func RateLimitMiddleware(limiter *RedisRateLimiter) func(http.Handler) http.Handler {
    return func(next http.Handler) http.Handler {
        return http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
            key := fmt.Sprintf("rl:%s", r.RemoteAddr)

            allowed, err := limiter.Allow(r.Context(), key)
            if err != nil {
                // Fail open on Redis errors — don't block all traffic
                // if the rate limiter is unavailable
                next.ServeHTTP(w, r)
                return
            }

            if !allowed {
                w.Header().Set("Retry-After", "1")
                http.Error(w, "Too Many Requests", http.StatusTooManyRequests)
                return
            }

            next.ServeHTTP(w, r)
        })
    }
}

Multi-Tier Rate Limiting

Production APIs rarely use a single rate limit across all endpoints. A more realistic setup applies different limits at different levels:

type Tier struct {
    PerSecond rate.Limit
    Burst     int
}

var tiers = map[string]Tier{
    "anonymous":  {PerSecond: rate.Every(500 * time.Millisecond), Burst: 5},
    "authenticated": {PerSecond: rate.Limit(20), Burst: 30},
    "premium":     {PerSecond: rate.Limit(100), Burst: 150},
}

func tierFromRequest(r *http.Request) string {
    if r.Header.Get("X-API-Key") != "" {
        return "premium"
    }
    if r.Header.Get("Authorization") != "" {
        return "authenticated"
    }
    return "anonymous"
}

Anonymous traffic gets strict limits to protect against abuse. Authenticated users get a generous burst for interactive workflows. Premium API keys get significantly higher throughput. The middleware selects the tier and creates a limiter for each unique identity within that tier.

You can extend this further with per-endpoint limits — a search endpoint might have different constraints than a file upload endpoint, since the computational cost differs dramatically.

Production Pitfalls

Getting the algorithm right is only half the battle. Here are the issues that tend to surface once you’ve deployed rate limiting to production:

Shared IPs and NATs. Rate limiting by IP penalizes users behind corporate proxies, CDNs, or NAT gateways. A single corporate IP could represent thousands of individual users. The solution is to limit by authenticated identity when available and fall back to IP only for unauthenticated traffic. If you must limit by IP, use generous burst sizes.

Clock skew in distributed systems. If you’re limiting based on server-local timestamps across multiple instances, clock drift will cause inconsistent behavior. Redis-based approaches sidestep this by using a single authoritative data store. If you’re using the in-process approach, accept that it’s approximate and use it as a first line of defense.

Failing open vs. failing closed. When your Redis connection drops, should you allow all requests or reject them? For most APIs, failing open is the right call — blocking all traffic because a rate limiter is unhealthy makes the problem worse. The middleware above does this: Redis errors fall through to the next handler.

Missing rate limit headers. Clients need visibility into their limits. Standard headers — X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset — let consumers back off proactively instead of discovering limits through 429 responses. The Go rate.Limiter exposes Tokens() and Reserve() methods that make this straightforward to implement.

Not nesting limits. Global per-account limits prevent any single consumer from dominating your capacity, but they don’t prevent a single endpoint from being overwhelmed. Layer global limits with per-endpoint limits. The outer layer protects your infrastructure; the inner layer protects individual expensive operations.

Choosing the Right Approach

For a single server or a service behind a sticky-session load balancer, golang.org/x/time/rate is usually sufficient. It’s zero-dependency, well-tested, and handles burst traffic naturally through the token bucket algorithm.

For distributed deployments, the Redis sliding window approach gives you consistent limits across instances with sub-millisecond overhead per request. The pipeline pattern keeps round trips minimal, and the sorted set auto-expires stale data.

In practice, many production systems use both: an in-process limiter as the first check (catching the easy rejections without a network call) and a Redis-backed limiter for the authoritative per-account limits. The in-process layer handles bulk traffic spikes; the Redis layer ensures fairness across instances.

The key is to start simple — add a basic token bucket per IP, ship it, and observe how your traffic actually behaves. Rate limits that are too tight frustrate users; limits that are too loose leave you exposed. The data from your first deployment will tell you exactly where to refine.