The Art of Rate Limiting: 4 Algorithms Every Engineer Should Know

Every production service eventually faces the same question: how do you protect it from being overwhelmed? Whether it’s a sudden traffic spike, a misbehaving client retrying in a tight loop, or a viral moment that sends 100x your normal traffic — rate limiting is the safety net that keeps your systems alive.

The problem sounds simple — “just reject requests when there are too many.” But the implementation details matter enormously. Pick the wrong algorithm and you get bursty traffic patterns, wasted capacity, or worse — a rate limiter that lets through the exact traffic you wanted to block. Let’s walk through the four algorithms every engineer should understand, with working Go implementations.

1. Fixed Window Counter

The simplest approach: divide time into fixed windows (e.g., 1-minute intervals) and count requests within each window. When the count exceeds the limit, reject until the window resets.

type FixedWindowLimiter struct {
    mu       sync.Mutex
    limit    int
    window   time.Duration
    count    int
    windowStart time.Time
}

func (l *FixedWindowLimiter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    if now.Sub(l.windowStart) >= l.window {
        l.windowStart = now
        l.count = 0
    }

    if l.count >= l.limit {
        return false
    }
    l.count++
    return true
}

Pros: Simple, minimal memory (one counter per key).

Cons: Traffic bursts at window boundaries. If your limit is 100 req/min, a client can send 100 requests at 11:59:59 and another 100 at 12:00:01 — 200 requests in 2 seconds. This “boundary burst” is the algorithm’s fatal flaw.

2. Sliding Window Log

Instead of counting in fixed intervals, keep a log (timestamp) of every request. To decide if a new request is allowed, count how many timestamps fall within the last N seconds. Old timestamps get pruned.

type SlidingLogLimiter struct {
    mu     sync.Mutex
    limit  int
    window time.Duration
    logs   []time.Time
}

func (l *SlidingLogLimiter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    cutoff := now.Add(-l.window)

    // Prune old entries
    valid := l.logs[:0]
    for _, t := range l.logs {
        if t.After(cutoff) {
            valid = append(valid, t)
        }
    }
    l.logs = valid

    if len(l.logs) >= l.limit {
        return false
    }
    l.logs = append(l.logs, now)
    return true
}

Pros: Accurate. No boundary bursts — it truly enforces “N requests per rolling window.”

Cons: Memory grows linearly with request count. For high-throughput services, storing a timestamp per request is expensive. This works fine for per-user limits with moderate traffic but doesn’t scale to high-volume keys.

3. Token Bucket

This is the algorithm used by AWS API Gateway and most production systems. Imagine a bucket that holds tokens. Tokens are added at a fixed rate (the “refill rate”). Each request consumes one token. If the bucket is empty, the request is rejected. The bucket has a maximum capacity, allowing controlled bursts.

type TokenBucketLimiter struct {
    mu         sync.Mutex
    tokens     float64
    maxTokens  float64
    refillRate float64 // tokens per second
    lastRefill time.Time
}

func NewTokenBucket(maxTokens, refillRate float64) *TokenBucketLimiter {
    return &TokenBucketLimiter{
        tokens:     maxTokens,
        maxTokens:  maxTokens,
        refillRate: refillRate,
        lastRefill: time.Now(),
    }
}

func (l *TokenBucketLimiter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(l.lastRefill).Seconds()
    l.tokens = math.Min(l.maxTokens, l.tokens + elapsed*l.refillRate)
    l.lastRefill = now

    if l.tokens < 1 {
        return false
    }
    l.tokens--
    return true
}

Pros: Allows controlled bursts (up to maxTokens), memory-efficient (just a counter + timestamp), and the math is O(1) per request. This is the gold standard for most APIs.

Cons: The burst behavior can be surprising. A client that’s been idle can immediately consume all tokens. If you need strict “never more than N per second” semantics, combine with a secondary check.

4. Sliding Window Counter (Hybrid)

This combines fixed windows with a weighted calculation from the previous window. It gives you near-sliding-window accuracy with fixed-window memory usage — the best of both worlds for distributed systems.

type SlidingWindowCounter struct {
    mu           sync.Mutex
    limit        int
    window       time.Duration
    currentCount int
    prevCount    int
    windowStart  time.Time
}

func (l *SlidingWindowCounter) Allow() bool {
    l.mu.Lock()
    defer l.mu.Unlock()

    now := time.Now()
    elapsed := now.Sub(l.windowStart)

    if elapsed >= l.window {
        // Shift windows
        l.prevCount = l.currentCount
        l.currentCount = 0
        l.windowStart = now
        elapsed = 0
    }

    // Weighted estimate: previous window * overlap ratio + current
    overlapRatio := 1.0 - elapsed.Seconds()/l.window.Seconds()
    estimated := float64(l.prevCount)*overlapRatio + float64(l.currentCount)

    if int(estimated) >= l.limit {
        return false
    }
    l.currentCount++
    return true
}

Pros: O(1) memory, smooth rate enforcement, no boundary burst problem. This is what Cloudflare uses in production for their global rate limiting.

Cons: The estimation isn’t perfect — it’s a weighted average, not exact. Close enough for virtually all real-world use cases.

Which One Should You Use?

Fixed Window: Quick prototyping, internal services where boundary bursts are acceptable
Sliding Log: When you need exact limits and traffic per-key is moderate
Token Bucket: Public APIs, payment gateways — anywhere you need burst tolerance with a sustained rate
Sliding Window Counter: Distributed rate limiting at scale (Redis-backed), high-throughput public APIs

Production Considerations

Rate limiting in a single process is straightforward. In production, you need to think about distributed state. The standard approach is Redis with INCR + EXPIRE for fixed windows, or the redis-cell module (CL.THROTTLE) which implements the Generic Cell Rate Algorithm (GCRA) — a token-bucket variant that provides rolling time windows without a background drip process. If you’re on Kubernetes, the Envoy proxy has built-in rate limiting that integrates with a global Redis-backed gRPC rate limit service.

Two things most teams learn the hard way:

Always return rate limit headers (X-RateLimit-Remaining, X-RateLimit-Reset). Clients that can self-throttle are far better than clients that blindly retry.
Use HTTP 429 with Retry-After. Don’t return 500 or 503 — those mean something different. 429, defined in RFC 6585 Section 4, tells the client “you’re going too fast, try later.”

Wrapping Up

Rate limiting isn’t just about protecting your servers — it’s a contract with your users. A well-designed rate limiter makes your API predictable, fair, and resilient. Start with token bucket for per-user limits, graduate to sliding window counter when you need distributed consistency, and always, always return the headers.

The code in this post is intentionally simple (no Redis, no goroutine pools) to make the algorithms clear. Production implementations add distributed state, hierarchical limits (per-second AND per-minute), and graceful degradation when the rate limiting backend is unavailable. But the core math is exactly what you see here.