Rate Limiting Strategies — Token Bucket, Sliding Window, and the 429 Retry-After Contract

An API without rate limiting goes down the moment one buggy client or one motivated attacker shows up. This guide explains why you need rate limiting, the four classic algorithms (fixed window, sliding window, token bucket, leaky bucket), how to scope by IP, user, or API key, and the response-header contract clients expect.

Why rate limiting

Cost protection — DB, third-party APIs, and LLM calls cost real money per request. One client at 1000/min ruins the budget.
Stability — one bad actor can't degrade everyone else. Noisy-neighbor isolation.
Security — slows brute-force, credential stuffing, scraping.
Fairness — different limits for free vs paid tiers.

The four classic algorithms

1. Fixed Window — simplest

A counter per time window (e.g. one minute). Resets at window rollover.

# Redis-style pseudo-code
key = `rl:${userId}:${Math.floor(Date.now() / 60000)}`;
count = INCR key
EXPIRE key 60
if count > 100: return 429

Pros: simplest. One counter.
Cons: edge bursts — 100 hits in the last second of one window plus 100 in the first second of the next = 200 over two seconds.

2. Sliding Window Log — most accurate

Store every request timestamp; count within the window.

# Redis sorted set
ZADD rl:user123 ${Date.now()} ${uuid}
ZREMRANGEBYSCORE rl:user123 0 ${Date.now() - 60000}
ZCARD rl:user123  # → current window count

Pros: no edge bursts. Exact.
Cons: memory grows with request count. Expensive at scale.

3. Sliding Window Counter — the balanced pick

Two fixed windows, weighted. Current count + previous window count × elapsed-fraction-remaining.

prevCount = 60   # previous minute
currCount = 30   # current minute
elapsedInCurrent = 30s   # 30 seconds into current window
weightedCount = prevCount * (1 - elapsedInCurrent/60) + currCount
              = 60 * 0.5 + 30
              = 60

Pros: minimal edge bursting + tiny memory. Most practical.
Cons: slight inaccuracy (assumes uniform distribution).

4. Token Bucket — allows bursts

A bucket holds N tokens. R tokens are added per unit time. Each request consumes one. Empty bucket → reject.

# capacity=100, refill=10 tokens/sec
# Idle users have a full bucket → burst of 100 OK
# After that, sustained 10 req/sec

Pros: allows occasional bursts while bounding average rate. User-friendly.
Cons: needs atomic Redis SET + EXPIRE + Lua to maintain state safely.

Stripe and GitHub use this model: a quiet user briefly bursts because they've banked tokens.

5. Leaky Bucket — enforces steady rate

A queue. Requests join the tail; processed at a fixed rate. Full queue → reject.

Pros: smooth output rate.
Cons: adds queue latency. Unsuitable for synchronous APIs.

Rarely used as the primary API rate limiter. Occasionally for outbound rate limits (calling a third party).

Picking the algorithm

Algorithm	When it fits	Example
Fixed Window	Edge bursts don't matter; simplicity wins	Login attempts (5/min)
Sliding Window Counter	General API limits	1000 reads/hour
Token Bucket	Allow occasional bursts	Stripe API, AI inference endpoints
Sliding Window Log	Strict accuracy at low limits	Signups (3/hour)

Scoping — what counts as "the same caller"

Same algorithm, different grouping changes the system entirely.

Per-IP

The default for unauthenticated endpoints.
Caveats: corporate NAT puts many users behind one IP. Mobile users change IPs frequently.
X-Forwarded-For is only trustworthy when a trusted proxy (Cloudflare, AWS ALB) is in front.

Per-user

The standard once authenticated.
Predictable limits per user. Accurate.

Per-API-key

External integrations. Tier-based limits.
Free / pro / enterprise plans with different ceilings.

Per-endpoint

Expensive endpoints (search, AI inference) get tighter limits. Cheap GETs get more.
Compose keys like rl:${userId}:/api/search.

Response headers — the client contract

RFC 9331 (standardized in 2024; providers used custom names before).

# Within limit
HTTP/1.1 200 OK
RateLimit: limit=100, remaining=75, reset=30
RateLimit-Policy: 100;w=60

# Exceeded
HTTP/1.1 429 Too Many Requests
RateLimit: limit=100, remaining=0, reset=42
RateLimit-Policy: 100;w=60
Retry-After: 42

Key headers:

RateLimit — current limit / remaining / reset in seconds.
RateLimit-Policy — the policy (limit;w=window).
Retry-After — when the client may try again (seconds or HTTP-date). Required on 429.

GitHub and Stripe still use their own X-RateLimit-* headers. New systems should adopt the standard RateLimit* form.

Why 429?

See HTTP Status Codes for the 4xx class. 429 Too Many Requests is the right code.

503 implies a transient server issue, not quota — avoid.
403 is for permissions, not throughput — wrong fit.

Client behavior — exponential backoff with jitter

async function fetchWithRetry(url, opts, max = 5) {
  for (let i = 0; i <= max; i++) {
    const res = await fetch(url, opts);
    if (res.status !== 429) return res;
    const retryAfter = parseInt(res.headers.get("Retry-After") ?? "0", 10);
    const wait = Math.min(retryAfter * 1000, 60_000) || 1000 * 2 ** i;
    await new Promise((r) => setTimeout(r, wait + Math.random() * 1000));
  }
  throw new Error("Rate limit exhausted");
}

Notes:

Honor Retry-After when present — the server's authoritative value.
Otherwise exponential backoff (powers of two).
Always add jitter. Many clients retrying simultaneously = thundering herd.

Build retry test URLs in cURL Builder. Combine with UUID / ULID Generator to attach an idempotency key so retries are safe.

Distributed counting — Redis

N servers sharing one counter — Redis is the standard.

-- Lua script (atomic)
local key = KEYS[1]
local limit = tonumber(ARGV[1])
local window = tonumber(ARGV[2])

local current = redis.call("INCR", key)
if current == 1 then
  redis.call("EXPIRE", key, window)
end

if current > limit then
  return {0, current, redis.call("TTL", key)}
else
  return {1, current, redis.call("TTL", key)}
end

Atomic INCR + EXPIRE inside Lua is race-free. Libraries: express-rate-limit, @upstash/ratelimit, node-rate-limiter-flexible.

Edge / serverless variants

Redis from Cloudflare Workers / Vercel Edge adds latency. Alternatives:

Cloudflare Rate Limiting — infrastructure layer, configured as a WAF rule.
Durable Objects — Cloudflare's globally stateful worker. Owns the counter.
Upstash Redis — REST API with edge-local regions.

Common pitfalls

1. Omitting `Retry-After`

429 alone leaves the client guessing. The standard header is the contract.

2. IP-only scoping

False positives behind NAT; bypass for shifting mobile IPs. Once authenticated, use user IDs.

3. Stampede at window rollover

Fixed-window users all retry exactly at the boundary. Use sliding window or add jitter.

4. Blindly trusting `X-Forwarded-For`

Without a trusted proxy, clients can set any value. Honor it only when terminating behind a proxy you control.

5. Race conditions in Redis

Separate INCR + EXPIRE calls leave a window where a crashed process leaves a permanent key. Bundle into a Lua script.

6. Global-only rate limiting

"10,000 requests/min" with no per-user split — one bad actor consumes the whole budget. Layer global + per-user.

7. No limit on auth endpoints

Summary

Algorithm: general APIs → Sliding Window Counter; burst-friendly → Token Bucket; strict accuracy → Sliding Window Log.
Scoping: authenticated → per-user; public → per-IP; B2B → per-API-key. Mix per-endpoint when needed.
Response: 429 + Retry-After + RateLimit-* headers.
Client: respect Retry-After, exponential backoff, jitter.
Distributed: Redis Lua scripts for atomicity. Edge uses Durable Objects or Upstash.
Auth endpoints get the strictest limits — first defense against brute force.