How Caching Actually Works

Phil Karlton: "The two hardest problems in computer science — cache invalidation, naming things, and off-by-one errors". It sounds like a joke; it isn't. Caching is the biggest perf lever and the biggest source of bugs. This guide covers the cache hierarchy, four write patterns, the stampede trap, and why TTL alone won't save you.

Cache Hierarchy — Closer Is Faster

User browser
  ↓
Browser cache (HTTP cache)            ~ 1ms (local disk)
  ↓
CDN edge cache (CloudFront, Cloudflare) ~ 10-50ms (geographic)
  ↓
Reverse proxy cache (Varnish, nginx)   ~ 1ms (data center)
  ↓
Application cache (in-memory, Redis)   ~ 1-5ms
  ↓
Database query cache (Postgres, MySQL) ~ 10-50ms
  ↓
Database disk                          ~ 100ms+

→ Hit at any upper layer skips all the lower ones. Massive cost / latency cuts.
→ But every layer adds stale risk — the invalidation game.

Four Write Patterns

1. Cache-aside (most common)

read:
  data = cache.get(key)
  if data is null:
    data = db.query(key)
    cache.set(key, data, ttl=5min)
  return data

write:
  db.update(key, value)
  cache.delete(key)   ← or cache.set(new value)

Pros: simple. Application controls caching logic.
Cons:
- Latency penalty on miss (round-trip to DB)
- Race between write + cache (another process re-caches the stale data)
- Thundering herd — when cache expires, N users go to DB at once

2. Write-through

write:
  cache.set(key, value)
  db.update(key, value)
  → both must succeed to commit

Pros: cache and DB always agree
Cons:
- Write latency = cache + DB combined
- Caches things rarely read (cache pollution)

3. Write-behind (write-back)

write:
  cache.set(key, value)
  queue.push("update DB later")
  → returns immediately (DB write is async)

Pros: very fast writes
Cons:
- Queue / cache dies → data loss
- Read-after-write consistency can break
- Rarely used (only special high-write workloads)

4. Refresh-ahead

A background worker refreshes before TTL expires:
  cron: every minute, inspect expiry of all cache keys
        if expiry < 30s → re-fetch and update cache

Pros: reads always hit cache (no penalty)
Cons:
- Refreshes keys that won't be read (waste)
- Complex to implement and monitor

Stampede / Thundering Herd

Popular page's cache TTL = 5 min. The moment after expiry:

   t=0     t=5min     t=5min+ε
   ┌───┐   ┌────────┐  ┌────────────────┐
   │OK │   │ EXPIRE │  │ 1000 users hit │
   └───┘   └────────┘  │ DB at once!    │
                        └────────────────┘

→ 1000 concurrent DB queries → DB overload → additional users also fail (cascade)

Fix 1 — Lock (single-flight):
  on cache miss:
    if redis.setnx("lock:key", 1, ex=30s):
      data = db.query(...)
      cache.set(key, data)
      redis.del("lock:key")
    else:
      sleep(50ms); retry cache  ← wait until another worker fills it

Fix 2 — stale-while-revalidate:
  Serve stale data after expiry (next user triggers background refresh)
  → No user waits; stale data shown briefly

Fix 3 — Random jitter:
  TTL = 5 min ± random(60s)
  → Expirations spread out, fewer simultaneous misses

Why TTL Alone Isn't Enough

Scenario: a user changes their name (Alice → A.)

t=0   : user GET /me  → "Alice" (cache + DB)
t=1   : user PATCH /me  → "A."
        DB updated, but cache still has "Alice"
t=2   : user GET /me  → "Alice" ← stale!
        Auto-update after 5 min — but user is confused in the meantime.

Fixes:
- Explicit invalidation on write (cache.del("user:42"))
- ETag / version so clients can detect stale data
- Short TTL (10-30s) — still a race window
- pub/sub — broadcast invalidations across caches (Redis Cluster, NATS)
- Event-sourcing projection — change events automatically refresh cache (see separate guide)

Cache Key Design

Naming — one of the two hard problems.

❌ Bad:
  key = "user_data"   ← no per-user separation
  key = "users"       ← what's stale when "users" is stale?

✓ Good:
  key = "user:42"
  key = "user:42:posts:page:1"
  key = "session:abc123"
  key = "config:v3"  ← versioned, auto-invalidates on new deploy

Rules:
- prefix:id:variant pattern (separators / namespaces)
- Include version or schema hash (cold-start on deploy)
- Encode locale / user variables explicitly (one user's data must not be served to another)

Common Pitfalls

Forgetting cache invalidation — stale shown after update. Review every write path's invalidation.
No negative caching — also cache "not found". Otherwise every missing key hits the DB.
Shared cache for per-user data — User A sees User B's data. Always include user_id in the key.
Unbounded cache size — specify eviction policy (LRU, LFU). Memory exhaustion → restart.
Treating cache as source of truth — Redis down = full outage. Cache is optimization; DB is the fallback.

Wrap-up

Caching at its heart — copy data to multiple places + keep them in sync. Big perf wins, but invalidation is the hard problem. Layer-precise intent and TTLs.

Practical: cache-aside + stale-while-revalidate + versioned keys is the modern standard. Explicit invalidation on write, stampede defenses (locks or jitter), monitoring (hit rate, eviction count).