Skip to content
yutils

How CPU Caches Actually Work

L1 vs L2 vs L3, cache lines (64 bytes), why row-major iteration is 10× faster than column-major, false sharing in multi-threaded code, the prefetcher, NUMA, and why the same algorithm has wildly different real performance.

~9 min read

Same algorithm, same big-O — yet one is 10× faster. The difference is usually CPU cache usage. The biggest performance factor on a modern CPU, but invisible in code. This guide covers L1/L2/L3 structure, cache lines, false sharing, and prefetchers.

Why Cache Is Necessary — The Speed Gap

Access latency (approximate, modern x86):

L1 cache (32-64 KB)         : 1 ns  (~4 cycles)
L2 cache (256 KB - 1 MB)    : 3-5 ns (~12 cycles)
L3 cache (8-64 MB, shared)  : 12-20 ns (~40 cycles)
Main memory (RAM)           : 100 ns (~300 cycles)
NVMe SSD                    : 10 μs (~30,000 cycles)
HDD                         : 10 ms (~30,000,000 cycles)

During 100 cycles of RAM access, the CPU could run 100 instructions — all wasted.

If the CPU went to RAM for every instruction, you'd have a 1 GHz processor. Cache bridges the gap.

Cache Hierarchy

         CPU core 1                    CPU core 2
       ┌──────────┐                   ┌──────────┐
       │ L1 (32K) │                   │ L1 (32K) │  ← per core
       ├──────────┤                   ├──────────┤
       │ L2 (256K)│                   │ L2 (256K)│  ← per core
       └─────┬────┘                   └─────┬────┘
             │                              │
             └──────────┬───────────────────┘
                        │
                 ┌──────────────┐
                 │  L3 (16 MB)  │  ← shared across cores
                 └──────┬───────┘
                        │
                 ┌──────────────┐
                 │   RAM (16 GB)│
                 └──────────────┘

Cache Line — The Memory Transfer Unit

When the CPU fetches from RAM, it doesn't fetch 1 byte but 64 bytes at a time (a cache line).

Access array[0]:
- Fetch array[0..15] (16 ints, 64 bytes) into L1 at once

Access array[1]:
- Already in L1! Zero additional cost.

→ "Spatial locality" exploited

Row-major vs Column-major — 10× Difference

int matrix[1000][1000];

// Row-major traversal (rows outer)
for (int i = 0; i < 1000; i++)
    for (int j = 0; j < 1000; j++)
        sum += matrix[i][j];
// → matrix[0][0], matrix[0][1], ... contiguous memory
// → cache lines well-used → fast

// Column-major traversal (cols outer)
for (int j = 0; j < 1000; j++)
    for (int i = 0; i < 1000; i++)
        sum += matrix[i][j];
// → matrix[0][0], matrix[1][0], matrix[2][0] ... 1000×4 bytes apart
// → new cache line each access → slow (5-10× difference)

Same big-O O(N²), same instruction count. Yet 5-10× real performance difference. All from cache-line usage.

False Sharing — The Invisible Multithreaded Trap

struct {
    int counter1;   // only thread A updates
    int counter2;   // only thread B updates
} shared;  // both vars on the same cache line (64 bytes)

// Logically no race (different vars).
// But in practice:

Thread A: counter1++  → cache line invalidated in B's L1
Thread B: counter2++  → cache line invalidated in A's L1
Thread A: counter1++  → cache miss, refetch from RAM (slow!)
...

→ "False sharing" — no logical conflict, but cache coherence acts like one.
   Fix: align to separate cache lines (padding)

struct {
    alignas(64) int counter1;
    alignas(64) int counter2;
};

Prefetcher — The CPU Predicts the Future

The CPU predicts the next memory access and fetches in advance (hardware prefetcher).

Sequential pattern detection:
- array[0], array[1], array[2] → "probably array[3] next"
- Prefetch ahead → no wait, instant hit

Stride pattern detection:
- a[0], a[64], a[128] (64-byte stride) → prefetch with that stride

Random pattern:
- Unpredictable, prefetch fails

That's why linked lists and other random-pointer-chasing data structures are far slower than arrays. Cache misses + prefetcher useless.

Layout — Array of Structs vs Struct of Arrays

// Array of Structs (AoS) — natural OOP
struct Point { float x, y, z; };
Point points[1000];

for (auto& p : points) p.x += 1;
// → x, y, z, x, y, z, ... all pulled into cache (y, z unused)

// Struct of Arrays (SoA) — data-oriented design
struct Points { float xs[1000], ys[1000], zs[1000]; };
Points points;

for (int i = 0; i < 1000; i++) points.xs[i] += 1;
// → xs[0..15] fills a cache line — 4× more useful

→ For hot paths like "frequently update x only", SoA wins on cache use.
   Mainstream in games / ML / SIMD work.

Measuring — perf, vtune, dtrace

# Linux perf
perf stat -e cache-misses,cache-references ./program
# → check cache miss rate

perf record -e cache-misses ./program
perf report  # → which function causes the most misses

# Rule of thumb: cache-miss rate < 5%
# > 10% suggests room for algorithm/data-structure improvement

NUMA — Memory Locality on Multi-CPU Systems

Server (2 sockets):
   CPU 0 ──── RAM A (near, fast)
     │
     │ (interconnect, slow)
     │
   CPU 1 ──── RAM B (near, fast)

If a thread on CPU 0 accesses RAM B:
→ cross interconnect → 2× latency

Fix:
- Thread affinity (pin a thread to a CPU)
- numactl --cpunodebind=0 --membind=0 ./program
- Important for DBs / large JVMs

Common Pitfalls

  • Linked-list overuse — for small N, arrays (vectors) almost always win on cache.
  • std::map (red-black tree) cache misses std::unordered_map + sequential layout is often faster.
  • Large object arrays — sizeof(Big) = 200 B means one object per cache line. Consider SoA.
  • Not using thread-local counters — shared counters in multi-threaded code = cache-line ping-pong. Thread-local + periodic sum.
  • Large stack variables — fragments cache lines. Put big arrays on the heap.

Wrap-up

Cache is invisible but the biggest performance variable. Same big-O, different cache pattern = 10× difference. "Linked list" is often the wrong answer when arrays win.

Modern CPU = a game of "RAM is far, cache is near". Designing cache-friendly data structures and traversal orders yields 5-10× performance on the same code.

Back to guides