Same algorithm, same big-O — yet one is 10× faster. The difference is usually CPU cache usage. The biggest performance factor on a modern CPU, but invisible in code. This guide covers L1/L2/L3 structure, cache lines, false sharing, and prefetchers.
Why Cache Is Necessary — The Speed Gap
Access latency (approximate, modern x86):
L1 cache (32-64 KB) : 1 ns (~4 cycles)
L2 cache (256 KB - 1 MB) : 3-5 ns (~12 cycles)
L3 cache (8-64 MB, shared) : 12-20 ns (~40 cycles)
Main memory (RAM) : 100 ns (~300 cycles)
NVMe SSD : 10 μs (~30,000 cycles)
HDD : 10 ms (~30,000,000 cycles)
During 100 cycles of RAM access, the CPU could run 100 instructions — all wasted.If the CPU went to RAM for every instruction, you'd have a 1 GHz processor. Cache bridges the gap.
Cache Hierarchy
CPU core 1 CPU core 2
┌──────────┐ ┌──────────┐
│ L1 (32K) │ │ L1 (32K) │ ← per core
├──────────┤ ├──────────┤
│ L2 (256K)│ │ L2 (256K)│ ← per core
└─────┬────┘ └─────┬────┘
│ │
└──────────┬───────────────────┘
│
┌──────────────┐
│ L3 (16 MB) │ ← shared across cores
└──────┬───────┘
│
┌──────────────┐
│ RAM (16 GB)│
└──────────────┘Cache Line — The Memory Transfer Unit
When the CPU fetches from RAM, it doesn't fetch 1 byte but 64 bytes at a time (a cache line).
Access array[0]:
- Fetch array[0..15] (16 ints, 64 bytes) into L1 at once
Access array[1]:
- Already in L1! Zero additional cost.
→ "Spatial locality" exploitedRow-major vs Column-major — 10× Difference
int matrix[1000][1000];
// Row-major traversal (rows outer)
for (int i = 0; i < 1000; i++)
for (int j = 0; j < 1000; j++)
sum += matrix[i][j];
// → matrix[0][0], matrix[0][1], ... contiguous memory
// → cache lines well-used → fast
// Column-major traversal (cols outer)
for (int j = 0; j < 1000; j++)
for (int i = 0; i < 1000; i++)
sum += matrix[i][j];
// → matrix[0][0], matrix[1][0], matrix[2][0] ... 1000×4 bytes apart
// → new cache line each access → slow (5-10× difference)Same big-O O(N²), same instruction count. Yet 5-10× real performance difference. All from cache-line usage.
False Sharing — The Invisible Multithreaded Trap
struct {
int counter1; // only thread A updates
int counter2; // only thread B updates
} shared; // both vars on the same cache line (64 bytes)
// Logically no race (different vars).
// But in practice:
Thread A: counter1++ → cache line invalidated in B's L1
Thread B: counter2++ → cache line invalidated in A's L1
Thread A: counter1++ → cache miss, refetch from RAM (slow!)
...
→ "False sharing" — no logical conflict, but cache coherence acts like one.
Fix: align to separate cache lines (padding)
struct {
alignas(64) int counter1;
alignas(64) int counter2;
};Prefetcher — The CPU Predicts the Future
The CPU predicts the next memory access and fetches in advance (hardware prefetcher).
Sequential pattern detection:
- array[0], array[1], array[2] → "probably array[3] next"
- Prefetch ahead → no wait, instant hit
Stride pattern detection:
- a[0], a[64], a[128] (64-byte stride) → prefetch with that stride
Random pattern:
- Unpredictable, prefetch failsThat's why linked lists and other random-pointer-chasing data structures are far slower than arrays. Cache misses + prefetcher useless.
Layout — Array of Structs vs Struct of Arrays
// Array of Structs (AoS) — natural OOP
struct Point { float x, y, z; };
Point points[1000];
for (auto& p : points) p.x += 1;
// → x, y, z, x, y, z, ... all pulled into cache (y, z unused)
// Struct of Arrays (SoA) — data-oriented design
struct Points { float xs[1000], ys[1000], zs[1000]; };
Points points;
for (int i = 0; i < 1000; i++) points.xs[i] += 1;
// → xs[0..15] fills a cache line — 4× more useful
→ For hot paths like "frequently update x only", SoA wins on cache use.
Mainstream in games / ML / SIMD work.Measuring — perf, vtune, dtrace
# Linux perf
perf stat -e cache-misses,cache-references ./program
# → check cache miss rate
perf record -e cache-misses ./program
perf report # → which function causes the most misses
# Rule of thumb: cache-miss rate < 5%
# > 10% suggests room for algorithm/data-structure improvementNUMA — Memory Locality on Multi-CPU Systems
Server (2 sockets):
CPU 0 ──── RAM A (near, fast)
│
│ (interconnect, slow)
│
CPU 1 ──── RAM B (near, fast)
If a thread on CPU 0 accesses RAM B:
→ cross interconnect → 2× latency
Fix:
- Thread affinity (pin a thread to a CPU)
- numactl --cpunodebind=0 --membind=0 ./program
- Important for DBs / large JVMsCommon Pitfalls
- Linked-list overuse — for small N, arrays (vectors) almost always win on cache.
- std::map (red-black tree) cache misses — std::unordered_map + sequential layout is often faster.
- Large object arrays — sizeof(Big) = 200 B means one object per cache line. Consider SoA.
- Not using thread-local counters — shared counters in multi-threaded code = cache-line ping-pong. Thread-local + periodic sum.
- Large stack variables — fragments cache lines. Put big arrays on the heap.
Wrap-up
Cache is invisible but the biggest performance variable. Same big-O, different cache pattern = 10× difference. "Linked list" is often the wrong answer when arrays win.
Modern CPU = a game of "RAM is far, cache is near". Designing cache-friendly data structures and traversal orders yields 5-10× performance on the same code.