CPU 캐시는 어떻게 동작할까?

같은 알고리즘, 같은 big-O. 그런데 한쪽이 10 배 빠르다. 차이는 보통 CPU cache 사용 패턴. modern CPU 의 가장 중요한 성능 결정 요소이지만 코드에서는 보이지 않는다. 이 가이드는 L1/L2/L3 의 구조, cache line, false sharing, prefetcher 의 동작을 정리한다.

왜 cache 가 필요한가 — 성능 격차

접근 latency (대략, modern x86):

L1 cache (32-64 KB)         : 1 ns  (~4 cycle)
L2 cache (256 KB - 1 MB)    : 3-5 ns (~12 cycle)
L3 cache (8-64 MB, 공유)     : 12-20 ns (~40 cycle)
Main memory (RAM)           : 100 ns (~300 cycle)
NVMe SSD                    : 10 μs (~30,000 cycle)
HDD                         : 10 ms (~30,000,000 cycle)

100 cycle 의 RAM access 동안 CPU 는 100 명령 실행 가능 — 다 wait.

CPU 가 매 명령마다 RAM access 한다면 1 GHz 짜리. cache 가 그 격차를 메워준다.

Cache hierarchy

         CPU core 1                    CPU core 2
       ┌──────────┐                   ┌──────────┐
       │ L1 (32K) │                   │ L1 (32K) │  ← per core
       ├──────────┤                   ├──────────┤
       │ L2 (256K)│                   │ L2 (256K)│  ← per core
       └─────┬────┘                   └─────┬────┘
             │                              │
             └──────────┬───────────────────┘
                        │
                 ┌──────────────┐
                 │  L3 (16 MB)  │  ← 모든 core 공유
                 └──────┬───────┘
                        │
                 ┌──────────────┐
                 │   RAM (16 GB)│
                 └──────────────┘

Cache Line — 메모리 전송 단위

CPU 가 RAM 에서 데이터 가져올 때 1 byte 가 아니라 64 byte 단위 (cache line)로 읽음.

array[0] 접근:
- RAM 에서 array[0..15] 까지 16 개 int (64 byte) 한꺼번에 L1 로

array[1] 접근:
- 이미 L1 에 있음! 0 ns 추가 비용.

→ "공간 지역성" 활용

Row-major vs Column-major — 10× 차이

int matrix[1000][1000];

// row-major 순회 (행 우선)
for (int i = 0; i < 1000; i++)
    for (int j = 0; j < 1000; j++)
        sum += matrix[i][j];
// → matrix[0][0], matrix[0][1], matrix[0][2] ... 연속 메모리
// → cache line 잘 활용 → 빠름

// column-major 순회 (열 우선)
for (int j = 0; j < 1000; j++)
    for (int i = 0; i < 1000; i++)
        sum += matrix[i][j];
// → matrix[0][0], matrix[1][0], matrix[2][0] ... 1000×4 byte 떨어진 곳
// → 매 접근마다 새 cache line → 느림 (5-10× 차이)

같은 big-O O(N²), 같은 명령 수. 그런데 실 성능 5-10 배 차이. cache line 활용 차이.

False Sharing — 멀티스레드의 invisible 함정

struct {
    int counter1;   // thread A 만 update
    int counter2;   // thread B 만 update
} shared;  // 두 변수가 같은 cache line (64 byte)

// 논리적으로는 race 0 (다른 변수).
// 그런데 실제로는:

Thread A: counter1++  → cache line invalidate (B 의 L1 무효화)
Thread B: counter2++  → cache line invalidate (A 의 L1 무효화)
Thread A: counter1++  → cache miss, RAM 에서 다시 (느림!)
...

→ "False sharing" — 논리적 충돌 없는데 cache coherence 가 충돌처럼 처리.
   해결: cache line 정렬로 분리 (padding)

struct {
    alignas(64) int counter1;
    alignas(64) int counter2;
};

Prefetcher — CPU 의 미래 예측

CPU 는 다음에 어떤 메모리를 access 할지 예측해서 미리 cache 에 가져옴 (hardware prefetcher).

sequential pattern 감지:
- array[0], array[1], array[2] 접근 → "next 도 array[3] 일 것"
- 미리 fetch → wait 없이 즉시 hit

stride pattern 감지:
- a[0], a[64], a[128] (64 byte 간격) → 그 stride 로 prefetch

random pattern:
- 예측 불가, prefetch 실패

그래서 linked list 같은 random pointer 따라가는 자료구조가 array 보다 훨씬 느림. cache miss + prefetcher 무용.

구조 선택 — array of struct vs struct of array

// Array of Struct (AoS) — 자연스러운 OOP
struct Point { float x, y, z; };
Point points[1000];

for (auto& p : points) p.x += 1;
// → x, y, z, x, y, z, ... 모두 cache 로 (y, z 안 쓰는데도)

// Struct of Array (SoA) — data-oriented design
struct Points { float xs[1000], ys[1000], zs[1000]; };
Points points;

for (int i = 0; i < 1000; i++) points.xs[i] += 1;
// → xs[0..15] cache line 가득 — 4 배 효율

→ "x 만 빈번히 update" 같은 hot path 에서 SoA 가 cache 효율 압승.
   game / ML / SIMD 영역의 mainstream 패턴.

측정 — perf, vtune, dtrace

# Linux perf
perf stat -e cache-misses,cache-references ./program
# → cache miss rate 확인

perf record -e cache-misses ./program
perf report  # → 어느 함수에서 cache miss 가장 많은지

# 일반적 목표: cache miss rate < 5%
# 10% 이상이면 알고리즘/자료구조 개선 여지 큼

NUMA — 멀티 CPU 의 memory locality

서버 (2 socket):
   CPU 0 ──── RAM A (가까움, 빠름)
     │
     │ (interconnect, 느림)
     │
   CPU 1 ──── RAM B (가까움, 빠름)

만약 CPU 0 에 thread 가 RAM B 의 데이터 access:
→ interconnect 통과 → 2× latency

해결:
- thread affinity (특정 thread 를 특정 CPU 에 pin)
- numactl --cpunodebind=0 --membind=0 ./program
- DB / 큰 JVM 에서 중요

흔한 함정

Linked list 남용 — 작은 N 에서는 array (vector) 가 cache 효율로 거의 항상 승.
std::map (red-black tree) 의 cache miss — std::unordered_map + sequential layout 이 더 빠른 경우 많음.
큰 객체 array — sizeof(Big) = 200 byte 면 한 cache line 에 한 객체. SoA 검토.
thread-local counter 미사용 — 멀티스레드 카운터 공유는 cache line ping-pong. thread-local + 주기적 합산.
큰 stack 변수 — cache line 단편화. 큰 array 는 heap 또는 외부.

마무리

Cache 는 보이지 않지만 가장 큰 성능 변수다. big-O 가 같아도 cache pattern 이 다르면 10× 차이. "linked list 가 정답" 인 줄 알았는데 array 가 빠른 경우가 흔하다.

Modern CPU = "RAM 은 멀다, cache 는 가깝다" 의 게임. 자료구조와 순회 순서를 cache friendly 하게 설계하면 동일 코드의 5-10 배 성능을 얻는다.