컴파일러는 어떻게 동작할까?

gcc hello.c 한 줄 안에 lexer, parser, semantic analyzer, optimizer, code generator 가 차례로 돈다. 컴파일러는 가장 복잡한 SW 중 하나지만 구조는 명확하다. 이 가이드는 그 단계, LLVM IR / Go SSA 의 의미, -O2 가 실제로 무엇을 바꾸는지 정리한다.

컴파일러 pipeline

Source code  →  Lexer  →  Parser  →  Semantic   →  IR         →  Optimization  →  Codegen  →  Machine code
                  토큰        AST       (타입체크)    (LLVM/SSA)      passes           assembly
                  화

  "int x = 1+2;"
        │
        ▼
  [INT, ID(x), EQ, INT_LIT(1), PLUS, INT_LIT(2), SEMI]
        │ (Lexer)
        ▼
       =
      / \
    x    +
        / \
       1   2
       │ (Parser → AST)
       ▼
   int x = 3;  ← 상수 폴딩 (semantic + optimization)
       │
       ▼
   mov eax, 3
   mov [rbp-4], eax  ← codegen

1단계 — Lexer (tokenization)

input: "int x = 42;"
output: [
  { kind: KEYWORD, text: "int" },
  { kind: IDENT,   text: "x" },
  { kind: EQ },
  { kind: INT_LIT, value: 42 },
  { kind: SEMI }
]

Regex / finite automaton 로 source 를 token stream 으로. 공백 · 주석은 여기서 drop.

2단계 — Parser (AST 생성)

token stream → Abstract Syntax Tree

"x = 1 + 2 * 3" →
        =
       / \
      x   +
         / \
        1   *
           / \
          2   3

operator precedence (* 가 + 보다 강함) 가 tree 구조로 반영.

Parser 종류:
- Recursive descent (수기 작성, GCC/Clang)
- LL(k) / LR(k) (yacc, bison, ANTLR)
- PEG (modern)

3단계 — Semantic Analysis

AST 를 walk 하며:
- 변수 declared 되었나
- 타입 일치하나 (int + int → int OK, int + string → 에러)
- 함수 호출 인자 수 맞나
- scope resolution

오류 메시지의 출처. "undefined variable 'foo'", "type mismatch" 등.

4단계 — IR (Intermediate Representation)

AST 그대로 optimization 하기 어려움 — flatten 한 IR 로 변환. LLVM 의 SSA IR 가 대표:

source:
  int add(int a, int b) {
    int c = a + b;
    return c;
  }

LLVM IR:
  define i32 @add(i32 %a, i32 %b) {
    %c = add i32 %a, %b
    ret i32 %c
  }

SSA (Static Single Assignment) = 각 변수가 한 번만 할당. dataflow 분석이 쉬워짐. modern compiler 거의 다 SSA.

5단계 — Optimization passes

-O2 = ~50 개 pass 가 IR 을 변형. 대표적:

Constant folding

int x = 1 + 2 * 3;  →  int x = 7;

Dead code elimination

if (false) { foo(); }  →  (removed)
int unused = 42;        →  (removed)

Inlining

int square(int x) { return x * x; }
int main() { return square(5); }

→ (after inline)
int main() { return 5 * 5; }
→ (after const fold)
int main() { return 25; }

Loop unrolling

for (int i = 0; i < 4; i++) sum += a[i];

→
sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];

→ branch overhead 제거

Vectorization (SIMD)

for (int i = 0; i < 1024; i++) c[i] = a[i] + b[i];

→ AVX 명령으로 한 번에 8 개 처리
movaps  ymm0, [a]
addps   ymm0, [b]
movaps  [c], ymm0
... (반복 128 번 만)

Strength reduction

x * 2  →  x << 1
x * 4  →  x << 2
x / 2  →  x >> 1 (unsigned)
x % 2  →  x & 1

6단계 — Code generation

IR → target architecture 의 assembly. x86, ARM, RISC-V 마다 다름.

LLVM IR:
  %c = add i32 %a, %b

x86 codegen (System V ABI):
  add edi, esi    ; %a 가 edi, %b 가 esi (calling convention)
  mov eax, edi    ; return value 는 eax

ARM codegen:
  add w0, w0, w1  ; w0 = w0 + w1 (return in w0)

Register Allocation — 가장 어려운 단계

IR 의 "infinite virtual register" 를 CPU 의 ~16 개 physical register 로 매핑. graph coloring 문제. NP-hard 라 heuristic 사용.

source 변수 100 개 → CPU register 16 개
→ 일부는 stack 으로 "spill"
→ spill = 매 access 마다 RAM 왕복 (~100 cycle)

→ register allocator 의 품질이 성능에 큰 영향.
   LLVM 은 greedy + linear scan 조합, GCC 도 유사.

-O0 vs -O2 vs -O3

level	특징	용도
-O0	optimization 0, fast compile	debug build (변수 다 보임)
-O1	basic opt, debug 가능	development
-O2 (recommended)	대부분 opt, debug 가능	production (default)
-O3	aggressive inline, vectorize, unroll	compute-heavy
-Os	size 최적화	embedded
-Ofast	-O3 + IEEE float 규칙 깸	numerical (위험)

왜 -O2 가 디버그 어렵나

-O0 의 gdb:
  (gdb) print x
  $1 = 42       ← source 변수 그대로 보임

-O2 의 gdb:
  (gdb) print x
  $1 = <optimized out>    ← register 에 박혀있거나 사라짐
  (gdb) step
  Single stepping until exit from function foo,
  which has no line number information.    ← inline 으로 함수 사라짐

→ Production 에서 -O0 디버그 빌드 별도 유지, 또는 -Og (debug-friendly opt) 사용.

언어별 컴파일러 특이점

Rust (rustc → LLVM)

front-end (MIR) 에서 borrow checker — runtime check 없이 메모리 안전
back-end LLVM 공유 → clang 과 동일 최적화
그래서 compile time 길지만 runtime 매우 빠름

Go (gc compiler)

자체 backend (LLVM 안 씀) — fast compile 우선
escape analysis 로 stack vs heap 자동 결정
generics (1.18+) 는 dictionary 방식 (Rust 의 monomorphization 과 trade-off)

Swift (Swift LLVM frontend)

SIL (Swift IR) → LLVM IR 두 단계 lowering
ARC (Automatic Reference Counting) 를 컴파일러가 retain/release 자동 삽입

JIT (V8, JVM HotSpot, JavaScriptCore)

실행 중 runtime profile 기반 optimize
tier 1 (interpreter) → tier 2 (baseline JIT) → tier 3 (optimizing JIT)
deoptimization — 가정 깨지면 lower tier 로 fallback

흔한 함정

-O3 가 더 빠르다는 미신 — 종종 -O2 와 같거나 더 느림 (code size ↑ → I-cache miss).
UB 가 최적화에 미치는 영향 — C/C++ 의 signed overflow 는 UB → compiler 가 "절대 안 일어남" 가정으로 코드 제거. "왜 내 check 가 사라졌나" 미스터리.
volatile 오해 — compiler 의 register caching 만 막음. atomic 보장 X (이전 가이드 참조).
inline asm — optimizer 에게 정확한 constraint 안 주면 silent bug.

마무리

컴파일러는 "source → machine code" 단순 변환이 아니라 수십 단계의 변형 + 수백 가지 최적화 의 pipeline. LLVM 같은 IR 공유로 modern language 가 같은 backend 위에 build.

Optimizer 동작 이해 → 코드 수정 효과 예측 가능. volatile / branch hint / inlining 의 의미를 알고 쓰는 것이 production 코드의 성능 결정.