Inside gcc hello.c: lexer, parser, semantic analyzer, optimizer, code generator — all in sequence. Compilers are among the most complex software, but the structure is clear. This guide covers the stages, LLVM IR / Go SSA, and what -O2 actually changes.
Compiler Pipeline
Source → Lexer → Parser → Semantic → IR → Optimization → Codegen → Machine code
tokens AST (typechk) (LLVM/SSA) passes assembly
"int x = 1+2;"
│
▼
[INT, ID(x), EQ, INT_LIT(1), PLUS, INT_LIT(2), SEMI]
│ (Lexer)
▼
=
/ \
x +
/ \
1 2
│ (Parser → AST)
▼
int x = 3; ← constant folding (semantic + optimization)
│
▼
mov eax, 3
mov [rbp-4], eax ← codegenStage 1 — Lexer (tokenization)
input: "int x = 42;"
output: [
{ kind: KEYWORD, text: "int" },
{ kind: IDENT, text: "x" },
{ kind: EQ },
{ kind: INT_LIT, value: 42 },
{ kind: SEMI }
]Convert source into a token stream via regex / finite automaton. Whitespace and comments are dropped here.
Stage 2 — Parser (build AST)
token stream → Abstract Syntax Tree
"x = 1 + 2 * 3" →
=
/ \
x +
/ \
1 *
/ \
2 3
Operator precedence (* binds tighter than +) reflected in tree shape.
Parser styles:
- Recursive descent (hand-written, GCC/Clang)
- LL(k) / LR(k) (yacc, bison, ANTLR)
- PEG (modern)Stage 3 — Semantic Analysis
Walk the AST and check:
- Variables declared?
- Types match? (int + int → int OK, int + string → error)
- Function arity / arg types?
- Scope resolution
Source of error messages: "undefined variable 'foo'", "type mismatch".Stage 4 — IR (Intermediate Representation)
AST is hard to optimize directly — convert to a flatter IR. LLVM's SSA IR is the canonical example:
source:
int add(int a, int b) {
int c = a + b;
return c;
}
LLVM IR:
define i32 @add(i32 %a, i32 %b) {
%c = add i32 %a, %b
ret i32 %c
}SSA (Static Single Assignment) = each variable assigned exactly once. Makes dataflow analysis tractable. Almost all modern compilers use SSA.
Stage 5 — Optimization Passes
-O2 = ~50 passes mutating the IR. Highlights:
Constant Folding
int x = 1 + 2 * 3; → int x = 7;Dead Code Elimination
if (false) { foo(); } → (removed)
int unused = 42; → (removed)Inlining
int square(int x) { return x * x; }
int main() { return square(5); }
→ (after inline)
int main() { return 5 * 5; }
→ (after const fold)
int main() { return 25; }Loop Unrolling
for (int i = 0; i < 4; i++) sum += a[i];
→
sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];
→ branch overhead eliminatedVectorization (SIMD)
for (int i = 0; i < 1024; i++) c[i] = a[i] + b[i];
→ AVX instructions processing 8 at once
movaps ymm0, [a]
addps ymm0, [b]
movaps [c], ymm0
... (only 128 iterations)Strength Reduction
x * 2 → x << 1
x * 4 → x << 2
x / 2 → x >> 1 (unsigned)
x % 2 → x & 1Stage 6 — Code Generation
IR → assembly for the target architecture. Different per x86, ARM, RISC-V.
LLVM IR:
%c = add i32 %a, %b
x86 codegen (System V ABI):
add edi, esi ; %a in edi, %b in esi (calling convention)
mov eax, edi ; return value in eax
ARM codegen:
add w0, w0, w1 ; w0 = w0 + w1 (return in w0)Register Allocation — The Hardest Stage
Map IR's "infinite virtual registers" to the CPU's ~16 physical registers. Graph coloring. NP-hard, so heuristics.
100 source variables → 16 CPU registers
→ some "spill" to stack
→ spill = RAM round-trip per access (~100 cycles)
→ Register allocator quality has huge perf impact.
LLVM: greedy + linear scan; GCC: similar.-O0 vs -O2 vs -O3
| level | characteristics | use |
|---|---|---|
| -O0 | no opt, fast compile | debug build (all vars visible) |
| -O1 | basic opt, debuggable | development |
| -O2 (recommended) | most opts, debuggable | production (default) |
| -O3 | aggressive inline, vectorize, unroll | compute-heavy |
| -Os | size-optimized | embedded |
| -Ofast | -O3 + break IEEE float rules | numerical (risky) |
Why -O2 Is Hard to Debug
-O0 in gdb:
(gdb) print x
$1 = 42 ← source variable visible
-O2 in gdb:
(gdb) print x
$1 = <optimized out> ← variable in register or gone
(gdb) step
Single stepping until exit from function foo,
which has no line number information. ← function inlined away
→ Keep -O0 debug builds, or use -Og (debug-friendly opt).Per-Language Compiler Notes
Rust (rustc → LLVM)
- Borrow checker on MIR — memory safety without runtime checks
- LLVM backend → same optimizer as clang
- Hence long compile time, very fast runtime
Go (gc compiler)
- Own backend (no LLVM) — fast compile is the priority
- Escape analysis auto-decides stack vs heap
- Generics (1.18+) use a dictionary approach (trade-off vs Rust's monomorphization)
Swift (Swift LLVM frontend)
- Two-step lowering: SIL (Swift IR) → LLVM IR
- ARC (Automatic Reference Counting) injected by the compiler
JIT (V8, JVM HotSpot, JavaScriptCore)
- Optimize based on runtime profiles
- Tier 1 (interpreter) → tier 2 (baseline JIT) → tier 3 (optimizing JIT)
- Deoptimization — fall back if assumptions break
Common Pitfalls
- "-O3 is always faster" myth — often the same as -O2 or worse (code size ↑ → I-cache miss).
- UB impact on optimization — C/C++ signed overflow is UB → compiler assumes "never happens" and may remove checks. "Where did my check go?" mystery.
- volatile misunderstanding — only blocks compiler register caching. No atomicity guarantee.
- inline asm — bad constraints = silent bugs.
Wrap-up
A compiler is not a simple "source → machine code" translator but a pipeline of dozens of stages and hundreds of optimizations. Shared IR (LLVM) lets modern languages build atop a common backend.
Understanding the optimizer lets you predict what code changes will do. Knowing what volatile / branch hints / inlining mean is what makes performance-aware production code.