How Compilers Actually Work

Inside gcc hello.c: lexer, parser, semantic analyzer, optimizer, code generator — all in sequence. Compilers are among the most complex software, but the structure is clear. This guide covers the stages, LLVM IR / Go SSA, and what -O2 actually changes.

Compiler Pipeline

Source  →  Lexer  →  Parser  →  Semantic   →  IR         →  Optimization  →  Codegen  →  Machine code
            tokens     AST       (typechk)    (LLVM/SSA)      passes           assembly

  "int x = 1+2;"
        │
        ▼
  [INT, ID(x), EQ, INT_LIT(1), PLUS, INT_LIT(2), SEMI]
        │ (Lexer)
        ▼
       =
      / \
    x    +
        / \
       1   2
       │ (Parser → AST)
       ▼
   int x = 3;  ← constant folding (semantic + optimization)
       │
       ▼
   mov eax, 3
   mov [rbp-4], eax  ← codegen

Stage 1 — Lexer (tokenization)

input: "int x = 42;"
output: [
  { kind: KEYWORD, text: "int" },
  { kind: IDENT,   text: "x" },
  { kind: EQ },
  { kind: INT_LIT, value: 42 },
  { kind: SEMI }
]

Convert source into a token stream via regex / finite automaton. Whitespace and comments are dropped here.

Stage 2 — Parser (build AST)

token stream → Abstract Syntax Tree

"x = 1 + 2 * 3" →
        =
       / \
      x   +
         / \
        1   *
           / \
          2   3

Operator precedence (* binds tighter than +) reflected in tree shape.

Parser styles:
- Recursive descent (hand-written, GCC/Clang)
- LL(k) / LR(k) (yacc, bison, ANTLR)
- PEG (modern)

Stage 3 — Semantic Analysis

Walk the AST and check:
- Variables declared?
- Types match? (int + int → int OK, int + string → error)
- Function arity / arg types?
- Scope resolution

Source of error messages: "undefined variable 'foo'", "type mismatch".

Stage 4 — IR (Intermediate Representation)

AST is hard to optimize directly — convert to a flatter IR. LLVM's SSA IR is the canonical example:

source:
  int add(int a, int b) {
    int c = a + b;
    return c;
  }

LLVM IR:
  define i32 @add(i32 %a, i32 %b) {
    %c = add i32 %a, %b
    ret i32 %c
  }

SSA (Static Single Assignment) = each variable assigned exactly once. Makes dataflow analysis tractable. Almost all modern compilers use SSA.

Stage 5 — Optimization Passes

-O2 = ~50 passes mutating the IR. Highlights:

Constant Folding

int x = 1 + 2 * 3;  →  int x = 7;

Dead Code Elimination

if (false) { foo(); }  →  (removed)
int unused = 42;        →  (removed)

Inlining

int square(int x) { return x * x; }
int main() { return square(5); }

→ (after inline)
int main() { return 5 * 5; }
→ (after const fold)
int main() { return 25; }

Loop Unrolling

for (int i = 0; i < 4; i++) sum += a[i];

→
sum += a[0];
sum += a[1];
sum += a[2];
sum += a[3];

→ branch overhead eliminated

Vectorization (SIMD)

for (int i = 0; i < 1024; i++) c[i] = a[i] + b[i];

→ AVX instructions processing 8 at once
movaps  ymm0, [a]
addps   ymm0, [b]
movaps  [c], ymm0
... (only 128 iterations)

Strength Reduction

x * 2  →  x << 1
x * 4  →  x << 2
x / 2  →  x >> 1 (unsigned)
x % 2  →  x & 1

Stage 6 — Code Generation

IR → assembly for the target architecture. Different per x86, ARM, RISC-V.

LLVM IR:
  %c = add i32 %a, %b

x86 codegen (System V ABI):
  add edi, esi    ; %a in edi, %b in esi (calling convention)
  mov eax, edi    ; return value in eax

ARM codegen:
  add w0, w0, w1  ; w0 = w0 + w1 (return in w0)

Register Allocation — The Hardest Stage

Map IR's "infinite virtual registers" to the CPU's ~16 physical registers. Graph coloring. NP-hard, so heuristics.

100 source variables → 16 CPU registers
→ some "spill" to stack
→ spill = RAM round-trip per access (~100 cycles)

→ Register allocator quality has huge perf impact.
   LLVM: greedy + linear scan; GCC: similar.

-O0 vs -O2 vs -O3

level	characteristics	use
-O0	no opt, fast compile	debug build (all vars visible)
-O1	basic opt, debuggable	development
-O2 (recommended)	most opts, debuggable	production (default)
-O3	aggressive inline, vectorize, unroll	compute-heavy
-Os	size-optimized	embedded
-Ofast	-O3 + break IEEE float rules	numerical (risky)

Why -O2 Is Hard to Debug

-O0 in gdb:
  (gdb) print x
  $1 = 42       ← source variable visible

-O2 in gdb:
  (gdb) print x
  $1 = <optimized out>    ← variable in register or gone
  (gdb) step
  Single stepping until exit from function foo,
  which has no line number information.    ← function inlined away

→ Keep -O0 debug builds, or use -Og (debug-friendly opt).

Per-Language Compiler Notes

Rust (rustc → LLVM)

Borrow checker on MIR — memory safety without runtime checks
LLVM backend → same optimizer as clang
Hence long compile time, very fast runtime

Go (gc compiler)

Own backend (no LLVM) — fast compile is the priority
Escape analysis auto-decides stack vs heap
Generics (1.18+) use a dictionary approach (trade-off vs Rust's monomorphization)

Swift (Swift LLVM frontend)

Two-step lowering: SIL (Swift IR) → LLVM IR
ARC (Automatic Reference Counting) injected by the compiler

JIT (V8, JVM HotSpot, JavaScriptCore)

Optimize based on runtime profiles
Tier 1 (interpreter) → tier 2 (baseline JIT) → tier 3 (optimizing JIT)
Deoptimization — fall back if assumptions break

Common Pitfalls

"-O3 is always faster" myth — often the same as -O2 or worse (code size ↑ → I-cache miss).
UB impact on optimization — C/C++ signed overflow is UB → compiler assumes "never happens" and may remove checks. "Where did my check go?" mystery.
volatile misunderstanding — only blocks compiler register caching. No atomicity guarantee.
inline asm — bad constraints = silent bugs.

Wrap-up

A compiler is not a simple "source → machine code" translator but a pipeline of dozens of stages and hundreds of optimizations. Shared IR (LLVM) lets modern languages build atop a common backend.

Understanding the optimizer lets you predict what code changes will do. Knowing what volatile / branch hints / inlining mean is what makes performance-aware production code.