How SQL Parsing Works (and Why Injection Is Dangerous)

SELECT * FROM users WHERE id = 42 takes a four-stage trip — lexer → parser → planner → executor — before any row comes back. SQL injection is dangerous because an attacker can manipulate the very first step, where the parser draws the line between code and data. This guide walks through the SQL parser internals, what the query planner does, the mechanics of injection, and exactly how prepared statements block it.

The four-stage SQL pipeline

SQL string
   ↓
1. Lexer    — tokens (SELECT, FROM, identifier, number, string ...)
   ↓
2. Parser   — AST (abstract syntax tree)
   ↓
3. Planner  — query plan (which index, join order, ...)
   ↓
4. Executor — actual row fetching
   ↓
results

Implementations vary, but the skeleton is the same in PostgreSQL, MySQL, and SQLite.

1. Lexer — characters → tokens

input: SELECT name, age FROM users WHERE id = 42

tokens:
  KEYWORD     SELECT
  IDENT       name
  PUNCT       ,
  IDENT       age
  KEYWORD     FROM
  IDENT       users
  KEYWORD     WHERE
  IDENT       id
  OP          =
  NUMBER      42

The lexer skips whitespace and groups characters into tokens. A state machine, character by character:

Letter or _ → IDENT or KEYWORD state
Digit → NUMBER state
' → STRING state until the next '
-- → comment state until end of line

Keyword recognition — after collecting an IDENT, look it up in a reserved-word table. SELECT is a keyword; users is an identifier.

2. Parser — tokens → AST

Most engines use LALR(1) parser generators (yacc/bison). PostgreSQL's grammar is ~2,000 lines of bison.

Example AST:
  SelectStmt
  ├── targets: [name, age]
  ├── from: [users]
  └── where: BinaryOp
      ├── op: =
      ├── left: ColumnRef(id)
      └── right: Constant(42)

The key property — structure and values live in separate AST nodes. The comparison = is a node; the literal 42 is a Constant child. That separation is the safety boundary.

3. Planner — choosing an execution plan

The same AST can be executed multiple ways. WHERE id = 42 could be:

Sequential scan — read every row and check
Index scan — use the B-tree on users.id
Bitmap scan — combine multiple indexes

The planner uses statistics (row counts, distinct values, index cardinality) to estimate cost and pick the cheapest plan. EXPLAIN ANALYZE shows the choice.

JOIN gets harder:

Nested loop — good for small tables
Hash join — good when there's no usable index
Merge join — both inputs already sorted

For N tables, there are N! join orderings. Planners use dynamic programming (or in some cases genetic search) to find a good one.

4. Executor — running the plan

Plan tree, top-down. Each node is a row iterator — "give me the next row." Results flow into a buffer or cursor and back to the client.

SQL Injection — breaking the parser's invariants

The vulnerable pattern — string concatenation:

// JavaScript (intentionally unsafe)
const userInput = req.query.id;
const sql = "SELECT * FROM users WHERE id = " + userInput;
db.query(sql);

// Normal user: ?id=42
//   → SELECT * FROM users WHERE id = 42

// Attacker: ?id=1 OR 1=1 --
//   → SELECT * FROM users WHERE id = 1 OR 1=1 --
//   → exposes every user (1=1 is always true, -- comments out the rest)

What just happened — the parser cheerfully accepted 1 OR 1=1 as valid SQL. The data/code boundary was destroyed at the moment of string concatenation, before the parser ever ran.

Even nastier variants

?id=1; DROP TABLE users--
→ SELECT * FROM users WHERE id = 1; DROP TABLE users--
→ Two statements (on DBs that allow multi-statement)

?username=admin&password=' OR '1'='1
→ SELECT * FROM users WHERE username = 'admin' AND password = '' OR '1'='1'
→ Bypasses the password check

?id=1 UNION SELECT credit_card FROM payments
→ Pulls rows from a different table into the result set

Prepared statements — how they neutralize injection

Prepared statements split SQL into two phases:

// 1. prepare the template (with placeholders)
db.prepare("SELECT * FROM users WHERE id = $1");

// 2. execute later, passing values separately
db.execute([userInput]);

The crucial property — the DB parses the template once and caches the AST. Subsequent executions plug values into placeholder slots in the cached AST. The parser never sees the user input as SQL, so it cannot be interpreted as code.

// Attempted injection
userInput = "1 OR 1=1"

// The DB binds this as a value, not code:
//   → SELECT * FROM users WHERE id = '1 OR 1=1'
//   → looks for rows with id literally equal to "1 OR 1=1"
//   → returns 0 rows
//   → safe

That's why prepared statements are structurally safer than escape functions. Escape can be skipped or done wrong. Prepared is "the parser doesn't run twice" — a structural guarantee.

Stored procedures aren't automatically safe

Stored procedures that build dynamic SQL with EXEC reintroduce the same risk:

CREATE PROCEDURE search(@name VARCHAR(50)) AS
  EXEC('SELECT * FROM users WHERE name = ''' + @name + '''')
-- @name = ' OR '1'='1 → injected.

Use parameterized queries inside stored procedures too.

Parser quirks per DB

PostgreSQL — most spec-compliant

Largest reserved-word table (SQL:2016 + extensions)
:: cast (id::text) is non-standard; the standard form is CAST(id AS text)
$1, $2 placeholders
Double-quoted identifiers preserve case: "User"

MySQL — loosest grammar

-- comments require a trailing space (a security-flavored choice). Older clients also accept #
Backtick-quoted identifiers: `User`
Coerces numbers and strings (WHERE id = '42' works) — quiet sources of bugs
? placeholders

SQLite — minimal

Dynamic typing — one column can hold strings and ints
Accepts both "User" and `User` for identifier quoting
Tiny parser, suitable for embedded use

SQL formatters are parser by-products

An SQL formatter (auto-indent) also runs a parser:

Parse SQL into an AST
Pretty-print the AST with indentation rules

SQL Formatter lets you pick PostgreSQL / MySQL / SQLite / Standard. The same SQL can format differently because reserved-word tables differ per dialect.

EXPLAIN — see the query plan

Every major DB supports EXPLAIN:

EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'a@b.c';

QUERY PLAN
─────────────────────────────────────────────────────────
Index Scan using users_email_idx on users
  (cost=0.42..8.44 rows=1 width=72)
  (actual time=0.029..0.030 rows=1 loops=1)
  Index Cond: (email = 'a@b.c')
Planning Time: 0.092 ms
Execution Time: 0.052 ms

cost = planner estimate. actual time = what really happened. When the two diverge significantly, statistics are stale — ANALYZE table_name.

Sequential Scan on a large table is a missing-index signal. Hash Join with mismatched build/probe sides burns memory.

Common pitfalls

1. ORM raw-query escape hatches

Sequelize's query, Prisma's $queryRaw — these can bypass the prepared-statement path if used wrong. Check the docs for each method.

2. % and _ injection in LIKE

?search=%
→ WHERE name LIKE '%%%'  ← matches everything (unexpected)

If the LIKE pattern comes from user input, escape %and _: str.replace(/[%_]/g, "\\$&") plus the LIKE ESCAPE clause.

3. ORDER BY with user-supplied column

?sort=name; DROP TABLE users--
→ ORDER BY name; DROP TABLE users--

ORDER BY can't be parameterized (column names are syntactic). Whitelist allowed values (e.g. ["name", "created_at"]).

4. Dynamic IN clauses

// Bad
const ids = req.query.ids.split(",");  // "1,2; DROP TABLE--"
db.query(`SELECT * FROM users WHERE id IN (${ids.join(",")})`);

// Good
db.query("SELECT * FROM users WHERE id = ANY($1)", [ids]);
// Postgres arrays, or generate N placeholders programmatically

5. Error message leaks

Raw DB errors expose table and column names to attackers. Always return a generic error to the client and log the detail server-side.

References

PostgreSQL — query processing overview — official docs
OWASP — SQL Injection — OWASP
Bobby Tables (prepared statements per language) — bobby-tables.com
PostgreSQL grammar source (gram.y) — GitHub

Summary

The SQL pipeline: lexer → parser → planner → executor.
Lexer emits tokens (KEYWORD / IDENT / NUMBER / STRING). Parser builds an AST that separates structure from values.
Planner picks the cheapest execution from many candidates for the same AST.
SQL injection happens because concatenation collapses the code/data boundary before parsing. Inputs like 1 OR 1=1 -- become SQL syntax.
Prepared statements parse once, cache the AST, and bind values into placeholder slots — the parser never sees the user input as code.
Places placeholders can't reach (ORDER BY, IN-cardinality, LIKE pattern, dynamic SQL) need whitelisting or other structural defenses.
DB parser quirks: PostgreSQL is strict, MySQL is lax, SQLite is tiny.
Try it: SQL Formatter uses dialect-aware parsers to format and normalize keyword case.