SELECT * FROM users WHERE id = 42 takes a four-stage trip — lexer → parser → planner → executor — before any row comes back. SQL injection is dangerous because an attacker can manipulate the very first step, where the parser draws the line between code and data. This guide walks through the SQL parser internals, what the query planner does, the mechanics of injection, and exactly how prepared statements block it.
The four-stage SQL pipeline
SQL string
↓
1. Lexer — tokens (SELECT, FROM, identifier, number, string ...)
↓
2. Parser — AST (abstract syntax tree)
↓
3. Planner — query plan (which index, join order, ...)
↓
4. Executor — actual row fetching
↓
resultsImplementations vary, but the skeleton is the same in PostgreSQL, MySQL, and SQLite.
1. Lexer — characters → tokens
input: SELECT name, age FROM users WHERE id = 42
tokens:
KEYWORD SELECT
IDENT name
PUNCT ,
IDENT age
KEYWORD FROM
IDENT users
KEYWORD WHERE
IDENT id
OP =
NUMBER 42The lexer skips whitespace and groups characters into tokens. A state machine, character by character:
- Letter or _ → IDENT or KEYWORD state
- Digit → NUMBER state
'→ STRING state until the next'--→ comment state until end of line
Keyword recognition — after collecting an IDENT, look it up in a reserved-word table. SELECT is a keyword; users is an identifier.
2. Parser — tokens → AST
Most engines use LALR(1) parser generators (yacc/bison). PostgreSQL's grammar is ~2,000 lines of bison.
Example AST:
SelectStmt
├── targets: [name, age]
├── from: [users]
└── where: BinaryOp
├── op: =
├── left: ColumnRef(id)
└── right: Constant(42)The key property — structure and values live in separate AST nodes. The comparison = is a node; the literal 42 is a Constant child. That separation is the safety boundary.
3. Planner — choosing an execution plan
The same AST can be executed multiple ways. WHERE id = 42 could be:
- Sequential scan — read every row and check
- Index scan — use the B-tree on
users.id - Bitmap scan — combine multiple indexes
The planner uses statistics (row counts, distinct values, index cardinality) to estimate cost and pick the cheapest plan. EXPLAIN ANALYZE shows the choice.
JOIN gets harder:
- Nested loop — good for small tables
- Hash join — good when there's no usable index
- Merge join — both inputs already sorted
For N tables, there are N! join orderings. Planners use dynamic programming (or in some cases genetic search) to find a good one.
4. Executor — running the plan
Plan tree, top-down. Each node is a row iterator — "give me the next row." Results flow into a buffer or cursor and back to the client.
SQL Injection — breaking the parser's invariants
The vulnerable pattern — string concatenation:
// JavaScript (intentionally unsafe)
const userInput = req.query.id;
const sql = "SELECT * FROM users WHERE id = " + userInput;
db.query(sql);
// Normal user: ?id=42
// → SELECT * FROM users WHERE id = 42
// Attacker: ?id=1 OR 1=1 --
// → SELECT * FROM users WHERE id = 1 OR 1=1 --
// → exposes every user (1=1 is always true, -- comments out the rest)What just happened — the parser cheerfully accepted 1 OR 1=1 as valid SQL. The data/code boundary was destroyed at the moment of string concatenation, before the parser ever ran.
Even nastier variants
?id=1; DROP TABLE users--
→ SELECT * FROM users WHERE id = 1; DROP TABLE users--
→ Two statements (on DBs that allow multi-statement)
?username=admin&password=' OR '1'='1
→ SELECT * FROM users WHERE username = 'admin' AND password = '' OR '1'='1'
→ Bypasses the password check
?id=1 UNION SELECT credit_card FROM payments
→ Pulls rows from a different table into the result setPrepared statements — how they neutralize injection
Prepared statements split SQL into two phases:
// 1. prepare the template (with placeholders)
db.prepare("SELECT * FROM users WHERE id = $1");
// 2. execute later, passing values separately
db.execute([userInput]);The crucial property — the DB parses the template once and caches the AST. Subsequent executions plug values into placeholder slots in the cached AST. The parser never sees the user input as SQL, so it cannot be interpreted as code.
// Attempted injection
userInput = "1 OR 1=1"
// The DB binds this as a value, not code:
// → SELECT * FROM users WHERE id = '1 OR 1=1'
// → looks for rows with id literally equal to "1 OR 1=1"
// → returns 0 rows
// → safeThat's why prepared statements are structurally safer than escape functions. Escape can be skipped or done wrong. Prepared is "the parser doesn't run twice" — a structural guarantee.
Stored procedures aren't automatically safe
Stored procedures that build dynamic SQL with EXEC reintroduce the same risk:
CREATE PROCEDURE search(@name VARCHAR(50)) AS
EXEC('SELECT * FROM users WHERE name = ''' + @name + '''')
-- @name = ' OR '1'='1 → injected.Use parameterized queries inside stored procedures too.
Parser quirks per DB
PostgreSQL — most spec-compliant
- Largest reserved-word table (SQL:2016 + extensions)
::cast (id::text) is non-standard; the standard form isCAST(id AS text)$1,$2placeholders- Double-quoted identifiers preserve case:
"User"
MySQL — loosest grammar
--comments require a trailing space (a security-flavored choice). Older clients also accept#- Backtick-quoted identifiers:
`User` - Coerces numbers and strings (
WHERE id = '42'works) — quiet sources of bugs ?placeholders
SQLite — minimal
- Dynamic typing — one column can hold strings and ints
- Accepts both
"User"and`User`for identifier quoting - Tiny parser, suitable for embedded use
SQL formatters are parser by-products
An SQL formatter (auto-indent) also runs a parser:
- Parse SQL into an AST
- Pretty-print the AST with indentation rules
SQL Formatter lets you pick PostgreSQL / MySQL / SQLite / Standard. The same SQL can format differently because reserved-word tables differ per dialect.
EXPLAIN — see the query plan
Every major DB supports EXPLAIN:
EXPLAIN ANALYZE SELECT * FROM users WHERE email = 'a@b.c';
QUERY PLAN
─────────────────────────────────────────────────────────
Index Scan using users_email_idx on users
(cost=0.42..8.44 rows=1 width=72)
(actual time=0.029..0.030 rows=1 loops=1)
Index Cond: (email = 'a@b.c')
Planning Time: 0.092 ms
Execution Time: 0.052 mscost = planner estimate. actual time = what really happened. When the two diverge significantly, statistics are stale — ANALYZE table_name.
Sequential Scan on a large table is a missing-index signal. Hash Join with mismatched build/probe sides burns memory.
Common pitfalls
1. ORM raw-query escape hatches
Sequelize's query, Prisma's $queryRaw — these can bypass the prepared-statement path if used wrong. Check the docs for each method.
2. % and _ injection in LIKE
?search=%
→ WHERE name LIKE '%%%' ← matches everything (unexpected)If the LIKE pattern comes from user input, escape %and _: str.replace(/[%_]/g, "\\$&") plus the LIKE ESCAPE clause.
3. ORDER BY with user-supplied column
?sort=name; DROP TABLE users--
→ ORDER BY name; DROP TABLE users--ORDER BY can't be parameterized (column names are syntactic). Whitelist allowed values (e.g. ["name", "created_at"]).
4. Dynamic IN clauses
// Bad
const ids = req.query.ids.split(","); // "1,2; DROP TABLE--"
db.query(`SELECT * FROM users WHERE id IN (${ids.join(",")})`);
// Good
db.query("SELECT * FROM users WHERE id = ANY($1)", [ids]);
// Postgres arrays, or generate N placeholders programmatically5. Error message leaks
Raw DB errors expose table and column names to attackers. Always return a generic error to the client and log the detail server-side.
References
- PostgreSQL — query processing overview — official docs
- OWASP — SQL Injection — OWASP
- Bobby Tables (prepared statements per language) — bobby-tables.com
- PostgreSQL grammar source (gram.y) — GitHub
Summary
- The SQL pipeline: lexer → parser → planner → executor.
- Lexer emits tokens (KEYWORD / IDENT / NUMBER / STRING). Parser builds an AST that separates structure from values.
- Planner picks the cheapest execution from many candidates for the same AST.
- SQL injection happens because concatenation collapses the code/data boundary before parsing. Inputs like
1 OR 1=1 --become SQL syntax. - Prepared statements parse once, cache the AST, and bind values into placeholder slots — the parser never sees the user input as code.
- Places placeholders can't reach (ORDER BY, IN-cardinality, LIKE pattern, dynamic SQL) need whitelisting or other structural defenses.
- DB parser quirks: PostgreSQL is strict, MySQL is lax, SQLite is tiny.
- Try it: SQL Formatter uses dialect-aware parsers to format and normalize keyword case.