How Query Planners Actually Work

A single query that returns the same rows can be executed dozens of ways: scan the whole table or ride an index, join in this order or that. Picking among them is the job of the query planner (optimizer). It generates many equivalent plans and runs the one it estimates to be cheapest. The catch: "cheapest" is an estimate. When the estimate is wrong, a 0.5 ms query becomes 30 seconds. This guide walks through the planner's pipeline, its cost model, statistics, cardinality estimation, scan and join choices, and how to read EXPLAIN to see what it decided.

The pipeline — from query to plan

SQL text
   ↓ parse        syntax check → parse tree (see sibling guide)
   ↓ rewrite      expand views, flatten subqueries, fold constants
   ↓ plan/optimize  enumerate physical plans → estimate cost → pick min
   ↓ execute      actually run the chosen plan
   result rows

Parsing turns text into a tree — SQL grammar itself is covered in the how-sql-parsing-works guide. Here we focus on what comes after: rewrite → plan → execute.

Rewrite is a set of meaning-preserving transforms. A view is replaced by its defining query, some subqueries are flattened into joins, and constant conditions like WHERE 1=1 are folded away. The planner receives this tidied query and begins optimization in earnest.

Logical plan vs physical plan

A logical plan says what to do — "join these two tables, filter on this predicate". It captures only the meaning of the operations. A physical plan says how — "use a hash join, seq-scan table A, index-scan table B". One logical plan yields many physical ways to execute it.

Logical:  σ(status='paid')  orders ⋈ users
                              user_id

Same logical plan → candidate physical plans:
  A) Seq Scan(orders) ⋈hash Seq Scan(users)
  B) Index Scan(orders) ⋈nestloop Index Scan(users)
  C) Seq Scan(users) ⋈merge Index Scan(orders)
  ... many more

The planner's job = cost each candidate, pick the minimum.

The candidate count explodes as you add tables (join order alone grows factorially). So the planner prunes the search space with dynamic programming (PostgreSQL's join search) or heuristics (a genetic algorithm once there are many tables).

Cost-based optimization — assigning a number

Most modern RDBMSs use a cost-based optimizer (CBO). Each plan gets a single number (cost) and the minimum wins. Cost is not milliseconds — it is an abstract unit expressing "how many times the cost of reading one page sequentially this operation takes".

PostgreSQL cost parameters (defaults):

seq_page_cost     = 1.0    read one page sequentially (the base unit)
random_page_cost  = 4.0    read one page at a random location (HDD bias)
cpu_tuple_cost    = 0.01   process one row
cpu_index_tuple_cost = 0.005  process one index entry
cpu_operator_cost = 0.0025 one operator / function call

Rough plan cost =
  (pages read × page_cost)
+ (rows processed × cpu_tuple_cost)
+ (index entries × cpu_index_tuple_cost)
+ ...

The key assumption is that random_page_cost is 4× seq_page_cost. An index scan fetches rows from random locations, so each page is expensive; a seq scan sweeps the disk in order, so each page is cheap. That is why, when many rows match, "many expensive random reads" loses to "one cheap sequential sweep" — the core reason the planner sometimes ignores your index.

On SSD / NVMe, random reads are not that expensive. A common tuning move is lowering random_page_cost to around 1.1 to nudge the planner toward using indexes more aggressively.

Statistics — the planner's eyes

To compute cost the planner must know "how many rows are in this table" and "how many distinct values does this column have". That information is the statistics. PostgreSQL's ANALYZE (or autovacuum's auto-analyze) samples a table and collects them.

ANALYZE orders;   -- refresh statistics for orders

-- What it collects (inspect via the pg_stats view):
reltuples    estimated row count of the table
n_distinct   number (or fraction) of distinct values
most_common_vals / most_common_freqs   top frequent values + freq (MCV)
histogram_bounds   distribution histogram of the remaining values
null_frac    fraction of NULLs
correlation  physical order vs value order (gauges index-scan efficiency)

The histogram drives selectivity estimates for range queries. By splitting column values into equal-frequency buckets, the planner can estimate what fraction passes WHERE age > 40 from the bucket position. MCV (most-common-values) shines on skewed columns — if 99% of status is 'done', that value is recorded separately with its true frequency for an accurate estimate.

Why stale stats cause bad plans: statistics are a snapshot from the last ANALYZE. If a table held 10K rows yesterday and you loaded 10M today without analyzing, the planner still believes there are 10K and picks a "small-table" plan like a nested loop. In reality it loops ten million times and the query appears to hang.

Cardinality / selectivity — the #1 culprit

Cardinality = the estimated number of rows an operation will produce. Selectivity = the fraction a filter lets through (0–1). Every cost computation is built on top of these estimates. And the #1 cause of bad plans is cardinality misestimation.

Say orders has 1,000,000 rows, status has 5 distinct values.

WHERE status = 'paid'
→ planner: selectivity ≈ 1/5 = 0.2 → estimate 200,000 rows
   (uniform assumption; with MCV it uses the real frequency)

WHERE user_id = 42
→ if user_id has 100,000 distinct values
   selectivity ≈ 1/100,000 → estimate 10 rows → picks an index scan

The trouble starts when multiple conditions are correlated. By default the planner assumes conditions are independent and multiplies their selectivities.

WHERE city = 'Seoul' AND country = 'Korea'

planner assumes: independent → P(Seoul) × P(Korea)
reality: Seoul almost always implies Korea → the two are really one

→ the planner badly underestimates the row count
→ picks an over-optimistic plan (nested loop) → it blows up

To fix this, PostgreSQL lets you build multi-column (extended) statistics with CREATE STATISTICS — it stores the real number of distinct combinations for a correlated column group, breaking the multiplication assumption. When you suspect a misestimate, the first thing to check is the estimated-vs-actual rows gap in EXPLAIN ANALYZE.

Access methods — how to reach the table

Even for a single table there are several ways to fetch rows. The planner picks based on the estimated number of matching rows (selectivity).

Seq Scan — sweep the whole table in order. Cheapest when many rows match (low selectivity) or the table is small, because sequential reads are cheaper per page than an index's random fetches.
Index Scan — find row pointers via the index, then fetch rows from the table. Wins when few rows match (high selectivity); one random page read per match.
Index Only Scan — if every needed column lives in the index (covering), the table is never touched. Among the fastest.
Bitmap Heap Scan — build a bitmap of matching pages via the index, then fetch those pages in near-sequential order. Fits the middle ground: "too many matches for an index scan, too few for a seq scan". Can also combine multiple indexes.

Why the planner deliberately ignores your index: even with an index, if matches are a large fraction of the table (say 30%+), an index scan repeats that many random page fetches. Once that cost exceeds sweeping the table once sequentially, the planner correctly picks a seq scan. "I added an index, why isn't it used?" usually answers to "because using it would be more expensive".

-- Few matches → Index Scan
EXPLAIN SELECT * FROM orders WHERE user_id = 42;
  Index Scan using idx_user on orders  (cost=0.42..8.4 rows=9 ...)

-- Matches half the table → Seq Scan (ignoring the index is correct)
EXPLAIN SELECT * FROM orders WHERE status = 'done';
  Seq Scan on orders  (cost=0..18334 rows=512000 ...)
  Filter: (status = 'done')

Index-scan efficiency also depends on the statistics correlation — when index order matches physical storage order (high correlation), random fetches become nearly sequential and cheaper. For index internals see the how-database-indexes-work guide.

Join algorithms — three weapons

Three standard ways to combine two tables. The planner chooses based on each input's estimated size, whether it is sorted, and the available indexes.

Nested Loop — for each row of the outer table, look up the inner table. Best when the inner side has a good index and the outer side is small. Cost ≈ outer rows × inner lookup. Disastrous if the outer side is large.
Hash Join — build a hash table in memory from the smaller side, then stream the larger side through it to probe. Strong for equality joins between big tables; each side is read once (build + probe). If the hash exceeds memory (work_mem) it spills to disk.
Merge Join — sort both sides on the join key, then zip them together. Wins when both inputs are already sorted (e.g. in index order) or the sort cost is affordable. Strong for large sorted inputs.

Cost intuition (outer N rows, inner M rows):

Nested Loop  ≈ N × (one inner lookup)
             without an inner index → N × M → blows up

Hash Join    ≈ N + M  (build + probe, one scan each)
             default for big × big equality joins

Merge Join   ≈ N + M  + sort cost (0 if already sorted)
             likes sorted inputs / range-friendly

Intuition: small ⋈ big-with-index → nested loop. big ⋈ big equality → hash. already sorted → merge. If a cardinality misestimate makes an input "look small" when it is actually huge, the planner picks a nested loop and it explodes — once again, statistics are the crux.

Reading EXPLAIN / EXPLAIN ANALYZE

EXPLAIN prints only the estimated plan (it does not run the query). EXPLAIN ANALYZE actually runs it and shows estimated vs actual side by side — the core debugging tool.

EXPLAIN ANALYZE
SELECT * FROM orders o
JOIN users u ON u.id = o.user_id
WHERE u.country = 'Korea' AND o.status = 'paid';

QUERY PLAN
─────────────────────────────────────────────────────────────────
Hash Join  (cost=33.50..2150.75 rows=120 width=88)
           (actual time=0.42..58.3 rows=48000 loops=1)
  Hash Cond: (o.user_id = u.id)
  ->  Seq Scan on orders o  (cost=0..1800 rows=50000 width=72)
        (actual time=0.01..21.4 rows=50000 loops=1)
        Filter: (status = 'paid')
  ->  Hash  (cost=20..20 rows=800 width=16)
        ->  Index Scan using idx_country on users u
              (cost=0.3..20 rows=800 ...)
              (actual time=0.03..2.1 rows=790 loops=1)

Planning Time: 0.31 ms
Execution Time: 71.9 ms

How to read it:

Read from the innermost (most-indented) node outward. Above: orders seq scan + users index scan, combined by a hash join.
cost=start..total — cost to the first row .. cost to completion (abstract units). actual time=start..total is real milliseconds.
Most important: rows=120 (estimate) vs actual ... rows=48000. A 400× misestimate. The planner built the plan expecting 120 join rows but got 48,000 — the textbook case of the multiplication assumption collapsing on correlated predicates.
loops=N — this node ran N times (e.g. inside a nested loop). actual time is the per-loop average, so total time ≈ time × loops.
A big cost with small time (or vice versa) signals the estimate diverging from reality. Go further with EXPLAIN (ANALYZE, BUFFERS) to see real page reads.

Field method: find the node with the largest estimated-vs-actual rows gap. That is the source of the misestimate, and very likely the cause of a wrong join choice and the blow-up.

Plan caching / prepared statements

Re-planning the same query on every call is wasteful. Prepared statements reuse a plan. PostgreSQL toggles between two modes:

Custom plan — re-optimize each execution with the actual parameter values. Useful when the best plan depends on the value (one value is rare, another is common).
Generic plan — plan once using average selectivity for the parameters and reuse it, saving planning time. PostgreSQL watches the first few custom-plan costs and, if a generic plan is about as cheap, locks it in.

The trap: on a skewed column, a generic plan can get cached that is fine for rare values but terrible for common ones, then reused. plan_cache_mode can force either behavior.

References

PostgreSQL — Using EXPLAIN — postgresql.org
PostgreSQL — Planner Cost Constants — postgresql.org
PostgreSQL — Row Estimation Examples — postgresql.org
Use The Index, Luke — Markus Winand — use-the-index-luke.com

Practical takeaways

The planner's job = pick the cheapest among equivalent physical plans. Cost is an abstract unit, not time.
After big changes (bulk INSERT / delete / shifted column distribution) always run ANALYZE. Stale stats are a common root of bad plans.
The #1 cause of bad plans is cardinality misestimation. Fix correlated columns with CREATE STATISTICS.
When many rows match, the planner ignoring your index and picking a seq scan is the correct call — random fetches cost more than a sequential sweep.
Join intuition — small ⋈ big-with-index = nested loop, big ⋈ big = hash, already sorted = merge.
In EXPLAIN ANALYZE, find the node with the biggest estimated-vs-actual rows gap — that is the epicenter.
Don't fight the planner blindly (forcing index hints, etc.). Usually the right move is fixing statistics or rewriting the query. Related — how-database-indexes-work, how-sql-parsing-works.