The same query takes 0.5 ms on one column and 5,000 ms on another. The difference is the index. But "just add an index" isn't a silver bullet — which index for which query? Why can't LIKE '%foo' use an index? Why does the column order of a composite index matter? This guide walks through B-tree internals, how the query planner picks an index, and the common anti-patterns.
Why indexes exist
-- users table, 1 million rows
SELECT * FROM users WHERE email = 'alice@example.com';
WITHOUT index:
- Full table scan (1M rows) ≈ 1-5 seconds
WITH index on email:
- B-tree lookup → row pointer → fetch row ≈ < 1 msAn index is a book index. Don't read every page — jump straight to the entry.
B-tree — the standard index structure
Balanced tree. Default in PostgreSQL / MySQL / SQLite / Oracle:
[30, 70]
/ | \
[10, 20] [40, 50, 60] [80, 90, 100]
↓ ↓ ↓
leaf node leaf node leaf node
(pointers to actual rows)
search "55":
1. root [30, 70] → 30 < 55 < 70 → middle child
2. [40, 50, 60] → 50 < 55 < 60 → pointer after 60
3. leaf → pointer to the row with 55For 1 million rows, B-tree depth ≈ 3 page reads. Very fast.
B-tree's superpower — range queries are also efficient. Leaves form a sorted linked list, so WHERE age BETWEEN 20 AND 30 finds the first match and reads sequentially.
Hash index — equality only
- Equality (
WHERE id = 42) — O(1) - No range, sorting, or partial matches — hashes aren't ordered
- PostgreSQL hash indexes got WAL support in 9.6; older versions were unreliable
- Most systems — B-tree default; hash is niche
Why LIKE '%foo' can't use an index
-- index on name
SELECT * FROM users WHERE name LIKE 'foo%'; ✓ uses the index
SELECT * FROM users WHERE name LIKE '%foo'; ✗ full scan
Why:
B-tree lookups are prefix-based. 'foo%' matches names starting with
'foo' — alphabetically contiguous, so a range scan from 'foo' to
'foo~' works.
'%foo' has no fixed prefix — every value must be inspected.Workarounds — full-text search (PostgreSQL tsvector + GIN, MySQL FULLTEXT) or store a reversed string with an index, then query LIKE 'oof%'.
Composite index — column order is everything
CREATE INDEX idx ON users (last_name, first_name, age);
Works for these queries:
✓ WHERE last_name = 'Kim' (prefix)
✓ WHERE last_name = 'Kim' AND first_name = 'Sumi' (prefix)
✓ WHERE last_name = 'Kim' AND first_name = 'Sumi' AND age = 30 (full)
✓ WHERE last_name = 'Kim' AND age = 30 (range scan + filter)
Doesn't help these:
✗ WHERE first_name = 'Sumi' (no prefix)
✗ WHERE age = 30 (no prefix)
✗ WHERE first_name = 'Sumi' AND age = 30 (no prefix)Composite indexes follow the "left prefix" rule. Skip the first column and the index is useless. Put the most selective + most often filtered column first.
Practical example:
# Common query
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';
# Good index:
CREATE INDEX idx ON orders (user_id, status);
# Bad index (user_id is more selective):
CREATE INDEX idx ON orders (status, user_id);Covering index — answer from the index alone
-- Regular index
CREATE INDEX idx_email ON users (email);
SELECT name FROM users WHERE email = 'a@b.c';
↓
1. idx_email matches email → row pointer
2. Fetch the row from the table → extract name
Two steps ─ two I/Os
-- Covering index
CREATE INDEX idx_email_with_name ON users (email) INCLUDE (name);
-- (PostgreSQL INCLUDE, or composite (email, name))
SELECT name FROM users WHERE email = 'a@b.c';
↓
1. Index matches email → name is right there → return immediately
One step ─ no table accessVery fast, but the index grows. Worth it only for hot queries.
The cost of indexes
- Storage — 1M rows + 4 indexes = extra GBs
- Write amplification — INSERT / UPDATE / DELETE must update every related index. One row write = N+1 disk writes.
- VACUUM / OPTIMIZE — time to defragment
- Planner overhead — too many indexes slow the planner's choice
Trade-off — write-heavy tables: fewer indexes. Read-heavy tables: more.
Query planner — choosing an index
EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 42 AND created_at > NOW() - INTERVAL '7 days';
QUERY PLAN
─────────────────────────────────────────────────────────────────
Index Scan using idx_user_created on orders
(cost=0.43..8.45 rows=1 width=72)
(actual time=0.025..0.027 rows=3 loops=1)
Index Cond: (user_id = 42 AND created_at > now() - '7 days'::interval)
cost = planner's estimate
actual time = what really happenedThe planner uses statistics (row counts, distinct values, distribution) to pick the cheapest plan:
- Seq Scan — full table. Small tables or many matching rows.
- Index Scan — index → row pointer → row. Moderate matches.
- Index Only Scan — covering index. No row fetch.
- Bitmap Index Scan — combine multiple indexes.
Run ANALYZE table_name after big inserts (or rely on autovacuum).
Patterns that defeat indexes
1. Function on the indexed column
# Bad — wrapping the column kills the index
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';
# Good — lowercase on the input
SELECT * FROM users WHERE email = LOWER('Alice@Example.com');
# Or functional index
CREATE INDEX idx_email_lower ON users (LOWER(email));2. Implicit type conversion
# id column is VARCHAR
SELECT * FROM users WHERE id = 42; ← 42 is a number; needs conversion
→ may bypass the index
SELECT * FROM users WHERE id = '42'; ← string, uses the index3. OR across columns
SELECT * FROM users WHERE email = '...' OR phone = '...';
Even with indexes on both:
- Old planners: full scan (OR is hard)
- Modern planners: Bitmap Index Scan unions both indexes
Prefer UNION when clarity matters:
SELECT * FROM users WHERE email = '...'
UNION
SELECT * FROM users WHERE phone = '...';4. NOT IN / !=
Negation is bad for indexes — most rows match → full scan. When possible, rewrite as a positive filter.
5. Small tables
At 100 rows the difference between index and scan is negligible. Don't force-index everything — the planner might pick scan and be right.
Beyond B-tree
- GIN (Generalized Inverted Index, PostgreSQL) — full-text, JSONB keys, arrays
- GiST (Generalized Search Tree, PostgreSQL) — geometric (PostGIS), tsvector
- BRIN (Block Range, PostgreSQL) — append-only huge tables. Tiny, lossy
- Bitmap (Oracle, data warehouses) — low distinct values (gender, status)
- Spatial / R-tree — geographic
Unique indexes and constraints
CREATE UNIQUE INDEX idx_email ON users (email);
ALTER TABLE users ADD CONSTRAINT users_email_key UNIQUE (email);
Nearly the same — adding a UNIQUE constraint creates a unique index.
Constraints declare schema intent; indexes implement it.
Primary key = unique index + NOT NULL.Partial indexes — condition-aware
-- Most queries touch active users
CREATE INDEX idx_active_email ON users (email) WHERE status = 'active';
Effects:
1. Smaller index (only active users)
2. Faster matching queries (smaller B-tree)
3. Used when WHERE status = 'active' is part of the queryMonitoring indexes
# PostgreSQL — find unused indexes
SELECT schemaname, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;
# MySQL
SELECT * FROM sys.schema_unused_indexes;
# Drop indexes that get no scans — they only cost on writes.Common pitfalls
1. Indexing every column
"Just in case" → write overhead. Check pg_stat_user_indexes and drop unused.
2. Composite-index column order
Most selective + most-filtered column first; equality before range:
# Preferred — equality first, range second
INDEX (user_id, created_at)
WHERE user_id = 42 AND created_at > '...'
# Worse — range first
INDEX (created_at, user_id)
WHERE user_id = 42 AND created_at > '...'
↑ Range scan on created_at then filter user_id — slower3. Stale statistics
Big inserts without ANALYZE → planner picks a bad plan. Tune autovacuum.
4. UUID primary keys cause fragmentation
UUID v4 is random → every insert hits a different B-tree page → splits. Use UUID v7 (timestamp prefix) or ULID for sequential inserts.
5. NULL behavior varies
PostgreSQL and MySQL InnoDB index NULL. Oracle excludes NULL from B-tree by default. Test WHERE col IS NULL per engine.
References
- Use The Index, Luke — Markus Winand — use-the-index-luke.com
- PostgreSQL — Indexes — postgresql.org
- MySQL 8 — Optimization and Indexes — mysql.com
- B-tree visualization — USF CS
Summary
- Indexes are book indexes — direct lookups instead of full scans. B-tree is the default.
- B-tree handles equality + range. Hash only equality.
LIKE 'foo%'uses the index;LIKE '%foo'doesn't. Use full-text or reverse- string indexes for the latter.- Composite indexes follow the left-prefix rule. Selective + equality columns first.
- Covering indexes skip the table fetch. Use for hot queries.
- Index cost = storage + write amplification + VACUUM + planner overhead.
- Always EXPLAIN. Refresh statistics with ANALYZE / autovacuum.
- Specialized indexes — GIN (full-text), GiST (geometric), BRIN (huge log tables), partial (condition-aware).