Skip to content
yutils

How Database Indexes Actually Work

B-tree, hash indexes, covering indexes, why LIKE '%foo' can't use the index, why composite-index column order matters, and how EXPLAIN reveals when the planner is actually using your index.

~9 min read

The same query takes 0.5 ms on one column and 5,000 ms on another. The difference is the index. But "just add an index" isn't a silver bullet — which index for which query? Why can't LIKE '%foo' use an index? Why does the column order of a composite index matter? This guide walks through B-tree internals, how the query planner picks an index, and the common anti-patterns.

Why indexes exist

-- users table, 1 million rows
SELECT * FROM users WHERE email = 'alice@example.com';

WITHOUT index:
- Full table scan (1M rows) ≈ 1-5 seconds

WITH index on email:
- B-tree lookup → row pointer → fetch row ≈ < 1 ms

An index is a book index. Don't read every page — jump straight to the entry.

B-tree — the standard index structure

Balanced tree. Default in PostgreSQL / MySQL / SQLite / Oracle:

              [30, 70]
             /    |    \
          [10, 20] [40, 50, 60] [80, 90, 100]
           ↓          ↓               ↓
       leaf node   leaf node      leaf node
       (pointers to actual rows)

search "55":
1. root [30, 70] → 30 < 55 < 70 → middle child
2. [40, 50, 60] → 50 < 55 < 60 → pointer after 60
3. leaf → pointer to the row with 55

For 1 million rows, B-tree depth ≈ 3 page reads. Very fast.

B-tree's superpower — range queries are also efficient. Leaves form a sorted linked list, so WHERE age BETWEEN 20 AND 30 finds the first match and reads sequentially.

Hash index — equality only

  • Equality (WHERE id = 42) — O(1)
  • No range, sorting, or partial matches — hashes aren't ordered
  • PostgreSQL hash indexes got WAL support in 9.6; older versions were unreliable
  • Most systems — B-tree default; hash is niche

Why LIKE '%foo' can't use an index

-- index on name
SELECT * FROM users WHERE name LIKE 'foo%';   ✓ uses the index
SELECT * FROM users WHERE name LIKE '%foo';   ✗ full scan

Why:
B-tree lookups are prefix-based. 'foo%' matches names starting with
'foo' — alphabetically contiguous, so a range scan from 'foo' to
'foo~' works.

'%foo' has no fixed prefix — every value must be inspected.

Workarounds — full-text search (PostgreSQL tsvector + GIN, MySQL FULLTEXT) or store a reversed string with an index, then query LIKE 'oof%'.

Composite index — column order is everything

CREATE INDEX idx ON users (last_name, first_name, age);

Works for these queries:
✓ WHERE last_name = 'Kim'                                  (prefix)
✓ WHERE last_name = 'Kim' AND first_name = 'Sumi'          (prefix)
✓ WHERE last_name = 'Kim' AND first_name = 'Sumi' AND age = 30  (full)
✓ WHERE last_name = 'Kim' AND age = 30                     (range scan + filter)

Doesn't help these:
✗ WHERE first_name = 'Sumi'                  (no prefix)
✗ WHERE age = 30                              (no prefix)
✗ WHERE first_name = 'Sumi' AND age = 30      (no prefix)

Composite indexes follow the "left prefix" rule. Skip the first column and the index is useless. Put the most selective + most often filtered column first.

Practical example:

# Common query
SELECT * FROM orders WHERE user_id = 42 AND status = 'pending';

# Good index:
CREATE INDEX idx ON orders (user_id, status);

# Bad index (user_id is more selective):
CREATE INDEX idx ON orders (status, user_id);

Covering index — answer from the index alone

-- Regular index
CREATE INDEX idx_email ON users (email);

SELECT name FROM users WHERE email = 'a@b.c';
↓
1. idx_email matches email → row pointer
2. Fetch the row from the table → extract name
Two steps ─ two I/Os

-- Covering index
CREATE INDEX idx_email_with_name ON users (email) INCLUDE (name);
-- (PostgreSQL INCLUDE, or composite (email, name))

SELECT name FROM users WHERE email = 'a@b.c';
↓
1. Index matches email → name is right there → return immediately
One step ─ no table access

Very fast, but the index grows. Worth it only for hot queries.

The cost of indexes

  • Storage — 1M rows + 4 indexes = extra GBs
  • Write amplification — INSERT / UPDATE / DELETE must update every related index. One row write = N+1 disk writes.
  • VACUUM / OPTIMIZE — time to defragment
  • Planner overhead — too many indexes slow the planner's choice

Trade-off — write-heavy tables: fewer indexes. Read-heavy tables: more.

Query planner — choosing an index

EXPLAIN ANALYZE
SELECT * FROM orders
WHERE user_id = 42 AND created_at > NOW() - INTERVAL '7 days';

QUERY PLAN
─────────────────────────────────────────────────────────────────
Index Scan using idx_user_created on orders
  (cost=0.43..8.45 rows=1 width=72)
  (actual time=0.025..0.027 rows=3 loops=1)
  Index Cond: (user_id = 42 AND created_at > now() - '7 days'::interval)

cost = planner's estimate
actual time = what really happened

The planner uses statistics (row counts, distinct values, distribution) to pick the cheapest plan:

  • Seq Scan — full table. Small tables or many matching rows.
  • Index Scan — index → row pointer → row. Moderate matches.
  • Index Only Scan — covering index. No row fetch.
  • Bitmap Index Scan — combine multiple indexes.

Run ANALYZE table_name after big inserts (or rely on autovacuum).

Patterns that defeat indexes

1. Function on the indexed column

# Bad — wrapping the column kills the index
SELECT * FROM users WHERE LOWER(email) = 'alice@example.com';

# Good — lowercase on the input
SELECT * FROM users WHERE email = LOWER('Alice@Example.com');

# Or functional index
CREATE INDEX idx_email_lower ON users (LOWER(email));

2. Implicit type conversion

# id column is VARCHAR
SELECT * FROM users WHERE id = 42;   ← 42 is a number; needs conversion
                                         → may bypass the index

SELECT * FROM users WHERE id = '42';  ← string, uses the index

3. OR across columns

SELECT * FROM users WHERE email = '...' OR phone = '...';

Even with indexes on both:
- Old planners: full scan (OR is hard)
- Modern planners: Bitmap Index Scan unions both indexes

Prefer UNION when clarity matters:
SELECT * FROM users WHERE email = '...'
UNION
SELECT * FROM users WHERE phone = '...';

4. NOT IN / !=

Negation is bad for indexes — most rows match → full scan. When possible, rewrite as a positive filter.

5. Small tables

At 100 rows the difference between index and scan is negligible. Don't force-index everything — the planner might pick scan and be right.

Beyond B-tree

  • GIN (Generalized Inverted Index, PostgreSQL) — full-text, JSONB keys, arrays
  • GiST (Generalized Search Tree, PostgreSQL) — geometric (PostGIS), tsvector
  • BRIN (Block Range, PostgreSQL) — append-only huge tables. Tiny, lossy
  • Bitmap (Oracle, data warehouses) — low distinct values (gender, status)
  • Spatial / R-tree — geographic

Unique indexes and constraints

CREATE UNIQUE INDEX idx_email ON users (email);
ALTER TABLE users ADD CONSTRAINT users_email_key UNIQUE (email);

Nearly the same — adding a UNIQUE constraint creates a unique index.
Constraints declare schema intent; indexes implement it.

Primary key = unique index + NOT NULL.

Partial indexes — condition-aware

-- Most queries touch active users
CREATE INDEX idx_active_email ON users (email) WHERE status = 'active';

Effects:
1. Smaller index (only active users)
2. Faster matching queries (smaller B-tree)
3. Used when WHERE status = 'active' is part of the query

Monitoring indexes

# PostgreSQL — find unused indexes
SELECT schemaname, indexname, idx_scan
FROM pg_stat_user_indexes
WHERE idx_scan = 0;

# MySQL
SELECT * FROM sys.schema_unused_indexes;

# Drop indexes that get no scans — they only cost on writes.

Common pitfalls

1. Indexing every column

"Just in case" → write overhead. Check pg_stat_user_indexes and drop unused.

2. Composite-index column order

Most selective + most-filtered column first; equality before range:

# Preferred — equality first, range second
INDEX (user_id, created_at)
WHERE user_id = 42 AND created_at > '...'

# Worse — range first
INDEX (created_at, user_id)
WHERE user_id = 42 AND created_at > '...'
↑ Range scan on created_at then filter user_id — slower

3. Stale statistics

Big inserts without ANALYZE → planner picks a bad plan. Tune autovacuum.

4. UUID primary keys cause fragmentation

UUID v4 is random → every insert hits a different B-tree page → splits. Use UUID v7 (timestamp prefix) or ULID for sequential inserts.

5. NULL behavior varies

PostgreSQL and MySQL InnoDB index NULL. Oracle excludes NULL from B-tree by default. Test WHERE col IS NULL per engine.

References

Summary

  • Indexes are book indexes — direct lookups instead of full scans. B-tree is the default.
  • B-tree handles equality + range. Hash only equality.
  • LIKE 'foo%' uses the index; LIKE '%foo' doesn't. Use full-text or reverse- string indexes for the latter.
  • Composite indexes follow the left-prefix rule. Selective + equality columns first.
  • Covering indexes skip the table fetch. Use for hot queries.
  • Index cost = storage + write amplification + VACUUM + planner overhead.
  • Always EXPLAIN. Refresh statistics with ANALYZE / autovacuum.
  • Specialized indexes — GIN (full-text), GiST (geometric), BRIN (huge log tables), partial (condition-aware).
Back to guides