Skip to content
yutils

How Database Storage Engines Actually Work

Pages, the buffer pool, WAL and fsync, B-tree vs LSM-tree engines, compaction, and the write/read/space amplification trade-offs that decide which engine fits.

~10 min read

Give two databases the same data and one absorbs hundreds of thousands of writes per second while the other makes the disk scream under the same load. The difference is the storage engine — how it lays rows on disk and gets them back. Engines split into two camps: those that edit in place, the B-tree engines (Postgres, InnoDB, SQLite), and those that append first and tidy up later, the LSM-tree engines (RocksDB, Cassandra, LevelDB). This guide walks from pages, WAL, and the buffer pool through the read/write/space amplification trade-offs that decide which engine fits your workload.

It all starts with pages

Databases don't write rows one at a time. They read and write fixed-size blocks called pages, because the page is the smallest unit of disk I/O:

PostgreSQL  : 8KB page (default)
MySQL InnoDB: 16KB page
SQLite      : 4KB page (default)
SQL Server  : 8KB page

heap file = an array of pages
┌────────┬────────┬────────┬────────┐
│ page 0 │ page 1 │ page 2 │ page 3 │ ...
└────────┴────────┴────────┴────────┘
  ↑ each page holds many rows + a page header + free space

Even if a row is 100 bytes, reading it means reading the entire 8KB page it lives in. So "read one row" is really "read one page." Which rows share a page decides performance (related: how-database-indexes-work, where a covering index exploits exactly this page locality).

The buffer pool — disk I/O dominates

A memory read is ~100ns, an SSD random read is ~100µs (1,000×), an HDD seek is ~10ms (100,000×). So the database's first goal is to not touch the disk:

SELECT * FROM users WHERE id = 42;

1. Is the page holding id=42 already in the buffer pool (memory)?
   └ yes (cache hit)  → return from memory immediately (~100ns)
   └ no  (cache miss) → read page from disk → load into buffer pool (~100µs)

buffer pool = cache of recently used pages in memory (LRU-ish policy)
- PostgreSQL: shared_buffers
- MySQL InnoDB: innodb_buffer_pool_size (often 50-75% of RAM)

This is why tuning an OLTP database starts with checking that the buffer pool hit rate clears 99%. A 1% miss pulls in a disk access that is 1,000× slower. The OS page cache (related: how-filesystems-work) sits one layer below that.

The write path and durability — WAL

The problem: writing straight to a page's home location on disk is slow (random I/O), and a crash mid-write leaves a half-written page (a torn page). The fix is the Write-Ahead Log (WAL, a.k.a. redo log):

UPDATE accounts SET balance = balance - 100 WHERE id = 7;

1. Append the change to the WAL (sequential write at the log's end — fast)
2. fsync the WAL → guarantees "this change is durably recorded"
3. Reply COMMIT OK to the client  ← durable at this point
4. Modify the real data page in the buffer pool, in memory only (dirty page)
5. Later, a checkpoint flushes dirty pages to their home on disk

Key idea: in-place disk writes happen lazily, in batches. Durability
is the job of the WAL's sequential append + fsync.

Why it's fast — instead of random writes (editing a page in place) it does sequential writes (appending to the end of the log). Disks are far faster at sequential than random writes (dramatically on HDDs, and still on SSDs).

Why it's durablefsync flushes the OS buffers so the data has physically reached the disk. After a crash, restart replays the WAL from the last checkpoint, recovering changes that committed but hadn't been written to their home location yet.

Checkpoints and fsync

  • Checkpoint — periodically (by time or WAL size) flush dirty pages to their home in the data file. WAL records older than that point can be discarded (no longer needed for recovery).
  • The cost of fsync — fsync is a disk round trip, so it's expensive. Engines batch many transactions into a single fsync — group commit — to raise throughput.
  • Relaxing durability — Postgres synchronous_commit = off or MySQL innodb_flush_log_at_trx_commit = 2 loosen the fsync discipline for speed, at the risk of losing the last few ms of commits on a crash. It's a trade-off.

B-tree storage engines — update in place

The approach in Postgres, MySQL InnoDB, and SQLite. Keep the data (or a clustered index) as sorted B-tree pages, and write by modifying that page in place (update-in-place):

INSERT INTO users (id, ...) VALUES (55, ...);

           [30 | 70]            ← internal page
          /    |    \
    [10|20] [40|50|60] [80|90]  ← leaf pages (sorted rows)
                ↑
   55 belongs in leaf [40|50|60] → read that page → insert 55

What if the page is full?
[40|50|55|60] can't fit more → PAGE SPLIT
split into [40|50] and [55|60] + add a new pointer in the parent page

Page splits and write amplification — a single insert that triggers a split rewrites several pages. Worse, if the sort order is random (e.g. a UUID v4 primary key) splits happen constantly and the disk fragments (related: how-database-indexes-work, the UUID fragmentation section).

  • Strengths — staying sorted makes range scans (BETWEEN, ORDER BY) efficient, and a read usually finishes in one tree traversal + page fetch (low read amplification). The standard for read-heavy OLTP.
  • Weaknesses — writes are random I/O plus page splits, which hurts under very write-heavy ingest.

LSM-tree storage engines — append then tidy

The approach in RocksDB, Cassandra, LevelDB, ScyllaDB, and HBase. Give up in-place edits: buffer all writes in memory, dump them to sorted files, and merge those files in the background:

write:
1. Append to the WAL (durability) + insert into the memtable
   (in memory, a sorted structure)
2. When the memtable hits a threshold (say 64MB) → flush to disk as an SSTable
   (SSTable = Sorted String Table, a sorted immutable file)
3. Start a fresh memtable and keep accepting writes

The disk accumulates many SSTables:
[memtable] (in memory, newest)
   ↓ flush
[SSTable-4] [SSTable-3] [SSTable-2] [SSTable-1]  (oldest)

Key idea: every write is an in-memory insert + a sequential append.
No in-place edits, no random I/O → write throughput is overwhelming.

updates and deletes are also appends:
- UPDATE = append a new version of the same key (the old one just lingers)
- DELETE = append a "tombstone" marker

The LSM read path — memtable + SSTables + bloom filters

Cheap writes come at the cost of expensive reads. The same key can live in the memtable and in several SSTables, so finding the latest version means looking in many places:

GET key="user:42"

1. Check the memtable (newest) → return if found
2. Otherwise search SSTables newest → oldest
   SSTable-4 → SSTable-3 → SSTable-2 → ...

Problem: with 10 SSTables, does one read mean 10 disk accesses?
→ read amplification

Fix: BLOOM FILTERS
- a small in-memory bloom filter per SSTable
- quickly answers "this key is definitely NOT in this SSTable"
  (false positives only, never false negatives)
- a "no" lets the engine skip that SSTable's disk read entirely
→ removes most of the unnecessary disk accesses

Still, if SSTables pile up forever, reads slow down and the disk fills with garbage (tombstones for deleted rows, stale versions of updated ones). That's why you need compaction.

Compaction — the heart of an LSM

A background thread reads several SSTables, merges them while dropping old versions and tombstones of the same key, and rewrites them as fewer, larger SSTables. Two merge strategies dominate:

SIZE-TIERED (Cassandra default, write-friendly)
- when N SSTables of similar size accumulate, merge them into one bigger one
- low write/compaction overhead
- downside: the same key spans many tiers → high read & space amplification

LEVELED (RocksDB/LevelDB default, read-friendly)
- levels L0, L1, L2 ... each ~10× the size of the previous
- within a level, SSTables have non-overlapping key ranges
  → a key exists in at most one SSTable per level
- low read amplification (you only check a few levels)
- downside: frequent, wide merges → high write amplification

Compaction isn't free — it burns disk I/O and CPU, and if mistuned it competes with foreground writes for resources (a write stall). Half of operating an LSM is tuning compaction.

The three amplifications

Three axes for judging a storage engine. You can't minimize all three at once — shrinking one grows another:

  • Write amplification — how many bytes a logical 1-byte write actually causes to hit the disk. B-tree: page splits + rewriting pages in place. LSM: compaction rewrites the same data at each level (especially leveled).
  • Read amplification — how many disk pages one logical read touches. B-tree: usually low (tree depth ≈ 3). LSM: higher, because it checks several SSTables plus bloom filters.
  • Space amplification — how much disk the data occupies versus its logical size. B-tree: low-ish (but watch fragmentation / fill factor). LSM: can be higher because old versions and tombstones linger until compaction (size-tiered especially).
Rough tendencies (not absolutes):

                write amp   read amp   space amp
B-tree           medium       low        low
LSM (leveled)    high         medium     medium
LSM (size-tiered) low         high       high

Which engine to pick

  • Write-heavy ingest → LSM — time series, logs, IoT, message/event ingestion, bulk INSERTs. Sequential appends give overwhelming write throughput. Cassandra / RocksDB / ScyllaDB.
  • Read-heavy + range-query OLTP → B-tree — ordinary web/app backends, transactions, queries heavy on ORDER BY / BETWEEN, anywhere low read latency matters. PostgreSQL / MySQL InnoDB.
  • Need both — MySQL can swap InnoDB for MyRocks (RocksDB underneath), and Postgres setups split workloads instead. RocksDB itself is an embeddable storage engine powering many databases under the hood.

Contrast: column-oriented storage (OLAP)

Everything so far stored data by row — row-oriented (OLTP) engines. Analytics (OLAP) does the opposite and stores by column:

row store (all columns of one row together):
[id=1,name=Kim,age=30][id=2,name=Lee,age=25] ...
→ "all fields of this user" sit on one page. Good for OLTP.

column store (all values of one column together):
id:   [1, 2, 3, ...]
name: [Kim, Lee, Park, ...]
age:  [30, 25, 41, ...]
→ "average age across all users" reads only the age column.
   Unneeded columns never touch the disk + same type compresses well.

e.g. ClickHouse, DuckDB, Parquet, Redshift, BigQuery, Snowflake

For analytical queries that scan a few columns over millions of rows and aggregate, a column store crushes a row store. Conversely, for OLTP that reads and writes a whole row's fields, the row store wins. The workload picks the layout.

Practical takeaways

  • The unit of disk I/O is the page (usually 4-16KB), not the row. Buffer pool hit rate dominates OLTP performance.
  • The WAL delivers durability and fast writes at once — sequential append + fsync before commit, with lazy in-place flushing left to the checkpoint.
  • B-tree = update in place, low read/space amp, strong range scans → read-heavy OLTP (Postgres / InnoDB).
  • LSM-tree = memtable → SSTable → compaction, low write amp, high write throughput → write-heavy ingest (RocksDB / Cassandra).
  • LSM reads use bloom filters to skip unnecessary SSTable accesses. Compaction is leveled (read-friendly) vs size-tiered (write-friendly).
  • Write / read / space amplification can't all be minimized — choosing an engine means matching that trade-off to your workload.
  • Analytics (OLAP) wants a column store — it reads only the needed columns and compresses well. Different purpose than a row store (OLTP).
Back to guides