How Filesystems Actually Work

What does open("/etc/passwd") actually call in the OS? Why does rm finish instantly even on a huge file, while secure delete takes minutes? What exactly does fsync guarantee? Filesystems are a thick layer between your code and the disk. This guide unpacks it.

The Core Concept — inode

Directory entry:  "passwd"  →  inode #1024
inode #1024:
  ├── permissions (rwx)
  ├── owner / group
  ├── size, atime, mtime, ctime
  ├── link count (how many directory entries point here)
  └── data block pointers:
       direct[0] → block 5000
       direct[1] → block 5001
       ...
       direct[11] → block 5011
       indirect  → block (containing 256 block pointers)
       double_indirect → ...
       triple_indirect → ...

Filenames live only in directories (which are themselves inodes + data). The inode has only a number. A hard link = two directory entries pointing to the same inode. Unlinking a hard link just decrements link count — at 0, the inode is freed.

block — The Unit of Disk I/O

Filesystems treat the disk in block units (typically 4 KB). A 1-byte file occupies 4 KB. Large files are sequences of blocks.

100-byte file:
  inode size = 100
  direct[0] = block 5000 (only 100 bytes of 4 KB used, rest wasted)

10 MB file:
  inode size = 10485760
  direct[0..11] = block 5000..5011 (48 KB)
  indirect → block 6000 → [pointer × 256] → 1 MB
  double_indirect → ...

open() Lifecycle

int fd = open("/etc/passwd", O_RDONLY);

Internally:
1. Path resolve: "/" → root inode → "etc" → etc inode → "passwd" → passwd inode
   (each directory lookup = O(N) or htree)
2. Permission check
3. Add entry to the process's file descriptor table
   → points to a row in system-wide open-file table
   → which points to the inode
4. Return fd (small integer)

read(fd, buf, 4096):
1. fd → open-file entry → current offset (e.g. 0)
2. inode → block pointers → disk I/O
3. update offset (0 → 4096)
4. copy data to buf

VFS — Virtual File System

Every Linux filesystem (ext4, btrfs, xfs, NFS, fuse...) implements the same VFS interface. That's why cat /proc/cpuinfo and cat /etc/passwd are the same syscall.

/dev/sda1 (ext4)  /dev/sda2 (btrfs)  NFS server  procfs (in-memory)
       │                │                │             │
       └────────────────┴────────────────┴─────────────┘
                              │
                          VFS layer
                              │
                       syscalls (open/read/write/close)
                              │
                       application

Why rm Is Fast

Even rm on a 100 GB file finishes instantly — because the data isn't actually erased.

unlink("/foo/bar"):
1. Remove "bar" entry from /foo directory
2. Decrement inode link count
3. If link count = 0 and no process has it open → free inode + return
   blocks to free list
4. The actual disk bytes remain (no overwrite)

Secure delete (shred) overwrites every byte of every block with random data — that's why it's slow.

Which is also why file recovery tools can restore accidentally rm'd files (until the blocks are overwritten by a new file).

fsync — Did It Really Reach Disk?

write(fd, buf, 4096);     // page cache only (RAM)
                          // disk hasn't seen it

// crash now → data loss possible.

fsync(fd);                // flush page cache → real disk
                          // when this returns, disk has it.

Database commits and WAL flushes all depend on fsync. That's why SQLite's synchronous=OFF is fast but dangerous.

SSD write barriers and disk controllers lying about flush complicate things, but as far as the OS is concerned, fsync return = guarantee.

Journaling — Crash Consistency

What if a multi-step write is interrupted by crash?

Legacy ext2: dir entry add + inode update + block alloc interrupted
             → entry exists but link count wrong → fsck takes ages

ext3/4 (journaling):
  1. Write "these changes are coming" to the journal (sequential, fast)
  2. Apply the actual changes
  3. Mark commit in journal

  On post-crash mount:
    - Replay committed entries
    - Ignore uncommitted

  → fsck is short, consistency guaranteed.

Copy-on-Write — btrfs / ZFS / APFS

Traditional (in-place):
  Modify data at block 5000 → overwrite the same block

Copy-on-Write:
  Copy block 5000 to a new block 6000, modify there
  Atomically update metadata pointer 5000 → 6000

  Pros:
  - Crash mid-operation = metadata sees either old or new — no corruption
  - Snapshots are essentially free (just keep the old pointers)
  - Dedup is natural

  Cons:
  - Fragmentation
  - Free-space accounting is complex

Filesystem Comparison

FS	Approach	Strengths	Notes
ext4	journaling	Stable, default	Linux default
xfs	journaling, B-tree dirs	Large, high concurrency	RHEL default
btrfs	CoW	Snapshots, dedup, RAID	Some RAID modes still unstable
ZFS	CoW	Data integrity (checksums), scale	Solaris/FreeBSD origin, Linux too
APFS	CoW	SSD-optimized, clone	macOS default
NTFS	journaling	ACLs, alternate data streams	Windows default

Large-Directory Traps

Directory entries in a linear list make a 1M-file lookup O(N).

ext4: htree (hashed B-tree) → O(log N)
xfs: B+ tree → O(log N)
ext2 (legacy): linear list → O(N), ls on 1M files takes minutes

→ Use a modern FS for huge directories (e.g. /var/spool).

Common Pitfalls

Inode exhaustion — blocks free but inodes used up. Check df -i. Creating millions of small files triggers this.
Skipping fsync — data loss on power loss. Databases / important data must fsync.
O_DIRECT misuse — bypasses page cache. Default is faster for most apps. Only DB / cache managers really need O_DIRECT.
mmap SIGBUS — if a mmaped file is truncated, accessing the truncated region raises SIGBUS. Munmap explicitly.
Metadata overhead of many small file writes — 10K files = 10K directory entries + 10K inodes + 10K block-alloc. tar is much faster.

Wrap-up

Filesystems are more than "data storage" — they juggle concurrency, crash consistency, efficient lookup, and safe permissions all at once. Understanding the layer lets you answer "why is our backup so slow?" / "why did SQLite suddenly speed up?"