What does open("/etc/passwd") actually call in the OS? Why does rm finish instantly even on a huge file, while secure delete takes minutes? What exactly does fsync guarantee? Filesystems are a thick layer between your code and the disk. This guide unpacks it.
The Core Concept — inode
Directory entry: "passwd" → inode #1024
inode #1024:
├── permissions (rwx)
├── owner / group
├── size, atime, mtime, ctime
├── link count (how many directory entries point here)
└── data block pointers:
direct[0] → block 5000
direct[1] → block 5001
...
direct[11] → block 5011
indirect → block (containing 256 block pointers)
double_indirect → ...
triple_indirect → ...Filenames live only in directories (which are themselves inodes + data). The inode has only a number. A hard link = two directory entries pointing to the same inode. Unlinking a hard link just decrements link count — at 0, the inode is freed.
block — The Unit of Disk I/O
Filesystems treat the disk in block units (typically 4 KB). A 1-byte file occupies 4 KB. Large files are sequences of blocks.
100-byte file:
inode size = 100
direct[0] = block 5000 (only 100 bytes of 4 KB used, rest wasted)
10 MB file:
inode size = 10485760
direct[0..11] = block 5000..5011 (48 KB)
indirect → block 6000 → [pointer × 256] → 1 MB
double_indirect → ...open() Lifecycle
int fd = open("/etc/passwd", O_RDONLY);
Internally:
1. Path resolve: "/" → root inode → "etc" → etc inode → "passwd" → passwd inode
(each directory lookup = O(N) or htree)
2. Permission check
3. Add entry to the process's file descriptor table
→ points to a row in system-wide open-file table
→ which points to the inode
4. Return fd (small integer)
read(fd, buf, 4096):
1. fd → open-file entry → current offset (e.g. 0)
2. inode → block pointers → disk I/O
3. update offset (0 → 4096)
4. copy data to bufVFS — Virtual File System
Every Linux filesystem (ext4, btrfs, xfs, NFS, fuse...) implements the same VFS interface. That's why cat /proc/cpuinfo and cat /etc/passwd are the same syscall.
/dev/sda1 (ext4) /dev/sda2 (btrfs) NFS server procfs (in-memory)
│ │ │ │
└────────────────┴────────────────┴─────────────┘
│
VFS layer
│
syscalls (open/read/write/close)
│
applicationWhy rm Is Fast
Even rm on a 100 GB file finishes instantly — because the data isn't actually erased.
unlink("/foo/bar"):
1. Remove "bar" entry from /foo directory
2. Decrement inode link count
3. If link count = 0 and no process has it open → free inode + return
blocks to free list
4. The actual disk bytes remain (no overwrite)Secure delete (shred) overwrites every byte of every block with random data — that's why it's slow.
Which is also why file recovery tools can restore accidentally rm'd files (until the blocks are overwritten by a new file).
fsync — Did It Really Reach Disk?
write(fd, buf, 4096); // page cache only (RAM)
// disk hasn't seen it
// crash now → data loss possible.
fsync(fd); // flush page cache → real disk
// when this returns, disk has it.Database commits and WAL flushes all depend on fsync. That's why SQLite's synchronous=OFF is fast but dangerous.
SSD write barriers and disk controllers lying about flush complicate things, but as far as the OS is concerned, fsync return = guarantee.
Journaling — Crash Consistency
What if a multi-step write is interrupted by crash?
Legacy ext2: dir entry add + inode update + block alloc interrupted
→ entry exists but link count wrong → fsck takes ages
ext3/4 (journaling):
1. Write "these changes are coming" to the journal (sequential, fast)
2. Apply the actual changes
3. Mark commit in journal
On post-crash mount:
- Replay committed entries
- Ignore uncommitted
→ fsck is short, consistency guaranteed.Copy-on-Write — btrfs / ZFS / APFS
Traditional (in-place):
Modify data at block 5000 → overwrite the same block
Copy-on-Write:
Copy block 5000 to a new block 6000, modify there
Atomically update metadata pointer 5000 → 6000
Pros:
- Crash mid-operation = metadata sees either old or new — no corruption
- Snapshots are essentially free (just keep the old pointers)
- Dedup is natural
Cons:
- Fragmentation
- Free-space accounting is complexFilesystem Comparison
| FS | Approach | Strengths | Notes |
|---|---|---|---|
| ext4 | journaling | Stable, default | Linux default |
| xfs | journaling, B-tree dirs | Large, high concurrency | RHEL default |
| btrfs | CoW | Snapshots, dedup, RAID | Some RAID modes still unstable |
| ZFS | CoW | Data integrity (checksums), scale | Solaris/FreeBSD origin, Linux too |
| APFS | CoW | SSD-optimized, clone | macOS default |
| NTFS | journaling | ACLs, alternate data streams | Windows default |
Large-Directory Traps
Directory entries in a linear list make a 1M-file lookup O(N).
ext4: htree (hashed B-tree) → O(log N)
xfs: B+ tree → O(log N)
ext2 (legacy): linear list → O(N), ls on 1M files takes minutes
→ Use a modern FS for huge directories (e.g. /var/spool).Common Pitfalls
- Inode exhaustion — blocks free but inodes used up. Check
df -i. Creating millions of small files triggers this. - Skipping fsync — data loss on power loss. Databases / important data must fsync.
- O_DIRECT misuse — bypasses page cache. Default is faster for most apps. Only DB / cache managers really need O_DIRECT.
- mmap SIGBUS — if a mmaped file is truncated, accessing the truncated region raises SIGBUS. Munmap explicitly.
- Metadata overhead of many small file writes — 10K files = 10K directory entries + 10K inodes + 10K block-alloc. tar is much faster.
Wrap-up
Filesystems are more than "data storage" — they juggle concurrency, crash consistency, efficient lookup, and safe permissions all at once. Understanding the layer lets you answer "why is our backup so slow?" / "why did SQLite suddenly speed up?"