Inside printf, eventually write(1, ...) is called, and that's the entry into the kernel. The moment of crossing user/kernel mode — a syscall — costs ~100 ns. This guide covers why, and how to reduce it (vDSO, io_uring).
User Mode vs Kernel Mode
CPU protection rings:
Ring 0 — kernel (all instructions, all memory)
Rings 1, 2 — rarely used
Ring 3 — user application (restricted)
User mode cannot:
- Directly access I/O ports (disk, NIC)
- Read/write arbitrary physical memory
- Change interrupt masks
- Touch other processes' memory
→ must ask the kernel = syscallThe Actual syscall Instruction
write(1, "hi", 2);
Internally (x86-64 Linux):
mov rax, 1 ; syscall number (1 = write)
mov rdi, 1 ; arg1 = fd (1 = stdout)
mov rsi, msg ; arg2 = buffer pointer
mov rdx, 2 ; arg3 = size
syscall ; ← one CPU instruction
The "syscall" instruction:
1. Switches user mode → kernel mode (privilege escalation)
2. Jumps PC to the kernel's syscall handler address
3. Saves user rsp, rip, flags, etc.
4. Switches to the kernel stack
Kernel handler:
- Dispatches via rax to the syscall table
- Calls sys_write()
- Puts result in rax
- Returns to user via "sysret"Why 100 ns?
A function call costs 1 ns, a syscall 100 ns. Where does 100× come from?
- Privilege transition — possibly CR3 (page table base) update
- Stack switch — user → kernel stack
- Register save/restore — back up all user registers
- Cache / TLB pollution — fetching kernel code → I-cache misses; kernel data access → D-cache misses
- Meltdown/Spectre mitigation (KPTI) — full page-table swap since 2018. Syscall cost nearly 2×
strace — What Syscalls Are You Making?
$ strace -c ls
% time seconds usecs/call calls errors syscall
------ -------- ---------- ----- ------ ----------
60.32 0.000045 0 89 read
19.46 0.000015 0 37 write
9.32 0.000007 0 10 openat
...
→ Reports most-frequent syscalls. Check hot-path syscall counts.
Specific syscalls only:
$ strace -e trace=open,read ./program
Attach to running process:
$ strace -p $(pidof program)vDSO — Syscall Without the Syscall
Functions that only read kernel state — like gettimeofday(), clock_gettime() — are expensive as syscalls. Fix: kernel mmaps a read-only page into user-space.
$ cat /proc/self/maps | grep vdso
7ffd...000-7ffd...000 r-xp 00000000 00:00 0 [vdso]
Functions in vDSO:
- clock_gettime
- gettimeofday
- getcpu
- time
glibc clock_gettime implementation:
if (vdso_clock_gettime != NULL)
return vdso_clock_gettime(clk, tp); ← normal function call, ~5 ns
else
return syscall(SYS_clock_gettime, clk, tp); ← fallback, ~100 ns
→ Same function, 100× faster.Reducing Syscall Cost — Design Patterns
1. Batch / Vectored I/O
100 × write(fd, &b, 1) → 100 syscalls = 10 μs
1 × write(fd, buf, 100) → 1 syscall = 100 ns
→ 100× faster. stdio's (printf) buffering is exactly this.
writev() / readv() — multiple buffers in one syscall:
struct iovec iov[3] = {
{hdr, hdr_len},
{body, body_len},
{footer, footer_len}
};
writev(fd, iov, 3); // send 3 regions in one syscall2. mmap — Memory-Map Instead of read
Traditional:
for (...) read(fd, buf, size); // syscall per read
mmap:
char* p = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
// one syscall maps the whole file into memory
// subsequent p[i] access = just memory (kernel only on page fault)
→ Great for random access on large files.3. io_uring (Linux 5.1+) — Batched Async Syscalls
Traditional:
read(fd1, b1, n); // 100 ns syscall
read(fd2, b2, n); // 100 ns syscall
read(fd3, b3, n); // 100 ns syscall
→ 300 ns + 3 context switches
io_uring:
Submit 3 reads on the submission queue (memory write, no syscall)
io_uring_enter() once (single syscall)
Kernel processes 3 asynchronously → completion queue
→ 100 ns + 0 extra context switches
→ Big wins for disk/network-heavy workloads.Modern Measurement — eBPF + bpftrace
# Top 10 slowest syscalls on this system right now
sudo bpftrace -e '
tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
@ns[probe] = hist(nsecs - @start[tid]);
delete(@start[tid]);
}'
→ Live production measurement without kernel changes.Per-OS Syscall ABI
| OS | Instruction | Arg passing | Notes |
|---|---|---|---|
| Linux x86-64 | syscall | rdi, rsi, rdx, r10, r8, r9 | Stable ABI |
| macOS x86-64 | syscall | Same | But going around libSystem is discouraged |
| Windows | syscall (varies) | Only via ntdll is safe | Kernel ABI is unofficial |
| BSD | syscall | Similar to Linux |
WebAssembly + WASI
WebAssembly's file I/O / network needs host syscall capabilities. WASI abstracts sandboxed syscalls. Cloudflare Workers, Wasmtime, etc. implement WASI.
Common Pitfalls
- Syscall in a tight loop — 1M of even the simplest syscall = 100 ms. Profile and you may find a surprising hot spot.
- fork() cost — copies parent's page tables. A large process forking triggers GB-scale metadata copy (even with CoW).
- Limited syscalls in signal handlers — only async-signal-safe (printf is not).
- strace overhead — strace itself is ptrace syscalls. For production use perf / eBPF.
- Missing EINTR handling — read can return -1 + EINTR if a signal arrives. Retry loop required.
Wrap-up
Syscalls are the only contract between user and kernel. 100 ns adds up — a meaningful chunk of perf goes there. Batch / mmap / vDSO / io_uring are all forms of "reduce the syscalls".
Where to start performance analysis: strace -c. See the syscall count and types; if excessive, consider batching.