How System Calls Actually Work

Inside printf, eventually write(1, ...) is called, and that's the entry into the kernel. The moment of crossing user/kernel mode — a syscall — costs ~100 ns. This guide covers why, and how to reduce it (vDSO, io_uring).

User Mode vs Kernel Mode

CPU protection rings:

Ring 0  — kernel (all instructions, all memory)
Rings 1, 2 — rarely used
Ring 3  — user application (restricted)

User mode cannot:
- Directly access I/O ports (disk, NIC)
- Read/write arbitrary physical memory
- Change interrupt masks
- Touch other processes' memory
→ must ask the kernel = syscall

The Actual syscall Instruction

write(1, "hi", 2);

Internally (x86-64 Linux):
  mov rax, 1       ; syscall number (1 = write)
  mov rdi, 1       ; arg1 = fd (1 = stdout)
  mov rsi, msg     ; arg2 = buffer pointer
  mov rdx, 2       ; arg3 = size
  syscall          ; ← one CPU instruction

The "syscall" instruction:
1. Switches user mode → kernel mode (privilege escalation)
2. Jumps PC to the kernel's syscall handler address
3. Saves user rsp, rip, flags, etc.
4. Switches to the kernel stack

Kernel handler:
- Dispatches via rax to the syscall table
- Calls sys_write()
- Puts result in rax
- Returns to user via "sysret"

Why 100 ns?

A function call costs 1 ns, a syscall 100 ns. Where does 100× come from?

Privilege transition — possibly CR3 (page table base) update
Stack switch — user → kernel stack
Register save/restore — back up all user registers
Cache / TLB pollution — fetching kernel code → I-cache misses; kernel data access → D-cache misses
Meltdown/Spectre mitigation (KPTI) — full page-table swap since 2018. Syscall cost nearly 2×

strace — What Syscalls Are You Making?

$ strace -c ls

% time  seconds   usecs/call  calls  errors  syscall
------  --------  ----------  -----  ------  ----------
 60.32  0.000045         0      89          read
 19.46  0.000015         0      37          write
  9.32  0.000007         0      10          openat
  ...

→ Reports most-frequent syscalls. Check hot-path syscall counts.

Specific syscalls only:
$ strace -e trace=open,read ./program

Attach to running process:
$ strace -p $(pidof program)

vDSO — Syscall Without the Syscall

Functions that only read kernel state — like gettimeofday(), clock_gettime() — are expensive as syscalls. Fix: kernel mmaps a read-only page into user-space.

$ cat /proc/self/maps | grep vdso
7ffd...000-7ffd...000  r-xp 00000000 00:00 0  [vdso]

Functions in vDSO:
- clock_gettime
- gettimeofday
- getcpu
- time

glibc clock_gettime implementation:
  if (vdso_clock_gettime != NULL)
    return vdso_clock_gettime(clk, tp);   ← normal function call, ~5 ns
  else
    return syscall(SYS_clock_gettime, clk, tp);  ← fallback, ~100 ns

→ Same function, 100× faster.

Reducing Syscall Cost — Design Patterns

1. Batch / Vectored I/O

100 × write(fd, &b, 1) → 100 syscalls = 10 μs
1 × write(fd, buf, 100)  → 1 syscall = 100 ns

→ 100× faster. stdio's (printf) buffering is exactly this.

writev() / readv() — multiple buffers in one syscall:
  struct iovec iov[3] = {
    {hdr, hdr_len},
    {body, body_len},
    {footer, footer_len}
  };
  writev(fd, iov, 3);   // send 3 regions in one syscall

2. mmap — Memory-Map Instead of read

Traditional:
  for (...) read(fd, buf, size);   // syscall per read

mmap:
  char* p = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
  // one syscall maps the whole file into memory
  // subsequent p[i] access = just memory (kernel only on page fault)

→ Great for random access on large files.

3. io_uring (Linux 5.1+) — Batched Async Syscalls

Traditional:
  read(fd1, b1, n);     // 100 ns syscall
  read(fd2, b2, n);     // 100 ns syscall
  read(fd3, b3, n);     // 100 ns syscall
  → 300 ns + 3 context switches

io_uring:
  Submit 3 reads on the submission queue (memory write, no syscall)
  io_uring_enter() once (single syscall)
  Kernel processes 3 asynchronously → completion queue
  → 100 ns + 0 extra context switches

→ Big wins for disk/network-heavy workloads.

Modern Measurement — eBPF + bpftrace

# Top 10 slowest syscalls on this system right now
sudo bpftrace -e '
  tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
  tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    @ns[probe] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
  }'

→ Live production measurement without kernel changes.

Per-OS Syscall ABI

OS	Instruction	Arg passing	Notes
Linux x86-64	syscall	rdi, rsi, rdx, r10, r8, r9	Stable ABI
macOS x86-64	syscall	Same	But going around libSystem is discouraged
Windows	syscall (varies)	Only via ntdll is safe	Kernel ABI is unofficial
BSD	syscall	Similar to Linux

WebAssembly + WASI

WebAssembly's file I/O / network needs host syscall capabilities. WASI abstracts sandboxed syscalls. Cloudflare Workers, Wasmtime, etc. implement WASI.

Common Pitfalls

Syscall in a tight loop — 1M of even the simplest syscall = 100 ms. Profile and you may find a surprising hot spot.
fork() cost — copies parent's page tables. A large process forking triggers GB-scale metadata copy (even with CoW).
Limited syscalls in signal handlers — only async-signal-safe (printf is not).
strace overhead — strace itself is ptrace syscalls. For production use perf / eBPF.
Missing EINTR handling — read can return -1 + EINTR if a signal arrives. Retry loop required.

Wrap-up

Syscalls are the only contract between user and kernel. 100 ns adds up — a meaningful chunk of perf goes there. Batch / mmap / vDSO / io_uring are all forms of "reduce the syscalls".

Where to start performance analysis: strace -c. See the syscall count and types; if excessive, consider batching.