Skip to content
yutils

How System Calls Actually Work

User mode vs kernel mode, the syscall instruction (x86 syscall, ARM svc), what strace shows, why syscalls cost 100 ns minimum, vDSO (clock_gettime without a syscall), and io_uring as the future of syscall amortization.

~9 min read

Inside printf, eventually write(1, ...) is called, and that's the entry into the kernel. The moment of crossing user/kernel mode — a syscall — costs ~100 ns. This guide covers why, and how to reduce it (vDSO, io_uring).

User Mode vs Kernel Mode

CPU protection rings:

Ring 0  — kernel (all instructions, all memory)
Rings 1, 2 — rarely used
Ring 3  — user application (restricted)

User mode cannot:
- Directly access I/O ports (disk, NIC)
- Read/write arbitrary physical memory
- Change interrupt masks
- Touch other processes' memory
→ must ask the kernel = syscall

The Actual syscall Instruction

write(1, "hi", 2);

Internally (x86-64 Linux):
  mov rax, 1       ; syscall number (1 = write)
  mov rdi, 1       ; arg1 = fd (1 = stdout)
  mov rsi, msg     ; arg2 = buffer pointer
  mov rdx, 2       ; arg3 = size
  syscall          ; ← one CPU instruction

The "syscall" instruction:
1. Switches user mode → kernel mode (privilege escalation)
2. Jumps PC to the kernel's syscall handler address
3. Saves user rsp, rip, flags, etc.
4. Switches to the kernel stack

Kernel handler:
- Dispatches via rax to the syscall table
- Calls sys_write()
- Puts result in rax
- Returns to user via "sysret"

Why 100 ns?

A function call costs 1 ns, a syscall 100 ns. Where does 100× come from?

  • Privilege transition — possibly CR3 (page table base) update
  • Stack switch — user → kernel stack
  • Register save/restore — back up all user registers
  • Cache / TLB pollution — fetching kernel code → I-cache misses; kernel data access → D-cache misses
  • Meltdown/Spectre mitigation (KPTI) — full page-table swap since 2018. Syscall cost nearly 2×

strace — What Syscalls Are You Making?

$ strace -c ls

% time  seconds   usecs/call  calls  errors  syscall
------  --------  ----------  -----  ------  ----------
 60.32  0.000045         0      89          read
 19.46  0.000015         0      37          write
  9.32  0.000007         0      10          openat
  ...

→ Reports most-frequent syscalls. Check hot-path syscall counts.

Specific syscalls only:
$ strace -e trace=open,read ./program

Attach to running process:
$ strace -p $(pidof program)

vDSO — Syscall Without the Syscall

Functions that only read kernel state — like gettimeofday(), clock_gettime() — are expensive as syscalls. Fix: kernel mmaps a read-only page into user-space.

$ cat /proc/self/maps | grep vdso
7ffd...000-7ffd...000  r-xp 00000000 00:00 0  [vdso]

Functions in vDSO:
- clock_gettime
- gettimeofday
- getcpu
- time

glibc clock_gettime implementation:
  if (vdso_clock_gettime != NULL)
    return vdso_clock_gettime(clk, tp);   ← normal function call, ~5 ns
  else
    return syscall(SYS_clock_gettime, clk, tp);  ← fallback, ~100 ns

→ Same function, 100× faster.

Reducing Syscall Cost — Design Patterns

1. Batch / Vectored I/O

100 × write(fd, &b, 1) → 100 syscalls = 10 μs
1 × write(fd, buf, 100)  → 1 syscall = 100 ns

→ 100× faster. stdio's (printf) buffering is exactly this.

writev() / readv() — multiple buffers in one syscall:
  struct iovec iov[3] = {
    {hdr, hdr_len},
    {body, body_len},
    {footer, footer_len}
  };
  writev(fd, iov, 3);   // send 3 regions in one syscall

2. mmap — Memory-Map Instead of read

Traditional:
  for (...) read(fd, buf, size);   // syscall per read

mmap:
  char* p = mmap(NULL, file_size, PROT_READ, MAP_SHARED, fd, 0);
  // one syscall maps the whole file into memory
  // subsequent p[i] access = just memory (kernel only on page fault)

→ Great for random access on large files.

3. io_uring (Linux 5.1+) — Batched Async Syscalls

Traditional:
  read(fd1, b1, n);     // 100 ns syscall
  read(fd2, b2, n);     // 100 ns syscall
  read(fd3, b3, n);     // 100 ns syscall
  → 300 ns + 3 context switches

io_uring:
  Submit 3 reads on the submission queue (memory write, no syscall)
  io_uring_enter() once (single syscall)
  Kernel processes 3 asynchronously → completion queue
  → 100 ns + 0 extra context switches

→ Big wins for disk/network-heavy workloads.

Modern Measurement — eBPF + bpftrace

# Top 10 slowest syscalls on this system right now
sudo bpftrace -e '
  tracepoint:raw_syscalls:sys_enter { @start[tid] = nsecs; }
  tracepoint:raw_syscalls:sys_exit /@start[tid]/ {
    @ns[probe] = hist(nsecs - @start[tid]);
    delete(@start[tid]);
  }'

→ Live production measurement without kernel changes.

Per-OS Syscall ABI

OSInstructionArg passingNotes
Linux x86-64syscallrdi, rsi, rdx, r10, r8, r9Stable ABI
macOS x86-64syscallSameBut going around libSystem is discouraged
Windowssyscall (varies)Only via ntdll is safeKernel ABI is unofficial
BSDsyscallSimilar to Linux

WebAssembly + WASI

WebAssembly's file I/O / network needs host syscall capabilities. WASI abstracts sandboxed syscalls. Cloudflare Workers, Wasmtime, etc. implement WASI.

Common Pitfalls

  • Syscall in a tight loop — 1M of even the simplest syscall = 100 ms. Profile and you may find a surprising hot spot.
  • fork() cost — copies parent's page tables. A large process forking triggers GB-scale metadata copy (even with CoW).
  • Limited syscalls in signal handlers — only async-signal-safe (printf is not).
  • strace overhead — strace itself is ptrace syscalls. For production use perf / eBPF.
  • Missing EINTR handling — read can return -1 + EINTR if a signal arrives. Retry loop required.

Wrap-up

Syscalls are the only contract between user and kernel. 100 ns adds up — a meaningful chunk of perf goes there. Batch / mmap / vDSO / io_uring are all forms of "reduce the syscalls".

Where to start performance analysis: strace -c. See the syscall count and types; if excessive, consider batching.

Back to guides