"Containers are lightweight VMs" is the common pitch — and it's wrong. VMs ship their own kernel; containers share the host's. So how does ps in one container hide processes from another, and how does rm -rf / in a container leave the host intact? Linux namespaces + cgroups. This guide walks through what's actually happening when you docker run.
A container is just a process with extras
# On the host
$ docker run -d nginx
$ ps aux | grep nginx
root 12345 /usr/sbin/nginx ← actually a host process
host PID = 12345
# Inside the container
$ docker exec -it <id> ps aux
root 1 /usr/sbin/nginx ← same process, container sees PID 1One process, two PIDs — 12345 on the host, 1 inside. The trick is the PID namespace.
Namespaces — eight dimensions of isolation
The Linux kernel maintains a separate view of certain resources per namespace:
| Namespace | Isolates | Effect |
|---|---|---|
| PID | process tree | ps inside only shows the container's own processes |
| NET | network stack (interfaces, routes, iptables) | each container has its own lo, eth0 |
| MNT | mount points | each container sees its own / |
| UTS | hostname + domain | each container has its own hostname |
| IPC | SysV / POSIX IPC | shared memory / semaphores isolated |
| USER | UID / GID maps | container root maps to non-root on host |
| CGROUP | cgroup view | (added 2016) |
| TIME | system clock | (added 2020) |
Linux experiment:
$ unshare --pid --fork --mount-proc bash
# New PID namespace. Bash becomes PID 1.
$ ps aux
USER PID ...
root 1 bash
root 2 ps
# Host's other processes are invisible.cgroups — resource limits
While namespaces isolate "view," cgroups (control groups) isolate "resource allocation":
- CPU — cap usage (
--cpus=2) - Memory — RAM cap + OOM behavior (
--memory=512m) - Block I/O — disk read/write throttle
- Network I/O — bandwidth shaping (with tc)
- PID — prevent fork bombs
Implemented as files under /sys/fs/cgroup/:
$ cat /sys/fs/cgroup/docker/abc123.../memory.max
536870912 ← 512 MB
$ cat /sys/fs/cgroup/docker/abc123.../cpu.max
200000 100000 ← 200ms of CPU per 100ms window (2 cores)Docker run flags like --memory and --cpus end up writing to these files.
Union filesystem — why images are small
A Docker image is a stack of layers:
Image: my-app:v1
├── Layer 4: COPY ./app /app (5 MB)
├── Layer 3: RUN npm install (200 MB)
├── Layer 2: COPY package.json /app (1 KB)
├── Layer 1: WORKDIR /app (0 bytes)
└── Layer 0: FROM node:20 (200 MB, base image)
Total: 405 MB
Add my-app:v2 (only the app code changed):
├── NEW Layer 4' (5 MB)
├── Layer 3 (shared) (reuses, 0 bytes)
├── Layer 2 (shared) (reuses)
├── Layer 1 (shared) (reuses)
└── Layer 0 (shared) (reuses)
Disk increase: just 5 MBLayers are read-only. Containers add a writable layer on top with copy-on-write semantics:
Container starts:
┌─────────────────────────┐
│ Writable layer (RW) │ ← container's changes live here
├─────────────────────────┤
│ Image Layer 4 (RO) │
├─────────────────────────┤
│ Image Layer 3 (RO) │
├─────────────────────────┤
│ ... (RO) │
└─────────────────────────┘
When the container writes:
- Copies the file from the RO layer into the writable layer
- Modifies the copy
- Reads search top-down for the first matchModern Docker uses overlayfs — Linux's union filesystem — stored under /var/lib/docker/overlay2/.
Networking — four common modes
- bridge (default) — Docker creates a virtual bridge (docker0). Each container has a veth pair. NAT for outbound.
- host — uses the host network namespace directly. No isolation, fastest.
- none — network isolated, no interfaces. Most secure.
- overlay — multi-host clusters (Swarm, Kubernetes). VXLAN tunnels.
bridge mode flow:
container1 (172.17.0.2) ── veth1 ── docker0 (172.17.0.1) ── NAT ── eth0 ── Internet
container2 (172.17.0.3) ── veth2 ─┘
Inter-container traffic — via docker0 directlyWhat Docker the daemon actually does
user → docker CLI → dockerd (daemon, REST API)
↓
containerd (high-level runtime)
↓
runc (low-level runtime, OCI spec)
↓
Linux kernel (namespaces + cgroups + overlayfs)The actual container creation happens in runc (OCI). runc creates namespaces, sets up cgroups, chroots, and exec()s. Docker is a UX layer on top.
Alternatives:
- Podman — daemon-less, rootless. Same OCI images.
- containerd — Kubernetes's default runtime. Lighter than Docker.
- CRI-O — Red Hat. Kubernetes-only.
Dockerfile build steps
FROM node:20
WORKDIR /app
COPY package.json .
RUN npm install
COPY . .
CMD ["node", "server.js"]
Each line is a new layer:
1. Pull node:20 base
2. Create /app (0 bytes)
3. Copy package.json
4. Run npm install (deps)
5. Copy the rest of the code
6. Metadata (CMD)
Cache usage:
- If package.json hasn't changed, RUN npm install is cache-hit
- That's why copying package.json before RUN npm install is best practiceMulti-stage builds — slim final images
# Stage 1 — builder
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build ← produces dist/
# Stage 2 — final runtime
FROM node:20-alpine ← Alpine is tiny (60 MB vs 1 GB)
WORKDIR /app
COPY --from=builder /app/dist /app/dist
COPY --from=builder /app/node_modules /app/node_modules
CMD ["node", "dist/server.js"]
# Result — no dev deps or source code in the final imageCompared to VMs
| VM (VirtualBox, VMware) | Container (Docker) | |
|---|---|---|
| Kernel | Own (guest OS) | Shared with host |
| Size | GB | MB |
| Boot | Minutes | Seconds (often ms) |
| Isolation | Very strong (hypervisor) | Kernel-namespace level |
| OS choice | Any (Linux on Mac etc.) | Same kernel ABI as host |
| Overhead | 10-20% | ~1-2% |
Docker Desktop on Mac / Windows actually runs a hidden Linux VM (HyperKit / WSL2). Containers on macOS still need a Linux kernel.
Container isolation isn't VM isolation
Risks:
- Kernel exploits — a host kernel bug affects every container. VMs are shielded by the hypervisor.
- Root in container ≈ root on host — without USER namespaces, container root maps to host root. Breakout risk.
- Shared resources — /dev, parts of /proc. Wrong mounts leak host info.
Mitigations:
--userfor non-root- Rootless Docker / Podman
- gVisor (Google) — application kernel for extra isolation
- Kata Containers — micro-VMs with container UX
Common pitfalls
1. PID 1 responsibilities
PID 1 in Linux is special — reaps orphans, default signal handler. CMD ["bash", "-c", "node server.js"] makes bash PID 1, which doesn't forward signals to node. Use tini or exec directly.
2. Layer explosion
# Bad — each RUN is a new layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y vim
RUN rm -rf /var/lib/apt/lists/*
# Result — 4 layers, and intermediate cache stays in the image
# Good — chained
RUN apt-get update && \
apt-get install -y curl vim && \
rm -rf /var/lib/apt/lists/*
# One layer, no leftovers3. Missing .dockerignore
COPY . . pulls in .git, node_modules, .env, etc. Use .dockerignore:
# .dockerignore
node_modules
.git
.env
*.log
.DS_Store4. Mounting host paths and UID mismatches
# host UID 1000, container UID 100 (alpine)
docker run -v $(pwd):/data alpine touch /data/file
↓
file owned by UID 100
host sees UID 100 — permission issue
Fix:
docker run -u $(id -u):$(id -g) -v $(pwd):/data ...5. Shipping dev images to production
FROM node:20 = 1 GB. FROM node:20-alpine = 60 MB. Production should use distroless or Alpine bases. Even if dev uses bulky images, the final stage should be slim.
References
- Linux man — namespaces(7) — man7.org
- OCI Runtime Spec — GitHub
- Liz Rice — Container from scratch — YouTube
- Overlay filesystem — kernel.org
Summary
- Containers are normal processes — host kernel shared, plus Linux namespaces and cgroups around them.
- Eight namespaces (PID / NET / MNT / UTS / IPC / USER / CGROUP / TIME) provide view isolation.
- cgroups cap CPU / memory / I/O / PID. Controlled via files under
/sys/fs/cgroup/. - An image is a stack of read-only layers; containers add a writable layer with copy-on-write via overlayfs.
- Docker daemon = UX over runc + containerd. Alternatives: Podman, containerd, CRI-O.
- Multi-stage builds + .dockerignore shrink images dramatically.
- Containers aren't VMs — same kernel = light but less isolated. Use gVisor / Kata for stricter boundaries.
- Try it: Docker Compose Visualizer / Kubernetes YAML Visualizer visualize your manifests.