How Docker Containers Actually Work

"Containers are lightweight VMs" is the common pitch — and it's wrong. VMs ship their own kernel; containers share the host's. So how does ps in one container hide processes from another, and how does rm -rf / in a container leave the host intact? Linux namespaces + cgroups. This guide walks through what's actually happening when you docker run.

A container is just a process with extras

# On the host
$ docker run -d nginx
$ ps aux | grep nginx
root  12345  /usr/sbin/nginx     ← actually a host process
                                   host PID = 12345
# Inside the container
$ docker exec -it <id> ps aux
root  1      /usr/sbin/nginx     ← same process, container sees PID 1

One process, two PIDs — 12345 on the host, 1 inside. The trick is the PID namespace.

Namespaces — eight dimensions of isolation

The Linux kernel maintains a separate view of certain resources per namespace:

Namespace	Isolates	Effect
PID	process tree	ps inside only shows the container's own processes
NET	network stack (interfaces, routes, iptables)	each container has its own lo, eth0
MNT	mount points	each container sees its own /
UTS	hostname + domain	each container has its own hostname
IPC	SysV / POSIX IPC	shared memory / semaphores isolated
USER	UID / GID maps	container root maps to non-root on host
CGROUP	cgroup view	(added 2016)
TIME	system clock	(added 2020)

Linux experiment:

$ unshare --pid --fork --mount-proc bash
# New PID namespace. Bash becomes PID 1.

$ ps aux
USER  PID  ...
root    1  bash
root    2  ps
# Host's other processes are invisible.

cgroups — resource limits

While namespaces isolate "view," cgroups (control groups) isolate "resource allocation":

CPU — cap usage (--cpus=2)
Memory — RAM cap + OOM behavior (--memory=512m)
Block I/O — disk read/write throttle
Network I/O — bandwidth shaping (with tc)
PID — prevent fork bombs

Implemented as files under /sys/fs/cgroup/:

$ cat /sys/fs/cgroup/docker/abc123.../memory.max
536870912    ← 512 MB

$ cat /sys/fs/cgroup/docker/abc123.../cpu.max
200000 100000  ← 200ms of CPU per 100ms window (2 cores)

Docker run flags like --memory and --cpus end up writing to these files.

Union filesystem — why images are small

A Docker image is a stack of layers:

Image: my-app:v1
├── Layer 4: COPY ./app /app          (5 MB)
├── Layer 3: RUN npm install           (200 MB)
├── Layer 2: COPY package.json /app    (1 KB)
├── Layer 1: WORKDIR /app              (0 bytes)
└── Layer 0: FROM node:20              (200 MB, base image)

Total: 405 MB

Add my-app:v2 (only the app code changed):
├── NEW Layer 4'                       (5 MB)
├── Layer 3 (shared)                   (reuses, 0 bytes)
├── Layer 2 (shared)                   (reuses)
├── Layer 1 (shared)                   (reuses)
└── Layer 0 (shared)                   (reuses)

Disk increase: just 5 MB

Layers are read-only. Containers add a writable layer on top with copy-on-write semantics:

Container starts:
┌─────────────────────────┐
│ Writable layer (RW)     │  ← container's changes live here
├─────────────────────────┤
│ Image Layer 4 (RO)      │
├─────────────────────────┤
│ Image Layer 3 (RO)      │
├─────────────────────────┤
│ ... (RO)                │
└─────────────────────────┘

When the container writes:
- Copies the file from the RO layer into the writable layer
- Modifies the copy
- Reads search top-down for the first match

Modern Docker uses overlayfs — Linux's union filesystem — stored under /var/lib/docker/overlay2/.

Networking — four common modes

bridge (default) — Docker creates a virtual bridge (docker0). Each container has a veth pair. NAT for outbound.
host — uses the host network namespace directly. No isolation, fastest.
none — network isolated, no interfaces. Most secure.
overlay — multi-host clusters (Swarm, Kubernetes). VXLAN tunnels.

bridge mode flow:
container1 (172.17.0.2) ── veth1 ── docker0 (172.17.0.1) ── NAT ── eth0 ── Internet
container2 (172.17.0.3) ── veth2 ─┘

Inter-container traffic — via docker0 directly

What Docker the daemon actually does

user → docker CLI → dockerd (daemon, REST API)
                          ↓
                       containerd (high-level runtime)
                          ↓
                       runc (low-level runtime, OCI spec)
                          ↓
                       Linux kernel (namespaces + cgroups + overlayfs)

The actual container creation happens in runc (OCI). runc creates namespaces, sets up cgroups, chroots, and exec()s. Docker is a UX layer on top.

Alternatives:

Podman — daemon-less, rootless. Same OCI images.
containerd — Kubernetes's default runtime. Lighter than Docker.
CRI-O — Red Hat. Kubernetes-only.

Dockerfile build steps

FROM node:20
WORKDIR /app
COPY package.json .
RUN npm install
COPY . .
CMD ["node", "server.js"]

Each line is a new layer:
1. Pull node:20 base
2. Create /app (0 bytes)
3. Copy package.json
4. Run npm install (deps)
5. Copy the rest of the code
6. Metadata (CMD)

Cache usage:
- If package.json hasn't changed, RUN npm install is cache-hit
- That's why copying package.json before RUN npm install is best practice

Multi-stage builds — slim final images

# Stage 1 — builder
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build      ← produces dist/

# Stage 2 — final runtime
FROM node:20-alpine    ← Alpine is tiny (60 MB vs 1 GB)
WORKDIR /app
COPY --from=builder /app/dist /app/dist
COPY --from=builder /app/node_modules /app/node_modules
CMD ["node", "dist/server.js"]

# Result — no dev deps or source code in the final image

Compared to VMs

	VM (VirtualBox, VMware)	Container (Docker)
Kernel	Own (guest OS)	Shared with host
Size	GB	MB
Boot	Minutes	Seconds (often ms)
Isolation	Very strong (hypervisor)	Kernel-namespace level
OS choice	Any (Linux on Mac etc.)	Same kernel ABI as host
Overhead	10-20%	~1-2%

Docker Desktop on Mac / Windows actually runs a hidden Linux VM (HyperKit / WSL2). Containers on macOS still need a Linux kernel.

Container isolation isn't VM isolation

Risks:

Kernel exploits — a host kernel bug affects every container. VMs are shielded by the hypervisor.
Root in container ≈ root on host — without USER namespaces, container root maps to host root. Breakout risk.
Shared resources — /dev, parts of /proc. Wrong mounts leak host info.

Mitigations:

--user for non-root
Rootless Docker / Podman
gVisor (Google) — application kernel for extra isolation
Kata Containers — micro-VMs with container UX

Common pitfalls

1. PID 1 responsibilities

PID 1 in Linux is special — reaps orphans, default signal handler. CMD ["bash", "-c", "node server.js"] makes bash PID 1, which doesn't forward signals to node. Use tini or exec directly.

2. Layer explosion

# Bad — each RUN is a new layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y vim
RUN rm -rf /var/lib/apt/lists/*

# Result — 4 layers, and intermediate cache stays in the image

# Good — chained
RUN apt-get update && \
    apt-get install -y curl vim && \
    rm -rf /var/lib/apt/lists/*

# One layer, no leftovers

3. Missing .dockerignore

COPY . . pulls in .git, node_modules, .env, etc. Use .dockerignore:

# .dockerignore
node_modules
.git
.env
*.log
.DS_Store

4. Mounting host paths and UID mismatches

# host UID 1000, container UID 100 (alpine)
docker run -v $(pwd):/data alpine touch /data/file
                                    ↓
                                    file owned by UID 100
                                    host sees UID 100 — permission issue

Fix:
docker run -u $(id -u):$(id -g) -v $(pwd):/data ...

5. Shipping dev images to production

FROM node:20 = 1 GB. FROM node:20-alpine = 60 MB. Production should use distroless or Alpine bases. Even if dev uses bulky images, the final stage should be slim.

References

Linux man — namespaces(7) — man7.org
OCI Runtime Spec — GitHub
Liz Rice — Container from scratch — YouTube
Overlay filesystem — kernel.org

Summary

Containers are normal processes — host kernel shared, plus Linux namespaces and cgroups around them.
Eight namespaces (PID / NET / MNT / UTS / IPC / USER / CGROUP / TIME) provide view isolation.
cgroups cap CPU / memory / I/O / PID. Controlled via files under /sys/fs/cgroup/.
An image is a stack of read-only layers; containers add a writable layer with copy-on-write via overlayfs.
Docker daemon = UX over runc + containerd. Alternatives: Podman, containerd, CRI-O.
Multi-stage builds + .dockerignore shrink images dramatically.
Containers aren't VMs — same kernel = light but less isolated. Use gVisor / Kata for stricter boundaries.
Try it: Docker Compose Visualizer / Kubernetes YAML Visualizer visualize your manifests.