Skip to content
yutils

How Docker Containers Actually Work

Containers aren't lightweight VMs — they're Linux processes with namespaces and cgroups around them. Walk through namespaces, cgroups, the overlay filesystem, image layers, and what Docker the daemon actually does.

~9 min read

"Containers are lightweight VMs" is the common pitch — and it's wrong. VMs ship their own kernel; containers share the host's. So how does ps in one container hide processes from another, and how does rm -rf / in a container leave the host intact? Linux namespaces + cgroups. This guide walks through what's actually happening when you docker run.

A container is just a process with extras

# On the host
$ docker run -d nginx
$ ps aux | grep nginx
root  12345  /usr/sbin/nginx     ← actually a host process
                                   host PID = 12345
# Inside the container
$ docker exec -it <id> ps aux
root  1      /usr/sbin/nginx     ← same process, container sees PID 1

One process, two PIDs — 12345 on the host, 1 inside. The trick is the PID namespace.

Namespaces — eight dimensions of isolation

The Linux kernel maintains a separate view of certain resources per namespace:

NamespaceIsolatesEffect
PIDprocess treeps inside only shows the container's own processes
NETnetwork stack (interfaces, routes, iptables)each container has its own lo, eth0
MNTmount pointseach container sees its own /
UTShostname + domaineach container has its own hostname
IPCSysV / POSIX IPCshared memory / semaphores isolated
USERUID / GID mapscontainer root maps to non-root on host
CGROUPcgroup view(added 2016)
TIMEsystem clock(added 2020)

Linux experiment:

$ unshare --pid --fork --mount-proc bash
# New PID namespace. Bash becomes PID 1.

$ ps aux
USER  PID  ...
root    1  bash
root    2  ps
# Host's other processes are invisible.

cgroups — resource limits

While namespaces isolate "view," cgroups (control groups) isolate "resource allocation":

  • CPU — cap usage (--cpus=2)
  • Memory — RAM cap + OOM behavior (--memory=512m)
  • Block I/O — disk read/write throttle
  • Network I/O — bandwidth shaping (with tc)
  • PID — prevent fork bombs

Implemented as files under /sys/fs/cgroup/:

$ cat /sys/fs/cgroup/docker/abc123.../memory.max
536870912    ← 512 MB

$ cat /sys/fs/cgroup/docker/abc123.../cpu.max
200000 100000  ← 200ms of CPU per 100ms window (2 cores)

Docker run flags like --memory and --cpus end up writing to these files.

Union filesystem — why images are small

A Docker image is a stack of layers:

Image: my-app:v1
├── Layer 4: COPY ./app /app          (5 MB)
├── Layer 3: RUN npm install           (200 MB)
├── Layer 2: COPY package.json /app    (1 KB)
├── Layer 1: WORKDIR /app              (0 bytes)
└── Layer 0: FROM node:20              (200 MB, base image)

Total: 405 MB

Add my-app:v2 (only the app code changed):
├── NEW Layer 4'                       (5 MB)
├── Layer 3 (shared)                   (reuses, 0 bytes)
├── Layer 2 (shared)                   (reuses)
├── Layer 1 (shared)                   (reuses)
└── Layer 0 (shared)                   (reuses)

Disk increase: just 5 MB

Layers are read-only. Containers add a writable layer on top with copy-on-write semantics:

Container starts:
┌─────────────────────────┐
│ Writable layer (RW)     │  ← container's changes live here
├─────────────────────────┤
│ Image Layer 4 (RO)      │
├─────────────────────────┤
│ Image Layer 3 (RO)      │
├─────────────────────────┤
│ ... (RO)                │
└─────────────────────────┘

When the container writes:
- Copies the file from the RO layer into the writable layer
- Modifies the copy
- Reads search top-down for the first match

Modern Docker uses overlayfs — Linux's union filesystem — stored under /var/lib/docker/overlay2/.

Networking — four common modes

  • bridge (default) — Docker creates a virtual bridge (docker0). Each container has a veth pair. NAT for outbound.
  • host — uses the host network namespace directly. No isolation, fastest.
  • none — network isolated, no interfaces. Most secure.
  • overlay — multi-host clusters (Swarm, Kubernetes). VXLAN tunnels.
bridge mode flow:
container1 (172.17.0.2) ── veth1 ── docker0 (172.17.0.1) ── NAT ── eth0 ── Internet
container2 (172.17.0.3) ── veth2 ─┘

Inter-container traffic — via docker0 directly

What Docker the daemon actually does

user → docker CLI → dockerd (daemon, REST API)
                          ↓
                       containerd (high-level runtime)
                          ↓
                       runc (low-level runtime, OCI spec)
                          ↓
                       Linux kernel (namespaces + cgroups + overlayfs)

The actual container creation happens in runc (OCI). runc creates namespaces, sets up cgroups, chroots, and exec()s. Docker is a UX layer on top.

Alternatives:

  • Podman — daemon-less, rootless. Same OCI images.
  • containerd — Kubernetes's default runtime. Lighter than Docker.
  • CRI-O — Red Hat. Kubernetes-only.

Dockerfile build steps

FROM node:20
WORKDIR /app
COPY package.json .
RUN npm install
COPY . .
CMD ["node", "server.js"]

Each line is a new layer:
1. Pull node:20 base
2. Create /app (0 bytes)
3. Copy package.json
4. Run npm install (deps)
5. Copy the rest of the code
6. Metadata (CMD)

Cache usage:
- If package.json hasn't changed, RUN npm install is cache-hit
- That's why copying package.json before RUN npm install is best practice

Multi-stage builds — slim final images

# Stage 1 — builder
FROM node:20 AS builder
WORKDIR /app
COPY . .
RUN npm install
RUN npm run build      ← produces dist/

# Stage 2 — final runtime
FROM node:20-alpine    ← Alpine is tiny (60 MB vs 1 GB)
WORKDIR /app
COPY --from=builder /app/dist /app/dist
COPY --from=builder /app/node_modules /app/node_modules
CMD ["node", "dist/server.js"]

# Result — no dev deps or source code in the final image

Compared to VMs

VM (VirtualBox, VMware)Container (Docker)
KernelOwn (guest OS)Shared with host
SizeGBMB
BootMinutesSeconds (often ms)
IsolationVery strong (hypervisor)Kernel-namespace level
OS choiceAny (Linux on Mac etc.)Same kernel ABI as host
Overhead10-20%~1-2%

Docker Desktop on Mac / Windows actually runs a hidden Linux VM (HyperKit / WSL2). Containers on macOS still need a Linux kernel.

Container isolation isn't VM isolation

Risks:

  • Kernel exploits — a host kernel bug affects every container. VMs are shielded by the hypervisor.
  • Root in container ≈ root on host — without USER namespaces, container root maps to host root. Breakout risk.
  • Shared resources — /dev, parts of /proc. Wrong mounts leak host info.

Mitigations:

  • --user for non-root
  • Rootless Docker / Podman
  • gVisor (Google) — application kernel for extra isolation
  • Kata Containers — micro-VMs with container UX

Common pitfalls

1. PID 1 responsibilities

PID 1 in Linux is special — reaps orphans, default signal handler. CMD ["bash", "-c", "node server.js"] makes bash PID 1, which doesn't forward signals to node. Use tini or exec directly.

2. Layer explosion

# Bad — each RUN is a new layer
RUN apt-get update
RUN apt-get install -y curl
RUN apt-get install -y vim
RUN rm -rf /var/lib/apt/lists/*

# Result — 4 layers, and intermediate cache stays in the image

# Good — chained
RUN apt-get update && \
    apt-get install -y curl vim && \
    rm -rf /var/lib/apt/lists/*

# One layer, no leftovers

3. Missing .dockerignore

COPY . . pulls in .git, node_modules, .env, etc. Use .dockerignore:

# .dockerignore
node_modules
.git
.env
*.log
.DS_Store

4. Mounting host paths and UID mismatches

# host UID 1000, container UID 100 (alpine)
docker run -v $(pwd):/data alpine touch /data/file
                                    ↓
                                    file owned by UID 100
                                    host sees UID 100 — permission issue

Fix:
docker run -u $(id -u):$(id -g) -v $(pwd):/data ...

5. Shipping dev images to production

FROM node:20 = 1 GB. FROM node:20-alpine = 60 MB. Production should use distroless or Alpine bases. Even if dev uses bulky images, the final stage should be slim.

References

Summary

  • Containers are normal processes — host kernel shared, plus Linux namespaces and cgroups around them.
  • Eight namespaces (PID / NET / MNT / UTS / IPC / USER / CGROUP / TIME) provide view isolation.
  • cgroups cap CPU / memory / I/O / PID. Controlled via files under /sys/fs/cgroup/.
  • An image is a stack of read-only layers; containers add a writable layer with copy-on-write via overlayfs.
  • Docker daemon = UX over runc + containerd. Alternatives: Podman, containerd, CRI-O.
  • Multi-stage builds + .dockerignore shrink images dramatically.
  • Containers aren't VMs — same kernel = light but less isolated. Use gVisor / Kata for stricter boundaries.
  • Try it: Docker Compose Visualizer / Kubernetes YAML Visualizer visualize your manifests.
Back to guides