How Load Balancers Actually Work

One server can't handle the traffic. So you run several and put a load balancer in front to spread requests across them. Simple to say, but in practice "L4 or L7", "which algorithm", "must a session stick to one server", "how do you remove a dead server", and "what if the load balancer itself dies" are all decisions. This guide walks through each one.

Why You Need One

One server: CPU/memory cap → connection cap → it dies, everything down

Many servers + LB:
                 ┌──→ server A
client → [ LB ] ─┼──→ server B
                 └──→ server C

What you get:
- Horizontal scale (scale out): add servers → more throughput
- High availability: if one dies, the LB stops sending to it
- Zero-downtime deploys: drain one server → deploy → re-add (rolling)

L4 vs L7 — Which Layer Decides

The biggest fork: at which OSI layer does the LB inspect traffic and decide?

L4 (transport, TCP/UDP):
  - Forwards based on IP + port only
  - Doesn't read content (HTTP path, header, cookie) — can't
  - Pins a connection to one server and passes it through (passthrough)
  - Fast and cheap, TLS flows through untouched (no termination)
  e.g. AWS NLB, IPVS, HAProxy(tcp mode)

L7 (application, HTTP/HTTPS):
  - Actually parses the HTTP request — path, header, cookie, method
  - Enables routing like "/api → backend A, /img → backend B"
  - TLS termination, compression, header injection, retries, caching
  - Smarter but costs CPU (parse every request)
  e.g. AWS ALB, Nginx, HAProxy(http mode), Envoy

In short: L4 delivers by address only, L7 reads the contents before delivering. They're often combined — L4 absorbs bandwidth at the edge, an inner L7 does fine-grained routing.

Distribution Algorithms

The rule for picking "which server". The right answer depends on the shape of your traffic.

round-robin:
  A → B → C → A → B → C ...
  Simple/fair. But only fair if servers are identical and request
  costs are similar. One 10s request and one 10ms request → imbalance.

weighted round-robin:
  A(weight 3) → A → A → B(weight 1) → A → A → A → B ...
  For unequal hardware. 8-core:4-core = weight 2:1.

least-connections:
  Send to the server with the fewest active connections right now.
  Balances better than round-robin when request lengths vary.
  Especially good for long-lived connections (websockets, etc.).

least-response-time:
  Combines connection count + average response time. Auto-avoids a
  server that's gotten slow.

IP hash:
  hash(client IP) % server_count → always the same server.
  Pins a client to one server (one way to do sticky sessions, below).
  Changing server count remaps most clients (consistent hashing eases it).

Health Checks — Removing Dead Servers

The LB's core job: periodically verify each backend is alive and, if not, pull it from the pool so no traffic goes there.

Passive health check:
  Observe failures in real traffic — N consecutive 5xx/timeout → eject.
  No extra requests (lightweight). But the first user eats the failure.

Active health check:
  The LB probes periodically (e.g. GET /healthz every 5s)
  - N consecutive failures → unhealthy → remove from pool
  - M consecutive successes again → healthy → re-add
  Detects trouble before user requests hit it.

Designing a good /healthz:
  - Not just a bare 200 — check dependencies too (shallow vs deep)
  - But too deep means a slightly wobbly DB marks ALL servers
    unhealthy → total outage
  - Have it answer only "can I serve a request?" (liveness vs readiness)

Sticky Sessions (Session Affinity)

Routing a given client's requests to the same server every time. Needed when a server holds session state in its own memory.

Why?
  Login session stored in server A's memory → next request lands on B
  → "not logged in" → broken.

How:
  1) IP hash — pin by client IP. But thousands behind one NAT IP skew it.
  2) Cookie-based — the LB injects a cookie naming the server (e.g. AWSALB).
     Most accurate. Unaffected by NAT.

Cost:
  - Breaks balance (heavy users pile onto one server)
  - If that server dies → every session pinned to it is gone

The better path — stateless servers:
  Keep sessions in shared storage (Redis/DB) or a JWT, not server memory.
  Then sticky is unnecessary → any server can serve → better balance and
  fault tolerance.
  Practical advice: design toward eliminating stickiness when you can.

Making the LB Itself Highly Available

A LB that is a single point of failure defeats the purpose. The LB is itself made redundant.

active-passive (failover):
  LB1(active) + LB2(standby) share one virtual IP (VIP).
  - VRRP/keepalived heartbeat between them
  - active dies → passive takes over the VIP → traffic continues
  - standby usually sits idle (cost)

active-active:
  Multiple LBs take traffic at once. Needs upstream distribution →
  usually DNS round-robin or anycast.

anycast:
  The same IP is advertised by LBs in many regions at once (BGP).
  The network routes to the "nearest" LB automatically.
  - A region's LB/link dies → BGP reconverges to the next nearest
  - Lower latency + fault isolation. The standard for CDNs/large services.

The Whole Picture

               DNS (returns one anycast IP)
                      │
        ┌─────────────┼─────────────┐
   [LB region A]  [LB region B]  [LB region C]   ← anycast, nearest wins
        │
   L7 routing + health checks + (no sticky if possible)
        │
   ┌────┼────┐
  srv  srv  srv   ← stateless, sessions in Redis/JWT
        │
   shared session store / DB

Common Pitfalls

Health check too deep — a briefly slow DB drops every server into unhealthy and wipes you out. Separate liveness (am I alive) from readiness (ready to take traffic).
Relying on sticky, then a server dies — every pinned session evaporates. Go stateless if you can; otherwise put sessions in shared storage.
least-connections when connection ≠ load — some requests hold a connection but sit idle (websockets), others are short but CPU-heavy. Verify connection count reflects real load.
Removing a server without graceful shutdown — in-flight requests get cut. Add a "draining" phase on deploy (stop new requests, let existing ones finish).
Thundering herd on recovery — a server comes back and all health checks mark it healthy at once → traffic floods it → it dies again. Ease it with slow start (gradual weight ramp).
Confusion over TLS termination point — terminate at L7 and the LB→server hop may be plaintext. Decide by policy whether to re-encrypt internally.

Wrap-up

A load balancer isn't just "spread the requests" — it's a bundled decision over L4/L7, algorithm, health checks, session strategy, and its own redundancy. Hold onto one principle: the more stateless your servers, the simpler and stronger the LB. Then sticky sessions — and "that server died, the sessions are gone" — simply disappear.