One server can't handle the traffic. So you run several and put a load balancer in front to spread requests across them. Simple to say, but in practice "L4 or L7", "which algorithm", "must a session stick to one server", "how do you remove a dead server", and "what if the load balancer itself dies" are all decisions. This guide walks through each one.
Why You Need One
One server: CPU/memory cap → connection cap → it dies, everything down
Many servers + LB:
┌──→ server A
client → [ LB ] ─┼──→ server B
└──→ server C
What you get:
- Horizontal scale (scale out): add servers → more throughput
- High availability: if one dies, the LB stops sending to it
- Zero-downtime deploys: drain one server → deploy → re-add (rolling)L4 vs L7 — Which Layer Decides
The biggest fork: at which OSI layer does the LB inspect traffic and decide?
L4 (transport, TCP/UDP):
- Forwards based on IP + port only
- Doesn't read content (HTTP path, header, cookie) — can't
- Pins a connection to one server and passes it through (passthrough)
- Fast and cheap, TLS flows through untouched (no termination)
e.g. AWS NLB, IPVS, HAProxy(tcp mode)
L7 (application, HTTP/HTTPS):
- Actually parses the HTTP request — path, header, cookie, method
- Enables routing like "/api → backend A, /img → backend B"
- TLS termination, compression, header injection, retries, caching
- Smarter but costs CPU (parse every request)
e.g. AWS ALB, Nginx, HAProxy(http mode), EnvoyIn short: L4 delivers by address only, L7 reads the contents before delivering. They're often combined — L4 absorbs bandwidth at the edge, an inner L7 does fine-grained routing.
Distribution Algorithms
The rule for picking "which server". The right answer depends on the shape of your traffic.
round-robin:
A → B → C → A → B → C ...
Simple/fair. But only fair if servers are identical and request
costs are similar. One 10s request and one 10ms request → imbalance.
weighted round-robin:
A(weight 3) → A → A → B(weight 1) → A → A → A → B ...
For unequal hardware. 8-core:4-core = weight 2:1.
least-connections:
Send to the server with the fewest active connections right now.
Balances better than round-robin when request lengths vary.
Especially good for long-lived connections (websockets, etc.).
least-response-time:
Combines connection count + average response time. Auto-avoids a
server that's gotten slow.
IP hash:
hash(client IP) % server_count → always the same server.
Pins a client to one server (one way to do sticky sessions, below).
Changing server count remaps most clients (consistent hashing eases it).Health Checks — Removing Dead Servers
The LB's core job: periodically verify each backend is alive and, if not, pull it from the pool so no traffic goes there.
Passive health check:
Observe failures in real traffic — N consecutive 5xx/timeout → eject.
No extra requests (lightweight). But the first user eats the failure.
Active health check:
The LB probes periodically (e.g. GET /healthz every 5s)
- N consecutive failures → unhealthy → remove from pool
- M consecutive successes again → healthy → re-add
Detects trouble before user requests hit it.
Designing a good /healthz:
- Not just a bare 200 — check dependencies too (shallow vs deep)
- But too deep means a slightly wobbly DB marks ALL servers
unhealthy → total outage
- Have it answer only "can I serve a request?" (liveness vs readiness)Sticky Sessions (Session Affinity)
Routing a given client's requests to the same server every time. Needed when a server holds session state in its own memory.
Why?
Login session stored in server A's memory → next request lands on B
→ "not logged in" → broken.
How:
1) IP hash — pin by client IP. But thousands behind one NAT IP skew it.
2) Cookie-based — the LB injects a cookie naming the server (e.g. AWSALB).
Most accurate. Unaffected by NAT.
Cost:
- Breaks balance (heavy users pile onto one server)
- If that server dies → every session pinned to it is gone
The better path — stateless servers:
Keep sessions in shared storage (Redis/DB) or a JWT, not server memory.
Then sticky is unnecessary → any server can serve → better balance and
fault tolerance.
Practical advice: design toward eliminating stickiness when you can.Making the LB Itself Highly Available
A LB that is a single point of failure defeats the purpose. The LB is itself made redundant.
active-passive (failover):
LB1(active) + LB2(standby) share one virtual IP (VIP).
- VRRP/keepalived heartbeat between them
- active dies → passive takes over the VIP → traffic continues
- standby usually sits idle (cost)
active-active:
Multiple LBs take traffic at once. Needs upstream distribution →
usually DNS round-robin or anycast.
anycast:
The same IP is advertised by LBs in many regions at once (BGP).
The network routes to the "nearest" LB automatically.
- A region's LB/link dies → BGP reconverges to the next nearest
- Lower latency + fault isolation. The standard for CDNs/large services.The Whole Picture
DNS (returns one anycast IP)
│
┌─────────────┼─────────────┐
[LB region A] [LB region B] [LB region C] ← anycast, nearest wins
│
L7 routing + health checks + (no sticky if possible)
│
┌────┼────┐
srv srv srv ← stateless, sessions in Redis/JWT
│
shared session store / DBCommon Pitfalls
- Health check too deep — a briefly slow DB drops every server into unhealthy and wipes you out. Separate liveness (am I alive) from readiness (ready to take traffic).
- Relying on sticky, then a server dies — every pinned session evaporates. Go stateless if you can; otherwise put sessions in shared storage.
- least-connections when connection ≠ load — some requests hold a connection but sit idle (websockets), others are short but CPU-heavy. Verify connection count reflects real load.
- Removing a server without graceful shutdown — in-flight requests get cut. Add a "draining" phase on deploy (stop new requests, let existing ones finish).
- Thundering herd on recovery — a server comes back and all health checks mark it healthy at once → traffic floods it → it dies again. Ease it with slow start (gradual weight ramp).
- Confusion over TLS termination point — terminate at L7 and the LB→server hop may be plaintext. Decide by policy whether to re-encrypt internally.
Wrap-up
A load balancer isn't just "spread the requests" — it's a bundled decision over L4/L7, algorithm, health checks, session strategy, and its own redundancy. Hold onto one principle: the more stateless your servers, the simpler and stronger the LB. Then sticky sessions — and "that server died, the sessions are gone" — simply disappear.