N machines need to agree on the same value. Some die, some are slow, and the network drops packets. Deciding "what is true" in this setting is consensus. etcd / Consul / ZooKeeper / Kafka controller / Spanner — the foundation of every distributed system. This guide covers Paxos · Raft, the FLP impossibility, and why Raft won in practice.
The Problem Consensus Solves
5 servers: A, B, C, D, E
Client requests "set x = 42".
Naive: only one accepts — that server dies, data lost
All accept? — sync cost ↑, one slow node stalls everything
Majority (3/5)? → the core idea of consensus
Requirements:
- Safety: no two nodes agree on different values (no split-brain)
- Liveness: a decision is reached eventually (1 node fail OK)FLP Impossibility (1985)
Fischer-Lynch-Paterson theorem: in a fully asynchronous environment, even one failure makes consensus impossible to guarantee.
"Is this server dead or slow?" — in async, indistinguishable
→ Assume dead, proceed? → it was alive, both decide → split-brain
→ Assume alive, wait? → stuck forever (lose liveness)
Resolution: assume partial synchrony + timeouts
- Usually async, sometimes the sync assumption breaks
- Timeouts treat slow as dead — false positives OK
- Safety preserved, liveness relaxed to "eventually progresses"Paxos (1990, Lamport)
The first formal consensus algorithm. Two phases (Prepare → Accept), a 2-phase commit pattern.
Proposer (proposes a value) → Acceptors (majority must agree)
Phase 1 — Prepare:
Proposer sends prepare(n) with ballot number n to all acceptors
Acceptor: if n > all previously seen n', reply "promise(n)"
+ include previously accepted value v if any
Phase 2 — Accept:
Upon majority of promises → decide on a value
(the prior accepted v if any; otherwise own value)
Send accept(n, value) to all acceptors
Majority accepts → decided
Characteristics:
- Two phases × two RTTs — high latency
- Concurrent proposers = livelock (ballot escalation race)
- Conceptually simple, hard to understand (Lamport's own "part-time
parliament" paper was so abstract that follow-ups had to re-explain it)Multi-Paxos / Raft — Practical Evolution
Running Paxos for every decision is slow. One leader, briefly maintained, handles many decisions → Multi-Paxos. Raft (2014) restructures the same idea to be easier to understand.
Raft's Three Phases
1. Leader Election
- Every follower has an election timeout (150-300 ms random)
- If the leader's heartbeat doesn't arrive → become candidate
- Vote for self + request votes from others
- On majority votes → leader, new term begins
2. Log Replication
- Client request → leader appends to its own log
- AppendEntries RPC to followers (heartbeat included)
- On majority ack → commit + apply to state machine
- Inform followers "committed"
3. Safety (no split-brain)
- New leader must contain all committed entries of prior leaders
(vote-time check)
- Only one leader per termQuorum — The Magic of Majority
5-node cluster:
- write quorum: 3/5 (majority)
- read quorum: 3/5 (optional, usually read from leader only)
Why majority?
Two majority subsets must overlap in at least one server
→ the same server sees both decisions → consistency by construction
Recommended cluster size: odd
- 3 nodes: tolerate 1 failure
- 5 nodes: tolerate 2 failures
- 7 nodes: tolerate 3 failures (but commit latency ↑)
Even sizes work but "majority" is ambiguous (a 4/8 vs 4/8 split is possible)Split-Brain — The Most Dangerous Failure
network partition:
[A, B, C] ←→ [D, E] (cut in the middle)
If both partitions elect leaders → split-brain
- [A,B,C] leader decides x=1
- [D,E] leader decides x=2
- On partition heal → which is correct?
Prevention:
- Only majority (3/5 or 2/3) may be leader
- [D,E] has only 2 → no majority → no leader
- Only [A,B,C] progresses
Cost: the minority partition can't progress (availability loss, CAP's P)Real Systems
| System | Algorithm | Use |
|---|---|---|
| etcd | Raft | Kubernetes metadata store |
| Consul | Raft | Service discovery + config |
| ZooKeeper | Zab (Paxos variant) | Kafka controller, Hadoop coordination |
| Kafka 3.3+ | KRaft (Raft itself) | Eliminate ZooKeeper dependency |
| CockroachDB | Raft (per-range) | One Raft group per data range |
| Spanner | Paxos | Global database |
Why Raft Beat Paxos in Practice
- Understandability — Paxos leaves you wondering "did I really get it?". Raft splits into explicit phases (leader / log / safety) with a gentler curve. Raft's paper is literally titled "In Search of an Understandable Consensus Algorithm".
- Simpler implementation — Multi-Paxos implementation details (ballots, optimizations) vary by vendor. Raft has a clear spec.
- Open-source reference — etcd / hashicorp/raft etc. are production-grade and easy to borrow from.
Common Pitfalls
- 3-node cluster risk — survives 1 failure (quorum 2/3), but 2 simultaneous failures (e.g. same AZ outage) take everything down. Distribute across AZs.
- Leader latency caps throughput — every write goes through the leader for quorum. Read replicas scale reads; writes are bottlenecked by a single leader.
- Cluster membership changes are tricky — needs joint consensus or similar. Done wrong = split-brain.
- Log size growth — snapshots compact. But installing snapshots is also IO-heavy.
- Clock skew assumptions — Raft is fine with NTP-level (tens of ms). Sub-ms clocks like Spanner's TrueTime are a separate domain.
Wrap-up
Consensus is the atomic primitive of distributed systems — every consistency guarantee builds on it. Paxos came first; Raft became mainstream by being understandable.
Consensus cost determines the cost of every distributed write. That's why practical designs split: some operations strong, the rest eventual.