Skip to content
yutils

How Consensus Algorithms Actually Work

Paxos vs Raft, leader election, quorum (N/2+1), split-brain, the FLP impossibility result, why Raft beat Paxos in practice, and how etcd / Consul / ZooKeeper actually decide what's true.

~10 min read

N machines need to agree on the same value. Some die, some are slow, and the network drops packets. Deciding "what is true" in this setting is consensus. etcd / Consul / ZooKeeper / Kafka controller / Spanner — the foundation of every distributed system. This guide covers Paxos · Raft, the FLP impossibility, and why Raft won in practice.

The Problem Consensus Solves

5 servers: A, B, C, D, E
Client requests "set x = 42".

Naive: only one accepts — that server dies, data lost
All accept? — sync cost ↑, one slow node stalls everything
Majority (3/5)? → the core idea of consensus

Requirements:
- Safety: no two nodes agree on different values (no split-brain)
- Liveness: a decision is reached eventually (1 node fail OK)

FLP Impossibility (1985)

Fischer-Lynch-Paterson theorem: in a fully asynchronous environment, even one failure makes consensus impossible to guarantee.

"Is this server dead or slow?" — in async, indistinguishable
→ Assume dead, proceed? → it was alive, both decide → split-brain
→ Assume alive, wait? → stuck forever (lose liveness)

Resolution: assume partial synchrony + timeouts
- Usually async, sometimes the sync assumption breaks
- Timeouts treat slow as dead — false positives OK
- Safety preserved, liveness relaxed to "eventually progresses"

Paxos (1990, Lamport)

The first formal consensus algorithm. Two phases (Prepare → Accept), a 2-phase commit pattern.

Proposer (proposes a value) → Acceptors (majority must agree)

Phase 1 — Prepare:
  Proposer sends prepare(n) with ballot number n to all acceptors
  Acceptor: if n > all previously seen n', reply "promise(n)"
                             + include previously accepted value v if any

Phase 2 — Accept:
  Upon majority of promises → decide on a value
    (the prior accepted v if any; otherwise own value)
  Send accept(n, value) to all acceptors
  Majority accepts → decided

Characteristics:
- Two phases × two RTTs — high latency
- Concurrent proposers = livelock (ballot escalation race)
- Conceptually simple, hard to understand (Lamport's own "part-time
  parliament" paper was so abstract that follow-ups had to re-explain it)

Multi-Paxos / Raft — Practical Evolution

Running Paxos for every decision is slow. One leader, briefly maintained, handles many decisions → Multi-Paxos. Raft (2014) restructures the same idea to be easier to understand.

Raft's Three Phases

1. Leader Election
   - Every follower has an election timeout (150-300 ms random)
   - If the leader's heartbeat doesn't arrive → become candidate
   - Vote for self + request votes from others
   - On majority votes → leader, new term begins

2. Log Replication
   - Client request → leader appends to its own log
   - AppendEntries RPC to followers (heartbeat included)
   - On majority ack → commit + apply to state machine
   - Inform followers "committed"

3. Safety (no split-brain)
   - New leader must contain all committed entries of prior leaders
     (vote-time check)
   - Only one leader per term

Quorum — The Magic of Majority

5-node cluster:
- write quorum: 3/5 (majority)
- read quorum: 3/5 (optional, usually read from leader only)

Why majority?
  Two majority subsets must overlap in at least one server
  → the same server sees both decisions → consistency by construction

Recommended cluster size: odd
- 3 nodes: tolerate 1 failure
- 5 nodes: tolerate 2 failures
- 7 nodes: tolerate 3 failures (but commit latency ↑)

Even sizes work but "majority" is ambiguous (a 4/8 vs 4/8 split is possible)

Split-Brain — The Most Dangerous Failure

network partition:
[A, B, C]  ←→ [D, E]   (cut in the middle)

If both partitions elect leaders → split-brain
  - [A,B,C] leader decides x=1
  - [D,E] leader decides x=2
  - On partition heal → which is correct?

Prevention:
- Only majority (3/5 or 2/3) may be leader
- [D,E] has only 2 → no majority → no leader
- Only [A,B,C] progresses

Cost: the minority partition can't progress (availability loss, CAP's P)

Real Systems

SystemAlgorithmUse
etcdRaftKubernetes metadata store
ConsulRaftService discovery + config
ZooKeeperZab (Paxos variant)Kafka controller, Hadoop coordination
Kafka 3.3+KRaft (Raft itself)Eliminate ZooKeeper dependency
CockroachDBRaft (per-range)One Raft group per data range
SpannerPaxosGlobal database

Why Raft Beat Paxos in Practice

  • Understandability — Paxos leaves you wondering "did I really get it?". Raft splits into explicit phases (leader / log / safety) with a gentler curve. Raft's paper is literally titled "In Search of an Understandable Consensus Algorithm".
  • Simpler implementation — Multi-Paxos implementation details (ballots, optimizations) vary by vendor. Raft has a clear spec.
  • Open-source reference — etcd / hashicorp/raft etc. are production-grade and easy to borrow from.

Common Pitfalls

  • 3-node cluster risk — survives 1 failure (quorum 2/3), but 2 simultaneous failures (e.g. same AZ outage) take everything down. Distribute across AZs.
  • Leader latency caps throughput — every write goes through the leader for quorum. Read replicas scale reads; writes are bottlenecked by a single leader.
  • Cluster membership changes are tricky — needs joint consensus or similar. Done wrong = split-brain.
  • Log size growth — snapshots compact. But installing snapshots is also IO-heavy.
  • Clock skew assumptions — Raft is fine with NTP-level (tens of ms). Sub-ms clocks like Spanner's TrueTime are a separate domain.

Wrap-up

Consensus is the atomic primitive of distributed systems — every consistency guarantee builds on it. Paxos came first; Raft became mainstream by being understandable.

Consensus cost determines the cost of every distributed write. That's why practical designs split: some operations strong, the rest eventual.

Back to guides