Skip to content
yutils

How the CAP Theorem Actually Works

What Brewer actually proved (and what marketing pretends he proved), partition tolerance is non-negotiable, the CP vs AP false dichotomy, PACELC as the more honest framework, and which trade-off each real database (DynamoDB / Cassandra / Spanner / etcd) actually picks.

~9 min read

"Our DB is CP" / "We're AP" — common claims, usually inaccurate. What Brewer actually proved is narrower; what marketing claims is something else. PACELC is the more honest framework. This guide covers what CAP really means and how real databases trade off.

What Brewer Actually Proved

CAP theorem (conjectured 2000, proven 2002 by Gilbert & Lynch):

A distributed system cannot simultaneously satisfy all three at 100%:

  C — Consistency (linearizability — same value everywhere at the same instant)
  A — Availability (every request gets a response)
  P — Partition tolerance (works through network partitions)

→ More precisely: in the presence of partition (P), you must give up
  either C or A.

Key points:
- "100% C, 100% A, with P" is impossible
- When there's no partition, you can have both (CAP's trap)

P Is Not Negotiable

Common misreading: "pick CP, AP, or CA". In reality:

P (partition tolerance) is effectively mandatory.
- Networks eventually partition (cable cut, switch fail, AZ outage)
- The only systems without partitions are "single DC, single node"
- So giving up P = giving up distributed systems

→ The real choice, during a partition: give up C or give up A?

CP: refuse responses during partition (preserve consistency)
AP: return stale data during partition (preserve availability)

CP vs AP — A False Dichotomy

Almost no real system is "entirely CP" or "entirely AP" — the trade-off is per-operation, per-configuration:

MongoDB:
- Default: AP-leaning (stale reads OK briefly during primary failover)
- Majority read concern: CP-leaning
- Write concern w=1 vs w=majority: different trade-offs

Cassandra:
- ONE consistency level: AP
- QUORUM: in between
- ALL: CP

PostgreSQL:
- Single node: CAP doesn't really apply
- Async streaming replication: AP (stale replicas possible)
- synchronous_commit=on: CP-leaning

→ Labels like "X is CP" are simplifications of the default config.
   In practice, choose per-operation.

PACELC — The More Honest Framework (Abadi 2010)

PA-CELC:

If Partition (P) — choose Availability (A) or Consistency (C)
Else (E) — choose Latency (L) or Consistency (C)

→ Makes the latency vs consistency trade-off explicit even without
  partitions.

Examples:
- Spanner: PC + EC (strong consistency + accepts latency cost always)
- Cassandra: PA + EL (availability + fast latency, eventual OK)
- DynamoDB: PA + EL (default) or PC + EC (strong-consistency mode)
- MongoDB: PA + EC (availability; consistency in normal times)
- etcd: PC + EC (strong consistency at all times)

What Real DBs Pick

DBPartitionNormal (else)Notes
SpannerC (CP)CTrueTime + global Paxos for strong + near-zero lag
etcd / ZooKeeperC (CP)CMetadata stores — consistency is paramount
CockroachDBC (CP)CNewSQL — Spanner-like, MVCC
MongoDBA (AP) defaultCw=majority + read concern → CP-like
CassandraA (AP)LHigh availability + low latency, eventual by default
DynamoDBA (AP) defaultLStrong-consistency mode is opt-in (costs more)
RiakA (AP)LDirect descendant of the Dynamo paper

Actual Partition Frequency

From Kyle Kingsbury's (Aphyr) Jepsen tests:
  - Partitions happen weekly to a few times a week in cloud envs
  - Usually ms-scale; sometimes minutes
  - Partitions occur even within an AZ (switch failures)

So:
- "Partitions are rare, CP is fine" is risky
- AP choices with short partitions show little user impact
- CP choices = service unavailable during partition — collides with SLAs

Spanner's Trick — TrueTime

Google Spanner is famous for "almost both CAP's C and A".

The secret:
- TrueTime API — GPS + atomic clocks sync all data centers to ±ms
- Every transaction gets a commit timestamp + uncertainty interval
- "Wait until the commit is guaranteed visible"
  → "commit wait" — preserves strong consistency

Cost:
- GPS antennas + atomic clocks per data center
- Transaction latency grows by the commit wait (a few ms)
- Google-only infrastructure → hard to replicate elsewhere

→ CockroachDB, YugabyteDB, etc. mimic the approach (NTP + heuristics).
   True TrueTime accuracy remains Google's alone.

Common Pitfalls

  • "We're CP" / "We're AP" labels — per-operation and per-config differences. Blanket labels are dangerous.
  • Ignoring partitions — "within my data center there are no partitions" is false. Jepsen shows them frequently.
  • Assumed "eventual" time bound — no spec guarantee. Lag can stretch into minutes. App must define stale-data handling explicitly.
  • Ignoring strong-consistency cost — Spanner / etcd write latency is 5-10× of single-node. A costly default for read-heavy workloads.
  • Confusing CAP and ACID — ACID is a single-transaction guarantee; CAP is a distributed-systems trade-off. ACID's C (consistency) ≠ CAP's C (linearizability).

Wrap-up

The real CAP lesson isn't "choose C or A" but "there are trade-offs — choose consciously". PACELC is the better framework — also accounts for latency vs consistency in normal times.

Practical: pick per-operation knobs explicitly (read concern, write concern, consistency level). Don't trust marketing labels; inspect the real configuration.

Back to guides