A 100 TB database. Write load exceeds one node. Replication alone won't help — every replica holds the same data. Solution: sharding — split data across nodes. Hash vs range, consistent hashing, hot shards, cross-shard joins — this guide covers the mechanics and pitfalls.
Sharding vs Replication — Different Problems
Replication: same data copied to N nodes
- Availability ↑
- Read throughput ↑ (read replicas)
- Write throughput = single-node limit (all writes go to leader)
Sharding: different data on different nodes
- Write throughput ↑ (N nodes in parallel)
- Storage capacity ↑ (N × disk)
- Availability alone is NOT solved by sharding → combine with replication
Real world: shard + replica together (each shard has its own replica set)
e.g. MongoDB sharded cluster = 3 shards × 3 replicas = 9 nodesPartition Strategy 1 — Range-based
Range partition by user_id:
shard-A: user_id 0 ~ 999_999
shard-B: user_id 1M ~ 1.999M
shard-C: user_id 2M ~ ∞
Pros:
- Efficient range scans ("users 1000-2000" hits only shard-A)
- New data (auto-increment) lands on one shard → ordering is excellent
Cons:
- ⚠ HOT SHARD — if only recent users are active, shard-C carries the load
- Imbalance requires splits
Used by: HBase, BigTable, MongoDB (default), DynamoDB (sort key)Partition Strategy 2 — Hash-based
Decide shard by hash(user_id) % N:
hash(user_42) = 0x8a3f... mod 3 = 1 → shard-B
hash(user_43) = 0x2c19... mod 3 = 0 → shard-A
hash(user_44) = 0xf501... mod 3 = 2 → shard-C
Pros:
- Uniform distribution (a property of random hashes)
- Hot keys naturally spread (adjacent IDs land on different shards)
Cons:
- No range scans ("users 1000-2000" must hit every shard)
- ⚠ Resharding trap — N change (3 → 4) moves ~80% of all data
Used by: DynamoDB partition key, Cassandra, MongoDB hashed shard keyConsistent Hashing — The Resharding Fix (1995)
Map hashes onto a ring of 0..2^32:
0
┌───┴───┐
│ shard │
│ A │
⊕ (0~85) │ ← user_X hashes to 200 → next shard is B
│ │
│ shard │
│ B │
│(85~170)│
│ │
│ shard │
│ C │
│(170~255)│
│ │
└───────┘
A key's data lives on the next shard clockwise.
Add a new shard (D at position 128):
- Only part of B (128~170) migrates to D
- Shards A and C are untouched
→ N → N+1 moves roughly 1/N of the data (vs ~80% with mod).
Virtual nodes (each shard occupies multiple ring positions) smooth
distribution further — the standard trick in Amazon Dynamo / Cassandra.Hot Shards — The Most Common Operations Problem
Case 1 — time-based IDs
Shard by created_at month → only this month's data is hot
Case 2 — skewed user distribution
Hash by user_id → a "celebrity user" with 100× traffic
→ their shard alone is hot
Case 3 — auto-increment with range
Inserts always append to the last shard → write hot
Fixes:
- Redesign shard key (composite, salt)
- Special-case celebrities (separate shard or cache) — Twitter's fanout
- Time-based should be "hash + time prefix", not pure daily shards
- AWS DynamoDB's adaptive capacity — auto-adds capacity for hot partitionsRebalancing — Migrating Without Downtime
Steps to move data when adding a shard:
1. Register new shard (routing layer marks it "warming")
2. Identify data to move (just 1/N via consistent hashing)
3. Copy data (old shard → new shard, in the background)
4. Verify (checksum / row count)
5. Routing cut-over (atomic, at one moment)
6. Clean up moved data from the old shard
Pitfalls:
- In-flight writes during cut-over — dual-write or lock
- Stale routing caches → query goes to wrong shard
- Deleting old data early → can't roll back a failed cut-over
- Background copy hits production load (rate limit required)
DBMS handles automatically: MongoDB chunks, Vitess vreplication, CockroachDB rebalancingCross-Shard Joins — Here Be Dragons
-- Natural on a single DB
SELECT u.name, o.total
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.region = 'KR';
-- Sharded:
users spread across shard-A, B, C
orders spread across shard-X, Y, Z (possibly different shard key)
→ Where does the JOIN happen?
- Pick one shard, fetch from others → that node's memory blows up
- Partial results from all shards → coordinator merges → coordinator hot
- App splits into two queries → in-memory merge (complex, slow)
Patterns to avoid:
1. Denormalize — copy frequently-joined fields ahead of time (small data)
2. Co-locate — same shard key for users and orders (user_id) → same shard
3. Global table — replicate small reference tables (region, category) on every shard
4. Materialized view — pre-computed join stored separately
→ "Sharding makes JOIN expensive" is the biggest ongoing operational burden.Real Systems
| System | Strategy | Notes |
|---|---|---|
| DynamoDB | Hash (partition key) | Adaptive capacity mitigates hot partitions |
| Cassandra | Consistent hashing + virtual nodes | Model of linear scaling |
| MongoDB | Range or hashed | Auto split / rebalance by chunk |
| Vitess (YouTube) | Range (sharded MySQL) | Sharding layer atop traditional MySQL |
| CockroachDB | Range (per-key) | Auto-split + auto-rebalance |
| Spanner | Range (per-tablet) | Split based on load |
Common Pitfalls
- Sharding too early — single DB under 100 GB is usually fine. Operational complexity is the biggest cost of premature sharding.
- Wrong shard key — changing it means moving all data. Shard key decisions are nearly irreversible.
- Assuming cross-shard transactions — 2PC adds latency + partial-failure risk. Design transactions to live in a single shard.
- Auto-increment ID + sharding — all writes hit the last shard. Use UUID v4 or snowflake IDs.
- Monitoring difficulty — per-shard load metrics are mandatory. One hot shard affects overall latency.
Wrap-up
Sharding is the canonical horizontal-scaling tool but expensive operationally. Push single DB + replication as far as you can, then shard. Once you do, joins and transactions cost more forever.
Consistent hashing + virtual nodes + replica per shard is the modern distributed-DB default — DynamoDB / Cassandra / CockroachDB are all variants.