How Sharding Actually Works

A 100 TB database. Write load exceeds one node. Replication alone won't help — every replica holds the same data. Solution: sharding — split data across nodes. Hash vs range, consistent hashing, hot shards, cross-shard joins — this guide covers the mechanics and pitfalls.

Sharding vs Replication — Different Problems

Replication: same data copied to N nodes
- Availability ↑
- Read throughput ↑ (read replicas)
- Write throughput = single-node limit (all writes go to leader)

Sharding: different data on different nodes
- Write throughput ↑ (N nodes in parallel)
- Storage capacity ↑ (N × disk)
- Availability alone is NOT solved by sharding → combine with replication

Real world: shard + replica together (each shard has its own replica set)
   e.g. MongoDB sharded cluster = 3 shards × 3 replicas = 9 nodes

Partition Strategy 1 — Range-based

Range partition by user_id:

shard-A: user_id 0     ~ 999_999
shard-B: user_id 1M    ~ 1.999M
shard-C: user_id 2M    ~ ∞

Pros:
- Efficient range scans ("users 1000-2000" hits only shard-A)
- New data (auto-increment) lands on one shard → ordering is excellent

Cons:
- ⚠ HOT SHARD — if only recent users are active, shard-C carries the load
- Imbalance requires splits

Used by: HBase, BigTable, MongoDB (default), DynamoDB (sort key)

Partition Strategy 2 — Hash-based

Decide shard by hash(user_id) % N:

hash(user_42) = 0x8a3f... mod 3 = 1 → shard-B
hash(user_43) = 0x2c19... mod 3 = 0 → shard-A
hash(user_44) = 0xf501... mod 3 = 2 → shard-C

Pros:
- Uniform distribution (a property of random hashes)
- Hot keys naturally spread (adjacent IDs land on different shards)

Cons:
- No range scans ("users 1000-2000" must hit every shard)
- ⚠ Resharding trap — N change (3 → 4) moves ~80% of all data

Used by: DynamoDB partition key, Cassandra, MongoDB hashed shard key

Consistent Hashing — The Resharding Fix (1995)

Map hashes onto a ring of 0..2^32:

           0
       ┌───┴───┐
       │ shard │
       │   A   │
   ⊕  (0~85)   │  ← user_X hashes to 200 → next shard is B
       │       │
       │ shard │
       │   B   │
       │(85~170)│
       │       │
       │ shard │
       │   C   │
       │(170~255)│
       │       │
       └───────┘

A key's data lives on the next shard clockwise.

Add a new shard (D at position 128):
- Only part of B (128~170) migrates to D
- Shards A and C are untouched

→ N → N+1 moves roughly 1/N of the data (vs ~80% with mod).

Virtual nodes (each shard occupies multiple ring positions) smooth
distribution further — the standard trick in Amazon Dynamo / Cassandra.

Hot Shards — The Most Common Operations Problem

Case 1 — time-based IDs
  Shard by created_at month → only this month's data is hot

Case 2 — skewed user distribution
  Hash by user_id → a "celebrity user" with 100× traffic
                   → their shard alone is hot

Case 3 — auto-increment with range
  Inserts always append to the last shard → write hot

Fixes:
- Redesign shard key (composite, salt)
- Special-case celebrities (separate shard or cache) — Twitter's fanout
- Time-based should be "hash + time prefix", not pure daily shards
- AWS DynamoDB's adaptive capacity — auto-adds capacity for hot partitions

Rebalancing — Migrating Without Downtime

Steps to move data when adding a shard:
1. Register new shard (routing layer marks it "warming")
2. Identify data to move (just 1/N via consistent hashing)
3. Copy data (old shard → new shard, in the background)
4. Verify (checksum / row count)
5. Routing cut-over (atomic, at one moment)
6. Clean up moved data from the old shard

Pitfalls:
- In-flight writes during cut-over — dual-write or lock
- Stale routing caches → query goes to wrong shard
- Deleting old data early → can't roll back a failed cut-over
- Background copy hits production load (rate limit required)

DBMS handles automatically: MongoDB chunks, Vitess vreplication, CockroachDB rebalancing

Cross-Shard Joins — Here Be Dragons

-- Natural on a single DB
SELECT u.name, o.total
FROM users u JOIN orders o ON o.user_id = u.id
WHERE u.region = 'KR';

-- Sharded:
users spread across shard-A, B, C
orders spread across shard-X, Y, Z (possibly different shard key)

→ Where does the JOIN happen?
- Pick one shard, fetch from others → that node's memory blows up
- Partial results from all shards → coordinator merges → coordinator hot
- App splits into two queries → in-memory merge (complex, slow)

Patterns to avoid:
1. Denormalize — copy frequently-joined fields ahead of time (small data)
2. Co-locate — same shard key for users and orders (user_id) → same shard
3. Global table — replicate small reference tables (region, category) on every shard
4. Materialized view — pre-computed join stored separately

→ "Sharding makes JOIN expensive" is the biggest ongoing operational burden.

Real Systems

System	Strategy	Notes
DynamoDB	Hash (partition key)	Adaptive capacity mitigates hot partitions
Cassandra	Consistent hashing + virtual nodes	Model of linear scaling
MongoDB	Range or hashed	Auto split / rebalance by chunk
Vitess (YouTube)	Range (sharded MySQL)	Sharding layer atop traditional MySQL
CockroachDB	Range (per-key)	Auto-split + auto-rebalance
Spanner	Range (per-tablet)	Split based on load

Common Pitfalls

Sharding too early — single DB under 100 GB is usually fine. Operational complexity is the biggest cost of premature sharding.
Wrong shard key — changing it means moving all data. Shard key decisions are nearly irreversible.
Assuming cross-shard transactions — 2PC adds latency + partial-failure risk. Design transactions to live in a single shard.
Auto-increment ID + sharding — all writes hit the last shard. Use UUID v4 or snowflake IDs.
Monitoring difficulty — per-shard load metrics are mandatory. One hot shard affects overall latency.

Wrap-up

Sharding is the canonical horizontal-scaling tool but expensive operationally. Push single DB + replication as far as you can, then shard. Once you do, joins and transactions cost more forever.

Consistent hashing + virtual nodes + replica per shard is the modern distributed-DB default — DynamoDB / Cassandra / CockroachDB are all variants.