Data at Scale
Replication, sharding, the CAP theorem, and choosing consistency vs availability.
Once one database can’t keep up — too many reads, too many writes, or too much data — you spread data across machines. That buys capacity but forces hard trade-offs about correctness.
When you shard across nodes, how you map keys to nodes matters enormously. Add or remove a node below and compare a consistent hashing ring (only ~1/N of keys move) against naive modulo (almost everything remaps, wrecking cache locality).
Replication (more reads, more safety)
Keep copies of the data on several machines. A common setup is leader–follower: writes go to the leader, which streams changes to followers that serve reads. This scales reads and survives a machine failing — but followers can lag, so a read might return slightly stale data (replication lag).
Sharding (more writes, more data)
Sharding (partitioning) splits the data itself across machines by a shard key — e.g. users A–M on one shard, N–Z on another. Now writes and storage scale horizontally. The costs: queries spanning shards are expensive, a bad key creates hot spots, and re-sharding later is painful. Choose the key carefully.
The CAP theorem
When the network partitions (machines can’t talk), a distributed store can guarantee only two of: Consistency, Availability, Partition tolerance. Since partitions will happen, you’re really choosing, during a partition, between:
- CP — refuse some requests to avoid returning wrong data (banking).
- AP — stay available and reconcile later, accepting temporary disagreement (shopping carts, social feeds).
Consistency models
A spectrum, not a switch:
- Strong — every read sees the latest write (simplest to reason about, costliest).
- Eventual — replicas converge eventually; reads may be stale for a while (cheap, highly available).
Pick per use case: a bank balance wants strong; a like-count tolerates eventual.
Takeaways
- Replication scales reads and adds fault tolerance, at the cost of staleness.
- Sharding scales writes and storage, at the cost of cross-shard queries and key design.
- CAP forces a C-vs-A choice under partition; pick the consistency model per data.