cs.thefarshad
hard

Data at Scale

Replication, sharding, the CAP theorem, and choosing consistency vs availability.

Once one database can’t keep up — too many reads, too many writes, or too much data — you spread data across machines. That buys capacity but forces hard trade-offs about correctness.

When you shard across nodes, how you map keys to nodes matters enormously. Add or remove a node below and compare a consistent hashing ring (only ~1/N of keys move) against naive modulo (almost everything remaps, wrecking cache locality).

3 nodes · 24 keys
CBACCBBAAhash ringkey → next node ↻
If you add one more node now
0%of keys remap
0 of 24 keys move
Each key walks clockwise to the next node. Adding a node steals keys only from one neighbor, so ~1/N of keys move. Yellow-ringed dots are the ones that would remap.
ABC
Consistent hashing keeps remaps small and local when membership changes.

Replication (more reads, more safety)

Keep copies of the data on several machines. A common setup is leader–follower: writes go to the leader, which streams changes to followers that serve reads. This scales reads and survives a machine failing — but followers can lag, so a read might return slightly stale data (replication lag).

Sharding (more writes, more data)

Sharding (partitioning) splits the data itself across machines by a shard key — e.g. users A–M on one shard, N–Z on another. Now writes and storage scale horizontally. The costs: queries spanning shards are expensive, a bad key creates hot spots, and re-sharding later is painful. Choose the key carefully.

The CAP theorem

When the network partitions (machines can’t talk), a distributed store can guarantee only two of: Consistency, Availability, Partition tolerance. Since partitions will happen, you’re really choosing, during a partition, between:

  • CP — refuse some requests to avoid returning wrong data (banking).
  • AP — stay available and reconcile later, accepting temporary disagreement (shopping carts, social feeds).

Consistency models

A spectrum, not a switch:

  • Strong — every read sees the latest write (simplest to reason about, costliest).
  • Eventual — replicas converge eventually; reads may be stale for a while (cheap, highly available).

Pick per use case: a bank balance wants strong; a like-count tolerates eventual.

Takeaways

  • Replication scales reads and adds fault tolerance, at the cost of staleness.
  • Sharding scales writes and storage, at the cost of cross-shard queries and key design.
  • CAP forces a C-vs-A choice under partition; pick the consistency model per data.