Architectures

Evolution of Distributes DBs

Logical multi-processor database design:

  • Shared memory (easiest to program, but most expensive)
  • Shared disks (easier)
  • Shared nothing (hard: MapReduce, etc)

parallel-arhitectures.png

Shared Nothing Architecture

  • each processing unit (node, process, thread) is independent and self-sufficient
  • it has its own memory and storage
  • these is no single point of connection in the system
  • allows individual servers to fail (with proper replication)
  • data records are distributed by messaging


Assumptions

Assumptions that we make about distributes systems

Data size

  • the amounts of data is so big to fit on one node
  • and even on a single rack
  • $\to$ therefore we need to partition the data across many nodes

Reliability

  • the system must be highly available to serve all applications
  • nodes may occasionally crash
  • but data must be safe
  • $\to$ therefore we need to replicate each row to multiple nodes and remain available despite failures

Performance

  • for real-time use
  • 95/99 percentile is more important than the average latency (we care about longest latency measures)
  • want it run on cheap commodity hardware
  • $\to$ need to be able to maintain low latency even during recovery operations


Design Principles

Partitioning / Incremental Scalability

  • Scale out one node at a time with minimal impact (with techniques like Consistent Hashing)

Symmetry

  • no special role node (with extra responsibilities)
  • simplifies maintenance

Decentralization

  • extension of Symmetry:
  • favor peer-to-peer techniques over centralized control

Heterogeneity

  • work distribution must be proportional to the capabilities of individual servers
  • don't need to upgrade old servers when adding a newer one

Concurrency Control

Main Article: Concurrency Control

Sources


See also