RDBMSs / Row-Oriented databases

  • Typically have strict schema
  • Declarative query language SQL (excellent for ad-hoc queries, easy joins)
  • Good transactions support (ACID)
  • Algebraic Optimization
  • Caching / Materialized Views
  • Strong Consistency

Downsides

  • many services don't require complex ad-hoc querying
  • typically choose consistency over availability (see the CAP Theorem)
  • replication solutions are limited
    • use traditional replication algorithms to give strong consistency (like Two-Phase Commit)
    • but data is not made available until the commit finishes (and the database is back to the consistent state)
    • not an option for systems where network failures are possible!
  • as the volume of data grows, queries become inefficient - not easily scalable
    • need to wait too long for all replicas to finish with commit
  • hard to load-balance

NoSQL

Column-Oriented Databases

Are better for storing large amounts of data, especially when the number of columns is very large

  • Sets of columns are stored together, so a particular record is actually split across several blocks
  • Within each block data is stored in sorted order
  • Need to maintain "join index" - to pull together different blocks that are for the same record
  • Especially good for analytical queries (such as OLAP)

Document-Oriented Databases

Main unit of data is a document - a self-contained (typically) record with all information at hand

  • no (little) need for joins

In-Memory DBs

  • Real-time transactions
  • Variety of indexing
  • Complex joins - still possible
  • Not for big data

NoSQL features

  • No ACID transactions, usually use weaker concurrency model (BASE)
  • Simpler API - usually no query language
  • restricted joins (for better efficiency)
  • Ability to horizontally scale "simple operations" throughput over many servers
simple = key lookups, read/write of one record
  • Ability to replicate and partition data over many servers
  • Efficient use of distributed indexes and RAM for data storage
  • The ability to dynamically add new attributes to data records


Major impact systems

Memcached

Showed that in-memory indexes can be highly scalable and it's possible to distribute and replicate objects over multiple nodes

  • main-memory caching service
no persistence, replication or fault-tolerance

Dynamo

Pioneered the idea of eventual consistency as a new way to achieve higher availability and scalability :

  • data fetches are not guaranteed to be up-to-date, but
  • updates are guaranteed to be propagated to all nodes (eventually)
  • DHT (Distributed Hash Table) with replication
  • Reconciliation at read time:
    • writes never fail
    • conflict resolution: last write wins or application specific
  • Configurable Consistency


BigTable

Showed that persistend record storage could be scaled to thousands of nodes


Sources

Machine Learning Bookcamp: Learn machine learning by doing projects. Get 40% off with code "grigorevpc".

Share your opinion