RDBMSs / Row-Oriented databases
- Typically have strict schema
- Declarative query language SQL (excellent for ad-hoc queries, easy joins)
- Good transactions support (ACID)
- Algebraic Optimization
- Caching / Materialized Views
- Strong Consistency
Downsides
- many services don't require complex ad-hoc querying
- typically choose consistency over availability (see the CAP Theorem)
- replication solutions are limited
- use traditional replication algorithms to give strong consistency (like Two-Phase Commit)
- but data is not made available until the commit finishes (and the database is back to the consistent state)
- not an option for systems where network failures are possible!
- as the volume of data grows, queries become inefficient - not easily scalable
- need to wait too long for all replicas to finish with commit
- hard to load-balance
NoSQL
Are better for storing large amounts of data, especially when the number of columns is very large
- Sets of columns are stored together, so a particular record is actually split across several blocks
- Within each block data is stored in sorted order
- Need to maintain "join index" - to pull together different blocks that are for the same record
- Especially good for analytical queries (such as OLAP)
Main unit of data is a document - a self-contained (typically) record with all information at hand
- no (little) need for joins
In-Memory DBs
- Real-time transactions
- Variety of indexing
- Complex joins - still possible
- Not for big data
NoSQL features
- No ACID transactions, usually use weaker concurrency model (BASE)
- Simpler API - usually no query language
- restricted joins (for better efficiency)
- Ability to horizontally scale "simple operations" throughput over many servers
- simple = key lookups, read/write of one record
- Ability to replicate and partition data over many servers
- Efficient use of distributed indexes and RAM for data storage
- The ability to dynamically add new attributes to data records
Major impact systems
Memcached
Showed that in-memory indexes can be highly scalable and it's possible to distribute and replicate objects over multiple nodes
- main-memory caching service
- no persistence, replication or fault-tolerance
Dynamo
Pioneered the idea of eventual consistency as a new way to achieve higher availability and scalability :
- data fetches are not guaranteed to be up-to-date, but
- updates are guaranteed to be propagated to all nodes (eventually)
- DHT (Distributed Hash Table) with replication
- Reconciliation at read time:
- writes never fail
- conflict resolution: last write wins or application specific
- Configurable Consistency
BigTable
Showed that persistend record storage could be scaled to thousands of nodes
Sources