(article needs improving)
Are better for storing large amounts of data, especially when the number of columns is very large
- Sets of columns are stored together, so a particular record is actually split across several blocks
- Within each block data is stored in sorted order
- Need to maintain "join index" - to pull together different blocks that are for the same record
- These column-oriented databases are especially good for OLAP
- Tables are distributes across different servers
- A table is broken into many tablets, each containing multiple rows (regions in HBase)
- Each tablet is broken into column families, each containing set of columns
- Each column family spans multiple rows - and stored in chunks of the Distributed File System
- Each chunk is served by a tablet server (which also takes care of replicas)
In order to access any particular record/column, we need to know which tablet server has them
For that purpose there is a Metadata Table which knows on what tablet server a particular record is.
- Search starts from the Root Tablet
- We want to find columns for a particular row
- We look up the root tablet which gives us Child Tablets that contain record's data.
- Then we get actual chunks where data is stored
- The row is indexed by a key (transaction ID)
- Column families have multiple columns, and columns are stored within single chunks
- Each Column Family may be stored on different chunk servers
- The number of column families is fixed, but the number of columns isn't - you may create as many as you want.
- Additionally each column combination can be timestamped.
- For example, today the region is one, but tomorrow is another, but you change it not by updating the value, but * rather by inserting a new value with the current timestamp (so versioning is done automatically)
- Because these DBs rely on DFS - they can do large parallel reads and inserts efficiently.
- For aggregation queries it's very fast to get results this way
- there is only one key - so you cannot access a record by any value other than id, there is no index for this.
- If you really need that, you'll have to traverse all the data.
But it can be overcome by adding a special table with indexes (it's used by Google App Engine) -- Add details?