Hive
Hive is a Data Warehouse solution built on top of Hadoop
- Main feature - Hive QL: Declarative Query language for ad-hoc analytics
Hive Data Model
Basic structures:
- Tables
- like tables in RDBs
- each table has a corresponding HDFS directory
- Partitions
- each table has one or more partitions
- Buckets
- data in each partition is divided into buckets
Data type system
- primitives
- int, float, string, date, boolean
- collections
- arrays and maps
- nestable
- can define own data types
Achitecture
(figure source: [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2637])
Main Components
- External Interfaces
- CLI
- JDBC + Thrift
- Web
- System Catalog
- called Metastore
- contains schemas
- keeps statistics - like in DBMS
- enables optimization techniques (but only naive rule-based optimizations)
- Driver that manages HiveQL queries, contains
- optimizer
- executor (which executes the plan in Topological Ordering)
- Hadoop as the execution engine
Query Execution
- a query plan consists of several MapReduce jobs
- results of each job is stored (materialized) on HDFS
- and the results are consumed by the next job in the graph
- so a job that depends on some other job must wait until it finishes
- it cannot start until all results are materialized to disk, i.e. no Pipelining| | |
Hive QL
Hive Query Language is a SQL-like declarative query language for ad-hoc queries
Main Features
- it compiles into a DAG of MapReduce jobs that are executed in Hadoop
- also can plug custom MapReduce scripts
Example
Suppose we have the following tables:
- status_update(user_id int, status string, ds string)
- ’'’ds’’’ is date
- profiles(userid int, school string, gender int)
To load data into a table we use
LOAD DATA LOCAL INPATH 'logs/status_updates'
INTO TABLE status_updates
PARTITION (ds='2009-03-20')
In this query we want to partition our table by date
Query 1
Compute daily statistics on how often a status is updated based on gender and school
FROM
(SELECT a.status, b.school, g.gender
FROM status_updates a JOIN profiles b
ON (a.userid = b.userid and a.ds = '2009-03-20') subq1
-- groups by gender
INSERT OVERWRITE TABLE gender_summary -- inserts the result into another table
PARTITION (ds='2009-03-20')
SELECT subq1.gender, count(1)
GROUP BY subq1.gender
-- groups by school
INSERT OVERWRITE TABLE school_summary
PARTITION (ds='2009-03-20')
SELECT subq.school, count(1)
GROUP BY subq1.school
note that we have 2 operations in one query
- they are performed in a single scan
Query 2
suppose we want to display top 10 memes per school
text only
REDUCE subq2.school, subq2.meme, subq2.cnt
-- using custom python script
USING 'top10.py' AS (school, meme, cnt)
FROM (
SELECT subq1.school, subq1.meme, count(1) as cnt
FROM
(MAP b.school, a.status
USING 'meme_extractor.py'
AS (school, meme)
FROM status_update a JOIN profiles b
ON (a.userid = b.userid)) subq1
GROUP BY subq1.school, subq1.meme
DISTRIBURE BY school, meme
SORT BY school, meme, cnt desc)
) subq2
See also
Sources
- Thusoo et all, Hive: A Warehousing Solution Over a Map-Reduce Framework (2009). [http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.151.2637]