Hive is a Data Warehouse solution built on top of Hadoop
Basic structures:
Data type system
(figure source: [1])
Main Components
Hive Query Language is a SQL-like declarative query language for ad-hoc queries
Main Features
Suppose we have the following tables:
To load data into a table we use
LOAD DATA LOCAL INPATH 'logs/status_updates' INTO TABLE status_updates PARTITION (ds='2009-03-20')
In this query we want to partition our table by date
Compute daily statistics on how often a status is updated based on gender and school
FROM (SELECT a.status, b.school, g.gender FROM status_updates a JOIN profiles b ON (a.userid = b.userid and a.ds = '2009-03-20') subq1 -- groups by gender INSERT OVERWRITE TABLE gender_summary -- inserts the result into another table PARTITION (ds='2009-03-20') SELECT subq1.gender, count(1) GROUP BY subq1.gender -- groups by school INSERT OVERWRITE TABLE school_summary PARTITION (ds='2009-03-20') SELECT subq.school, count(1) GROUP BY subq1.school
note that we have 2 operations in one query
suppose we want to display top 10 memes per school
REDUCE subq2.school, subq2.meme, subq2.cnt -- using custom python script USING 'top10.py' AS (school, meme, cnt) FROM ( SELECT subq1.school, subq1.meme, count(1) as cnt FROM (MAP b.school, a.status USING 'meme_extractor.py' AS (school, meme) FROM status_update a JOIN profiles b ON (a.userid = b.userid)) subq1 GROUP BY subq1.school, subq1.meme DISTRIBURE BY school, meme SORT BY school, meme, cnt desc) ) subq2