Pig
Pig Latin is a SQL-like declarative query language that runs on top of Hadoop
Pig Latin
- needs data model in form of UDF (user defined function)
- first it generated a query plan
- then compiles it into a set of MR jobs
- some optimizations are applied
Example
SQL:
SELECT SUM(s.Sale), c.City
FROM Sales s, Cities c
WHERE s.AddrId = c.AddrId
GROUP BY City;
Pig Latin
-- 1
tmp = COGROUP Sales BY AddrId,
Cities BY AddrId
-- 2
join = FOREACH tmp GENERATE
FLATTEN(Sales), FLATTEN(Cities)
-- 3
grp = GROUP join BY City
-- 4
res = FOREACH grp GENERATE SUM(Sale)
in Pig FOREACH $\approx$ Map
See also
Links
- http://www.slideshare.net/jayshao/introduction-to-apache-pig
- Official website: http://pig.apache.org/
- Process your data with Apache Pig [and link(http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/]) (на русском)