ML Wiki
Machine Learning Wiki - A collection of ML concepts, algorithms, and resources.

Pig

Pig

Pig Latin is a SQL-like declarative query language that runs on top of Hadoop

Pig Latin

  • needs data model in form of UDF (user defined function)
  • first it generated a query plan
  • then compiles it into a set of MR jobs
  • some optimizations are applied

Example

SQL:

SELECT SUM(s.Sale), c.City 
FROM Sales s, Cities c
WHERE s.AddrId = c.AddrId
GROUP BY City;

Pig Latin

-- 1
tmp = COGROUP Sales BY AddrId,
              Cities BY AddrId
-- 2 
join = FOREACH tmp GENERATE 
       FLATTEN(Sales), FLATTEN(Cities)
-- 3
grp = GROUP join BY City

-- 4
res = FOREACH grp GENERATE SUM(Sale)

in Pig FOREACH $\approx$ Map

See also

  • http://www.slideshare.net/jayshao/introduction-to-apache-pig
  • Official website: http://pig.apache.org/
  • Process your data with Apache Pig [and link(http://www.ibm.com/developerworks/linux/library/l-apachepigdataquery/]) (на русском)

Sources