Pig

Pig Latin is a SQL-like declarative query language that runs on top of Hadoop

Pig Latin

  • needs data model in form of UDF (user defined function)
  • first it generated a query plan
  • then compiles it into a set of MR jobs
  • some optimizations are applied


Example

SQL:

SELECT SUM(s.Sale), c.City 
FROM Sales s, Cities c
WHERE s.AddrId = c.AddrId
GROUP BY City;


Pig Latin

-- 1
tmp = COGROUP Sales BY AddrId,
              Cities BY AddrId
-- 2 
join = FOREACH tmp GENERATE 
       FLATTEN(Sales), FLATTEN(Cities)
-- 3
grp = GROUP join BY City

-- 4
res = FOREACH grp GENERATE SUM(Sale)

in Pig FOREACH $\approx$ Map


See also

Links

Sources