Pig

hadoop

Pig

Pig Latin is a SQL-like declarative query language that runs on top of Hadoop

Pig Latin

needs data model in form of UDF (user defined function)
first it generated a query plan
then compiles it into a set of MR jobs
some optimizations are applied

Example

SQL:

SELECT SUM(s.Sale), c.City 
FROM Sales s, Cities c
WHERE s.AddrId = c.AddrId
GROUP BY City;

Pig Latin

-- 1
tmp = COGROUP Sales BY AddrId,
              Cities BY AddrId
-- 2 
join = FOREACH tmp GENERATE 
       FLATTEN(Sales), FLATTEN(Cities)
-- 3
grp = GROUP join BY City

-- 4
res = FOREACH grp GENERATE SUM(Sale)

in Pig FOREACH $\approx$ Map

Sources

Introduction to Data Science (coursera)

✏️ Edit on GitHub

Pig

Pig

Example

See also

Links

Sources