# Comp150CPA: Clouds and Power-Aware Computing Classroom Exercise 10 Advanced Pig Spring 2011

### group member 5: ____________________________ login: ______________

In class we've studied the advanced Pig operators including COGROUP and CROSS. Let's play with these in more detail.

1. Suppose you have two relations:
```
friends: {name:chararray}
employees: {name:chararray}
```
Write a Pig script that computes the symmetric difference of these two relations, to get those people who are employees but not friends, or friends but not employees.

Answer:
```grps = COGROUP friends BY name, employees BY name;
-- A COGROUP can't generate (name, {}, {}) so
-- we *know* that if one bag is empty, the other is not.
grp2 = FILTER grps BY IsEmpty(friends) OR IsEmpty(employees);
diff = FOREACH grp2 GENERATE group AS name;
```
2. Suppose you have a parameters relation params:{name:chararray, value:int} that contains the relation:
```(atleast,3)
(atmost,6)
```
and want to select people in the relation `friends: {name:chararray, n:int}` who have between `atleast` and `atmost` friends. Write the Pig script to pick these people from the list.

Answer: This is a matter of repeating the parameter distribution pattern shown in class
```-- make an alias just containing the atleast value
atleast = FILTER params BY name=='atleast';
atleast2 = FOREACH atleast GENERATE value;
-- make an aliase just containing the atmost value
atmost = FILTER params BY name=='atmost';
atmost2 = FOREACH atmost GENERATE value;
-- right-distribute atleast and atmost values
f2 = CROSS friends, atleast2, atmost2;
-- since CROSS produces flat output, it uses :: names for structure.
-- throw out data that is not within specifications
f3 = FILTER f2 BY friends::n>=atleast2::value AND friends::n<=atmost2::value;
-- remove now extraneous data
f4 = FOREACH f3 GENERATE friends::name, friends::n;
```
3. Suppose your input data is a bag of pairs contest:{winner: chararray, loser:chararray}. Write a script that recovers the list of names that appear in the contest relation but never lost.

Answer: We want the set difference between winners and losers.
```
-- get a list of just the winners
n1 = FOREACH contest GENERATE winner;
-- get a list of just the losers
n2 = FOREACH contest GENERATE loser;
-- cogroup by name.
grp = COGROUP n1 BY winner, n2 BY loser;
-- if the loser is empty, person just won, didn't lose
grp2 = FILTER grp BY IsEmpty(loser);
-- project out cogroup data
alwayswon = FOREACH grp2 GENERATE group AS winner;
```
4. (Advanced) Write a pig script that has as input a tuple of three bags of names, and outputs all names that occur in exactly one of the three bags.

Answer: Let's assume that the schema is `stuff:{b1:{name:chararray}, b2:{name:chararray}, b3:{name:chararray}}`. Then we might write:
```
-- Pig subtlety: we want to do a three-way COGROUP,
-- which must have different names. So we repeat the construction.
s1 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3;
s2 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3;
s3 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3;
-- Now s1, s2 and s3 are bags of all reasonable combinations of names,
-- because flatten forms an implicit cross-product.
-- group the flatten triples in all three ways.
grp = COGROUP s1 BY f1, s2 BY f2, s3 BY f3;
-- COGROUP won't ever generate (name, {}, {}, {}) and
-- the fact that two are empty means that the third one is not
grp2 = FILTER grp BY (IsEmpty(s1) AND IsEmpty(s2))
OR (IsEmpty(s2) AND IsEmpty(s3))
OR (IsEmpty(s1) AND IsEmpty(s3));
out = FOREACH grp2 GENERATE group as name;
```