In class we've studied the advanced Pig operators including COGROUP and CROSS. Let's play with these in more detail.
friends: {name:chararray} employees: {name:chararray}Write a Pig script that computes the symmetric difference of these two relations, to get those people who are employees but not friends, or friends but not employees.
grps = COGROUP friends BY name, employees BY name; -- A COGROUP can't generate (name, {}, {}) so -- we *know* that if one bag is empty, the other is not. grp2 = FILTER grps BY IsEmpty(friends) OR IsEmpty(employees); diff = FOREACH grp2 GENERATE group AS name;
(atleast,3) (atmost,6)and want to select people in the relation
friends: {name:chararray, n:int}
who have between atleast
and atmost
friends.
Write the Pig script to pick these people from the list.
-- make an alias just containing the atleast value atleast = FILTER params BY name=='atleast'; atleast2 = FOREACH atleast GENERATE value; -- make an aliase just containing the atmost value atmost = FILTER params BY name=='atmost'; atmost2 = FOREACH atmost GENERATE value; -- right-distribute atleast and atmost values f2 = CROSS friends, atleast2, atmost2; -- since CROSS produces flat output, it uses :: names for structure. -- throw out data that is not within specifications f3 = FILTER f2 BY friends::n>=atleast2::value AND friends::n<=atmost2::value; -- remove now extraneous data f4 = FOREACH f3 GENERATE friends::name, friends::n;
-- get a list of just the winners n1 = FOREACH contest GENERATE winner; -- get a list of just the losers n2 = FOREACH contest GENERATE loser; -- cogroup by name. grp = COGROUP n1 BY winner, n2 BY loser; -- if the loser is empty, person just won, didn't lose grp2 = FILTER grp BY IsEmpty(loser); -- project out cogroup data alwayswon = FOREACH grp2 GENERATE group AS winner;
stuff:{b1:{name:chararray}, b2:{name:chararray}, b3:{name:chararray}}
. Then we might write:-- Pig subtlety: we want to do a three-way COGROUP, -- which must have different names. So we repeat the construction. s1 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; s2 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; s3 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; -- Now s1, s2 and s3 are bags of all reasonable combinations of names, -- because flatten forms an implicit cross-product. -- group the flatten triples in all three ways. grp = COGROUP s1 BY f1, s2 BY f2, s3 BY f3; -- COGROUP won't ever generate (name, {}, {}, {}) and -- the fact that two are empty means that the third one is not grp2 = FILTER grp BY (IsEmpty(s1) AND IsEmpty(s2)) OR (IsEmpty(s2) AND IsEmpty(s3)) OR (IsEmpty(s1) AND IsEmpty(s3)); out = FOREACH grp2 GENERATE group as name;