Comp150CPA: Clouds and Power-Aware Computing
Classroom Exercise 10
Advanced Pig
Spring 2011

In class we've studied the advanced Pig operators including COGROUP and CROSS. Let's play with these in more detail.

  1. Suppose you have two relations:
    friends: {name:chararray} 
    employees: {name:chararray}
    Write a Pig script that computes the symmetric difference of these two relations, to get those people who are employees but not friends, or friends but not employees.

    grps = COGROUP friends BY name, employees BY name; 
    -- A COGROUP can't generate (name, {}, {}) so 
    -- we *know* that if one bag is empty, the other is not. 
    grp2 = FILTER grps BY IsEmpty(friends) OR IsEmpty(employees); 
    diff = FOREACH grp2 GENERATE group AS name; 
  2. Suppose you have a parameters relation params:{name:chararray, value:int} that contains the relation:
    and want to select people in the relation friends: {name:chararray, n:int} who have between atleast and atmost friends. Write the Pig script to pick these people from the list.

    Answer: This is a matter of repeating the parameter distribution pattern shown in class
    -- make an alias just containing the atleast value
    atleast = FILTER params BY name=='atleast'; 
    atleast2 = FOREACH atleast GENERATE value; 
    -- make an aliase just containing the atmost value
    atmost = FILTER params BY name=='atmost'; 
    atmost2 = FOREACH atmost GENERATE value; 
    -- right-distribute atleast and atmost values 
    f2 = CROSS friends, atleast2, atmost2; 
    -- since CROSS produces flat output, it uses :: names for structure. 
    -- throw out data that is not within specifications
    f3 = FILTER f2 BY friends::n>=atleast2::value AND friends::n<=atmost2::value; 
    -- remove now extraneous data
    f4 = FOREACH f3 GENERATE friends::name, friends::n; 
  3. Suppose your input data is a bag of pairs contest:{winner: chararray, loser:chararray}. Write a script that recovers the list of names that appear in the contest relation but never lost.

    Answer: We want the set difference between winners and losers.
    -- get a list of just the winners
    n1 = FOREACH contest GENERATE winner; 
    -- get a list of just the losers 
    n2 = FOREACH contest GENERATE loser; 
    -- cogroup by name. 
    grp = COGROUP n1 BY winner, n2 BY loser; 
    -- if the loser is empty, person just won, didn't lose
    grp2 = FILTER grp BY IsEmpty(loser); 
    -- project out cogroup data 
    alwayswon = FOREACH grp2 GENERATE group AS winner; 
  4. (Advanced) Write a pig script that has as input a tuple of three bags of names, and outputs all names that occur in exactly one of the three bags.

    Answer: Let's assume that the schema is stuff:{b1:{name:chararray}, b2:{name:chararray}, b3:{name:chararray}}. Then we might write:
    -- Pig subtlety: we want to do a three-way COGROUP, 
    -- which must have different names. So we repeat the construction. 
    s1 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    s2 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    s3 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    -- Now s1, s2 and s3 are bags of all reasonable combinations of names,
    -- because flatten forms an implicit cross-product. 
    -- group the flatten triples in all three ways. 
    grp = COGROUP s1 BY f1, s2 BY f2, s3 BY f3; 
    -- COGROUP won't ever generate (name, {}, {}, {}) and
    -- the fact that two are empty means that the third one is not
    grp2 = FILTER grp BY (IsEmpty(s1) AND IsEmpty(s2)) 
                      OR (IsEmpty(s2) AND IsEmpty(s3)) 
                      OR (IsEmpty(s1) AND IsEmpty(s3)); 
    out = FOREACH grp2 GENERATE group as name;