Comp150CPA: Clouds and Power-Aware Computing
Classroom Exercise 10
Advanced Pig
Spring 2011

group member 1: ____________________________ login: ______________

group member 2: ____________________________ login: ______________

group member 3: ____________________________ login: ______________

group member 4: ____________________________ login: ______________

group member 5: ____________________________ login: ______________

In class we've studied the advanced Pig operators including COGROUP and CROSS. Let's play with these in more detail.

  1. Suppose you have two relations:
     
    friends: {name:chararray} 
    employees: {name:chararray}
    
    Write a Pig script that computes the symmetric difference of these two relations, to get those people who are employees but not friends, or friends but not employees.

    Answer:
    grps = COGROUP friends BY name, employees BY name; 
    -- A COGROUP can't generate (name, {}, {}) so 
    -- we *know* that if one bag is empty, the other is not. 
    grp2 = FILTER grps BY IsEmpty(friends) OR IsEmpty(employees); 
    diff = FOREACH grp2 GENERATE group AS name; 
    
  2. Suppose you have a parameters relation params:{name:chararray, value:int} that contains the relation:
    (atleast,3)
    (atmost,6)
    
    and want to select people in the relation friends: {name:chararray, n:int} who have between atleast and atmost friends. Write the Pig script to pick these people from the list.

    Answer: This is a matter of repeating the parameter distribution pattern shown in class
    -- make an alias just containing the atleast value
    atleast = FILTER params BY name=='atleast'; 
    atleast2 = FOREACH atleast GENERATE value; 
    -- make an aliase just containing the atmost value
    atmost = FILTER params BY name=='atmost'; 
    atmost2 = FOREACH atmost GENERATE value; 
    -- right-distribute atleast and atmost values 
    f2 = CROSS friends, atleast2, atmost2; 
    -- since CROSS produces flat output, it uses :: names for structure. 
    -- throw out data that is not within specifications
    f3 = FILTER f2 BY friends::n>=atleast2::value AND friends::n<=atmost2::value; 
    -- remove now extraneous data
    f4 = FOREACH f3 GENERATE friends::name, friends::n; 
    
  3. Suppose your input data is a bag of pairs contest:{winner: chararray, loser:chararray}. Write a script that recovers the list of names that appear in the contest relation but never lost.

    Answer: We want the set difference between winners and losers.
     
    -- get a list of just the winners
    n1 = FOREACH contest GENERATE winner; 
    -- get a list of just the losers 
    n2 = FOREACH contest GENERATE loser; 
    -- cogroup by name. 
    grp = COGROUP n1 BY winner, n2 BY loser; 
    -- if the loser is empty, person just won, didn't lose
    grp2 = FILTER grp BY IsEmpty(loser); 
    -- project out cogroup data 
    alwayswon = FOREACH grp2 GENERATE group AS winner; 
    
  4. (Advanced) Write a pig script that has as input a tuple of three bags of names, and outputs all names that occur in exactly one of the three bags.

    Answer: Let's assume that the schema is stuff:{b1:{name:chararray}, b2:{name:chararray}, b3:{name:chararray}}. Then we might write:
     
    -- Pig subtlety: we want to do a three-way COGROUP, 
    -- which must have different names. So we repeat the construction. 
    s1 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    s2 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    s3 = foreach stuff generate flatten(b1) AS f1,flatten(b2) AS f2,flatten(b3) AS f3; 
    -- Now s1, s2 and s3 are bags of all reasonable combinations of names,
    -- because flatten forms an implicit cross-product. 
    -- group the flatten triples in all three ways. 
    grp = COGROUP s1 BY f1, s2 BY f2, s3 BY f3; 
    -- COGROUP won't ever generate (name, {}, {}, {}) and
    -- the fact that two are empty means that the third one is not
    grp2 = FILTER grp BY (IsEmpty(s1) AND IsEmpty(s2)) 
                      OR (IsEmpty(s2) AND IsEmpty(s3)) 
                      OR (IsEmpty(s1) AND IsEmpty(s3)); 
    out = FOREACH grp2 GENERATE group as name;