Comp150CPA Homework Exercise 4: Advanced Pig

Objective: to write a nontrivial Pig script employing advanced concepts.
Deliverables: source code of your script, compatible with Cloudera Training Environment 0.3.3.
Work mode: exercise/lecture reinforcement, group work and submissions encouraged.

Overview

In groups of 2-5 people, please create a pig script friends.pig that implements LinkedIn's use of Hadoop to determine how many friends two non-friends share in common. The input is a list of contact pairs in a file friends.dat, e.g.,

Alva Mark
Siri Mark
Alva George
Siri George
Fred George

The friend relationship is symmetric, i.e., listing

Alva Mark

implicitly means that

 
Mark Alva

as well.

The output of your script should be a set of sentences describing how many friends two non-friends have in common, e.g.,

 
(Alva and Siri have 2 friend(s) in common.)
(Fred and Siri have 1 friend(s) in common.)

where every pair reported in the output are not already friends in the input. You should list these so that the pairs are in alphabetical order, and so that listing is in order according to number of friends, first name, and second name.

I will only give you the schema for input and the schema for output. You have to test with reasonable data.

Hints

This is a rather advanced use of Pig. The trick is to make it work properly, and then make it reasonably efficient. You do not need any mechanisms outside Pig to do this, though you might be tempted to utilize UDFs for some operations. My solution uses no UDFs, but utilizes:

Cross products (CROSS) to check pairs against one another.
Co-grouping (COGROUP-BY) to compute set differences.
Lots of filtering (FILTER-BY) to eliminate redundant data.
Concatenation (CONCAT) to produce the final

Submitting completed assignments

This assignment is a bit more complex to submit because you need to submit machine-readable code. We will submit this assignment as a set of files. Click here to fill out a form that -- when submitted -- generates a printable form that can be saved to a file, e.g., hw04.html. Next, provide this along with everything else via provide:

 
provide comp150cpa hw04submit friends.pig hw04.html

Completed assignments will be printed, graded by hand, and scanned back into the system. The grade for hw04submit will be listed as the grade for hw04 in provide, so we can give everyone in the group credit.

Extra credit

For 1 point extra credit, write another script that does what LinkedIn calls "3rd-order relationships", i.e., people you can meet because a friend can introduce you. List these in order of their total contacts, largest first. E.g., an output might be:

 
(Ed,George,Joe,20)

to mean that Ed knows George who knows Joe who has 20 contacts. Obviously, existing friends should not appear in the third column.

Call this script introduce.pig, and just submit it with the others on the command line.