In groups of 2-5 people, please create a pig script
friends.pig
that implements
LinkedIn's use of Hadoop to determine how many friends two non-friends share
in common. The input is a list of contact pairs in a file friends.dat
, e.g.,
Alva Mark Siri Mark Alva George Siri George Fred GeorgeThe friend relationship is symmetric, i.e., listing
Alva Markimplicitly means that
Mark Alvaas well.
The output of your script should be a set of sentences describing how many friends two non-friends have in common, e.g.,
(Alva and Siri have 2 friend(s) in common.) (Fred and Siri have 1 friend(s) in common.)where every pair reported in the output are not already friends in the input. You should list these so that the pairs are in alphabetical order, and so that listing is in order according to number of friends, first name, and second name.
I will only give you the schema for input and the schema for output. You have to test with reasonable data.
CROSS
) to check pairs against one another.
COGROUP-BY
) to compute set differences.
FILTER-BY
) to eliminate redundant data.
CONCAT
) to produce the final
hw04.html
.
Next, provide this along with everything else via provide:
provide comp150cpa hw04submit friends.pig hw04.htmlCompleted assignments will be printed, graded by hand, and scanned back into the system. The grade for hw04submit will be listed as the grade for hw04 in provide, so we can give everyone in the group credit.
(Ed,George,Joe,20)to mean that Ed knows George who knows Joe who has 20 contacts. Obviously, existing friends should not appear in the third column.
Call this script introduce.pig
, and just submit
it with the others on the command line.