Comp150CPA Homework Exercise 3: Becoming a Pig
- Objective: to write a few simple Pig scripts employing common
constructions.
- Deliverables: source code of your scripts, compatible
with Cloudera Training Environment 0.3.3.
- Work mode: exercise/lecture reinforcement, group work and
submissions encouraged.
Overview
In groups of 2-5 people, please create the following
Pig scripts. I will only give you the schema for data
and the schema for output. You have to test with reasonable data.
- Please write
lookup.pig
that simulates looking up the second
20 messages in my gmail mailbox (I.e., messages numbered 21-40 for mailbox alva.couch), when ordered according to time of arrival.
The input schema is:
mail: {mailbox: chararray, time:long, messageid: long}
where
- mailbox is the name of the mailbox, e.g., 'alva.couch', and there
are several mailboxes.
- time is the seconds since the Epoch (Jan 1, 1970 00:00:00 GMT)
when the message was received.
- messageid is the message id of the message.
and the output schema should be the same, i.e.,
out: {mailbox: chararray, time:long, messageid: long}
but only for mailbox
alva.couch and messages whose time puts them in order between 21 and 40.
You may assume that the time is linux system time, seconds since the epoch.
- Please write
lazy.pig
that
determines which users are forgetting to log out of their accounts.
The input schema is:
log: {time:long, user:chararray, action:chararray, status:chararray, address:chararray}
where
- time is the seconds since the Epoch (Jan 1, 1970 00:00:00 GMT)
as a long integer, when the action was performed.
- user is the user name used to login, e.g, 'couch'.
- action is what the user did, e.g., 'login', 'logout', etc.
- status is 'success' or 'failure'.
- address is where the request originated.
The output schema should be:
out: {user: chararray, times:int}
where
- user is the username as before.
- times is the number of times the user did a 'login' from a particular
address, without a matching 'logout' afterward.
- the output is sorted in order of laziness: most failed logouts first.
- Please write a Pig script 'first.pig' that identifies who made the post that was the first in each positive buzz about 'Toyota'.
The input to this is the schema
posts: {time:long, person:chararray, mentions:chararray, opinion:chararray}
where
- time is the seconds since the Epoch (Jan 1, 1970 00:00:00 GMT)
as a long integer, when the post was made.
- person is the person who posted it.
- mentions is the product the person mentioned, e.g., a brand.
- opinion is one of 'positive', 'neutral', or 'negative'.
The output should be the schema
out: {time:long, person:chararray, posts:int}
where
- time is the seconds since the Epoch (Jan 1, 1970 00:00:00 GMT)
as a long integer, when the first post was made.
- person is the person who made the first post.
- posts is the total number of posts in the buzz.
As in the classroom example, feel free to choose 3600 seconds as the
sample length for the buzz.
Submitting completed assignments
This assignment is a bit more complex to submit because you need to
submit machine-readable code.
Submitting completed assignments
We will
submit this assignment as a set of files. Click
here
to fill out a form that -- when submitted -- generates a printable
form that can be saved to a file, e.g., hw03.html
.
Next, provide this along with everything else via provide:
provide comp150cpa hw03submit lookup.pig lazy.pig first.pig hw03.html
Completed assignments will be printed, graded by hand, and
scanned back into the system.
The grade for hw03submit will be listed as the grade for hw03
in provide, so we can give everyone in the group credit.