Comp150CPA: Clouds and Power-Aware Computing
Classroom Exercise 9
Transformations and Schemas
Spring 2011
group member 1: ____________________________ login: ______________
group member 2: ____________________________ login: ______________
group member 3: ____________________________ login: ______________
group member 4: ____________________________ login: ______________
group member 5: ____________________________ login: ______________
In class we have studied how Pig statements transform data
schemas. Let's explore this in more detail.
Suppose that inside Pig, we have
grunt> DESCRIBE x;
x: {name: chararray, bored: int}
grunt> DESCRIBE y;
y: {name: chararray, joking: int}
- What schemas result from the following statements?
z = FOREACH x GENERATE name;
Answer:z: {name: chararray}
z = FILTER x BY name=='Alva';
Answer:z: {name: chararray, bored: int}
z = GROUP x by name;
Answer:z: {group: chararray, x: {name: chararray, bored: int}}
z = JOIN x BY bored,y BY joking;
Answer:z: {x::name: chararray, x::bored: int, y::name: chararray, y::joking: int}
z = GROUP x by name;
w = FOREACH z GENERATE group as name, x.bored as bored;
Answer:w: {name: chararray, bored: {bored: int}}
- (Advanced)
The Pig cookbook advises that one should use FOREACH-GENERATE to
omit unneeded columns as early as possible and as often as possible. Why?
Answer:
The reason is that as you project data out of your active dataset, the
local calculations become easier and the mapping and reduce phases
become less costly in CPU time and memory.
The earlier you can do this, the more computation time and memory you save.
This does not necessarily make the computation faster, though judicious
use has been reported to speed up computations by as much as a factor of 10.
What it does do, in all cases, is to save money if you are paying for CPU
time, as is true if you are doing your pig jobs on Amazon EC2.