Comp150CPA: Clouds and Power-Aware Computing
Classroom Exercise 9
Transformations and Schemas
Spring 2011

In class we have studied how Pig statements transform data schemas. Let's explore this in more detail. Suppose that inside Pig, we have

 
grunt> DESCRIBE x; 
x: {name: chararray, bored: int}
grunt> DESCRIBE y; 
y: {name: chararray, joking: int}

What schemas result from the following statements?

z = FOREACH x GENERATE name;

Answer:

z: {name: chararray}

z = FILTER x BY name=='Alva';

Answer:

z: {name: chararray, bored: int}

z = GROUP x by name;

Answer:

z: {group: chararray, x: {name: chararray, bored: int}}

z = JOIN x BY bored,y BY joking;

Answer:

z: {x::name: chararray, x::bored: int, y::name: chararray, y::joking: int}

z = GROUP x by name;
w = FOREACH z GENERATE group as name, x.bored as bored;

Answer:

w: {name: chararray, bored: {bored: int}}

(Advanced) The Pig cookbook advises that one should use FOREACH-GENERATE to omit unneeded columns as early as possible and as often as possible. Why?
Answer: The reason is that as you project data out of your active dataset, the local calculations become easier and the mapping and reduce phases become less costly in CPU time and memory. The earlier you can do this, the more computation time and memory you save. This does not necessarily make the computation faster, though judicious use has been reported to speed up computations by as much as a factor of 10. What it does do, in all cases, is to save money if you are paying for CPU time, as is true if you are doing your pig jobs on Amazon EC2.