Comp150CPA: Clouds and Power-Aware Computing
Classroom Exercise 10b (not assigned)
Map and Reduce
Spring 2011

group member 1: ____________________________ login: ______________

group member 2: ____________________________ login: ______________

group member 3: ____________________________ login: ______________

group member 4: ____________________________ login: ______________

group member 5: ____________________________ login: ______________

In class we have studied how to convert familiar algorithms to the MapReduce framework. Let's explore that in more detail.

  1. Why can the reducer and combiner be the same subroutine in WordCount?
    Answer: It does not matter whether one sums up the ones before or after all key data is on the reducer node. A partial sum contributes to the total sum because addition is associative.

    (An example of a situation in which you cannot use a combiner is if you wanted to grab "every other" key in sorted order. You cannot know what "every other" means in order until you have all of them.)

  2. Suppose that rather than counting words, you want to compute the range of frequencies of words, i.e., the lowest and highest word counts over all documents in a set. Describe the mapper input and output necessary to solve this problem. Feel free to use new classes.
    Answer: There was some confusion on what I meant by this one. I actually meant for you to use a key that is a document name, and a value that consists of a low and high count for words within that document. E.g., if the lowest frequency word in foo.txt was "albatross" with a frequency of 2, and the highest frequency word in foo.txt is "the" with 500 instances, then you would want to output "foo.txt" -> <2,500>.

    First, we can arrange to read the whole document, keyed by its name. We accomplish this by changing the input type. Input to the mapper is thus a document name mapped to the fulltext of the document as a set of pairs < document_name, document_text >. Output of the mapper for each document is a set of pairs <document_name, < frequency, frequency > where "frequency" is the frequency of one word. You output lots of these, one per document, word pair. These are guesses at the lowest and highest frequencies.

    Note: I get around this hack by use of composite types in the next lecture.

    The reducer (and combiner) takes a key that is a document and an iterator over pairs of low/high estimates < m, n >. The output of the reducer is document_name -> < min, max > where "min" is the minimum of the m's and "max" is the maximum of the n's. Over all words in the document, these are the desired low and high frequencies. This solves the problem.

    To obtain candidate words, map the pair < document_name, document_text > to multiple pairs < document_name, < low_word, low, high_word, high >>, where "low_word" and "high_word" are the low and high words, respectively, while "low" and "high" are their counts. Obviously the highest frequency word might not be unique, so the reducer might have to make an arbitrary choice.

  3. Suppose that after mapping, you have four mapper outputs with contents
    Mapper 0
    keyvalue
    foo1
    albatross10
    Mapper 1
    keyvalue
    albatross7
    foo3
    Mapper 2
    keyvalue
    bane5
    joe1
    Mapper 3
    keyvalue
    bar5
    albatross7
    Suppose that the partioning hash function maps words beginning with "a"-"g" to 0, "h"-"n" to 1, "o"-"u" to 2, and "v"-"z" to 3. Simulate the result of each step of the butterfly sort and describe how many communication steps are necessary to completely sort the data above to appropriate workers.
    Answer: First we compute the desired worker addresses for each word, which are h(word):
    wordhash
    albatross0
    bane0
    bar0
    foo0
    joe1
    So at the end, every key will end up on worker 0 or 1.
    1. In the first step (bit 0),
      1. albatross and foo on worker 0=0002 belong there and stay put.
      2. albatross and foo on worker 1=0112 are sent to worker 0=0002 where they belong.
      3. bane and joe on worker 2=0102 stay put because worker 3=0112 is farther from where they both want to be!
      4. bar and albatross on worker 3=0112 are sent to worker 2=0102.
    2. In the second step (bit 1),
      1. All on worker 0=0002 stay put.
      2. All on worker 2=0102 move to worker 0=0002, including joe(!) because that puts him closer to his destination of worker 1=0012.
    3. In the third step (bit 0)
      1. joe moves from worker 0=0002 to worker 1=0012 (done).
  4. (Advanced) Suppose that we represent a table of distances between cities in MapReduce as city1 -> < distance, city2 > where city1 and city2 are cities and distance is the shortest distance between them. Give a MapReduce algorithm for determining the shortest distance between two cities X and Y, using intermediate cities when necessary. What does the Map step do, and what does the Reduce step do?
    Answer: This is (of course) insanely tricky.

    I will adopt a simple notation for key/value pairs. I will write < key, value > for a particular key/value pair. If a key has two parts A and B, we will write that as < < A, B >, value >. Same in reverse for multi-part values.

    First, suppose you just wanted to do the classic all-pairs shortest path algorithm as a serial thing. You could do this by "collapsing the map" so that all data goes to one reducer. If the input to the mapper was of the form < city1, < distance, city2 > >, then the output would be < null, < city1, distance, city2 > >, where "null" is the same key for all values. The effect of using the same key for all triples lumps them onto the same worker and into the same iterator in the reducer. Thus the input to the reducer is all triples and the output of the reducer is all shortest triples according to Dijkstra's all-pairs-shortest-path. But -- of course -- this is disgusting!

    To exploit parallelism, one has to be more tricky. The key is to perform part of Dijkstra's algorithm for each step. The basic construction in Dijkstra is to consider each path from A to B to C as a candidate way to get from A to C. To do this in parallel, we can transform the data to a more convenient form and then use several map/reduce phases.

    Step 1: First we transform < city1, < distance, city2 > > into < < city1, city2 >, distance > and < < city2, city1 >, distance > with a mapper, treating both cities as a pair key. In the reducer, we select the minimun distance for each key < city1, city2 >. This insures that redundant routes between cities are not counted, and makes sure that the representation is symmetric, which simplifies the following. This is going to be the input to the next mapper.

    Step 2: The next mapping step transforms data back from << city1, city2 >, distance > to the original form < city1, < distance, city2 > >. The reduce phase sees a key city1 and all adjacent cities city2, with their distances. The reduce phase proceeds to generate new candidates for distance as follows:

     
    for all distance2, city2 in the iterator, 
        output <<city1, city2>, distance2 > 
        for all distance3, city3 in the iterator, 
           if (city2 != city3) { 
    	    output << city2, city3 >, distance2+distance3 > 
           } 
    
    Yes, this is a reducer whose output is larger than its input. That is fine! Never mind that we are "abusing language" a bit! The point of this is that if city2 is connected to city1, and city1 is connected to city3, then city2 is connected to city3 with the (obvious) distance. This is one step of Dijkstra!

    Step 3: Now we have an inflated set of distances, some of which are too big. At the next step, we again pass data through the mapper unchanged, but choose the minimum distance in the reducer, so that we now have a "best so far" set of distances < < city1, city2 >, distance >.

    We repeat steps 2 and 3 above until we generate no further candidates. We know that will happen within N repeats, where N is the number of nodes.

    This is just one way of accomplishing Dijkstra in parallel. It is not particularly fast, but that it works on very large graphs.