In class we have studied how to convert familiar algorithms to the MapReduce framework. Let's explore that in more detail.
(An example of a situation in which you cannot use a combiner is if you wanted to grab "every other" key in sorted order. You cannot know what "every other" means in order until you have all of them.)
First, we can arrange to read the whole document, keyed by its name. We accomplish this by changing the input type. Input to the mapper is thus a document name mapped to the fulltext of the document as a set of pairs < document_name, document_text >. Output of the mapper for each document is a set of pairs <document_name, < frequency, frequency > where "frequency" is the frequency of one word. You output lots of these, one per document, word pair. These are guesses at the lowest and highest frequencies.
Note: I get around this hack by use of composite types in the next lecture.
The reducer (and combiner) takes a key that is a document and an iterator over pairs of low/high estimates < m, n >. The output of the reducer is document_name -> < min, max > where "min" is the minimum of the m's and "max" is the maximum of the n's. Over all words in the document, these are the desired low and high frequencies. This solves the problem.
To obtain candidate words, map the pair < document_name, document_text > to multiple pairs < document_name, < low_word, low, high_word, high >>, where "low_word" and "high_word" are the low and high words, respectively, while "low" and "high" are their counts. Obviously the highest frequency word might not be unique, so the reducer might have to make an arbitrary choice.
|
|
|
|
word | hash |
---|---|
albatross | 0 |
bane | 0 |
bar | 0 |
foo | 0 |
joe | 1 |
I will adopt a simple notation for key/value pairs. I will write < key, value > for a particular key/value pair. If a key has two parts A and B, we will write that as < < A, B >, value >. Same in reverse for multi-part values.
First, suppose you just wanted to do the classic all-pairs shortest path algorithm as a serial thing. You could do this by "collapsing the map" so that all data goes to one reducer. If the input to the mapper was of the form < city1, < distance, city2 > >, then the output would be < null, < city1, distance, city2 > >, where "null" is the same key for all values. The effect of using the same key for all triples lumps them onto the same worker and into the same iterator in the reducer. Thus the input to the reducer is all triples and the output of the reducer is all shortest triples according to Dijkstra's all-pairs-shortest-path. But -- of course -- this is disgusting!
To exploit parallelism, one has to be more tricky. The key is to perform part of Dijkstra's algorithm for each step. The basic construction in Dijkstra is to consider each path from A to B to C as a candidate way to get from A to C. To do this in parallel, we can transform the data to a more convenient form and then use several map/reduce phases.
Step 1: First we transform < city1, < distance, city2 > > into < < city1, city2 >, distance > and < < city2, city1 >, distance > with a mapper, treating both cities as a pair key. In the reducer, we select the minimun distance for each key < city1, city2 >. This insures that redundant routes between cities are not counted, and makes sure that the representation is symmetric, which simplifies the following. This is going to be the input to the next mapper.
Step 2: The next mapping step transforms data back from << city1, city2 >, distance > to the original form < city1, < distance, city2 > >. The reduce phase sees a key city1 and all adjacent cities city2, with their distances. The reduce phase proceeds to generate new candidates for distance as follows:
for all distance2, city2 in the iterator, output <<city1, city2>, distance2 > for all distance3, city3 in the iterator, if (city2 != city3) { output << city2, city3 >, distance2+distance3 > }Yes, this is a reducer whose output is larger than its input. That is fine! Never mind that we are "abusing language" a bit! The point of this is that if city2 is connected to city1, and city1 is connected to city3, then city2 is connected to city3 with the (obvious) distance. This is one step of Dijkstra!
Step 3: Now we have an inflated set of distances, some of which are too big. At the next step, we again pass data through the mapper unchanged, but choose the minimum distance in the reducer, so that we now have a "best so far" set of distances < < city1, city2 >, distance >.
We repeat steps 2 and 3 above until we generate no further candidates. We know that will happen within N repeats, where N is the number of nodes.
This is just one way of accomplishing Dijkstra in parallel. It is not particularly fast, but that it works on very large graphs.