Comp150CPA Midterm Exam Review Answers
Spring 2011

The following are sketches of the important ingredients of a correct answer. Many answers are acceptable.

  1. (25 points) Suppose you have a document space and a word of interest, and want to to output the name of the single document in which the word appears first. I.e., if there are multiple documents containing the word, you would output the name of the one in which the word appears at the earliest character. Give the pseudo-code for a Hadoop algorithm that does this.

    Answer: The key to the problem is what the mapper does and what the reducer does. The default input to the first mapper is a numeric offset and a document line. We need to transform that into pairs < document, position > where "document" is a document name and "position" is where the word occurs. Any pseudo-code that does this is acceptable.

    For the reducer, one outputs the minimum "position" for each document. as a pair < document, position >. Since "minimum" is associative, one can use this reducer as a combiner.

    But at this point, one isn't done. What if there are lots of these? One still needs to pick the one that is minimum. For this, we map < document, position > to < 1, < document, position > > and then, in a second reducer, pick the one for which the position is minimum.

    Any pseudo-code that does this is acceptable. It is much less efficient to reduce < 1, < document, position > > in the first step.

  2. (25 points) Sketch how you would accomplish the same thing in Google AppEngine, and describe the differences between doing it in Hadoop and AppEngine.

    Answer: The key is that in AppEngine, Map/Reduce operations are hidden inside data queries. To exploit this, one must input data in a form that implies appropriate Map/Reduce operations during queries. .

    First, one must input the "documents" from Hadoop into a datastore that has the same structure, i.e., the AppEngine objects should have subparts such as < document, lineNumber, word >. One queries these for the objects whose words match the word in question, which is in essence a (hidden) Map operation. One can then iterate through the results and pick the minimum. Since one does not have "explicit" access to "reduce", this is slower.

    However, there is another, much more clever option. One can set the ordering for a query and the number of results to be obtained. Using query.setRange(1,1) limits the query to return the very first result in order, while using query.setOrdering("lineNumber") orders the results by line number. Using the two together is, in essence, a reduce-to-minimum operation.

  3. (25 points) A house cleaning service takes a request from a person, negotiates the time of day to come, arrives and cleans the house, bills the person, and receives payment. Give a BPMN diagram for this business process.

    Answer: The key to this problem is that a "negotiation" is a pair of repeated loops, where the loops communicate with each other without deadlock. E.g., consider:

    In this diagram, which is a single negotiation, note that the decisions are done in parallel based upon communication, and that one does not branch before sharing state. Many people branched before sharing state, which causes the process to deadlock.

  4. (25 points) A small company has 10 employees making an average of $50,000 a year. The company has a net profit of $3,000,000 a year from web sales. Estimate the cost of an hour of downtime.

    Answer: Cost of downtime = revenue lost + work lost.

    This does not count the typical 2-4 weeks vacation, and is thus an underestimate. Also, many work weeks vary from the 40 hour/week standard.