Comp150CPA: Clouds and Power-Aware Computing
Midterm Examination.
April 6, 2011 -- Open Notes

Name: ______________________________________________

Login: ______________

Please answer the following questions on these sheets. Please put your name at the top of each sheet. You may use any printed material. Electronic devices are prohibited.

  1. (20 points) Can an object marked as @Persistent in AppEngine contain an instance of another @Persistent object as a member? Why or why not?
    Answer: No. An object can contain a reference to another persistent object (by key), but cannot contain the content of the object. To see why this is a good idea, remember that every object instance is accessible via a (unique) key. A persistent sub-object would also have a key, so that there would be two keys for one object.
  2. (30 points) Suppose you have an input that has entries like the following:
    time person activity 
    where Please write a Pig script that computes the total time that George spent sleeping. This is the sum of the differences between times that George started sleeping and times that George started doing something else. For example, for the data
    10001  George sleeping
    10006  George walking
    10100  George sleeping
    10200  George showering
    10500  George running
    your program should output the number 105, which is 10006 - 10001 + 10200 - 10100. You may assume that every time George starts sleeping, there is another subsequent event when George stops sleeping.
    Answer: First we input the data via
    log = LOAD 'file.dat'  USING PigStorage AS (stamp:integer, name:text, activity:text); 
    # schema is log:{stamp, name, activity}
    We are only interested in George:
    log2 = FILTER log1 BY name=='George'; 
    log3 = FOREACH log2 GENERATE stamp, activity; 
    # schema is log3:{stamp, activity}
    And we are only interested in whether George is sleeping or not:
    tmp1 = FILTER log3 BY activity=='sleeping'; 
    asleep = FOREACH tmp1 GENERATE stamp; 
    # schema is asleep:{stamp}
    tmp2 = FILTER log3 BY activity!='sleeping'; 
    awake = FOREACH tmp2 GENERATE stamp; 
    # schema is awake:{stamp}
    After this: Next, we pair up asleep and awake states, using a cross product:
    prod = CROSS asleep, awake;  
    # schema is prod:{asleep::stamp, awake::stamp} 
    less = FILTER prod BY asleep::stamp<awake::stamp; 
    grp = GROUP prod by asleep::stamp; 
    # schema is grp:{asleep::stamp, prod:{asleep::stamp, awake::stamp}}
    after which we have grp as follows:
    10001 {(10001, 10006), 
           (10001, 10200), 
           (10001, 10500)}
    10100 {(10100, 10200), 
           (10100, 10500)} 
    and throw away all but the smallest awake::stamp:
    least = FOREACH grp { 
        foo1 = ORDER prod by awake::stamp; 
        foo2 = LIMIT foo1 1; 
        GENERATE foo2; 
    # schema is least:{asleep::stamp, awake::stamp} 
    which gives us least as
    10001 10006 
    10100 10200 
    after which we subtract and join things
    sub = FOREACH least GENERATE (awake::stamp - asleep::stamp) as slept; 
    # schema is sub:{slept}
    every = GROUP sub BY all; 
    # schema is every:{all, sub:{slept}}
    sum = FOREACH every GENERATE sum(sub:slept); 
    dump sum; 
    Of course, there are many other ways to do this.
  3. (25 points) List the disadvantages of structuring a program via service-oriented architecture (SOA). Then explain why these disadvantages are less important than the business needs that are satisfied through SOA.
    Answer: The disadvantages of SOA include: But these are not as important as the business needs of:
  4. (25 points) You are designing a stock-market analysis tool that takes as input events of the form:
    time stock change
    where: You wish to analyze these in a variety of ways, looking for subtle correlations.
    1. (10 points) Should you use a datastore with strong or eventual consistency? Why?
      Answer: This is debatable. Actually, most of these use eventual consistency; because records are always posted at the end of a transaction log, so that the most one can suffer from eventual consistency is slightly out-of-date logs. Strong consistency is more costly to no particular advantage.
    2. (15 points) Should you use a Map/Reduce or Distributed-Hash-Table infrastructure to store the data? Why?
      Answer: Obviously, correlations and trends cannot easily be computed via DHTs. Map/Reduce is the only technology that is powerful enough to do correlations at all.