Classroom Exercise 14

The Business Case for Clouds

Spring 2011

In class we have explored the business case for clouds as a way of exploiting economy of scale and reassigning risks. Let's explore that in more detail.

- In lecture, I claimed that the number of machines needed in the
cloud scales (roughly) linearly with the number of clients. What
determines how fast the number of machines has to grow to meet expectations?
*Answer:*There are several determining factors, including- the total amount of disk required in the core.
- response-time requirements on the edge.

- I claimed that the number of staff necessary to run a cloud grows
sub-linearly with the number of clients, i.e., it exhibits less than
linear growth. Why does that happen?
*Answer:*It doesn't take 1000 people to manage 1000 machines; in fact, about 5 people can do the job! The only reason one needs more people is when the probability of a hardware failure exceeds the size of the staff. Thus one might need five more people for the next 1000 servers. - You have 1000 employees each making an average of $60,000 a year
doing data entry. What is the approximate cost to you of an hour of downtime?
*Answer:*Cost of downtime = hours of downtime * average lost work/hour. As a rough calculation, each employee works 40 hours a week, so that their wage per hour = $60,000 / (52 weeks/year) / (40 hours/week) = about $29 /hour. So the cost of one hour of downtime when they are all working is $29,000. - At the peak sales hour of the day, about 1000 customers come to
your website. Of these, about 20% buy something, and the average purchase
is $20. What is the cost of an hour of downtime at the peak hour of the day?
*Answer:*The expected purchases = $20 * 20% of 1000 = $4000. - (Advanced) Suppose that I have a response-time SLA that requires
response within one second, and I rewrite the SLA so that
it is probabilistic rather than absolute, so that it says "99% of the time, the
response time should be under 1 second", and "the remaining 1% of the
time, the response time should be under 10 seconds". Does this change
power requirements for satisfying the SLA? Why?
*Answer:*There is a profound effect of the rewriting. Remember that the response time for swinging a server determines how fast we must respond, and means we must over-provision to meet demand. If we can be wrong 1% of the time, that means that we can leave servers powered down longer, because we don't have to always meet demand. This means they will take less power, and we won't have to over-provision as much to meet demand.