### Comp111 Final Exam Dec 16, 2013 - open books and notes

Any and all printed and written materials are allowed, including books, course notes, and handwritten notes.

No electronic devices of any kind are allowed, including cellphones, calculators, pagers, laptops, etc. Please turn these off before beginning the exam.

Please do not discuss this examination with anyone until the answer key is posted.

Please show all work on these pages. Feel free to use the back of each page as needed. Make sure that your work for each problem is clearly marked, especially on the backs of pages. These pages will be graded and returned electronically, like exercises, and the results will be made available online.

Please note that point values have no correspondence with difficulty. We suggest that you solve the problems that you best understand first.

Please also note that the space I gave you to answer is the space I would need, and not necessarily the space you will need. There is no deduction for more verbose answers (though succinct and comprehensive answers are very much appreciated by your instructor!).

1. Suppose that we know that there are an average of 10 "dirty" blocks in the disk page cache, and that applications are writing an average of 500 blocks per second into the disk page cache.
1. (10 points) What is the average time for a disk write (from the disk page cache to physical disk) in seconds/block? Please show your work.

(This is an alternative way to measure disk write speed, but requires root privilege.)

Answer: By Little's law, L = λ W. L = 10 blocks in system, λ = 500 blocks per second (because written blocks are automatically marked "dirty"), so W = L/λ = 10/500 = 1/50 seconds time in system for each block. This is the time from entering the queue to being written onto the disk.

Many people pointed out that the raw disk write speed without the queue is 1/500 seconds per block, by the principle of balance.

Scoring: 10 if perfect, 8 if mathematical error or reported raw time to write without time in system. 5 if mis-applied Little's law.

2. (10 points) Can we compute block read speed via a similar computation? Why or why not?

Answer: It is easier to measure times for reads directly. There is no concept of queueing for read; reads are aggressive and occur "immediately". The page cache makes measurement more difficult by short-circuiting reads that have already been done, in a manner that is difficult to predict.

In order to do it, one might measure:

• The average number of uncached read requests per second = λ
• The average number of pending read requests = L.
Then again, W = L/λ. But both L and λ are more difficult to measure and/or estimate than just measuring read speed.

Scoring: 10 if accounted for difficulty of measuring. 8 if argument is superficial (e.g., "we don't have a read queue") or minor mis-statements, or characterized problem correctly without realizing that measurement was difficult. 5 if there were major mis-statements, e.g., claiming that independence and/or memorylessness are required for Little's law -- they aren't, and this is what makes Little's law so powerful a tool.

2. (10 points) Suppose that power fails on a journaled filesystem halfway through a rename request for a file (e.g., via the "mv" command). What is the resulting state of the file after filesystem recovery? Why?

Answer: As names are a property of the directory, a simple rename of a file (e.g., "mv foo bar") requires a single read of a directory block followed by a single write. If this is interrupted by a power failure, the block is not written to the journal and the filesystem is not marked consistent as a result of the change. Thus the filesystem is restored to the previous consistency watermark and the rename is not preserved.

Note that there is no conceivable way the file can be lost, because the rename occurs "in place" in the directory; a trivial rename requires one write. Note also that even if the rename is not atomic as one write, the consistency watermark will not be written until all writes are finished, so the intermediate changes will all be lost when restoring to the last consistent watermark. So, even in complex cases, e.g. "mv /foo/bar /cat/dog", the whole rename will be lost.

Note that mv only creates a new inode if the move is between filesystems. In the same filesystem, only one or (at most) two directories are changed. mv does not touch the contents of the files.

Scoring: 10 if perfect, 9 if typo, 8 if proper idea with minor issues, 7 if comments explain cp rather than mv, 5 if plausible but incorrect, 3 if answer exhibits major misunderstandings.

3. (10 points) One subtle detail that we have not covered in class for watermarked journals is what happens when multiple processes try to accomplish journalled writes at the same time. Can we watermark consistency for each process's writes separately, or must we determine watermarks based upon the total sequence of writes from all processes? Why?

Answer: This is extremely subtle. While file writes to the same file from different processes are atomic, there is no observable effect of concurrent writes to different files. Thus the operating system does not prevent concurrency in this case, to improve performance. This means that writes to different files can interleave. When this is allowed, naively writing watermarks for each update separately creates watermarks when the filesystem is in an inconsistent state. Thus the driver must only create watermarks when all files being written are in a consistent state, so coordination is required. Many students also pointed out that the filesystem driver does not know the identities of processes, which is correct, making marking "by process" impractical.

Another way to do this would be to serialize all writes, regardless of the file and process, but performance would suffer severely and there is no "phenomological" reason to restrict writing this much.

Scoring: 10 if realized that consistency must be coordinated; 8 if did not realize that multiple-file writes can interleave or applied the concept of atomic actions literally without considering performance. 5 if there were major misunderstandings.

4. (10 points) In a Redundant Array of Inexpensive Disks (RAID) array, is the filesystem driver aware that it is running on a RAID device? Why or why not?

Answer: No. A RAID array is a raw block device. Filesystems are built on top of raw devices. No filesystem accounts for the implementation of the raw device underlying it.

Scoring: 10 if correct, 0 if incorrect.

5. (10 points) One of the major pain points in managing a hardware RAID5 array is that one's disk choices for replacing failed disks are very limited. Often a hardware raid controller will only accept only disks from a specific manufacturer, and all disks must be the same size. Why?

Answer: Recall that a RAID 5 array is formed by accessing N disks in parallel, with one disk serving as a parity disk. Each N-1 blocks are written onto different disks, and the Nth block makes it possible to recover any one of the other blocks should its disk fail.

Thus it is important to make the disks the same size, because otherwise, excess disk storage is wasted and inaccessible. E.g., if one forms a RAID 5 array from four 1TB disks and one 2TB disk, there is no way the extra TB can be accessed.

The fact that they are required to be from the same manufacturer is more subtle. Hardware controllers are more limited than software controllers; one cannot simply "augment" their programs for new kinds of disks. There are slight differences in disk controllers from different manufacturers that we often mediate in software; hardware controllers have no such luxury. Thus we must live with the limits of the controller.

Scoring: 10 if perfect, 8 if missed the limitations due to hardware control,5 if discussion was too general, 0 if incorrect.

6. (10 points) When moving a filesystem to a new larger disk, it is a common trick to copy the filesystem bit-by-bit and then expand it to fit. What steps are involved in most efficiently expanding an EXT3 filesystem to utilize more raw disk space?

Answer: Recall that an EXT3 filesystem consists of a large number of block groups, each with its own superblock, descriptors, inodes, and data blocks. The simplest way to expand a filesystem is to copy it onto the larger disk and then initialize the remainder of the disk to contain new block groups of the same size. Then one has to update the superblocks for the new locations of the block groups. This is much faster than rewriting the files to a new filesystem.

Contrary to what several students wrote, "increasing the size of a block" is neither efficient nor feasible. Block size is based upon what the hardware best supports.

Scoring: 10 if perfect, 8 if minor issues or ommissions, 6 if reasonable but inefficient, 4 if unreasonable, 2 if proposed solution is correct but largely irrelevant.

7. (10 points) In the review session, we mentioned briefly how different the concept of "filesystem integrity" is from "integrity of a user's home directory." What user desires for home directory integrity are not provided by filesystem integrity? Is it possible to construct a filesystem that supplies those user needs? Why or why not?

Answer: First, the integrity of a filesystem is the property that -- within limits -- what you write is what you read back. This is equivalent with its consistency as a filesystem. There are many times, however, that what you write is not what you read back, especially during power failures, disk failures, and even cable failures. The reason for this is that it makes the overall filesystem operation more efficient. Specifically, lazy writes both streamline the process of updating the disk and lead to certain kinds of integrity failures at the user level, even though system-level integrity is preserved. In particular, watermarks ensure system-level integrity (in the sense that the filesystem remains "consistent" but not user-level integrityi (in the sense that anything you write to the filesystem is persistent. As discussed during the review, there are other ways to ensure that the user's concept of integrity is preserved, including syncing the filesystem via unterruptible power before a power failure. However, this is not a filesystem mechanism.

Note that "integrity" -- in this case -- is not a "security" issue; it is the fact that changes you make as a user are persistent -- or not.

This is a case in which one cannot have one's cake and eat it too. The reason that integrity is lost is that the alternative is writing agressively, which leads to improved user-level integrity at the cost of poor overall performancei (and potential loss of system integrity as well).

Scoring: 10 if perfect, 8 if minor issues, 6 if major issues, 4 if any redeeming value.

8. In distributed and multi-core operating systems, the concept of "atomic operations" is largely being replaced by the concept of "serialized sets of operations". A "serialized set of operations" has the property that operations in the set are invoked one by one in an order determined at run time by race conditions, so that no two are ever invoked at the same time. (Warning: this is very different than a "serializable" class in java, etc , which is one whose instances can be written to and read back from files!)
1. (10 points) Is the set of atomic operations that we have discussed in class also a serialized set of operations according to the above definition? Why or why not?

Answer: Yes, trivially. This satisfies the definition of what it means to be atomic.

Note: some students pointed out that I didn't mention anything about being "independent of external influence" in the concept of serialized operations. I gave them credit for this reasoning. The spirit of my question, however, was that serialized sets are independent of external influences.

Note: the concept of an atomic operation is independent of the implementation of that operation. The fact that atomic operations that we have studied stop the scheduler is not an essential attribute of an atomic operation; they could accomplish atomic operations a different way (e.g., by compulsory locking) and still be atomic.

(However, the converse is false; a serialized set need not consist of atomic operations; they can be uninterruptable only in the context of that set and possible to interrupt by operations outside the set.)

Scoring: 10 if correct, 5 if either answer or reasoning is incorrect (but not both), 0 if both answer and reasoning are incorrect.

2. (10 points) What is the advantage of thinking about atomic operations in this way? What kinds of situations that might arise in multi-core programming are better expressed in terms of "serialized sets of operations" rather than via our concept of "atomic" operations? Why?

Answer: Our model of atomic operations -- as discussed in class -- is naive compared to what really happens in a complex system. One very seldom has to limit concurrency, and one usually does this for behavioral reasons: to make the outcomes of actions more predictable. For example, the reason that "writes to the same file are atomic" is that this limits the resulting states of the file to reflect limited number of write schedules. Likewise, there is no compelling reason for writes to two different files to be serialized.

The concept of a "set of serialized operations" generalizes the concept of atomic operations to allow multiple sets of operations with mutual exclusion behavior that are not globally atomic, but only atomic relative to the rest of the set. For example, there are good technical reasons to ensure that no two cores try to call pthread_mutex_lock at the exact same time. Thus the set {pthread_mutex_lock, pthread_mutex_unlock} is a serialized set according to the above definitions. There is -- however -- no particular reason to limit I/O not to be concurrent with locking.

In general, "serialized sets of operations" allow us to model concurrency and what should be kept from happening concurrently much more accurately than our very course "atomic operations" model.

In the class, very few students realized that the issue is performance; many students claimed that serialized operations accomplish what atomic operations already accomplish (and mentioned nothing more).

Scoring: 10 if perfect, 8 if minor omissions, 6 if major misunderstandings, 4 if answer does not pertain to the question but has some value.

9. (10 points EXTRA CREDIT) The difference between a virtualization hypervisor and a regular operating system is that it has only two functions: (a) scheduling operating systems to run and (b) managing I/O. We have discussed how hypervisors manage I/O but not about scheduling. What are the major differences between the scheduling subsystem for processes in a regular operating system and the scheduler in a virtualization hypervisor? What optimization objectives should the hypervisor scheduler try to achieve? What strategies for scheduling are relevant?

Answer: The difference between a hypervisor scheduler and a program scheduler is that the hypervisor schedules operating systems to run. This means sharing CPU between operating systems in a manner that optimizes the function of the "guest" operating systems.

This is a bit subtle. The only thing that really works is some form of fair scheduling for response time. The operating system is a complex entity that engages in many ongoing processes that must remain responsive, including -- most importantly -- updating the time of day.

In this case, "fair" means that the operating system gets the cycles it needs to continue to run, including updating clocks, etc. One can't just "starve" an OS when it is not being used; this will affect the clock, among other things. Also, many users expect the two operating systems to both remain responsive. Many users set up environments with windows to both operating systems at the same time.

If one operating system is not being used at all, one can "hibernate" it, which means creating an image of all of memory that can be restored. This is not the same thing as running it.

All of the above observations are evidence that some form of scheduling that optimizes for fairness (either round robin or fair scheduling) must be used. As well, scheduling epochs must he longer than for the real OS, to allow context switching. In practice, the scheduling problem is too simple to merit "completely fair scheduling" (because there are never more than 8 OS's and the OS's neither die nor spawn) and simple adjustments to round-robin suffice.

Several misconceptions in answers are worth mentioning. A hypervisor does not have knowledge of the processes in each operating system, nor can that knowledge be used to improve scheduling. In like manner, the hypervisor does not have knowledge of the way processes are using devices; it sees the global usage of devices by the OS and nothing more. Also, the typical VM scheduler uses no virtual memory at all; all memory for all OSs is pre-mapped, so context switches are fast.

Scoring: 10 if perfect, 5 if defined the problem but did not answer it definitively, 0 if no redeeming value or there were errors or mis-statements. On this problem, errors were treated more severely; errors and mis-statements led to zero "extra credit".