### Comp111 Final Exam Monday Dec 15, 2014 - open books and notes

Any and all printed and written materials are allowed, including books, course notes, and handwritten notes.

No electronic devices of any kind are allowed, including cellphones, calculators, pagers, laptops, etc. Please turn these off before beginning the exam.

Please show all work on these pages. Feel free to use the back of each page as needed. Make sure that your work for each problem is clearly marked, especially on the backs of pages. These pages will be graded and returned electronically, like exercises, and the results will be made available online.

Please note that point values have no correspondence with difficulty. We suggest that you solve the problems that you best understand first.

Please also note that the space I gave you to answer is the space I would need, and not necessarily the space you will need. There is no deduction for more verbose answers (though succinct and comprehensive answers are very much appreciated by your instructor!).

1. Suppose the 5-minute load average of a linux system is 2.0.
1. (10 points) Explain exactly what this measurement means.
Answer: The load average is the average number of processes in the run queue during the specified interval. This corresponds exactly to the number of processes in the "runnable" or "running" state in the 5-state model, and includes processes that are currently running at the time. This is computed by checking the state of the run queue at regular intervals.

Thus, the precise meaning of this is that on average -- during the last 5 minutes, 2 processes have been present in the run queue.

Scoring: +10 if acceptable, +8 if minor mis-statements, +5 if major issues.

Note particularly that many people defined load in the "Little's Law" sense, but not in the Linux sense. That was worth 1/2 credit because it is 1/2 of the story.

2. (10 points) Supposing that this load average is made up entirely of 30 people editing, and that each one types an average of 1 character per second, what is the average editing response time between pressing a character and seeing the result?
Answer: This is a bit tricky. There are 30 people editing, which corresponds to 30 processes potentially in the run queue. We know that on average, there are 2 processes runnable and/or running and the rest are (presumably) waiting for input. So, it is reasonable to characterize the system as a queueing system where editing events flow in and updated screens are the output. The fact that the load average is 2 means that there are two events in the system at a time corresponds to L (load) in Little's laws. We also know λ which is 30 characters (editing events) per second (because editing events are presumably independent). What we want is the time in system W. By Little's laws, L = λ W and W = L / λ = (2 events) / (30 events/second) = 1/15 second.

Scoring: +10 if acceptable, +8 if minor mis-statements, +5 if major issues.

If you made a mistake in part (a) that carried over into part (b), I did not count off twice. Many people answered part (a) incorrectly and got part (b) correct because their misinterpretation in (a) did not affect the result in (b). I gave these people credit even though their reasoning was incorrect.

2. I skipped some details of implementing transactional filesystems. Let's fill in some of the blanks.
1. (10 points) I claim that it is relatively simple to defragment a transactional filesystem, compared to defragmenting a physical filesystem. Why?
Answer: The point of defragmenting a filesystem is to reduce the "internal" fragmentation created by deleting files. In a normal filesystem, defragmenting files requires interpreting the filesystem. One must rewrite file blocks -- one block at a time -- into free spaces in order to free up larger sections of disk.

First note that fragmentation in a transactional filesystem is fundamentally different than in a regular filesystem. When transaction blocks are no longer needed, there are "gaps in a queue" rather than gaps in a filesystem. If we compact the queue by scanning it and writing over unused blocks to compact the whole structure, this is sufficient to defragment the disk.

Part of the reason that this kind of defragmentation is sufficient is that the files are -- on average -- written into the transaction queue in the correct order and using statistically contiguous sets of blocks. In essence, in cases other than random access, the transaction queue does not ever get significantly fragmented from the process of writing, though it does get fragmented from deleting transactions. In the unusual case of random-access files, fragmentation is irrelevant because we will statistically never read the file in sequential order, anyway.

Note also that one does not need to interpret the semantics of blocks in order to defragment the transaction queue that -- in a fully transactional filesystem -- is the whole disk. So, one can simply start at the beginning and eliminate holes created by deleting files or replacing their contents by copying used blocks into the holes in order: what I called "compacting the disk" in class. Then, if one updates the transaction hash for the new locations of blocks, all will be well. The filesystem does not even have to be interpreted for this to work.

Scoring: +10 if acceptable, +8 if minor mis-statements, +5 if major issues.

2. (10 points) Is the page cache described for EXT2-4 necessary in a fully transactional filesystem? Why or why not?
Answer: It is more crucial than ever for reading, though if writing, it is not needed.

In the case of reading, the page cache is used to reduce the number of times that a specific disk block is read physically from disk. In a transactional filesystem, reading a block is "a bit more expensive" than in a non-transactional filesystem, because one has to go though one extra layer of indirection. Thus it is more important to minimize reads than before, and the page cache is absolutely essential.

Since the transactional filesystem utilizes aggressive writing at all times, the page cache is not absolutely required for writing, though on occasion a block that has just been written will be read immediately thereafter. So it is a good idea to cache writes as well, based upon the principle of locality.

Scoring: +10 if acceptable, +8 if minor mis-statements, +5 if major issues. If you answered "no" in general and didn't mention reading, you got +5.

3. (10 points) Is fairness important for the disk scheduler? Why or why not?
Answer: This is a bit of a "red herring" question. Remember that the disk scheduler writes blocks from the journal and/or memory into the physical disk blocks that are indicated. There is no concept of "ownership" for these blocks, and no way to even define fairness. The scheduler's clear task is to maximize throughput, which is -- as we have seen -- completely the opposite of creating a situation of fairness.

So fairness is irrelevant to the disk scheduler.

Many people thought the disk scheduler affects reads. It only affects writes. Reads are always done aggressively on demand. Only writes are "scheduled." The disk scheduler is also completely absent and irrelevant for a fully transactional filesystem because writes to the journal are also done aggressively.

Another common misconception was that disk scheduling affects process efficiency. It does not. Once a block is written to the page cache, the process has no more dealings with it and does not wait for the block to be scheduled to disk. Thus the situation is worse than irrelevance; there is not even a plausible definition of what it would mean to be "fair". This is another reason that the page cache cannot be eliminated for any kind of filesystem (see answer to 2b).

Scoring: +10 if acceptable, +8 if minor mis-statements, +5 if you thought scheduling affects reads and answered accordingly. +3 if reasoning was more flawed than that, e.g., claiming that disk scheduling affects process scheduling.

4. Consider the act of deleting a regular file from a filesystem.
1. (10 points) List all changes that need to be made to the EXT2 filesystem in order to delete a file.
Answer: The filesystem driver needs to:
1. Remove the file from its directory (one block) (+2).
2. Decrease the access count for the file in its inode (one block) (+1).
3. If the access count is 0, then also:
• mark the blocks of the file as free (one update per file block) (+2).
• mark the file inode as free (one block update) (+2)
4. Update the super block for the number of blocks and inodes now free (one block). (+1)
One should also not do anything else (+2). Total is +10. Note in particular that deleting a file does not write any new information into the file's blocks or inode. These are left unchanged and marked as reusable. This is why it is possible to mine deleted files from a disk.

Scoring: as listed above.

2. (10 points) One of the great mysteries of Linux for users is that to delete a file, one does not need to be able to write to it. In fact, one can delete files over which one has no privileges at all. Use the structure of the filesystem to explain why this is true, and describe the privileges one needs in order to delete a file.
Answer: Note that in the prior problem, the key act in deleting a file is to remove the directory entry corresponding to the file. This requires permission to modify a directory. This is -- in fact -- the only permission one needs, because the file itself is not changed by that act. In turn, the act of deleting a file completely is a side-effect of removing it from all directories that contain it.

Notably, as before, one does not write over the file's blocks or inode inorder to delete it. This is why write permission to the file is not required.

Also, deleting the file does not require deleting all symlinks to the file; just all hard links. Dangling symlinks that point to nothing are allowed.

Scoring: +10 if acceptable; +8 if minor mis-statements; +5 if major misunderstandings present; +3 if any value.

5. (10 points) I claim that the buddy system for memory management is actually slower than other mechanisms if all memory chunks to be allocated are exactly the same size. Why? What other mechanism could one use instead?
Answer: One key issue here is that in the buddy system we are using a "step" to choose the list in which the desired element resides. If all elements are the same size, we need exactly one list and do not need to choose that list at runtime.

Another key issue is that to achieve a state in which all blocks are the same size n, the buddy system has to divide a page O(213-ceiling(log2(n))) times to obtain a blocksize into which the requested size n fits. One page is 213 bytes. The smaller n is, the more time that is needed to subdivide the page.

Instead, we can divide the whole allocated page -- e.g. 8192 bytes --by an appropriate power of two and split all of them at once into chunks of appropriate size in a single subdivision step. Everyone in the class did this in O(8192/n) by adding the newly allocated chunks to a free list (at the head).

However, there is a much more clever way to accomplish this. The splitting time time can be made O(1) and independent of the requested size n of the memory chunk, because new pages allocated by sbrk are automatically zeroed (to avoid information bleed between processes), and thus any linked list pointers that one posits in the structure of said memory are NULL at the outset. Thus, the most efficient way to allocate this "page of chunks" is via an array pointer that points to the lowest unallocated chunk and returns that for each malloc, which avoids pre-initializing the chunks. A single free list is not used when the chunk is allocated, and only used when the chunk is freed. No one at all pointed out these issues.

Thus, we have only one free list, very fast malloc and subdivision, and constant-time page subdivision, so that both kinds of overhead are minimized.

There are several other refinements that can speed up things further, such as using arrays of struct and corresponding bit arrays to manage fixed-size structs, as an alternative to using linked lists of structs of varying sizes, each with a descriptor. Malloc-style chunk descriptors can be eliminated. One can organize each allocated memory page -- or group of pages, at one's discretion -- as a used bitarray + an array of structs. This is similar to the inode allocation algorithm on a disk.

The net effect of all of these changes is a memory allocator that eliminates almost all of the space and time overhead necessary to implement the buddy system.

High-level programming languages including Scheme, Lisp, and ML organize memory into several areas, some of which contain same-size components, and utilize one of these variants to manage each of those areas.

Several people misinterpreted the problem and showed how to save space rather than time. A number of people did not provide any mechanism for freeing storage that is no longer required.

Some people thought that the existing "buddy system" malloc is not O(1), which it is, though the reason is subtle. If one has a page of 8192=213 bytes and a grain limit of 32=s5 bytes per allocated chunk, then a page needs to be subdivided at most 13-5 = 8 times to get to the smallest chunk size. 8 is a constant, and 8 times is thus 0(1). Many peoples' schemes required an O(p) traversal where p = page size. This is still O(1) because p is a constant, but with a much higher constant factor in the definition of O().

A number of people also forgot that alignment to the size of the largest primitive type is necessary, and that sbrk only allocates page-sized chunks due to hardware limitations of the memory mapper. These limits make it impossible to divide up 8192 bytes into chunks of exactly n bytes where n is the requested length; padding must be done in order to achieve alignment requirements, and calling sbrk once per request is less time-efficient (and does exactly the same thing) as allocating a whole page at a time.

Scoring: +10 if acceptable; +8 if minor mis-statements; +5 if major misunderstandings present; +3 if any value.

6. (10 points) In assignment 3, many people observed a curious behavior. The process being controlled seemed to write many thousands of lines before the parent process could read 100 lines and stop it. Explain this behavior from your understanding of the I/O subsystem. What actually happened?
Answer: In assignment 3, we captured the output of the child in a pipe that obeys bounded buffer semantics. Pipes by default are block buffered; writes are not provided to the reader until a whole block of the output is present. What actually happened is that the child's writes to the parent are not done until the buffer contains one block. Thus the parent will only kill the child when it actually perceives that 100 lines have been read, while in fact the child has perhaps written over 1000 lines at that time (in one block).

One major misunderstanding was that many people identified speed of read as the problem. This was not the real problem, because the data is not released from the child to the parent until one block of output is present. No amount of speed in reading can compensate for data that is not yet present for the parent to read.

Also, a number of people thought that "lazy flushing" and/or "the dirty bit" were relevant. These are only relevant to disk paging, and we were writing to and reading from a pipe instead of a disk.

Scoring: +10 if acceptable; +8 if identified buffering as the cause but did not specifically mention the issue of block buffering; +5 if major misunderstandings present; +3 if any value.

7. (10 points) In the final lecture, I reminded you that one thing to avoid in an operating system is "thrashing", in which the operating system takes more time than it provides for useful work. Describe a few kinds of thrashing and how to avoid these behaviors.
Answer: All forms of thrashing involve too much need for some kind of physical resource. The most common form of thrashing is "swap heaven": doing so much swapping of process memory in and out that no useful work gets done. The second most common is filesystem thrashing, in which so many processes are trying to read files from the same filesystem that the reads interfere with one another. Write thrashing does not occur due to the existence of the disk scheduler, which is designed to prevent write thrashing!

But there are less common forms of thrashing, including:

1. Cache thrashing: memory intensive processes overwhelm the processor cache and compete for cache pages.
2. Network thrashing: network intensive processes overwhelm the network card by writing too much to the network. Thus, the OS has to queue outputs and block the processes until the network becomes available.
3. Scheduler thrashing: the run queue fills up with so many processes that "Completely Fair Scheduling" takes significant time away from computing.
4. Virtualization thrashing: two independent operating systems writing to the same disk (perhaps in different partitions) interfere with one another due to lack of coordination.
5. Context switch thrashing: I/O changes state so often that processes are constantly leaving the run queue in I/O wait.
6. Spinlocks -- while not defined particularly as a form of thrashing because they occur inside user processes -- can also lead to a form of thrashing. When applications utilize spinlocks (which are ostensibly infinite loops polling for a state change), these locks can take up most of the computation cycles.
In all of these cases but the last, thrashing occurs in the operating system as a result of unforeseen process interactions that -- in isolation -- are not thrashing.

The disk scheduler avoids disk write thrashing by scheduling writes but not reads. Other forms of thrashing are best avoided through process admission control: not executing processes until sufficient resources are available.

Many things are not thrashing. Deadlock and livelock are not thrashing; in the former case no resources are being consumed, while in the latter process cycles -- rather than OS cycles -- are being consumed. Completely fair, O(1), and round-robin scheduling do not thrash in isolation; it takes specific process behavior, e.g., intensive I/O, to cause them to thrash.

Scoring: +10 if gave two examples of thrashing and how to stop them. +8 if gave at least one real example, and perhaps gave another non-example. +5 if there was confusion in the examples. +3 if the answer convinced me that you do not understand what thrashing actually is.

8. (10 points EXTRA CREDIT) One of the worst hardware ideas of all time was the so-called multi-ported disk drive that had one disk and two interfaces to different hosts running different operating systems. Supposing that the multi-ported disk contains one EXT4 filesystem and that the two host run independent copies of linux, what are the limitations of this technique, and why did this product fail to sell?
Answer: The main limitation is that only one host can be writing to the filesystem at a time, because writes from both hosts simultaneously have the potential to create inconsistent states. The reason is that there are two page caches and these can easily become inconsistent; there is no mechanism for synchronizing them after they are out of sync. This is called the "cache coherence problem": the three-entity system tends to drift out of a consistent state and data loss results when one system caches a write that the other fails to recognize and thus overwrites, e.g., in updating a directory. Thus it is simply not possible to mount one partition read-write on two uncoordinated operating systems and simultaneously utilize page caching.

Thus to use the multi-ported disk, page caching has to be turned off and severe performance degradation (> 10x slower) occurs.

Even if we mount the filesystem on one host read-only, there is still an inability to track changes in real time, because these changes start out in the memory of the writing host. Even if the changes are written to the journal when made, this means that there will be the need to always check the journal on the read-only mount even if the block is in the page cache. Thus reads are going to be comparatively slow on the host without write access.

But the real reason that the product did not sell is that OS virtualization accomplishes the same thing without confusion, by utilizing a common driver over both operating systems.