Comp111: Operating Systems
Classroom Exercise 15
File I/O
Fall 2017

Consider the following simple programs:
// program 1
main() { 
    FILE *f = fopen("foo", "w"); 
    fprintf(f,"hi!\n");
    fclose(f); 
}
// program 2
main() { 
    FILE *g = fopen("foo", "a"); 
    fprintf(g,"ho!\n");
    fprintf(g,"yo!\n");
    fclose(g); 
}

  1. Suppose that there is a shared kernel file descriptor for both open calls, and that formatted I/O to disk files is block-buffered, as in linux. If we repeatedly execute the two processes asynchronously, how many different files "foo" can be produced? What are they, and how are they created? Hint:
    1. whenever it is called, fopen("foo", "w") implicitly erases the file contents and sets the file pointer to the beginning of the file.
    2. whenever it is called, fopen("foo", "a") does not erase file contents and sets the file pointer to the end of the file ("append mode").

    Answer: The key is that any number of "Program 1"s and "Program 2"s can be executing, and can have started (and finished) in any order. Whenever Program 1 executes, the file is erased and initial content is appended atomically when "Program 1" exits (which implicitly flushes the block buffer in the process). Whenever "Program 2" executes, the file gets new content (as a block) when "Program 2" exits. "Program 2" only appends its data. Thus, the file can contain exactly one instance of "Program 1"s output, interspersed anywhere in a stream of "Program 2" outputs, which are posted atomically. Thus the file contents can perhaps be best represented as a regular expression (ho\nyo\n)*(hi\n)?(ho\nyo\n)*; one "hi\n" in a bunch of "ho\nyo\n"s.
  2. Answer question (1) again but instead suppose that there is a different kernel descriptor for each open call (not linux).
    Answer: The main difference is that Program 1 still resets the file when it closes but not before. Program 2 still appends content to the state where it opens the file, and still commits those changes when it closes. The main difference is that some posts are lost due to overwrites. However, the result is exactly the same as for the prior problem.
  3. What is the answer to (1) if the file f is line-buffered rather than block-buffered, (e.g., by calling setlinebuf(f) just after fopen in both programs)?
    Answer: Line buffering changes when atomic writes are done to the operating system, and thus provides more flexibility as to the sequences that can appear. In particular, there can still only be one "hi\n", but "ho\n" and "yo\n" can appear interleaved in any order and any number of times, based upon process state. The first process still erases the contents of the file, though. Thus the regular expression for the output is (hi\n)?(ho\n|yo\n)*: one "hi\n" embedded in an arbitrary random sequence of "ho\n"s and "yo\n"s.
  4. What is the answer to (1) if the file f is line-buffered rather than block-buffered, and there is a different kernel descriptor for each open call?
    Answer: This gives the same result as (3), for the same reasons. This is because the first process still re-initializes the file and then closes, after which appends work regardless. The fact that atomic acts all have the same length means that even though a "ho" can be overwritten by a "yo" and vice versa, there is no situation in which 1/2 of a "ho" is overwritten, e.g.
  5. (Advanced) What are the disadvantages of having one shared file descriptor for each file that is opened? What kinds of file updates are not supported?
    Answer: It could be that one wants the other semantics, e.g., so that the last close replaces the whole file rather than posting additional data. This can be used to implement "mailboxes" that allow a different kind of communication than pipes. For example, a writing process could put something in a mailbox and then a reading process could pick it up, independent of the state of the writing process's read/write pointer. Mailboxes -- unlike pipes -- obey "message passing" semantics rather than "stream semantics". This is useful for situations in which the act of sending a message is a transaction.
  6. (Advanced) It is typical to store a database as a binary file of structures, e.g., literal copies of
     
    struct table { 
        int id; 
        char name[20]; 
        double salary; 
    }; 
    
    are written to predetermined offsets in the file. What issues of portability arise if structures are written literally to disk in this manner? Can I transfer the file to another machine and be assured of reading back the same structures? Why or why not?
    Answer: There are several things that vary according to architecture, including
    1. The size of an integer.
    2. The byte order of an integer (little-endian or big-endian).
    3. The representation and size of double precision (which, on a supercomputer, can be 128 bits rather than 64).
    All of these mean that special care has to be taken when writing structures in a portable fashion. Some languages handle this through what is known as "serialization". Java, for example, includes an interface Serializable that depicts an object in a portable form. Another approach to serialization is to use "network order" representations for int, long, and short (though double is not specifically mapped). This normalizes depictions of int in the file so that any architecture can read them back. However, there is no mechanism in network order for dealing with double precision binary numbers.

    Practical implementations of structure serialization typically agree upon some nominal architecture for which to represent results and then craft libraries to read and write that nominal structure on each different architecture to be used. This is the way database systems assure file portability.