11 September 2017: Appel, A Runtime System

We spent most of class sketching out how another language could be implemented on top of Appel’s run-time system.

No performance implications were discussed.

13 September 2017: More of Appel, A Runtime System

Control of the data layout

We agreed that the run-time system is in charge of data layout. However, one group pointed out that the run-time system’s support for shared closures in records (Figure 6 on page 363) and interior pointers into strings (Figure 7 on page 364, but really Figure 9 on page 367) is really there because the compiler needs it to be there. The cleverness, if there is any, lies in supporting shared closures, plus calls to known functions, with just a minor tweak to the simple “records and strings” model.

An alternative that would put the compiler in charge of data layout would be one where there was a fixed data format imposed from outside, that both compiler and runtime might have to conform to. Examples: supporting foreign calls, compiling to ASIC or DSP or other special-purpose hardware.

Secrets

Things the compiler knows:

Things the runtime knows:

Things the compiler conceals:

Things the run-time conceals:

Polymorphic equality

TO BE FILLED IN

Assessment of the design

The interface seems reasonably narrow. You’d have a good chance of reusing the run-time system with a different compiler. Maybe not such a good chance of reusing the compiler with a different run-time system (although we know it has in fact been ported).

Nate pointed out that if you were to use the runtime with a language like Java, the generational collector might not perform well because the cost of maintaining the remembered set (pointers from old objects to new) might be too great. But generations (which were invented for Lisp and perfected for Smalltalk) would clearly be OK, so you might not have to change much.

Known and unknown calls

How do “known” and “unknown” calls relate to Appel’s paper?

TO BE FILLED IN

18 September 2017: Simple generational garbage collection and fast allocation

Plenty of languages initialize records at allocation time, but relatively few languages (ML, Erlang) mutate rarely. If we read Appel carelessly, these are the primary languages for which generational collection is interesting. But in practice, there are plenty of languages that mutate after allocation (Java, Smalltalk), in which generational collection is used. Perhaps the value of generational collection lies in the generational hypothesis and not in just in rarely mutated records.

It’s worth learning where Haskell sits in the initialized/mutated matrix.

Regarding the specification of forward:

20 September 2017: SML/NJ wrapup

(Sorry, no notes.)

25 September 2017: Stack walking

Problem: given current (shadow) program counter and linkage registers (stack pointer, etc.), compute the same information for the next older frame.

AMD64 calling convention

Linkage registers

The standard C calling convention uses two linkage registers: a stack pointer and a frame pointer. Recovery uses a combination of memory and arithmetic.

Stack pointer Frame pointer
%rsp %rbp

Walking code (linkage-register edition)

sp<n+1> := fp<n> + 16     # current fp is old sp - 16
pc<n+1> := *(fp<n> + 8)   # old pc is saved on stack as return addr
fp<n+1> := *(fp<n>)       # old fp is also saved on stack

Callee-saves registers

Recovering callee-saves registers is hopeless without help from the compiler. In the standard, this help comes in the form of DWARF, a form of “debugging symbols.” A return address defines a call site, and by looking up the call site in a table, one can find the DWARF data, which says which callee-saves registers have been spilled and at what locations they are spilled.

Commentary on the calling convention

The calling convention resembles the i386 calling convention, which in turn is descended from microcontrollers without general-purpose registers. Built into the DNA of this processor family is the idea that during execution of a procedure, the stack pointer moves. If you have been involved with COMP 40 or have stared at suitable assembly codes, you’ll see PUSH and POP instructions. The introduction of a stable, non-moving register—the frame pointer—makes it much easier to find the stack frame and to address local variables without worrying about where the stack pointer is.

In point of fact, this is a bad choice for a register-poor architecture like i386. Advanced languages avoid the C calling convention in favor of something that uses just one pointer (the stack pointer) to keep track of the current activation. The choice is excusable if we consider the cost of the compiler—these conventions date from the days when your compiler had to fit in 64KB of RAM. So the compiler could not be too smart.

The stack boundary

Every calling convention needs to specify where on the stack the rights of the compiler end—this is where the rights of the run-time system (and the kernel) begin. This space is often used for temporary, thread-local storage, e.g., when the processor takes an interrupt.

In a typical calling convention, the stack grows downward toward lower addresses, and every location below the stack pointer is volatile: from the compiler’s point of view, these locations are subject to change without notice, which may originate in the run-time system or in the OS. But in the AMD64 calling convention, the compiler’s rights extend 128 bytes below the stack pointer—the run-time system and OS must preserve the contents of this “red zone.” When code is executing between call sites, the compiler has efficient access to this area by using a signed, negative, 8-bit offset from the stack pointer. (But at a call site, every location from the stack pointer down is handed to the callee.)

MIPS calling convention

Linkage registers

The standard C calling convention uses two linkage registers: a stack pointer and a return-address register. The return-address register, r31, is nonvolatile: conventional code is required to return by branching indirect through r31. A caller’s stack pointer is recovered via arithmetic; the return address is treated in the same way as any other callee-saves register.

The frame size is a function of the program counter.

sp<n+1> := sp<n> + framesize(pc<n>)  
pc<n+1> := r31

Recovering the frame size

A procedure’s stack frame, if any, must be allocated in a single instruction (subtract from stack pointer) located in the first basic block of a function, which comprises the sequence of instructions between the procedure’s entry point and the first branch instruction. The size is computed by decoding the instruction or instructions.

Callee-saves registers

Callee-saves registers are also required be saved in the first basic block. Consequence: if there is any path through the routine that has to spill a callee-saves register, that register has to be spilled on every path, precluding the idea of having a “fast path” without memory traffic. Programmers (and especially compiler writers) would be aware of such requirements and so when trying to create a fast path would likely protect the slow path behind a call.

Recovering information from the program counter

The convention is designed such that all information is recoverable at run time by scanning the instruction stream:

Decoding the instruction stream is practical for two reasons:

By contrast, decoding AMD64 is a nightmare—the legacy instruction format is hell, and I believe that it is not possible to scan backwards even in principle, because the coding of preceding instructions could be ambiguous.

Commentary on the hardware

The MIPS chip was one of the very first commercial RISC chips. It was trying to take market share from established minicomputers like VAX. Hardware was at a premium, and instruction-level parallelism was achieved by pipelining. The “branch-delay slot” allowed for execution of the next instruction in the pipeline before taking the branch—there was no “interlock” to suppress that execution. (MIPS stands for “Microprocessor without interlocked pipeline stages”.)

The MIPS people went to great lengths to maximize the value of their upstart hardware. One of the many cute tricks was that page tables were managed almost entirely in software—the only virtual-memory hardware was the so-called “translation lookaside buffer (TLB),” which mapped only 64 virtual pages to hardware addresses. A cache miss in the TLB resulted in a trap to the OS, which would look at the software page tables and update the TLB with a new association.

Commentary on the calling convention

The big shift here is that during execution of a procedure, the stack pointer doesn’t move. You can dispense with a frame pointer, but your compiler has to be smart enough to look at a whole procedure and allocate all its stack space (including space for outgoing arguments) in one go. A 1985 personal workstation might have as much as 1MB of RAM, so it was practical to demand a more ambitious compiler.

The register conventions are carefully crafted to allow some caller-saves, some callee-saves, a couple for the kernel, and one for the assembler! (The register for the assembler enabled MIPS to craft an assembler that “looked like” it could work with 32-bit immediate values, even though those values had to be split across two instructions. This trick helped with adoption because although the RISC architecture required new assembly-language programming techniques, assembly-language programmers didn’t have to learn them: they could use the old, familiar techniques, and the assembler would compensate.)

2 October 2017

Results:

4 October 2017

Lua resources:

Intermediate states of a channel

A channel (called a “medium” by Cormack) has two states: full or empty. (It’s a Haskell MVar.) If we want to look at intermediate states, we can consider that,

talk   = P; Write; V
listen = P; Read;  V

We can describe the state in a more refined way by extending “full/empty” state with an additional component having three values:

The channel is initially empty/available, and one can easily draw the state diagram based on successful P and V operations—the diagram is a cycle and has exactly the six representable states.

It is also possible to draw a state diagram using the state of the underlying semaphores in the implementation, but this gets more complicated.

Semaphores using compare-and-swap

void p(unsigned *sem) {
  restart:
    unsigned n = *sem;
    unsigned final;
    if (isinteger(n) && n > 0) {
        final = n - 1;
    } else {
        // thread-local variable 'self' points to my process descriptor
        Cons cons = &self->cons_cell;
        cons->car = self;
        cons->cdr = n;
        final = (unsigned) cons;
    }
    atomic if (*sem == n) *sem = final;  // compare and swap
    else goto restart;
}