We spent most of class sketching out how another language could be implemented on top of Appel’s run-time system.
Erlang seemed to be mostly a good fit. There were questions about atoms/symbols (unique pointer? Or unboxed index into table?) and about how you would represent a tuple (pair) containing two closures (single record or three records).
It was not entirely obvious how you would provide the run-time type information needed by primitive operations like +
.
Java/C# seemed to be an OK fit. There are some primitive types that seemed to be OK, but most of the action is in user-defined objects. We didn’t quite complete a sketch of exactly how objects would be represented or how method dispatch would work.
There was some discussion of “nullable” primitives. Perhaps they are represented like a Haskell Maybe
type: null is unboxed, and any non-null value is boxed?
We’d hoped to analyze Python and JavaScript, but we didn’t get there.
Nate’s group turned up a really interesting question about mutable strings: suppose that your language has a mutable string abstraction, but that mutation is relatively rare, whereas parsing/matching on strings and extracting substrings is common. You’d like to have an implementation that allows multiple strings to share memory, copy on write. This is the sort of question that, by the end of the term, we ought to be able to answer.
No performance implications were discussed.
We agreed that the run-time system is in charge of data layout. However, one group pointed out that the run-time system’s support for shared closures in records (Figure 6 on page 363) and interior pointers into strings (Figure 7 on page 364, but really Figure 9 on page 367) is really there because the compiler needs it to be there. The cleverness, if there is any, lies in supporting shared closures, plus calls to known functions, with just a minor tweak to the simple “records and strings” model.
An alternative that would put the compiler in charge of data layout would be one where there was a fixed data format imposed from outside, that both compiler and runtime might have to conform to. Examples: supporting foreign calls, compiling to ASIC or DSP or other special-purpose hardware.
Things the compiler knows:
The compiler knows all the data layout imposed by the runtime, including the meaning of metadata (header words). For example, if it allocates an two-word object it needs to be able to create a header word with a descriptor saying “I’m a record and I’m two words long.”
The runtime system can move objects.
The compiler knows how to allocate records (and strings?).
Things the runtime knows:
Parts of the ML calling convention: enough so that the hand-written assembly primitives can be called using the same native calling convention (for unknown functions) as compiled functions. The runtime may have to know only how to implement procedures whose arguments are simple scalars.
Enough to be able to invoke an exception continuation (a bit like a call) when an interrupt occurs.
The runtime knows how to allocate arrays, bytearrays, and strings.
Note:
Object | Mutable | May contain pointers |
---|---|---|
Record | No | Yes |
Array | Yes | Yes |
String | No | No |
Bytearray | Yes | No |
Things the compiler conceals:
Everything about the source-language type system and data.
The optimization used for list
, option
, and other simple algebraic data types?
Things the run-time conceals:
TO BE FILLED IN
The interface seems reasonably narrow. You’d have a good chance of reusing the run-time system with a different compiler. Maybe not such a good chance of reusing the compiler with a different run-time system (although we know it has in fact been ported).
Nate pointed out that if you were to use the runtime with a language like Java, the generational collector might not perform well because the cost of maintaining the remembered set (pointers from old objects to new) might be too great. But generations (which were invented for Lisp and perfected for Smalltalk) would clearly be OK, so you might not have to change much.
How do “known” and “unknown” calls relate to Appel’s paper?
The business about PC-relative branches is a primitive form of a known call—the idea is to optimize a call to a known function by turning into a relative branch, rather than the full “load address into a register and branch indirect through that register.”
It is also possible to optimize known calls by using a more ambitious calling convention; for example, in making a known call to a function whose argument is a pair, you might choose to pass the elements of the pair in two registers, rather than pass a pointer to a pair allocated on the heap. This optimization might even make it possible to avoid ever allocating the pair on the heap. I don’t know whether SML/NJ did such optimizations in 1990, but I bet it does now.
TO BE FILLED IN
Plenty of languages initialize records at allocation time, but relatively few languages (ML, Erlang) mutate rarely. If we read Appel carelessly, these are the primary languages for which generational collection is interesting. But in practice, there are plenty of languages that mutate after allocation (Java, Smalltalk), in which generational collection is used. Perhaps the value of generational collection lies in the generational hypothesis and not in just in rarely mutated records.
It’s worth learning where Haskell sits in the initialized/mutated matrix.
Regarding the specification of forward
:
If we take it seriously, we need to talk about the pre-collection heap and the post-collection heap. The post-collection heap is isomorphic to a subgraph of the pre-collection heap. (And the isomorphism maps some objects to themselves—which ones?) If we write the isomorphism as ∼, we could say that when R′ = forward
(R), then R ∼ R′.
The rest is (important) detail.
(Sorry, no notes.)
Problem: given current (shadow) program counter and linkage registers (stack pointer, etc.), compute the same information for the next older frame.
The standard C calling convention uses two linkage registers: a stack pointer and a frame pointer. Recovery uses a combination of memory and arithmetic.
Stack pointer | Frame pointer |
---|---|
%rsp | %rbp |
Walking code (linkage-register edition)
sp<n+1> := fp<n> + 16 # current fp is old sp - 16
pc<n+1> := *(fp<n> + 8) # old pc is saved on stack as return addr
fp<n+1> := *(fp<n>) # old fp is also saved on stack
Recovering callee-saves registers is hopeless without help from the compiler. In the standard, this help comes in the form of DWARF, a form of “debugging symbols.” A return address defines a call site, and by looking up the call site in a table, one can find the DWARF data, which says which callee-saves registers have been spilled and at what locations they are spilled.
The calling convention resembles the i386 calling convention, which in turn is descended from microcontrollers without general-purpose registers. Built into the DNA of this processor family is the idea that during execution of a procedure, the stack pointer moves. If you have been involved with COMP 40 or have stared at suitable assembly codes, you’ll see PUSH and POP instructions. The introduction of a stable, non-moving register—the frame pointer—makes it much easier to find the stack frame and to address local variables without worrying about where the stack pointer is.
In point of fact, this is a bad choice for a register-poor architecture like i386. Advanced languages avoid the C calling convention in favor of something that uses just one pointer (the stack pointer) to keep track of the current activation. The choice is excusable if we consider the cost of the compiler—these conventions date from the days when your compiler had to fit in 64KB of RAM. So the compiler could not be too smart.
Every calling convention needs to specify where on the stack the rights of the compiler end—this is where the rights of the run-time system (and the kernel) begin. This space is often used for temporary, thread-local storage, e.g., when the processor takes an interrupt.
In a typical calling convention, the stack grows downward toward lower addresses, and every location below the stack pointer is volatile: from the compiler’s point of view, these locations are subject to change without notice, which may originate in the run-time system or in the OS. But in the AMD64 calling convention, the compiler’s rights extend 128 bytes below the stack pointer—the run-time system and OS must preserve the contents of this “red zone.” When code is executing between call sites, the compiler has efficient access to this area by using a signed, negative, 8-bit offset from the stack pointer. (But at a call site, every location from the stack pointer down is handed to the callee.)
The standard C calling convention uses two linkage registers: a stack pointer and a return-address register. The return-address register, r31
, is nonvolatile: conventional code is required to return by branching indirect through r31
. A caller’s stack pointer is recovered via arithmetic; the return address is treated in the same way as any other callee-saves register.
The frame size is a function of the program counter.
sp<n+1> := sp<n> + framesize(pc<n>)
pc<n+1> := r31
A procedure’s stack frame, if any, must be allocated in a single instruction (subtract from stack pointer) located in the first basic block of a function, which comprises the sequence of instructions between the procedure’s entry point and the first branch instruction. The size is computed by decoding the instruction or instructions.
Callee-saves registers are also required be saved in the first basic block. Consequence: if there is any path through the routine that has to spill a callee-saves register, that register has to be spilled on every path, precluding the idea of having a “fast path” without memory traffic. Programmers (and especially compiler writers) would be aware of such requirements and so when trying to create a fast path would likely protect the slow path behind a call.
The convention is designed such that all information is recoverable at run time by scanning the instruction stream:
Find the entry point by scanning backwards until you find a return instruction. The entry point follows (preceded by a branch-delay slot). The linker ensures that even the first procedure in the address space is preceded by a return instruction.
Scan forwards and decode each instruction, identifying each instruction that adjusts the stack pointer or spills a register to the stack. These instructions give the size of the stack frame and the locations of the callee-saves registers.
Decoding the instruction stream is practical for two reasons:
Every instruction is exactly 32 bits wide, so scanning backwards is trivial.
The instruction format itself is remarkably clean, so it is easy to identify instructions that
By contrast, decoding AMD64 is a nightmare—the legacy instruction format is hell, and I believe that it is not possible to scan backwards even in principle, because the coding of preceding instructions could be ambiguous.
The MIPS chip was one of the very first commercial RISC chips. It was trying to take market share from established minicomputers like VAX. Hardware was at a premium, and instruction-level parallelism was achieved by pipelining. The “branch-delay slot” allowed for execution of the next instruction in the pipeline before taking the branch—there was no “interlock” to suppress that execution. (MIPS stands for “Microprocessor without interlocked pipeline stages”.)
The MIPS people went to great lengths to maximize the value of their upstart hardware. One of the many cute tricks was that page tables were managed almost entirely in software—the only virtual-memory hardware was the so-called “translation lookaside buffer (TLB),” which mapped only 64 virtual pages to hardware addresses. A cache miss in the TLB resulted in a trap to the OS, which would look at the software page tables and update the TLB with a new association.
The big shift here is that during execution of a procedure, the stack pointer doesn’t move. You can dispense with a frame pointer, but your compiler has to be smart enough to look at a whole procedure and allocate all its stack space (including space for outgoing arguments) in one go. A 1985 personal workstation might have as much as 1MB of RAM, so it was practical to demand a more ambitious compiler.
The register conventions are carefully crafted to allow some caller-saves, some callee-saves, a couple for the kernel, and one for the assembler! (The register for the assembler enabled MIPS to craft an assembler that “looked like” it could work with 32-bit immediate values, even though those values had to be split across two instructions. This trick helped with adoption because although the RISC architecture required new assembly-language programming techniques, assembly-language programmers didn’t have to learn them: they could use the old, familiar techniques, and the assembler would compensate.)
Results:
Lua resources:
Roberto Ierusalimschy, Programming in Lua, first edition (Lua 5.0), Lua.org, 2003.
Ierusalimschy, de Figueiredo, and Celes, Lua—an Extensible Extension Language, Software—Practice & Experience 26(6), June 1996. 18 pages.
De Moura, Rodriguez, and Ierusalimschy, Coroutines in Lua, Journal of Universal Computer Science, 10(7), 2004, pages 910-925.
De Moura and Ierusalimschy, Revisiting Coroutines, ACM Transactions on Programming Languages and Systems, February 2009. Includes formal semantics, comparisons with one-shot continuations, and more. 31 pages.
A channel (called a “medium” by Cormack) has two states: full or empty. (It’s a Haskell MVar
.) If we want to look at intermediate states, we can consider that,
talk = P; Write; V
listen = P; Read; V
We can describe the state in a more refined way by extending “full/empty” state with an additional component having three values:
The channel is initially empty/available, and one can easily draw the state diagram based on successful P and V operations—the diagram is a cycle and has exactly the six representable states.
It is also possible to draw a state diagram using the state of the underlying semaphores in the implementation, but this gets more complicated.
void p(unsigned *sem) {
restart:
unsigned n = *sem;
unsigned final;
if (isinteger(n) && n > 0) {
final = n - 1;
} else {
// thread-local variable 'self' points to my process descriptor
Cons cons = &self->cons_cell;
cons->car = self;
cons->cdr = n;
final = (unsigned) cons;
}
atomic if (*sem == n) *sem = final; // compare and swap
else goto restart;
}