Reading guide for GC chapter

Garbage collection is the topic of chapter 4 of my book Programming Languages: Build, Prove, and Compare. Since the book wasn’t available at the start of the last complete CS 105, forcing you to buy a copy seems a bit harsh. So I have posted the version that my publisher granted permission to use in Fall 2020; you’ll find it on Slack. This version has figures that are not as good as those in the published version, but otherwise it is nearly identical to the published chapter.

You’ll read sections 4.1, 4.2, and 4.5. The first two sections introduce the basics, and section 4.5 explains copying collection in detail.

For lab, the key parts are the 2nd paragraph on page 265 (on reachability), and the description of tricolor marking on pages 265 and 266.

Section 4.3 describes infrastructure meant for one of the definitional interpreters used in my book. I’ve replaced it with similar text that describes the SVM.

Section 4.6 (debugging) will be helpful later on, but read just the introduction to the section and the techniques in section 4.6.2. (I’ve replaced the code in section 4.6.1 with direct calls to Valgrind macros.)

As you read, be aware of an essential difference in how the heap is used in the book chapter versus the SVM:

In the chapter, every Value is potentially allocated on the heap.
In the SVM, every Value is allocated in a VM register. Only large “payloads” are allocated on the heap. The most common payloads are
- The car and cdr of a cons cell
- The characters of a string
- The instructions of a function
- The captured variables of a closure

Page-by-page reading guide
261	If you like, you can skip the first couple of paragraphs of motivation.
262	Start with “How can the system know…” just before the diagrams, and look at the diagrams. Be aware that the ideas of the diagrams are useful, but because of the differences in how values are stored, the details might be misleading.
263	Right after the last diagram, the paragraph “locations holding unreachable objects” has important vocabulary. The rest of the information on the page is contextual, and since we’re not going to talk much about performance, you can skip it.
264	Read the introduction to section 4.2. Skip section 4.2.1 (Performance).
265-266	Read section 4.2.2 (Reachability and roots). This is the key section. In our SVM, roots are found in the VM state. They are in activations (on the call stack) and in live registers. For tricolor marking, we’ll put gray objects on a stack.
266-267	Section 4.2.3 (Heap growth) can be skipped on first reading. You’ll come back to it later.
268	Instead of reading section 4.3.2, read the adapted version below.
276-281	Section 4.5 has the basic ideas of copying collection. And our allocator will use pointers `hp` and `heaplimit`. But our free space will be broken into pages. Pages support parallel, incremental, and generational garbage collection, and they make it much easier to manage the size of the heap. They add a tiny amount of allocation overhead per page, which on decent-size pages is negligible. Read the section through the end of section 4.5.2. Skip all the code bits. I’ve ported those codes; you’ll find `copy` and `forward` procedures in file `vmheap.c`.
282	Take a look at function `scanloc` on page 282, and compare it with my function `scan_value` defined in file `vmheap.c`. You’ll be writing two similar scanning functions: one for activation records, which is almost trivial, and one for your VM state, which is not trivial.
284-288	Section 4.6 (Debugging) is worth a look. I suspect I will have to port some of the functions there to work with payloads of varying sizes. But I will provide something resembling this interface.

The managed heap in the SVM (adapted from book section 4.3)

Every Value object has a tag and a payload. The tag is a small integer that identifies the Value as a String, Boolean ConsCell, Number, VMFunction, VMClosure, or whatever. The payload is a whatever other information is needed to know everything there is to know about the value. A payload that fits in a machine word, like a Boolean or a number, is stored directly in the Value. But a payload that might not fit in a machine word, like a string or a cons cell, is just a pointer—which points to memory that is allocated on the managed heap. The managed heap holds all memory allocated by VM instructions and by the SVM loader.

Where `Value` objects are stored

All Value objects are stored either in VM registers or in heap-allocated data structures (like cons cells, closures, and tables). Unlike the interpreter in the book chapter, our SVM does not use an evaluation stack.

Reachability of locations

The garbage collector has to be able to find any location allocated on the heap, which means

Any pointer inside a Value that points to a heap-allocated object (e.g., function, string, closure, cons cell)
Any other pointer to a heap-allocated object, like for example a function suspended on the call stack

Such a pointer might be reachable from a variety of other data. The set of types that from which a heap-allocated pointer might be reachable is defined by induction.

Depending on its tag, any Value might contain a pointer to a heap object.
An activation record might contain a pointer to a struct VMFunction object.¹
A block payload, including a cons cell, might contain a Value.
A closure payload contains a pointer to a heap-allocated function, and it might contain a Value.
A table payload might contain a Value.
A VM state is likely to contain Values, activation records, a table payload, and a function payload.

Interface to the managed heap

The interface to the SVM heap is sketched in file vmheap.h.

Notable aspects of our system

Values are in registers; only payloads are on the heap.

String payloads and function payloads don’t contain pointers. Closure payloads do contain pointers. Block payloads and table payloads may contain pointers.

In most systems, a heap-allocated object contains a tag that identifies the size of the object and tells if the object contains pointers and where. In our system, the tag (which is essential) isn’t stored on the heap; it’s stored in a VM register. That makes our system work a little differently.

A pointer to the middle of an instruction stream is a so-called “interior pointer.” Interior pointers are hard to manage during copying collection—in order to know where the interior pointer moves to, the collector has to figure out what function the instruction stream is in. Our system won’t support interior pointers; any pointer to the middle of an instruction stream will have to be materialized using the function itself and an integer program counter.

In the book, no auxiliary data structure is needed to hold the gray objects: to-space itself acts as a queue. (This trick is owed to a 1970 paper by C. J. Cheney.) We use an explicit stack for two reasons:

Our heap objects don’t have tags—each object’s tag is one half of a struct Value—so there’s no way to look at a to-space page and find the objects.
Using a queue winds up traversing the graph of reachable objects breadth first, which separates objects from their immediate successors. Such separation results in poor locality. By using a stack, we enforce a depth-first traversal, which puts objects closer to their parents and siblings. On 21st-century hardware, locality matters!

If the activation record contains a pointer to the middle of a function’s instruction stream, that’s a so-called “interior pointer”, and they’re hard to garbage collect. Change your code so it uses the function and an integer index.↩