Garbage-collection background

Reading guide for GC chapter

Garbage collection is the topic of chapter 4 of my book Programming Languages: Build, Prove, and Compare. Since the book wasn’t available at the start of the last complete CS 105, forcing you to buy a copy seems a bit harsh. So I have posted the version that my publisher granted permission to use in Fall 2020; you’ll find it on Slack. This version has figures that are not as good as those in the published version, but otherwise it is nearly identical to the published chapter.

You’ll read sections 4.1, 4.2, and 4.5. The first two sections introduce the basics, and section 4.5 explains copying collection in detail.

For lab, the key parts are the 2nd paragraph on page 265 (on reachability), and the description of tricolor marking on pages 265 and 266.

Section 4.3 describes infrastructure meant for one of the definitional interpreters used in my book. I’ve replaced it with similar text that describes the SVM.

Section 4.6 (debugging) will be helpful later on, but read just the introduction to the section and the techniques in section 4.6.2. (I’ve replaced the code in section 4.6.1 with direct calls to Valgrind macros.)

As you read, be aware of an essential difference in how the heap is used in the book chapter versus the SVM:

Page-by-page reading guide
261 If you like, you can skip the first couple of paragraphs of motivation.
262 Start with “How can the system know…” just before the diagrams, and look at the diagrams. Be aware that the ideas of the diagrams are useful, but because of the differences in how values are stored, the details might be misleading.
263 Right after the last diagram, the paragraph “locations holding unreachable objects” has important vocabulary. The rest of the information on the page is contextual, and since we’re not going to talk much about performance, you can skip it.
264 Read the introduction to section 4.2. Skip section 4.2.1 (Performance).

Read section 4.2.2 (Reachability and roots). This is the key section.

In our SVM, roots are found in the VM state. They are in activations (on the call stack) and in live registers.

For tricolor marking, we’ll put gray objects on a stack.

266-267 Section 4.2.3 (Heap growth) can be skipped on first reading. You’ll come back to it later.
268 Instead of reading section 4.3.2, read the adapted version below.

Section 4.5 has the basic ideas of copying collection. And our allocator will use pointers hp and heaplimit. But our free space will be broken into pages. Pages support parallel, incremental, and generational garbage collection, and they make it much easier to manage the size of the heap. They add a tiny amount of allocation overhead per page, which on decent-size pages is negligible.

Read the section through the end of section 4.5.2. Skip all the code bits. I’ve ported those codes; you’ll find copy and forward procedures in file vmheap.c.

282 Take a look at function scanloc on page 282, and compare it with my function scan_value defined in file vmheap.c. You’ll be writing two similar scanning functions: one for activation records, which is almost trivial, and one for your VM state, which is not trivial.
284-288 Section 4.6 (Debugging) is worth a look. I suspect I will have to port some of the functions there to work with payloads of varying sizes. But I will provide something resembling this interface.

The managed heap in the SVM (adapted from book section 4.3)

Every Value object has a tag and a payload. The tag is a small integer that identifies the Value as a String, Boolean ConsCell, Number, VMFunction, VMClosure, or whatever. The payload is a whatever other information is needed to know everything there is to know about the value. A payload that fits in a machine word, like a Boolean or a number, is stored directly in the Value. But a payload that might not fit in a machine word, like a string or a cons cell, is just a pointer—which points to memory that is allocated on the managed heap. The managed heap holds all memory allocated by VM instructions and by the SVM loader.

Where Value objects are stored

All Value objects are stored either in VM registers or in heap-allocated data structures (like cons cells, closures, and tables). Unlike the interpreter in the book chapter, our SVM does not use an evaluation stack.

Reachability of locations

The garbage collector has to be able to find any location allocated on the heap, which means

Such a pointer might be reachable from a variety of other data. The set of types that from which a heap-allocated pointer might be reachable is defined by induction.

Interface to the managed heap

The interface to the SVM heap is sketched in file vmheap.h.

Notable aspects of our system

Values are in registers; only payloads are on the heap.

String payloads and function payloads don’t contain pointers. Closure payloads do contain pointers. Block payloads and table payloads may contain pointers.

In most systems, a heap-allocated object contains a tag that identifies the size of the object and tells if the object contains pointers and where. In our system, the tag (which is essential) isn’t stored on the heap; it’s stored in a VM register. That makes our system work a little differently.

A pointer to the middle of an instruction stream is a so-called “interior pointer.” Interior pointers are hard to manage during copying collection—in order to know where the interior pointer moves to, the collector has to figure out what function the instruction stream is in. Our system won’t support interior pointers; any pointer to the middle of an instruction stream will have to be materialized using the function itself and an integer program counter.

In the book, no auxiliary data structure is needed to hold the gray objects: to-space itself acts as a queue. (This trick is owed to a 1970 paper by C. J. Cheney.) We use an explicit stack for two reasons:

  1. If the activation record contains a pointer to the middle of a function’s instruction stream, that’s a so-called “interior pointer”, and they’re hard to garbage collect. Change your code so it uses the function and an integer index.