Informal description
This semantics describes the state of a simple virtual machine with an instruction stream, registers, global variables, and a literal pool.1 Instruction stream, registers, and global variables are concepts you know; the literal pool is the VM version of “read-only data,” and like read-only data in a native-code system, it is initialized at load time from a specification in the assembly language or object code.
The semantics is sophisticated enough to describe control flow within a single function, but it has no support for calls or returns. For a machine of this simplicity, operational semantics might be overkill, but when we want to add calls and tail calls, it will pay dividends.
Preliminaries
Semantic domains
The VM computes with values V. Registers are numbered with natural numbers ℕ, as are global variables and slots in the literal pool. Instructions ℐ are left unspecified but will be notated using typewriter font, as in examples halt
, check
(\(X\),\(YZ\)), or add
(\(X\), \(Y\), \(Z\)).
\(V\) | VM values |
𝒩 | Names of vScheme variables (subset of \(V\)) |
ℕ | Natural numbers |
ℒ | locations (in the store) |
ℐ | instructions |
Metavariables
Elements of semantic domains are written using conventional metavariables, as are some important functions on the same domains. The conventional metavariables and their types are shown below. Function δ specifies the execution of an instruction that runs entirely in the VM’s “arithmetic and logic unit” (ALU)—that is, an instruction that involves only registers and literals, with no control flow.
Metavariable | Type | Description |
---|---|---|
\(n\) | ℕ | A natural number |
\(v\) | \(V\) | A value |
\(ℓ\) | ℒ | A location |
\(R\) | ℕ → ℒ | The machine registers |
\(G\) | ℕ → ℒ | The global variables |
\(σ\) | ℒ → \(V\) | The store |
\(L\) | ℕ → \(V\) | The literal pool |
\(i\) | ℐ | An instruction |
\(I\) | ℐ list | A sequence of instructions |
\(δ\) | ⋯ | Semantic function for an ALU instruction |
The mapping defined by \(R\) is injective: different numbers designate different locations. In math, whenever \(i \ne j\) then \(R[i] \ne R[j]\). Likewise for \(G\) and globals.
The literal pool \(L\) need not be injective; it is possible for two different indices \(i\) and \(j\) to refer to the same literal. In math, it is possible for \(L[i] = L[j]\) even when \(i\ne j\).
Notation for the instruction stream
An instruction stream is a sequence of instructions. In this semantics, the sequence operations cons
and append
are both notated using the small infix dot ·. That dot can also stand for snoc
. So if \(i\) is an instruction and \(I\), \(I_1\), and \(I_2\) are sequences of instructions, then each of the following phrases stands for a sequence of instructions:
\(i\cdot I\)
\(I_1 \cdot I_2\)
\(I \cdot i\)
The VM, like a real machine, has an instruction pointer that points to a single instruction in the stream.2 That pointer is represented as a fat dot •. For example, if the VM has already executed instruction sequence \(I_1\), is about to execute an add
instruction, and then will continue with instruction sequence \(I_2\), the control part of its state is notated as \(I_1 \bullet \mathtt{add}(X,Y,Z) \cdot I_2\).
State and transitions of the abstract machine
The state of an abstract machine has these elements:
\(I₁•i·I₂\) | An instruction stream, with the IP pointing to \(i\) |
\(R\) | A register file; every register stands for a location |
\(L\) | A literal pool; every index stands for a value |
\(G\) | A global-variable table mapping natural numbers to locations |
\(σ\) | A store specifying the content of every location (the value of every register and global variable) |
Like the classic MIPS hardware, the SVM abstract machine does not have ugly “flags” or “condition codes.” Booleans are first-class values, and the results of comparisons and other Boolean operations are stored in general-purpose registers.
The machine’s state is written \(\langle I_1 \bullet i \cdot I_2, R, L, G, \sigma\rangle\).
Most transitions are for ALU-style instructions (“arithmetic and logical unit”). These instructions approximate the classic RISC instructions of 1980s hardware. An ALU instruction has access to the registers and the literal pool, and evaluating the instruction updates the store (that is, the contents of registers). The effect of an ALU instruction is determined by the function \(\delta\), which updates the contents of registers according to the equation \(\sigma' = \delta(i,R,L,\sigma)\). For example, here’s what \(\delta\) does with the add
and loadliteral
instructions:
\(\delta(\mathtt{add}(X,Y,Z), R, L, \sigma) = \sigma\{R(X) \mapsto \sigma(R(Y)) + \sigma(R(Z))\}\)
\(\delta(\mathtt{loadliteral}(X,YZ), R, L, \sigma) = \sigma\{R(X) \mapsto L(YX)\}\)
The state transition is written something like this:
\[ \frac{i \in \mathrm{dom}\ \delta} {\langle I_1 \bullet i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot i \bullet I_2, R, L, G, \delta(i,R,L,\sigma)\rangle} \]
The transition rule has these properties:
The rule applies only when \(i\) is an ALU instruction (decided by \(i \in \mathrm{dom}\ \delta\).
The store changes as updated by \(\delta\).
The control changes: the instruction pointer moves past instruction \(i\) to point to the start of the sequence \(I_2\).
This rule covers a huge fraction of the available instructions—basically all the ones that correspond to language primitives.
Another interesting set of transitions are the control transitions. I recommending sticking with the classical two control instructions: conditional and unconditional goto. To eliminate linking issues, we’ll use a PC-relative goto; instruction goto
\(k\) adds \(k\) to the program counter. It would be nice to stop there, but the goto hides a potential off-by-one error: do you add \(k\) before or after the more usual “add 1”? The uncertainty is resolved using a formal rule:
\[\frac{} {\langle I_1 \bullet \mathtt{goto}(k) \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle \mathit{shift}(I_1 \cdot \mathtt{goto}(k) \bullet I_2, k), R, L, G, \sigma\rangle} \]
As the rule suggests, you add \(k\) after the usual increment, so an infinite loop would be \(\mathtt{goto}(-1)\text.\) The shift
function is defined by induction on the magnitude of \(k\):
\(\mathit{shift}(I_2 \bullet i \cdot I_2, 0) = I_2 \bullet i \cdot I_2\)
\(\mathit{shift}(I_2 \bullet i \cdot I_2, k) = \mathit{shift}(I_2 \cdot i \bullet I_2, k -1)\), when \(k > 0\)
\(\mathit{shift}(I_2 \cdot i \bullet I_2, k) = \mathit{shift}(I_2 \bullet i \cdot I_2, k + 1)\), when \(k < 0\)
This definition might look fancy, but it just adds \(k\) to the program counter.
For conditional goto, I’ve borrowed an idea found in the 8-bit machines of the 1970s3 and used today in the Lua virtual machine. The conditional instruction, which I’ll just call if
, checks a condition and if the condition is false, the if
skips the next instruction.
\[\frac{\mathit{falsey}(R(X))} {\langle I_1 \bullet \mathit{if}(X) \cdot i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot \mathit{if}(X) \cdot i \bullet I_2, R, L, G, \sigma\rangle} \]
\[\frac{\lnot\mathit{falsey}(R(X))} {\langle I_1 \bullet \mathit{if}(X) \cdot i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot \mathit{if}(X) \bullet i \cdot I_2, R, L, G, \sigma\rangle} \]
Values nil
and false
are “falsey”; other values are “truthy”:
\[\mathit{falsey}(v) = (v = \mathtt{\#f} \lor v = \mathbf{nil})\]
Unspecified instructions
From the first week, the VM will implement instructions like check
, expect
, and print
, which affect the state of the world in ways that are not captured by the formal machine state. That means these instructions can’t be fully specified. But it’s not a problem. As always, the role of a formal specification is to nail down the parts of a system that are error-prone or hard to get right—like PC-relative goto
. But the real reason we have a formal semantics is that it will help us implement function calls and register windows in module 7. What we’re doing now is mostly for practice.
What the operational semantics buys us
The operational semantics gives clear guidance on how to represent the state of a virtual machine. The operational semantics can also hint to us about costs:
If every transition depends on the form of the next instruction, then getting that instruction and decoding it had better be cheap.
If most instructions apply the \(\delta\) function, and if most uses of \(\delta\) involve values in VM registers, then getting to VM registers probably should also be cheap.
Parts of the machine that are used rarely, like the literal pool and the global variables, maybe don’t have to be so cheap.
How to use the operational semantics
The main task for this module is to design a C struct
to hold the machine state. This task can be guided by all parts of the semantics, starting with the semantic domains and the metavariables:
A location is a location in memory, possibly inside some C data structure. Any location can be referred to by a pointer of type
Value *
.Every other semantic domain corresponds to a C type. For example,
Instruction
is defined in file<iformat.h>
to be a 32-bit word.Function domains have many plausible C representations. For example, a function might be represented as an array or a hash table. And the store σ, which has type ℒ → V, is represented by the contents of the C heap. That “function” is “called” by dereferencing a pointer.
A list domain might be represented by a C array, plus perhaps extra variables that tell us where the list begins and ends. A linked list is also possible, but linked lists are most useful when elements are likely to be added or removed at run time. I plan for functions not to edit their own machine code at run time, so for the instruction stream, the extra flexibility afforded by a linked list isn’t needed.
Metavariables have the types given, but the C representations of functions and lists have to be chosen wisely.
The store \(\sigma\) tends to throw people off. It plays the same role as the store in the semantics for \(\mu\)Scheme, for the same reason: it’s a mathematical model for mutable locations. The semantics corresponds directly to code:
A location ℓ in ℒ corresponds to a memory region that holds one
Value
.To refer to a location, we need a C pointer
p
of typeValue *
. The location is then referred to as*p
. A location might also be referred to using array notation, as inR[i]
.
Mutable locations support just two operations:
The first operation a mutable location supports is reading the contents. In the semantics, the operation is written σ(ℓ), and in code, it is written
*p
. Notice there’s no data structure. The C heap is the data structure!The other operation a mutable location supports is mutating (writing) the contents. In the semantics, the operation is written σ{ℓ↦v}, which notionally creates a new store. But in an implementation, there is only ever one store, and it is updated in place. So in code, a mutation is written
*p = v
. That code updates the C heap, which is the only structure that represents the store.Notice that in C code, reading and writing use the same notation: both are written
*p
orR[i]
. The difference between reading and writing depends on context. For example, a location that appears on the left-hand side of an=
operator is written; a location that appears on the right-hand side of an=
operator is read.
There is only a single C heap, and it’s implicit, so the store σ has no explicit representation in the
struct VMState
.
Using this guidance, here’s your checklist:
Choose a C representation for each component of the abstract machine’s state.
When planning
vmrun
, consider which parts of a machine’s state might be cached in local C variables ofvmrun
.Implement some instructions according to the model. If an instruction isn’t modeled, just call functions like
print
andcheck
and trust that they will do the right thing.