Informal and operational semantics of the VM, part I

Informal description

This semantics describes the state of a simple virtual machine with an instruction stream, registers, global variables, and a literal pool.1 Instruction stream, registers, and global variables are concepts you know; the literal pool is the VM version of “read-only data,” and like read-only data in a native-code system, it is initialized at load time from a specification in the assembly language or object code.

The semantics is sophisticated enough to describe control flow within a single function, but it has no support for calls or returns. For a machine of this simplicity, operational semantics might be overkill, but when we want to add calls and tail calls, it will pay dividends.


Semantic domains

The VM computes with values V. Registers are numbered with natural numbers ℕ, as are global variables and slots in the literal pool. Instructions ℐ are left unspecified but will be notated using typewriter font, as in examples halt, check(\(X\),\(YZ\)), or add(\(X\), \(Y\), \(Z\)).

Semantic domains
\(V\) VM values
𝒩 Names of vScheme variables (subset of \(V\))
Natural numbers
locations (in the store)


Elements of semantic domains are written using conventional metavariables, as are some important functions on the same domains. The conventional metavariables and their types are shown below. Function δ specifies the execution of an instruction that runs entirely in the VM’s “arithmetic and logic unit” (ALU)—that is, an instruction that involves only registers and literals, with no control flow.

Metavariable Type Description
\(n\) A natural number
\(v\) \(V\) A value
\(ℓ\) A location
\(R\) ℕ → ℒ The machine registers
\(G\) ℕ → ℒ The global variables
\(σ\) ℒ → \(V\) The store
\(L\) ℕ → \(V\) The literal pool
\(i\) An instruction
\(I\) ℐ list A sequence of instructions
\(δ\) Semantic function for an ALU instruction

The mapping defined by \(R\) is injective: different numbers designate different locations. In math, whenever \(i \ne j\) then \(R[i] \ne R[j]\). Likewise for \(G\) and globals.

The literal pool \(L\) need not be injective; it is possible for two different indices \(i\) and \(j\) to refer to the same literal. In math, it is possible for \(L[i] = L[j]\) even when \(i\ne j\).

Notation for the instruction stream

An instruction stream is a sequence of instructions. In this semantics, the sequence operations cons and append are both notated using the small infix dot ·. That dot can also stand for snoc. So if \(i\) is an instruction and \(I\), \(I_1\), and \(I_2\) are sequences of instructions, then each of the following phrases stands for a sequence of instructions:

\(i\cdot I\)
\(I_1 \cdot I_2\)
\(I \cdot i\)

The VM, like a real machine, has an instruction pointer that points to a single instruction in the stream.2 That pointer is represented as a fat dot •. For example, if the VM has already executed instruction sequence \(I_1\), is about to execute an add instruction, and then will continue with instruction sequence \(I_2\), the control part of its state is notated as \(I_1 \bullet \mathtt{add}(X,Y,Z) \cdot I_2\).

State and transitions of the abstract machine

The state of an abstract machine has these elements:

\(I₁•i·I₂\) An instruction stream, with the IP pointing to \(i\)
\(R\) A register file; every register stands for a location
\(L\) A literal pool; every index stands for a value
\(G\) A global-variable table mapping natural numbers to locations
\(σ\) A store specifying the content of every location (the value of every register and global variable)

Like the classic MIPS hardware, the SVM abstract machine does not have ugly “flags” or “condition codes.” Booleans are first-class values, and the results of comparisons and other Boolean operations are stored in general-purpose registers.

The machine’s state is written \(\langle I_1 \bullet i \cdot I_2, R, L, G, \sigma\rangle\).

Most transitions are for ALU-style instructions (“arithmetic and logical unit”). These instructions approximate the classic RISC instructions of 1980s hardware. An ALU instruction has access to the registers and the literal pool, and evaluating the instruction updates the store (that is, the contents of registers). The effect of an ALU instruction is determined by the function \(\delta\), which updates the contents of registers according to the equation \(\sigma' = \delta(i,R,L,\sigma)\). For example, here’s what \(\delta\) does with the add and loadliteral instructions:

  • \(\delta(\mathtt{add}(X,Y,Z), R, L, \sigma) = \sigma\{R(X) \mapsto \sigma(R(Y)) + \sigma(R(Z))\}\)

  • \(\delta(\mathtt{loadliteral}(X,YZ), R, L, \sigma) = \sigma\{R(X) \mapsto L(YX)\}\)

The state transition is written something like this:

\[ \frac{i \in \mathrm{dom}\ \delta} {\langle I_1 \bullet i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot i \bullet I_2, R, L, G, \delta(i,R,L,\sigma)\rangle} \]

The transition rule has these properties:

  1. The rule applies only when \(i\) is an ALU instruction (decided by \(i \in \mathrm{dom}\ \delta\).

  2. The store changes as updated by \(\delta\).

  3. The control changes: the instruction pointer moves past instruction \(i\) to point to the start of the sequence \(I_2\).

This rule covers a huge fraction of the available instructions—basically all the ones that correspond to language primitives.

Another interesting set of transitions are the control transitions. I recommending sticking with the classical two control instructions: conditional and unconditional goto. To eliminate linking issues, we’ll use a PC-relative goto; instruction goto \(k\) adds \(k\) to the program counter. It would be nice to stop there, but the goto hides a potential off-by-one error: do you add \(k\) before or after the more usual “add 1”? The uncertainty is resolved using a formal rule:

\[\frac{} {\langle I_1 \bullet \mathtt{goto}(k) \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle \mathit{shift}(I_1 \cdot \mathtt{goto}(k) \bullet I_2, k), R, L, G, \sigma\rangle} \]

As the rule suggests, you add \(k\) after the usual increment, so an infinite loop would be \(\mathtt{goto}(-1)\text.\) The shift function is defined by induction on the magnitude of \(k\):

\(\mathit{shift}(I_2 \bullet i \cdot I_2, 0) = I_2 \bullet i \cdot I_2\)
\(\mathit{shift}(I_2 \bullet i \cdot I_2, k) = \mathit{shift}(I_2 \cdot i \bullet I_2, k -1)\), when \(k > 0\)
\(\mathit{shift}(I_2 \cdot i \bullet I_2, k) = \mathit{shift}(I_2 \bullet i \cdot I_2, k + 1)\), when \(k < 0\)

This definition might look fancy, but it just adds \(k\) to the program counter.

For conditional goto, I’ve borrowed an idea found in the 8-bit machines of the 1970s3 and used today in the Lua virtual machine. The conditional instruction, which I’ll just call if, checks a condition and if the condition is false, the if skips the next instruction.

\[\frac{\mathit{falsey}(R(X))} {\langle I_1 \bullet \mathit{if}(X) \cdot i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot \mathit{if}(X) \cdot i \bullet I_2, R, L, G, \sigma\rangle} \]

\[\frac{\lnot\mathit{falsey}(R(X))} {\langle I_1 \bullet \mathit{if}(X) \cdot i \cdot I_2, R, L, G, \sigma\rangle \rightarrow \langle I_1 \cdot \mathit{if}(X) \bullet i \cdot I_2, R, L, G, \sigma\rangle} \]

Values nil and false are “falsey”; other values are “truthy”:

\[\mathit{falsey}(v) = (v = \mathtt{\#f} \lor v = \mathbf{nil})\]

Unspecified instructions

From the first week, the VM will implement instructions like check, expect, and print, which affect the state of the world in ways that are not captured by the formal machine state. That means these instructions can’t be fully specified. But it’s not a problem. As always, the role of a formal specification is to nail down the parts of a system that are error-prone or hard to get right—like PC-relative goto. But the real reason we have a formal semantics is that it will help us implement function calls and register windows in module 7. What we’re doing now is mostly for practice.

What the operational semantics buys us

The operational semantics gives clear guidance on how to represent the state of a virtual machine. The operational semantics can also hint to us about costs:

How to use the operational semantics

The main task for this module is to design a C struct to hold the machine state. This task can be guided by all parts of the semantics, starting with the semantic domains and the metavariables:

Using this guidance, here’s your checklist:

  1. Choose a C representation for each component of the abstract machine’s state.

  2. When planning vmrun, consider which parts of a machine’s state might be cached in local C variables of vmrun.

  3. Implement some instructions according to the model. If an instruction isn’t modeled, just call functions like print and check and trust that they will do the right thing.

  1. At run time and in the VM instruction set, each global variable is referred to by number. But in the virtual object code, global variables are referred to be name, not by number. In module 2, you’ll build the VM loader, which among other jobs converts each name to a number at load time.

  2. Because my early education involved IBM hardware, not Intel hardware, you will probably hear me call it the “program counter,” not the “instruction pointer.”

  3. Look up “decrement and skip if zero.”