Introduction

Constructed data is a class of values that includes all the values of ML’s algebraic data types, as well as tuples and records.¹ Lists are a special case of constructed data; cons abbreviates “construct.” Like any other class of values or types, constructed data is supported by introduction forms and elimination forms:

The introduction form for constructed data is the value constructor. A value constructor may be a value of constructed data all by itself, as in ML’s NONE or []. Or a value constructor may need to be applied to an argument or arguments, as in ML’s SOME or ::. The result of the application is constructed data.

In many languages, value constructors are identifiable lexically; in Haskell, for example, a value constructor’s name begins with a capital letter or a colon. The name of a variable must begin with a lowercase letter or with some non-colon symbol.
The elimination form for constructed data is the case expression (pattern matching). Other elimination forms include ML’s val bindings and clausal (fun) definitions, but all of these are syntactic sugar for case expressions. (The desugaring is described in chapter 8 of my book.)

Constructed data is found in both typed and untyped languages:

In ML, Haskell, OCaml, Miranda, Agda, and similar functional languages, tuple types are built in, and additional types of constructed data are enabled by means of datatype definitions.
In Erlang and Prolog, a value constructor’s name begins with a lowercase letter. Any such name can be applied to a list of arguments to form constructed data; the constructor doesn’t have to have been declared.

Constructed data in source code: The eScheme language

To enable us to implement constructed data and pattern matching, I have defined a variation on vScheme that has Erlang-style pattern matching. The variation is called eScheme. Its concrete syntax is taken from $\mu$ML, which is described in chapter 8 of my book Programming Languages: Build, Prove, and Compare.

The introduction form is the value constructor, which is recognized by its name. Any name that begins with # or with a capital letter or with make- is recognized as a value constructor. The name cons and the literal '() are also recognized as value constructors. Any other name is recognized as a variable.

A value constructor is a form of expression, and it may appear anywhere an expression can appear: in both function position and argument position.
The elimination form is a case expression:

(case $e$ ([$p_1$ $e_1$] $\cdots$ [$p_n$ $e_n$])),

where each $p_i$ is a pattern, and each [$p_i$ $e_i$] is called a choice. The value of expression $e$, called the scrutinee, is matched against each pattern in order. According to the operational semantics (book section 8.8.1), if $p_i$ is the first pattern that matches the value of $e$, then the result of the case expression is the value of $e_i$. Crucially, pattern $p_i$ may bind variables that are in scope for $e_i$.
The pattern is a new syntactic category with these forms:
- A variable
- A value constructor
- A value constructor applied to one or more patterns
- A literal integer(!)
- A “wildcard,” written as a lone underscore (_)
The matching is as in ML; a detailed semantics is presented in chapter 8.

I have extended our parser to understand the new syntactic forms and to parse μML code.

Matching in eScheme differs from matching in Standard ML in one important way: in eScheme, every value constructor can be used at every arity. For example, all of the following expression evaluate to well-formed constructed data:

(SOME 3)            ;; the number 3, wrapped
SOME                ;; a bare constructor (not a function as in μML) 
(SOME 'Ramsey 106)  ;; sensible (not ill-typed as in μML )

Constructed data at run time

When a value constructor is applied to $n$ arguments ($n>0$), it is represented at run time as a Block value with $n+1$ slots. The first slot contains the name of the constructor, and the remaining slots contain the arguments, in order.

When a value constructor is used by itself, or is applied to no arguments, it is represented at run time by its name. No Block is allocated.

Pattern matching operates by checking not only the identity of the value constructor but also the number of arguments to which it is applied. In CS 106, this combination of value constructor and arity will be called a labeled constructor. (The label is the arity.) A labeled constructor is written $C/n$, where $C$ is the name of the value constructor and $n$ is the number of arguments to which $C$ is applied. As examples,

(SOME 3)            ;; the labeled constructor is SOME/1
SOME                ;; the labeled constructor is SOME/0
(SOME 'Ramsey 106)  ;; the labeled constructor is SOME/2

VM Instructions for working with constructed data

Constructed data also require support for introduction and elimination at run time.

Instructions for introducing constructed data

In the SVM, constructed data is introduced by getting a value intro a register. If the value is a bare value constructor itself, the UFT can generate a loadliteral instruction. If the value is meant to be a block that holds a value constructor and its arguments, I recommend instructions that are nearly identical to the instructions for working with closures:

mkblock takes a destination register, a source register, and an 8-bit size. It allocates a fresh block of the given size, puts the value of the source register in slot 0, and stores the block in the destination register.
getblockslot takes a destination register, a source register, and an 8-bit unsigned index. It expects a block in the source register. From that block it extracts the value in the indexed slot, and it places that value in the destination register.
setblockslot takes a block register, a value register, and an 8-bit unsigned index. The block register must hold a block, and the instruction stores the value in the indexed slot of the given block. The instruction is executed for side effect; there is no destination register.

Instructions for eliminating constructed data

The instruction for eliminating constructed data is a computed goto. The instruction examines the scrutinee’s value constructor, and based on what it sees, it branches to a new address. In a native-code compiler for a typed functional language like ML or Haskell, each value constructor $C$ is represented by a small integer, and the compiler can generate an indirect branch that uses a jump table:

The compiler places a jump table (array of addresses) into read-only data.
Constructor $C$’s small integer is used as an index into the jump table, each entry of which contains the address of the code that handles the case for $C$.
The compiler generates code that loads the jump-table entry into a machine register.
An indirect branch (“jump register”) transfers control to the address in the register.

It’s the same code that a C or C++ compiler generates for a switch statement.

When the language does not have a static type system, computed goto is not an option. Instead, the compiler generates a different species of jump table: a list of key-value pairs. Each key is a labeled constructor~$C/n$, and each value is the address of the code that is meant to handle the case of that labeled constructor.

Our .vo file format doesn’t include read-only or literal arrays, and I’d rather not change the .vo format or the loader. So the jump table will be coded as a sequence of VM instructions:

A labeled constructor (key) is represented by a new if-vcon-match instruction. Opcode if-vcon-match expects an 8-bit arity in the $X$ field and a literal index in the $\mathit{YZ}$ field. It designates the labeled constructor literals[$\mathit{YZ}$]/$X$.
A target address (value) is represented by a goto instruction.

The computed goto is performed by new instruction goto-vcon.

Opcode goto-vcon expects the value of the scrutinee in the $X$ register. In the $Y$ field, it expects an unsigned 8-bit integer giving the number of entries in the jump table. The jump table must follow the goto-vcon instruction immediately.

The goto-vcon instruction computes the labeled constructor associated with the scrutinee, then searches the jump table sequentially for the first matching key.² If it finds a matching key, it jumps to the address coded by the corresponding value. Otherwise, execution continues with the instruction immediately after the jump table.

This representation has the advantage that the jump-table entries could, in principle, be interpreted by vmrun as instructions. This approach is useful for debugging, but in your final SVM I expect you instead to do all the computation in the goto-vcon instruction, so the entire operation takes only one trip through the main loop in vmrun.

A labeled constructor is associated with a scrutinee according to these rules:

If the scrutinee is a block b, then the constructor name $C$ is b[0] and the arity $n$ is one less then the number of slots in block b.
If the scrutinee is a cons cell, then the constructor name $C$ is "cons" and the arity $n$ is 2.
If the scrutinee is any other value $v$, then the constructor name $C$ is $v$ and the arity $n$ is zero.

Constructed data has new support in each of the compiler’s internal languages. Some related definitions are found in file case.sml, in structures Pattern, Case, and Constructed. These definitions show that internally, there is no such thing as “a value constructor by itself.” Such a constructor is always represented as its application to an empty list of arguments.

The internal languages are extended as follows:

The source AST acquires forms CASE and VCON.
Unambiguous vScheme keeps CASE, but VCON disambiguates to a CONSTRUCTED form, which applies a value constructor to zero or more arguments.
First-Order Scheme and its extension Closed Scheme retain the CASE and CONSTRUCTED forms from Unambiguous Scheme.
K-normal form acquires two forms that more closely resemble the VM instructions:
- The BLOCK form carries a list of registers; when it evaluated, it allocates a fresh block with the same number of slots as registers. Each slot in the block is then initialized with the value of the corresponding register.
  
  During code generation, the BLOCK form is translated into a mkblock instruction, followed by a (possibly empty) sequence of setblockslot instructions.
- The SWITCH_VCON form takes a scrutinee, a list of pairs each holding a labeled constructor and an expression, and finally a default expression to be executed if no labeled constructor matches. During code generation, each SWITCH_VCON is translated into a goto-vcon instruction followed by a jump table.

During K-normalization, each CASE is translated to a tree of SWITCH_VCON forms. The translation is performed by a match compiler, which you will write.

Embedding into vScheme

Embedding eScheme into vScheme is gnarly.

Anywhere that eScheme would use a block of $n$ slots, the embedding uses a list of $n$ elements. To make embedding easy, I’ve added a block primitive function to vScheme. It takes any number of arguments and returns those arguments in a vScheme list.
For readability, the CASE forms in first-order Scheme embed as case. But the embedded code won’t run.
The BLOCK form in KNF embeds as a call to primitive block.
The SWITCH_VCON form in KNF embeds as a Scheme cond form. Each condition calls a function matches-vcon-arity?, which I have added to the vScheme interpreter as a predefined function. This embedding should run.

Legacy values vs constructed data

In μML, lists and Booleans are not special. They are not supported by any primitives; they are instead predefined:

;; define type constructors and their associated value constructors
(data (* => *) list
  ['()  : (forall ['a] (list 'a))]
  [cons : (forall ['a] ('a (list 'a) -> (list 'a)))])

(data * bool
  [#t : bool]
  [#f : bool])

And their representations are not special; in the μML interpreter they are all just constructed data (CONVAL):

datatype exp = ...
and value
  = CONVAL of vcon * value list
  | SYM    of name
  | NUM    of int
  | CLOSURE   of lambda * value env ref
  | PRIMITIVE of primop
 withtype lambda = name list * exp
      and primop = value list -> value

In eScheme, we want lists and Booleans to continue to use the special-purpose representations in the SVM. To achieve this end, we special case the code generation and K-normalization of these forms:

The application of an ordinary value constructor translates into a K-normal form expression that allocates and initializes a block (record). But the application of cons to two arguments translates into a call to SVM primitive cons.
When not in function position, an occurrence of an ordinary value constructor translates into a symbol with the same name as the value constructor. But an occurrence of #t or #f translates into a Boolean literal, and an occurrence of ’() translates into the empty-list literal.
The translation of a case expression has to account for these changes. For example, to see if a scrutinee is an application of cons to two arguments, the translation won’t look for a Block with the string cons in the first slot—instead it will look for a ConsCell. And similarly for Booleans and the empty list.

Operational semantics

To give an operational semantics of the SVM’s new computed goto, I extend the VM state with a new component of the form $v/k$, where $v$ is a value associated with a value constructor, and $k$ is the number of arguments to which the constructor is applied.³ The component $v/k$ is updated by the GotoVcon instruction, inspected by the IfVconMatch instruction, and ignored by all other instructions.

Function $\mathit{con}$ extracts a value constructor and arity from constructed data. If it is given a value that is not constructed data, it treats the value itself as the constructor, with arity 0:

\[\frac{\mbox{$b$ is a block with $n$ slots}} {\mathit{con}(b) = b[0]/n} \]

\[\frac{\mbox{$c$ is a cons cell}} {\mathit{con}(c) = \mathtt{cons}/2} \]

\[\frac{\mbox{$v$ is neither a cons cell nor a block}} {\mathit{con}(v) = v/0} \]

In the operational semantics, GotoVcon instruction simply uses $\mathit{con}$ to set $v/k$. (In the implementation, the GotoVcon instruction does this and also does all the work of the succeeding instructions.) \[\frac{\mathit{con}(\sigma(R(r))) = v'/k'}{ \begin{array}{l} \langle I_1 \bullet \mathtt{GotoVcon}(r, n) \cdot I_2, R, L, G, S, \sigma, v/k \rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{GotoVcon}(r, n) \bullet I_2, R, L, G, S, \sigma, v'/k'\rangle \end{array} } \] The parameter $n$ is ignored by the operational semantics, but the implementation requires that the GotoVcon instruction be followed by $n$ pairs of IfVconMatch/Goto instructions.

The IfVconMatch instruction works very much like the If instruction, except that it checks the current $v/k$, not a value in a register. And because its work is done by GotoVcon, it doesn’t need to be implemented. If $v/k$ matches the constructor and arity specified by the IfVconMatch instruction, evaluation continues with the succeeding instruction $i$, which is always a Goto: \[\frac{v = L(j) \land k = n} {\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \bullet i \cdot I_2, R, L, G, \sigma, v/k\rangle} \]

If $v/k$ does not match, evaluation skips $i$ and continues with the instruction after $i$: \[\frac{v \ne L(j) \lor k \ne n} {\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \cdot i \bullet I_2, R, L, G, \sigma, v/k\rangle} \]

In my book, I treat tuples and records as a special case of constructed data, and we’ll do the same here.↩
In principle, the UFT could sort the keys and arrange for binary search here, but in the common case the resulting SVM could would most likely be slower. Binary search isn’t worth the effort.↩
The notation comes from Prolog and Erlang.↩