Introduction
Constructed data is a class of values that includes all the values of ML’s algebraic data types, as well as tuples and records.1 Lists are a special case of constructed data; cons
abbreviates “construct.” Like any other class of values or types, constructed data is supported by introduction forms and elimination forms:
The introduction form for constructed data is the value constructor. A value constructor may be a value of constructed data all by itself, as in ML’s
NONE
or[]
. Or a value constructor may need to be applied to an argument or arguments, as in ML’sSOME
or::
. The result of the application is constructed data.In many languages, value constructors are identifiable lexically; in Haskell, for example, a value constructor’s name begins with a capital letter or a colon. The name of a variable must begin with a lowercase letter or with some non-colon symbol.
The elimination form for constructed data is the
case
expression (pattern matching). Other elimination forms include ML’sval
bindings and clausal (fun
) definitions, but all of these are syntactic sugar forcase
expressions. (The desugaring is described in chapter 8 of my book.)
Constructed data is found in both typed and untyped languages:
In ML, Haskell, OCaml, Miranda, Agda, and similar functional languages, tuple types are built in, and additional types of constructed data are enabled by means of datatype definitions.
In Erlang and Prolog, a value constructor’s name begins with a lowercase letter. Any such name can be applied to a list of arguments to form constructed data; the constructor doesn’t have to have been declared.
Constructed data in source code: The eScheme language
To enable us to implement constructed data and pattern matching, I have defined a variation on vScheme that has Erlang-style pattern matching. The variation is called eScheme. Its concrete syntax is taken from \(\mu\)ML, which is described in chapter 8 of my book Programming Languages: Build, Prove, and Compare.
The introduction form is the value constructor, which is recognized by its name. Any name that begins with
#
or with a capital letter or withmake-
is recognized as a value constructor. The namecons
and the literal'()
are also recognized as value constructors. Any other name is recognized as a variable.A value constructor is a form of expression, and it may appear anywhere an expression can appear: in both function position and argument position.
The elimination form is a
case
expression:(case \(e\) ([\(p_1\) \(e_1\)] \(\cdots\) [\(p_n\) \(e_n\)])),
where each \(p_i\) is a pattern, and each [\(p_i\) \(e_i\)] is called a choice. The value of expression \(e\), called the scrutinee, is matched against each pattern in order. According to the operational semantics (book section 8.8.1), if \(p_i\) is the first pattern that matches the value of \(e\), then the result of the
case
expression is the value of \(e_i\). Crucially, pattern \(p_i\) may bind variables that are in scope for \(e_i\).The pattern is a new syntactic category with these forms:
- A variable
- A value constructor
- A value constructor applied to one or more patterns
- A literal integer(!)
- A “wildcard,” written as a lone underscore (
_
)
The matching is as in ML; a detailed semantics is presented in chapter 8.
I have extended our parser to understand the new syntactic forms and to parse μML code.
Matching in eScheme differs from matching in Standard ML in one important way: in eScheme, every value constructor can be used at every arity. For example, all of the following expression evaluate to well-formed constructed data:
(SOME 3) ;; the number 3, wrapped
SOME ;; a bare constructor (not a function as in μML)
(SOME 'Ramsey 106) ;; sensible (not ill-typed as in μML )
Constructed data at run time
When a value constructor is applied to \(n\) arguments (\(n>0\)), it is represented at run time as a Block
value with \(n+1\) slots. The first slot contains the name of the constructor, and the remaining slots contain the arguments, in order.
When a value constructor is used by itself, or is applied to no arguments, it is represented at run time by its name. No Block
is allocated.
Pattern matching operates by checking not only the identity of the value constructor but also the number of arguments to which it is applied. In CS 106, this combination of value constructor and arity will be called a labeled constructor. (The label is the arity.) A labeled constructor is written \(C/n\), where \(C\) is the name of the value constructor and \(n\) is the number of arguments to which \(C\) is applied. As examples,
(SOME 3) ;; the labeled constructor is SOME/1
SOME ;; the labeled constructor is SOME/0
(SOME 'Ramsey 106) ;; the labeled constructor is SOME/2
VM Instructions for working with constructed data
Constructed data also require support for introduction and elimination at run time.
Instructions for introducing constructed data
In the SVM, constructed data is introduced by getting a value intro a register. If the value is a bare value constructor itself, the UFT can generate a loadliteral
instruction. If the value is meant to be a block that holds a value constructor and its arguments, I recommend instructions that are nearly identical to the instructions for working with closures:
mkblock
takes a destination register, a source register, and an 8-bit size. It allocates a fresh block of the given size, puts the value of the source register in slot 0, and stores the block in the destination register.getblockslot
takes a destination register, a source register, and an 8-bit unsigned index. It expects a block in the source register. From that block it extracts the value in the indexed slot, and it places that value in the destination register.setblockslot
takes a block register, a value register, and an 8-bit unsigned index. The block register must hold a block, and the instruction stores the value in the indexed slot of the given block. The instruction is executed for side effect; there is no destination register.
Instructions for eliminating constructed data
The instruction for eliminating constructed data is a computed goto
. The instruction examines the scrutinee’s value constructor, and based on what it sees, it branches to a new address. In a native-code compiler for a typed functional language like ML or Haskell, each value constructor \(C\) is represented by a small integer, and the compiler can generate an indirect branch that uses a jump table:
The compiler places a jump table (array of addresses) into read-only data.
Constructor \(C\)’s small integer is used as an index into the jump table, each entry of which contains the address of the code that handles the case for \(C\).
The compiler generates code that loads the jump-table entry into a machine register.
An indirect branch (“jump register”) transfers control to the address in the register.
It’s the same code that a C or C++ compiler generates for a switch
statement.
When the language does not have a static type system, computed goto
is not an option. Instead, the compiler generates a different species of jump table: a list of key-value pairs. Each key is a labeled constructor~\(C/n\), and each value is the address of the code that is meant to handle the case of that labeled constructor.
Our .vo
file format doesn’t include read-only or literal arrays, and I’d rather not change the .vo
format or the loader. So the jump table will be coded as a sequence of VM instructions:
A labeled constructor (key) is represented by a new
if-vcon-match
instruction. Opcodeif-vcon-match
expects an 8-bit arity in the \(X\) field and a literal index in the \(\mathit{YZ}\) field. It designates the labeled constructor literals[\(\mathit{YZ}\)]/\(X\).A target address (value) is represented by a
goto
instruction.
The computed goto
is performed by new instruction goto-vcon
.
Opcode
goto-vcon
expects the value of the scrutinee in the \(X\) register. In the \(Y\) field, it expects an unsigned 8-bit integer giving the number of entries in the jump table. The jump table must follow thegoto-vcon
instruction immediately.The
goto-vcon
instruction computes the labeled constructor associated with the scrutinee, then searches the jump table sequentially for the first matching key.2 If it finds a matching key, it jumps to the address coded by the corresponding value. Otherwise, execution continues with the instruction immediately after the jump table.
This representation has the advantage that the jump-table entries could, in principle, be interpreted by vmrun
as instructions. This approach is useful for debugging, but in your final SVM I expect you instead to do all the computation in the goto-vcon
instruction, so the entire operation takes only one trip through the main loop in vmrun
.
A labeled constructor is associated with a scrutinee according to these rules:
If the scrutinee is a block
b
, then the constructor name \(C\) isb[0]
and the arity \(n\) is one less then the number of slots in blockb
.If the scrutinee is a cons cell, then the constructor name \(C\) is
"cons"
and the arity \(n\) is 2.If the scrutinee is any other value \(v\), then the constructor name \(C\) is \(v\) and the arity \(n\) is zero.
Related forms in the compiler
Constructed data has new support in each of the compiler’s internal languages. Some related definitions are found in file case.sml
, in structures , , and . These definitions show that internally, there is no such thing as “a value constructor by itself.” Such a constructor is always represented as its application to an empty list of arguments.
The internal languages are extended as follows:
The source AST acquires forms
and .Unambiguous vScheme keeps
, but disambiguates to a form, which applies a value constructor to zero or more arguments.First-Order Scheme and its extension Closed Scheme retain the
and forms from Unambiguous Scheme.K-normal form acquires two forms that more closely resemble the VM instructions:
The
form carries a list of registers; when it evaluated, it allocates a fresh block with the same number of slots as registers. Each slot in the block is then initialized with the value of the corresponding register.During code generation, the
form is translated into amkblock
instruction, followed by a (possibly empty) sequence ofsetblockslot
instructions.The
form takes a scrutinee, a list of pairs each holding a labeled constructor and an expression, and finally a default expression to be executed if no labeled constructor matches. During code generation, each is translated into agoto-vcon
instruction followed by a jump table.
During K-normalization, each match compiler, which you will write.
is translated to a tree of forms. The translation is performed by aEmbedding into vScheme
Embedding eScheme into vScheme is gnarly.
Anywhere that eScheme would use a block of \(n\) slots, the embedding uses a list of \(n\) elements. To make embedding easy, I’ve added a
block
primitive function to vScheme. It takes any number of arguments and returns those arguments in a vScheme list.For readability, the case. But the embedded code won’t run.
forms in first-order Scheme embed asThe
form in KNF embeds as a call to primitiveblock
.The cond form. Each condition calls a function matches-vcon-arity?, which I have added to the vScheme interpreter as a predefined function. This embedding should run.
form in KNF embeds as a Scheme
Legacy values vs constructed data
In μML, lists and Booleans are not special. They are not supported by any primitives; they are instead predefined:
;; define type constructors and their associated value constructors
(data (* => *) list
['() : (forall ['a] (list 'a))]
[cons : (forall ['a] ('a (list 'a) -> (list 'a)))])
(data * bool
[#t : bool]
[#f : bool])
And their representations are not special; in the μML interpreter they are all just constructed data (CONVAL
):
datatype exp = ...
and value
= CONVAL of vcon * value list
| SYM of name
| NUM of int
| CLOSURE of lambda * value env ref
| PRIMITIVE of primop
withtype lambda = name list * exp
and primop = value list -> value
In eScheme, we want lists and Booleans to continue to use the special-purpose representations in the SVM. To achieve this end, we special case the code generation and K-normalization of these forms:
The application of an ordinary value constructor translates into a K-normal form expression that allocates and initializes a block (record). But the application of cons to two arguments translates into a call to SVM primitive
cons
.When not in function position, an occurrence of an ordinary value constructor translates into a symbol with the same name as the value constructor. But an occurrence of #t or #f translates into a Boolean literal, and an occurrence of ’() translates into the empty-list literal.
The translation of a case expression has to account for these changes. For example, to see if a scrutinee is an application of cons to two arguments, the translation won’t look for a
Block
with the stringcons
in the first slot—instead it will look for aConsCell
. And similarly for Booleans and the empty list.
Operational semantics
To give an operational semantics of the SVM’s new computed goto, I extend the VM state with a new component of the form \(v/k\), where \(v\) is a value associated with a value constructor, and \(k\) is the number of arguments to which the constructor is applied.3 The component \(v/k\) is updated by the GotoVcon
instruction, inspected by the IfVconMatch
instruction, and ignored by all other instructions.
Function \(\mathit{con}\) extracts a value constructor and arity from constructed data. If it is given a value that is not constructed data, it treats the value itself as the constructor, with arity 0:
\[\frac{\mbox{$b$ is a block with $n$ slots}} {\mathit{con}(b) = b[0]/n} \]
\[\frac{\mbox{$c$ is a cons cell}} {\mathit{con}(c) = \mathtt{cons}/2} \]
\[\frac{\mbox{$v$ is neither a cons cell nor a block}} {\mathit{con}(v) = v/0} \]
In the operational semantics, GotoVcon
instruction simply uses \(\mathit{con}\) to set \(v/k\). (In the implementation, the GotoVcon
instruction does this and also does all the work of the succeeding instructions.) \[\frac{\mathit{con}(\sigma(R(r))) = v'/k'}{
\begin{array}{l}
\langle I_1 \bullet \mathtt{GotoVcon}(r, n) \cdot I_2, R, L, G, S,
\sigma, v/k \rangle \rightarrow\\
\qquad \qquad \qquad \qquad
\langle I_1 \cdot \mathtt{GotoVcon}(r, n) \bullet I_2,
R, L, G, S, \sigma, v'/k'\rangle
\end{array}
}
\] The parameter \(n\) is ignored by the operational semantics, but the implementation requires that the GotoVcon
instruction be followed by \(n\) pairs of IfVconMatch
/Goto
instructions.
The IfVconMatch
instruction works very much like the If
instruction, except that it checks the current \(v/k\), not a value in a register. And because its work is done by GotoVcon
, it doesn’t need to be implemented. If \(v/k\) matches the constructor and arity specified by the IfVconMatch
instruction, evaluation continues with the succeeding instruction \(i\), which is always a Goto
: \[\frac{v = L(j) \land k = n}
{\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle
\rightarrow\\
\qquad \qquad \qquad \qquad
\langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \bullet i \cdot I_2, R,
L, G, \sigma, v/k\rangle}
\]
If \(v/k\) does not match, evaluation skips \(i\) and continues with the instruction after \(i\): \[\frac{v \ne L(j) \lor k \ne n} {\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \cdot i \bullet I_2, R, L, G, \sigma, v/k\rangle} \]