Constructed Data

Introduction

Constructed data is a class of values that includes all the values of ML’s algebraic data types, as well as tuples and records.1 Lists are a special case of constructed data; cons abbreviates “construct.” Like any other class of values or types, constructed data is supported by introduction forms and elimination forms:

Constructed data is found in both typed and untyped languages:

Constructed data in source code: The eScheme language

To enable us to implement constructed data and pattern matching, I have defined a variation on vScheme that has Erlang-style pattern matching. The variation is called eScheme. Its concrete syntax is taken from \(\mu\)ML, which is described in chapter 8 of my book Programming Languages: Build, Prove, and Compare.

I have extended our parser to understand the new syntactic forms and to parse μML code.

Matching in eScheme differs from matching in Standard ML in one important way: in eScheme, every value constructor can be used at every arity. For example, all of the following expression evaluate to well-formed constructed data:

(SOME 3)            ;; the number 3, wrapped
SOME                ;; a bare constructor (not a function as in μML) 
(SOME 'Ramsey 106)  ;; sensible (not ill-typed as in μML )

Constructed data at run time

When a value constructor is applied to \(n\) arguments (\(n>0\)), it is represented at run time as a Block value with \(n+1\) slots. The first slot contains the name of the constructor, and the remaining slots contain the arguments, in order.

When a value constructor is used by itself, or is applied to no arguments, it is represented at run time by its name. No Block is allocated.

Pattern matching operates by checking not only the identity of the value constructor but also the number of arguments to which it is applied. In CS 106, this combination of value constructor and arity will be called a labeled constructor. (The label is the arity.) A labeled constructor is written \(C/n\), where \(C\) is the name of the value constructor and \(n\) is the number of arguments to which \(C\) is applied. As examples,

(SOME 3)            ;; the labeled constructor is SOME/1
SOME                ;; the labeled constructor is SOME/0
(SOME 'Ramsey 106)  ;; the labeled constructor is SOME/2

VM Instructions for working with constructed data

Constructed data also require support for introduction and elimination at run time.

Instructions for introducing constructed data

In the SVM, constructed data is introduced by getting a value intro a register. If the value is a bare value constructor itself, the UFT can generate a loadliteral instruction. If the value is meant to be a block that holds a value constructor and its arguments, I recommend instructions that are nearly identical to the instructions for working with closures:

Instructions for eliminating constructed data

The instruction for eliminating constructed data is a computed goto. The instruction examines the scrutinee’s value constructor, and based on what it sees, it branches to a new address. In a native-code compiler for a typed functional language like ML or Haskell, each value constructor \(C\) is represented by a small integer, and the compiler can generate an indirect branch that uses a jump table:

It’s the same code that a C or C++ compiler generates for a switch statement.

When the language does not have a static type system, computed goto is not an option. Instead, the compiler generates a different species of jump table: a list of key-value pairs. Each key is a labeled constructor~\(C/n\), and each value is the address of the code that is meant to handle the case of that labeled constructor.

Our .vo file format doesn’t include read-only or literal arrays, and I’d rather not change the .vo format or the loader. So the jump table will be coded as a sequence of VM instructions:

The computed goto is performed by new instruction goto-vcon.

This representation has the advantage that the jump-table entries could, in principle, be interpreted by vmrun as instructions. This approach is useful for debugging, but in your final SVM I expect you instead to do all the computation in the goto-vcon instruction, so the entire operation takes only one trip through the main loop in vmrun.

A labeled constructor is associated with a scrutinee according to these rules:

Related forms in the compiler

Constructed data has new support in each of the compiler’s internal languages. Some related definitions are found in file case.sml, in structures Pattern, Case, and Constructed. These definitions show that internally, there is no such thing as “a value constructor by itself.” Such a constructor is always represented as its application to an empty list of arguments.

The internal languages are extended as follows:

During K-normalization, each CASE is translated to a tree of SWITCH_VCON forms. The translation is performed by a match compiler, which you will write.

Embedding into vScheme

Embedding eScheme into vScheme is gnarly.

Legacy values vs constructed data

In μML, lists and Booleans are not special. They are not supported by any primitives; they are instead predefined:

;; define type constructors and their associated value constructors
(data (* => *) list
  ['()  : (forall ['a] (list 'a))]
  [cons : (forall ['a] ('a (list 'a) -> (list 'a)))])

(data * bool
  [#t : bool]
  [#f : bool])

And their representations are not special; in the μML interpreter they are all just constructed data (CONVAL):

datatype exp = ...
and value
  = CONVAL of vcon * value list
  | SYM    of name
  | NUM    of int
  | CLOSURE   of lambda * value env ref
  | PRIMITIVE of primop
 withtype lambda = name list * exp
      and primop = value list -> value

In eScheme, we want lists and Booleans to continue to use the special-purpose representations in the SVM. To achieve this end, we special case the code generation and K-normalization of these forms:

Operational semantics

To give an operational semantics of the SVM’s new computed goto, I extend the VM state with a new component of the form \(v/k\), where \(v\) is a value associated with a value constructor, and \(k\) is the number of arguments to which the constructor is applied.3 The component \(v/k\) is updated by the GotoVcon instruction, inspected by the IfVconMatch instruction, and ignored by all other instructions.

Function \(\mathit{con}\) extracts a value constructor and arity from constructed data. If it is given a value that is not constructed data, it treats the value itself as the constructor, with arity 0:

\[\frac{\mbox{$b$ is a block with $n$ slots}} {\mathit{con}(b) = b[0]/n} \]

\[\frac{\mbox{$c$ is a cons cell}} {\mathit{con}(c) = \mathtt{cons}/2} \]

\[\frac{\mbox{$v$ is neither a cons cell nor a block}} {\mathit{con}(v) = v/0} \]

In the operational semantics, GotoVcon instruction simply uses \(\mathit{con}\) to set \(v/k\). (In the implementation, the GotoVcon instruction does this and also does all the work of the succeeding instructions.) \[\frac{\mathit{con}(\sigma(R(r))) = v'/k'}{ \begin{array}{l} \langle I_1 \bullet \mathtt{GotoVcon}(r, n) \cdot I_2, R, L, G, S, \sigma, v/k \rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{GotoVcon}(r, n) \bullet I_2, R, L, G, S, \sigma, v'/k'\rangle \end{array} } \] The parameter \(n\) is ignored by the operational semantics, but the implementation requires that the GotoVcon instruction be followed by \(n\) pairs of IfVconMatch/Goto instructions.

The IfVconMatch instruction works very much like the If instruction, except that it checks the current \(v/k\), not a value in a register. And because its work is done by GotoVcon, it doesn’t need to be implemented. If \(v/k\) matches the constructor and arity specified by the IfVconMatch instruction, evaluation continues with the succeeding instruction \(i\), which is always a Goto: \[\frac{v = L(j) \land k = n} {\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \bullet i \cdot I_2, R, L, G, \sigma, v/k\rangle} \]

If \(v/k\) does not match, evaluation skips \(i\) and continues with the instruction after \(i\): \[\frac{v \ne L(j) \lor k \ne n} {\langle I_1 \bullet \mathtt{IfVconMatch}(n, j) \cdot i \cdot I_2, R, L, G, \sigma, v/k\rangle \rightarrow\\ \qquad \qquad \qquad \qquad \langle I_1 \cdot \mathtt{IfVconMatch}(n, j) \cdot i \bullet I_2, R, L, G, \sigma, v/k\rangle} \]


  1. In my book, I treat tuples and records as a special case of constructed data, and we’ll do the same here.

  2. In principle, the UFT could sort the keys and arrange for binary search here, but in the common case the resulting SVM could would most likely be slower. Binary search isn’t worth the effort.

  3. The notation comes from Prolog and Erlang.