Grammar for the specification language

<*>= [D->]
<terminal symbols>
# EBNF grammar for toolkit specification language

Step one is to initialize the globals

<toplevel>= (<-U) [D->]
global symtab           # table of top-level symbols
global globals          # top-level environment
global equivclasses     # list of equivalence classes of fields
global constructors     # table mapping constructor name to Stype
global conslist         # list of all constructors ever defined
global instructionctype # the type of instruction constructors
global bit_zero_is_lsb  # non-null if bits are numbered with 0 == least significant
global vanishing_latent_patlabel  # should vanish
global fieldname_literals, operands_and_ids, warned_literals # find conflicts
Defines bit_zero_is_lsb, conslist, constructors, equivclasses, fieldname_literals, globals, instructionctype, operands_and_ids, symtab, vanishing_latent_patlabel, warned_literals (links are to index).

<toplevel>+= (<-U) [<-D->]
procedure init_parser()
    every symtab | constructors := table()
    globals := [symtab]
    every unchecked_fields | guaranteed_fields := set()
    every fieldname_literals | operands_and_ids | warned_literals := set()
    every conslist | equivclasses := []
    instructionctype := constype("(anonymous constructor type)", set())
    bit_zero_is_lsb := 1
    vanishing_latent_patlabel := latent_patlabel()
Defines init_parser (links are to index).

Because of the strange way we allow users to specify field-name literals in field bindings, we want to make sure there's never a clash between the names used in fieldinfo statements and the names used in operands and equations.

<note new field name ii1>= (U->)
if member(operands_and_ids, ii1) then {
  <warn about clash in ii1>
} else
  insert(fieldname_literals, ii1)
<note new operand or id ii1>= (U-> U->)
if member(fieldname_literals, ii1) then {
  <warn about clash in ii1>
} else
  insert(operands_and_ids, ii1)
<warn about clash in ii1>= (<-U <-U)
every (if not member(warned_literals, ii1) then warning else verbose)(
  ii1 || " is used as a field-name literal and an operand or id..." |
  "  the literal takes priority in field bindings"
insert(warned_literals, ii1)

conslist contains a list of all constructors ever defined, even those that were later removed from the constructors table. Clients that use conslist are therefore obligated to check that constructors[] === x before using a constructor x found in conslist. Similar checking is required before using the members of a constructor type. The easy way to do this is to use the kept_constructors generator, which provides the retained constructors of a given type, or the retained untyped constructors if no type is given.

<toplevel>+= (<-U) [<-D->]
procedure kept_constructors(constype)
  local thelist 
  thelist := (\constype).members | conslist
  suspend 1(c := !thelist, constructors[] === c)
Defines kept_constructors (links are to index).

Two parsers are generated from this grammar: that for a specification, and that for a code file containing pattern-matching statements. No client ever uses Parsers directly; they use P_Spec or P_CodeFile, but the union makes it possible for both parsers to handle end-of-file properly. The silly tokens avoid parsing conflicts. (Incidentally, the lexer for a CodeFile starts out in a different state, but that shouldn't matter to the parser generator.)

<grammar>= (<-U) [D->]
Parsers   : "bogus spec marker" Spec | "bogus code marker" CodeFile;
Defines Parsers (links are to index).

The parts of a specification may appear in any order.

<grammar>+= (<-U) [<-D->]
Spec      : { Fieldspec | Patterns | Constructors | Placeholder 
            | FetchSpec | PCSpec   | RelocSpec    | AsmSpec 
            | BitSpec   
            } ;
Defines Spec (links are to index).

<refman: spec>=
The start symbol for the grammar describing a specification is
Definitions must appear before uses, but otherwise
the parts of a specification may appear in any order.
The grammar for a complete specification is therefore
<grammar>+= (<-U) [<-D->]
PCSpec : "pc_unit_bits" INT /* if ii2 > 0 then pc_unit_bits := ii2 
                               else error("pc_unit_bits must be positive") */;
<refman: spec: misc>= [D->]
\nextsection{Addressing units for the program counter}
The specification\indexlit{pc_unit_bits}
\lit{pc\_unit\_bits} \nt{width}
says that the program counter measures units of \nt{width} bits.
For example, if \verb+pc_unit_bits+ is 8 (the default), a 16-bit token
in a sequence causes the program counter to be advanced by~2.
The mechanism of the advance is the second argument to \verb+emit+
(when encoding) or the offset passed to the template for address
arithmetic (when decoding).
We also use these units when converting labels to integers. (We use the
difference, in increments of \verb+pc_unit_bits+, between the label
and the program counter).
The size of any token that appears in a sequence must be exactly
divisible by \verb+pc_unit_bits+.

Fields and field information

Fields declared in the same fields declaration are members of the same equivalence class. Fields in the same equivalence class share the same shape, i.e., the width of the unit in which they are defined. Constraints on fields with different classes may not be combined with &, even if the classes have the same shape. Equivalence classes are useful for distinguishing CISC instruction fragments that must be combined to form a complete instruction.

Internally, the toolkit uses little-endian bit numbering---that is, bit 0 is always the least significant bit.

<grammar>+= (<-U) [<-D->]
Fieldspec : "fields" "of" Ident "(" INT ")" 
            { Ident INT [":" INT /*ii2*/]  /* newfield(ii1, ii2, (\ii3 | ii2)+1) */ }
            /* if ii5 % 8 ~= 0 then
                  error("element size ", ii5, " is not a multiple of 8 bits.")
               <if needed, flip bit numbering of fields ii7 using size ii5>
               (/symtab[ii3] := equivclass(ii3, ii7, ii5)) | 
                         deferror(type(symtab[ii3]) || " ", image(ii3))
               put(equivclasses, symtab[ii3])
               every (!ii7).class := symtab[ii3] */
Defines Fieldspec (links are to index).

We now permit one-integer specifications of one-bit fields, by analogy with one-integer specifications of one-bit slices.

<toplevel>+= (<-U) [<-D->]
procedure newfield(name, lo, hi)
  return (/symtab[name] := field(name, lo, hi)) | deferror("Field", image(name))
Defines newfield (links are to index).

To number bits, we need both to change numbering and exchange. I had tried to do both in two assignments, but that's a botch.

<if needed, flip bit numbering of fields ii7 using size ii5>= (U->)
if /bit_zero_is_lsb then {
  if !ii7 then bit_numbering_used := 1
  every f := !ii7 do {
    f.hi := ii5 - f.hi
    f.lo := ii5 - f.lo
    f.lo :=: f.hi
<refman: spec: fields>= [D->]
The specification\indexlit{fields}
\lit{fields of} \term{token-name} \lit( \term{width} \lit) \nt{field-specs}
\sequence{\term{field-name} \term{low-bit}:\term{high-bit}}
defines a token named \term{token-name} that is \term{width} bits
wide, as well as all the fields that may be used to refer to parts of
that token.
Bits are numbered from~0.
By default, bit~0 is the least significant bit, but the significance of bit~0 
can be changed with the \lit{bit 0} directive (Section~\ref{section:bit-0}).

Bit numbering

<grammar>+= (<-U) [<-D->]
BitSpec      : "bit" Zero "is" Significance Significant
                 /* xxx := if ii4 == "least" then 1 else &null 
                    if xxx ~=== bit_zero_is_lsb then 
                      <possible warnings around changes in bit numberings>
                    bit_zero_is_lsb := xxx
                    bit_numbering_set := 1
Zero         : INT /* ii1 = 0 | error("expected `0'") */;
Significance : Ident 
                /* ii1 == ("most"|"least") | error("expected `most' or `least'") */;
Significant  : Ident /* ii1 == "significant" | error("expected `significant'") */;
<*>+= [<-D->]
global bit_numbering_set, bit_numbering_used
Defines bit_numbering_set, bit_numbering_used (links are to index).

<possible warnings around changes in bit numberings>= (<-U)
if \bit_numbering_set then
  warning("Multiple inconsistent bit numberings --- hope you really mean it!")
if \bit_numbering_used then
  warning("You change the bit numbering, but you've already used the old numbering" ||
          "... seems like a crazy idea, but I'll do it")
<refman: spec: misc>+= [<-D->]
\nextsection{Bit numbering}
By default, bits used to specify fields and bit-slices are numbered with bit~0
as the least significant bit.
This choice can be awkward for machines that number the bits starting with
bit~0 as the most significant bit.
The \lit{bit 0} directive sets the bit numbering:
\lit{bit 0 is} \alternate{\lit{most} | \lit{least}} \lit{significant}
This declaration changes the bit numbering used to interpret 
any \lit{fields} declarations and bit slices that follow it.
We recommend against changing bit numberings in mid-specification.

Only the default bit numbering has been tested.


Placeholders are emitted for instructions that refer to unresolved relocatable addresses. They are associated with the equivalence class of the fields used in the instruction. A placeholder is defined by a pattern with the same length (but not necessarily the same shape) as the equivalence class for which it is a placeholder. By construction, a placeholder can't refer to constructors that might use it, because no such constructor can have been defined yet. In other words, the definitions of placeholders must precede the definition of an instruction that outputs a pattern with fields in the associated equivalence classes.
<grammar>+= (<-U) [<-D->]
Placeholder : "placeholder" "for" Ident "is" Pattern 
              /* class := lookuptype(ii3, "equivclass")
                 (/class.holder := pnf(ii5, globals)) |
                 error("Placeholder for ", ii3, " is already defined") 
                 <make sure placeholder of class is the right length>
Defines Placeholder (links are to index).

<make sure placeholder of class is the right length>= (<-U)
if pattern_length(class.holder) ~= class.size then
  error("Length of placeholder `", patimage(class.holder), 
    "' \nfor ",, " does not match class size ", class.size)
<refman: spec: misc>+= [<-D->]
The specification\indexlit{placeholder}
\lit{placeholder for} \term{token-name} \lit{is} \nt{pattern}
Tells the toolkit's generator to use \nt{pattern} when it needs to
emit a placeholder for the named token.
Placeholders are emitted whenever a relocatable address is
needed but not yet known.

fieldinfo associates name, value pairs with one or more fields. The values of checked(unchecked) fields are (are not) checked at run time. A sparse binding of field names is used to specify that only a small set of values of the field are meaningful. The bindings associate a string name with a field value. The names construct binds string names to each possible value of a field. In both cases we return a table mapping names to values.

<grammar>+= (<-U) [<-D->]
Fieldspec         : "fieldinfo" { Fieldinfo } ;
Fieldinfo         : IdentBinding "is" "[" { Fielditem } "]" 
                    /*  every fieldinfo(lookuptype(!ii1, "field"), !ii4) */;
IdentBinding      : Ident /* [ii1] */ | "[" { Ident } "]";
Fielditem         : Ident  | SparseFieldNames | FieldNameList ;
SparseFieldNames  : "sparse" "[" FieldNameBindings "]" /* sparse_name_table(ii3) */ ;
FieldNameBindings : FieldNameBinding {"," FieldNameBinding}  /* push(ii2, ii1) */;
FieldNameBinding  : FieldName "=" Integer /* enumbinding(ii1, ii3) */ ;
FieldNameList     : "names" "[" {FieldName} "]" /* full_name_table(ii3) */ ;
FieldName         : (String | Ident) /* {<note new field name ii1>}; ii1 */;
Defines Bindlist, Fieldinfo, Fielditem, FieldName, FieldNameBindings, FieldNameList, IdentBinding (links are to index).

We reserve the following similar syntax for places where we aren't referring to field names:

<grammar>+= (<-U) [<-D->]
SparseNames : "sparse" "[" Bindings "]" /* sparse_name_table(ii3) */ ;
Bindings    : Binding {"," Binding}  /* push(ii2, ii1) */;
Binding     : (String|Ident) "=" Integer /* enumbinding(ii1, ii3) */ ;
DenseNames  : "names" "[" {String|Ident} "]" /* full_name_table(ii3) */ ;
NameTable   : SparseNames | DenseNames;

<refman: spec: fields>+= [<-D]

A \lit{fieldinfo}\indexlit{fieldinfo} specification gives auxiliary
information about a field or list of fields.
(It is useful to describe several fields simultaneously
when, for example, they all refer to the
same set of registers.)
Such auxiliary information might include special names used to refer
to values of the fields (as when they describe data formats or
registers) or information about what kind of checking should be done
when an application supplies a putative value for a field.

The syntax of \lit{fieldinfo} specifications is:
\lit{fieldinfo} \sequence{\nt{fieldinfo-spec}}
\begin{production} {fieldinfo-spec} 
  \nt{field-specifier} \lit{is} \lit[ \sequence{\nt{field-item}} \lit]
\begin{production} {field-specifier}
\term{field-name} \vbar\ \lit[ \sequence{\term{field-name}} \lit]
\begin{production} {field-item}
|       \lit{unchecked}
|       \lit{guaranteed}
|         \lit{sparse [} \nt{binding} \sequence{\lit, \nt{binding}}
|         \lit{names [} \sequence{\term{Ident} | \term{String}} \lit]
\begin{production} {binding}
 \alternate{\term{Ident} | \term{String}} \lit= \term{integer}
The terms \lit{checked}, \lit{unchecked}, and \lit{guaranteed} denote
three possible levels of checking in generated encoding procedures:
If $f$ is a checked field, the encoding procedure checks to make sure
its value falls within range.  If not, it calls an error handler.\indexlit{checked}
If $f$ is an unchecked field, the encoding procedure does not check its
value, but it does mask out high bits, ensuring that other parts of
the token can't be corrupted by a bad value of~$f$.\indexlit{unchecked}
If $f$ is a guaranteed field, the encoding procedure simply uses its
value without any checking or masking at all.%
\footnote{Signed field parameters must always be masked, but most
field parameters are unsigned.}
This option is useful mainly for situations in which $f$ denotes a
register, and the register allocator promises never to use a bogus

\lit{sparse}\indexlit{sparse} and \lit{names}\indexlit{names} are both
ways of specifying names of fields.
\lit{names} is used when all the names are known; 
\lit{sparse} is used otherwise.
Identifiers can be used to specify names on most targets, but names containing
special characters can only be specified using string literals.
Note that the toolkit accepts the C literal syntax for characters in
place of integer literals; this property may be useful when writing
\lit{sparse} bindings.
The use of commas to separate lists of bindings is not
consistent with our other representations of lists, but they look
funny without the commas.

I lump the word size of the target machine into the field specs, not because it makes all that much sense, but because I didn't know where else to put it.

<grammar>+= (<-U) [<-D->]
Fieldspec : "wordsize" INT /* wordsize := ii2 */;
<toplevel>+= (<-U) [<-D->]
global wordsize         # word size of the machine the application runs on
Defines wordsize (links are to index).

<initialization>= (<-U) [D->]
wordsize := 32
<refman: spec: misc>+= [<-D]
\nextsection{Word size}
There are a few situations in which we use the word size of the
machine for which we are generating code---mostly to see whether
certain values will fit.
It defaults to 32~bits, but it can be specified by
\lit{wordsize}\indexlit{wordsize} \term{width}
where \term{width} is the width in bits.

Pattern definitions

Patterns parse into abstract syntax; patbinding makes them concrete and binds them to identifiers.
<toplevel>+= (<-U) [<-D->]
global patlhs           # hold lhs for later action
Defines patlhs (links are to index).

<grammar>+= (<-U) [<-D->]
Patterns  : "patterns" { PatBinding };
PatBinding: (Ident /* patlhs := ii1 */) "is" BindingRHS
          | "[" {Ident} "]" "is" Pattern /* patbinding(ii2, ii5) */
BindingRHS: Pattern /* patbinding(patlhs, ii1) */
          | "any" "of" "[" {Ident} "]" "," "which" "is" Pattern
              /* patbinding(copy(ii4), ii9)
                 l := []
                 every put(l, Pident("_" ~== !ii4))
                 patbinding(patlhs, Por(l)) */
<refman: spec: pattern bindings>=
Patterns can be named using pattern-binding
declarations:\index{pattern bindings}
\begin{production}{spec}\lit{patterns} \sequence{\nt{pattern-binding}}\end{production}
  \term{pattern-name} \lit{is} \nt{pattern}
| \lit[ \sequence{\term{pattern-name}} \lit] \lit{is} \nt{pattern}
| \term{pattern-name} \lit{is any of [} \sequence{\term{pattern-name}}
                                \rlap{\lit{], which is} \nt{pattern}}
The first form names a single pattern.

The second,  most common form binds a list of names on the left to a
list of patterns on the right.
The number of names must match the number of patterns; the special
name ``\verb+_+'' can be used to match uninteresting patterns on the
This form requires the use of generating expressions (or an explicit list of
patterns) to produce a list of patterns.
It is typically used to create opcode tables.
In this situation, the name ``\verb+_+'' simply indicates an unused opcode.

The third form extends the second to handle a common idiom.
The sequence of pattern names and the pattern on the far right act
just as they do in the second form, but in addition, the single name
on the left is bound to the disjunction of all the patterns on the
right that are bound to names other than ``\verb+_+''.
Examples can be found in the SPARC and 486 specifications given by

The name to which a pattern is bound remains associated with that
pattern, and that name may be used in the formation of opcode names.
When a disjunction is bound, the individual disjuncts retain their
names, if any.
This behavior can be exploited when writing 
constructor specifications, in which 
disjuncts are enumerated when patterns appear in opcodes.

Parsing labels relies on a grotesque hack to keep things LL(1); we allow an arbitrary sequent to precede colon, then patch things up semantically afterwards.

<grammar>+= (<-U) [<-D->]
Pattern   : Disjunct      { "|" Disjunct }      /* Por(push(ii2, ii1))   */ ;
Disjunct  : Sequent       { ";" Sequent 
                          | ":" Sequent /* colon_mark(ii2) */
                          }      /* Pseq(colons_to_labels(push(ii2, ii1)))  */ ;
Sequent   : Conjunct      { "&" Conjunct }      /* Pand(push(ii2, ii1))  */ ;

Conjunct  : ["..."] DotsR    /* if \ii1 then Pseq([dots_pattern(),ii2]) else ii2 */;
DotsR     : Atomic ["..."]   /* if \ii2 then Pseq([ii1,dots_pattern()]) else ii1 */ ;

Atomic    : (Ident /* atomicid := ii1 */)
            (                                    /* Pident(atomicid) */
            | Relop (Generator | Expr)           /* Pcon(atomicid, ii1, ii2) */
            | {"^" Opname} "(" ConstructorArgs ")" 
                  /* Papp(push(ii1, \symtab[atomicid] | atomicid), ii3) */
            ) /* ii2 */
          | String {"^" Opname} "(" ConstructorArgs ")"
                  /* Papp(push(ii2, ii1), ii4) */
          | "(" Pattern ")"
          | "[" {Ident /* Pident(ii1) */} "]"   /* Plist(ii2) */
ConstructorArgs : AppExpr {"," AppExpr} /* push(ii2, ii1) */
                | /* [] */

We used to ``explode'' compound constructor names at this syntactic step, but this turned out to be not what we wanted on the RHS of constructor definitions. For example, in

pattern load is ld | lb
constructor load^"2" a,b,c,d is load(a,b); load(c,d)
we cannot evaluate Papp(name, args) for the loads on the right-hand side until we have an environment telling use whether load stands for ld or lb in this particular context. (The original semantics made it stand for ld|lb, which we decided was rarely what anyone wanted.) Here's the colon fix:
<toplevel>+= (<-U) [<-D->]
record colon_mark(x)

procedure colons_to_labels(l)
  every i := 1 to *l & type(l[i]) == "colon_mark" do {
    l[i] := l[i].x  # strip out tag
    (l[i-1] := undo_identifier_syntax(l[i-1])) |
      error("Colon must be preceded by an identifier")
  return l

procedure undo_identifier_syntax(seq)
  if type(seq) == "Pand" & *seq.patterns = 1 & con := seq.patterns[1] &
     type(con) == "Pident"
    return Plabel(
Defines colon_mark, colons_to_labels, undo_identifier_syntax (links are to index).

atomicid is an auxiliary variable.

<toplevel>+= (<-U) [<-D->]
global atomicid
Defines atomicid (links are to index).

The following records represent the nodes in a pattern's abstract syntax tree.

<toplevel>+= (<-U) [<-D->]
record Por (patterns)                   # disjunction
record Pseq(patterns)                   # sequence
record Pand(patterns)                   # conjunction
record Pcon(name, relop, value)         # constraint on field
record Pident(name)                     # identifier standing for a pattern
record Plabel(name)                     # pattern label
record Papp(cons, args)                 # constructor applied to arguments
                                        #   (args are AppExprs)
record Plist(patterns)                  # list of patterns in square brackets
Defines Pand, Papp, Pcon, Pident, Plabel, Plist, Por, Pseq (links are to index).

<refman: spec: patterns>=
Patterns can be defined both in terms of themselves or in terms of simpler elements.
We discuss the base cases of the recursion first.

\section{Simple patterns}
Simple patterns can be described by any of the
following syntaxes:\index{patterns!simple} 
        \term{name} of \rlap{pattern, field, or constructor type}
| \nt{opcode} \lit( \nt{arguments} \lit)&
        Constructor application
| \nt{field-binding}&
        Binds field to variable
| \nt{constraint}&      
        Constrains field
<refman: other atomic patterns>
| \lit[ \sequence{name} \lit]&  
        List of patterns
\term{field-name} \lit= \nt{expr}
\term{field-name} \nt{relational-operator} 
                \rlap{\alternate{\term{integer} | \nt{generating-expression}}}
\lit{<} \vbar{} \lit{<=} \vbar{} \lit{=} \vbar{} \lit{!=} \vbar{}
\lit{>} \vbar{} \lit{>=} 
Unless a list of pattern names or a generating expression is used, the pattern
denotes a single pattern; otherwise it denotes a list.
Lists are ordinarily used to describe opcode tables.
The \nt{arguments} in a constructor application may include expressions, 
literal strings (which must be field names), or other constructor applications.
Section \ref{section:expr} describes the syntax of expressions (\nt{expr}).
Expressions in a \nt{field-binding} may contain free variables.
Such variables may be operands,  or they may 
be computed by the solver.

When an identifier is used as a pattern, its meaning depends on
Pattern names defined with the \lit{patterns} pattern-binding statement can
always be used in later patterns.
On the right-hand side of a constructor
definition, the names of operands that represent fields or constructor
types may also be used as patterns.
The name of a field operand~\lit{f} is short for
\lit{f =} {\em the value of the operand \lit{f}}.
Similarly, the name of a constructor-type operand~\lit T is short for
its value, which must be a pattern because the result of applying a
constructor is always a pattern.

<refman: constructor applications>

<refman: spec: generating expressions>

\section{Combining patterns}

The recursive part of the definition of patterns makes it possible to
combine patterns using the following
operators:\index{patterns!complex}\index{pattern operators}
\nt{pattern} \lit{...}&                Loosens rules restricting conjunction.
| \lit{...} \nt{pattern}&              Loosens rules restricting conjunction.
| \nt{pattern} \lit{\&} \nt{pattern}&  Conjunction.
| \nt{pattern} \lit; \nt{pattern}&     Sequence.
| \nt{pattern} \litbar\ \nt{pattern}&     Disjunction.
Operators listed first have higher precedence.

\section{Pattern labels}

One can refer to a  location within a pattern by using a \emph{pattern label}.
The syntax is
\term{label-name} \nt{pattern}
The appearance of \term{label-name} is a binding instance; its scope
is the \indexedlit{patterns} declaration or the constructor branch in
which it appears.
Within that scope, it refers to the location in the instruction stream
immediately preceding the pattern it labels.
That location can be converted to an integer in the same way as a
relocatable address.
Labelling has the same precedence as the sequencing operator, so, for
example, to refer to the location at the end of an instruction, one
can use the idiom
\nt{pattern}\lit{; L: epsilon}

<refman: description of pattern operators>

The name of an applied constructor depends on the environment in which it is evaluated. We use explode_names to generate all possible constructor names denoted by a constructor application, then generate the applications.

<toplevel>+= (<-U) [<-D->]
procedure explode_apps(opcode, args, rho)
  l := []
  every c := explode_names(opcode, rho) do {
    put(l, Papp(cons_named(c), args))
  return if *l = 1 then l[1] else Por(l)
Defines explode_apps (links are to index).

<refman: constructor applications>= (<-U)
The \nt{opcode} used in a constructor application may denote a single
constructor or a group of constructors with identical operands.
It works the same way in a constructor specification, as described in
Strictly speaking, applying a constructor to create a pattern is not
``simple,'' because patterns themselves are used to define
It is actually part of a mutual recursion between the definitions of
patterns and constructors.

<grammar>+= (<-U) [<-D->]
Relop     : "=" | "!=" | "<=" | ">=" | "<" | ">" ;
Generator : "{" Integer "to" Integer [ "columns" INT /* ii2 */ ] "}"  
                                /* Gfor(ii2, ii4+1, \ii5 | 1) */
          | "[" { Integer } "]"     /* Glist(ii2) */
<refman: spec: generating expressions>= (<-U)
A {\em generating expression}\index{generating expression} can be used
instead of an integer in field constraints.
The resulting constraint denotes a list of patterns, one for each
value of the generating expression.
There are three forms of generating expression:
\begin{production} {generating-expression}
  \lit\lbr{} \term{lo} \lit{to} \term{hi} \lit\rbr
| \lit\lbr{} \term{lo} \lit{to} \term{hi} \lit{columns} \term{n} \lit\rbr
| \lit[ \sequence{\term{integer}} \lit]
The first form generates the integers from \term{lo} to \term{hi}
inclusive, in order.
The second form generates the same integers, but in an order suited to
an \term{n}-column layout, e.g. $\term{lo}, \term{lo}+\term{n},
\term{lo}+2\term{n}, \ldots$, wrapping around as needed.
The third form simply lists the integers generated; it is suitable for
sparse tables.

There are a couple of special patterns. These alone produce patterns directly, not an AST. (pnf, for example, isn't prepared to cope with a pattern with an empty sequence).

<grammar>+= (<-U) [<-D->]
Atomic : "epsilon"   /* epsilon() */
       | "some" Ident /* wildcard(lookuptype(ii2, "equivclass")) */
<refman: other atomic patterns>= (<-U)
| \lit{epsilon}\indexlit{epsilon}&The empty sequence.
| \lit{some}\indexlit{some} \term{token-name}&
        Matches a single token \rlap{of the class named.}

Pattern syntax

A generator is represented by a Glist. The list of values is either given explicitly or generated on the spot by Gfor. Note that hi is one more than the largest value generated.
<toplevel>+= (<-U) [<-D->]
record Glist(values)                    
procedure Gfor(lo, hi, cols)
    local l, r
    l := []
    r := (hi - lo) / cols
    every put(l, lo + (0 to r - 1) + r * (0 to cols - 1))
    return Glist(l)
Defines Gfor, Glist (links are to index).

Binding a pattern requires putting it in normal form. If there is a generator in the pattern, the normal form is a list. In that case, id must also be a list.

<toplevel>+= (<-U) [<-D->]
procedure patbinding(id, ast)
    local p
    <verbosely announce binding>
    p := pnf(ast, globals)
    <insist p have no conditionals nor field bindings>
    case type(id) || "," || type(p) of {
        "string,pattern" : patbind(id, p, globals)
        "list,list"      : if *id = *p then
                             while patbind(get(id), get(p), globals)
                             error("identifier list/generator length mismatch: ", 
                                   *id, " vs ", *p)
        "string,list"    : error("generator must be bound to an identifier list")
        "list,pattern"   : error("identifier list must be bound to a generator")
        default          : impossible("pattern binding")
Defines patbinding (links are to index).

<insist p have no conditionals nor field bindings>= (<-U)
case type(p) of {
  "pattern" : insist_global_pattern(p)
  "list"    : every insist_global_pattern(!p)
<verbosely announce binding>= (<-U)
case type(id) of {
  "string" : verbose("Pattern ", id)
  "list"   : verbose(*id, " patterns")
<refman: constructors>=
The constructor specifications are the most complicated in
the toolkit; they can use compound opcodes, equations,
assembly-language syntax, and conditional assembly.
Constructor specifications are the only ones in which newlines are
We make newlines significant so that we can use them in the very
common case in which the output pattern of the constructor is implicit.

<refman: constructor-spec syntax>

<refman: opcodes and operands>

<refman: branches of constructors>


A constructor consists of an Opcode, one or more Operands, an optional constructor type ConstType, and one or more Branches. The default type of a constructor is instruction. One-branch syntax is radically different from multibranch syntax and I haven't the time to explain the details now. Each branch consists of a set of Equations followed by a required output Pattern. Branches are evaluated in order: the constructor's value is the output pattern of the first branch whose constraints are satisfied.
<grammar>+= (<-U) [<-D->]
Constructors : "constructors" (/*see_newline()*/) { Constructor /* see_newline() */} ;
Constructor  : Opcode Operands [ ":" ConstType ] NLBranches
                 /* note_constructor(ii1, ii2, \ii3 | instructionctype, ii4) */ 
Operands : SeeWhite {Operand} StopWhite /* process_operands(ii2) */;

To process operands, we combine adjacent literals and remove trailing white space.

<*>+= [<-D->]
procedure process_operands(ops)
  if type(ops[-1]) == "literal" & ops[-1].s ? (white(), pos(0)) then
    pull(ops) # discard trailing white space
  l := []
  every x := !ops do
    if type(x) == "literal" & type(l[-1]) == "literal" then
      l[-1].s ||:= x.s
      put(l, x)
  return l
Defines process_operands (links are to index).

<refman: constructor-spec syntax>= (<-U)
The syntax of constructor specifications is\indexlit{constructors}
\lit{constructors} \sequence{\nt{constructor}}
  \nt{opcode} \sequence{\nt{operand}} \optional{ \lit: \term{type-name} } $\star$
Because both type and branches are optional, we use the newline to
separate the last operand of the current constructor from the opcode
of the following constructor.
The newline is significant only in the position marked~$\star$, and
only the first newline is significant.  It is OK to include the
newline even when explicit branches are present.

<refman: constructor types>
<grammar>+= (<-U) [<-D->]
NLBranches   : Branches
             | NEWLINE (Branches | /* [ [ [], &null ] ] */)
Branches     : SingleBranch /* [ii1] */
             | WhenBranch {WhenBranch | OtherwiseBranch}
                /* push(ii2, ii1); put(ii2, \ii3); ii2 */
SingleBranch : "{" Equations "}" [ "is" Pattern ] /* [ ii2, \ii4 | &null] */ 
             | "is" Pattern    /* [ [] , ii2 ] */ 
WhenBranch      : "when" "{" Equations "}" "is" Pattern /* [ii3, ii6] */;
OtherwiseBranch : "otherwise" "is" Pattern /* [[], ii3] */;
<refman: branches of constructors>= (<-U)
The branches given with a constructor determine its output pattern.
Most commonly, each constructor produces a unique sequence of tokens
when applied, so its right-hand side is specified by one pattern
containing a single disjunct:
  \optional{\lit\lbr \nt{equations} \lit\rbr} 
  \optional{\lit{is} \nt{pattern}}
The equations (Section~\ref{sec:equations}) are solved to produce a
set of {\em conditions}, which are predicates sufficient to guarantee
that the
equations have a solution, and to produce {\em bindings} for
the variables solved for.
When encoding,
the field- and integer-valued operands to the constructor provide the
inputs to the equation solver; when decoding, it is the variables
bound to fields in \nt{pattern} which perform that function.
Equations can be omitted if not needed.

The output pattern, which describes the binary representation 
of the constructor,
can also be omitted, in which case
all the field names, pattern names, and
constructor-type names in the opcode and operands are conjoined to
produce an {\em implicit} output pattern.
Omitting both equations and output pattern 
yields a very concise constructor specification; 
we use this idiom heavily in our MIPS and SPARC descriptions.

The toolkit checks a constructor's output pattern for 
It generates an error if a field in a token is overconstrained
or if fields overlap.
For an untyped constructor, it
generates a warning if there are
unspecified bits in the constructor's output pattern,
and it initializes unspecified bits to zero.

Sometimes we want to choose an output pattern based on some properties
of the operands (conditional assembly).
In that case, we can give multiple branches, each with its own set of
  \sequence{\lit{when} \lit\lbr{} \nt{equations} \lit\rbr{} \lit{is} \nt{pattern} |
            \lit{otherwise is} \nt{pattern}}
The sequence of \lit{when} branches may not be empty, 
and output patterns cannot be omitted when multiple branches are used.

When using this constructor in encoding, the toolkit chooses the first
branch whose conditions are known to be satisfied.
This conservative approach means that the toolkit can always produce
correct code in a single pass, but that, for example, branches to
unknown labels are always encoded in the most general way.
It would be pleasant if we got around to incorporating standard
multi-pass algorithms for conditional assembly, but don't look for it
<grammar>+= (<-U) [<-D->]
Opcode    : Opname { "^" Opname }  /* push(ii2, ii1) */ ;
Opname    : Ident /* \symtab[ii1] | ii1 */ | String ;
<refman: opcodes and operands>= (<-U) [D->]
The opcode and operand notations used in constructor specifications
are designed to enable concise specifications that resemble assembly
In fact, the toolkit uses these notations to infer an
assembly-language syntax for each instruction, and the toolkit can
generate encoding procedures which emit that assembly language.
As noted above, the constructor specification begins with opcode and
The list of operands is terminated by the first newline, or by any of
the tokens that introduce later constructs.%
\footnote{Computers are better than I at computing follow sets, but
certainly the colon that introduces a constructor type, the brace that
introduces equations, and the keyword \lit{is} all terminate lists of

Opcodes are formed from three kinds of units: literal strings, pattern
names, and field names.
Literal strings simply contribute to the opcode name, but patterns and
fields introduce an implicit iteration: patterns iterate over
disjuncts, and fields iterate over named values.
Each iteration results in the specification of a separate constructor.
For a pattern, the name of the disjunct is substituted for
the pattern name in the opcode, and the disjunct is used to stand for
the pattern on the right-hand side.
\bogon{At the moment, that value isn't used properly in equations.
This is an outstanding bug; the value should be equivalent to an
integer literal in the equations.}
For a field, the name of the field's value%
\footnote{As given by  \lit{names} or \lit{sparse} in a \lit{fieldinfo}
is substituted for the 
the field name on the opcode, and the pattern binding the field to
that value is used to stand for
the field on the right-hand side.
Any pattern so introduced on the right-hand side is conjoined into the
{\em implicit output pattern}, which is the one used in the
constructor specification when branches are omitted.
For examples, see \authorcite{ramsey:jersey}.
The basic idea is for any pattern or field on the left to be re-used
on the right --- typically it will be conjoined in, in which case it
is used implicitly.

The elements of an opcode
are joined with a hat (\lit{\char`\^}).
The syntax for an opcode is
\nt{opname} \sequence{\lit{\char`\^} \nt{opname}}
\term{string} \vbar{}  \term{pattern-name} \vbar{} \term{field-name}
Most opcodes use just a single name.
It's permissible to use an unbound identifier as an \nt{opname},
which has the same effect as a literal string.%
\footnote{This is hokey language
design, but it makes the specifications look nice.
It's unlikely to lead to errors, because a misspelling, which would
result in an unbound identifier, would also result in that
identifier's being unbound on the right-hand side, which would be
flagged as an error.
(We don't expect users to write specifications in which pattern or
field names are used on the left without also being used on the right,
and we issue a warning message if they do.)}

<grammar>+= (<-U) [<-D->]
Operand      : Ident ["!"] /* {<note new operand or id ii1>};
                              name_to_input(ii1, ii2) */
             | (Literal | GlobOperator | White) /* literal(ii1) */
Literal      : String | Integer /* string(ii1) */ | Relop | "=>" 
             | "[" | "]" | "(" | ")" | "+" | "-" | "/" 
             | "&" | "@" | "#" | "%" | ";" | "|" 
GlobOperator : "*" | "$" | ","
<refman: opcodes and operands>+= (<-U) [<-D]
Our operand syntax is unusual in that we permit arbitrary ``noise
words'' to enable the specification writer to mimic assembly-language
We call such noise words \nt{literal}s.
We use literals to help create encoding procedures that emit
assembly language.
For binary encoding and decoding, literals are ignored completely.
We permit any integer or quoted string, plus a variety of special
characters that are common in assembly languages but don't conflict
with our usage in specifications:
\term{string} \vbar{} \term{integer}
| any of these characters: \verb^<>=[]()+-/&@#%;*$,^\litbar

<refman: constructor operands by name>

A ConstType is defined implicitly by the first declaration of a constructor of that type.

<grammar>+= (<-U) [<-D->]
ConstType : Ident /* (/symtab[ii1] := constype(ii1, set())) | 
                     lookuptype(ii1, "constype")  */;
Defines ConstType (links are to index).

<refman: constructor types>= (<-U)
Constructors specified by the toolkit are {\em typed} or {\em untyped\/},
depending on whether a \term{type-name} appears.
All constructors are applied to operands to produce patterns, 
but the encoding procedures that the toolkit generates for a
constructor have different effects depending on typing.
Untyped constructors emit tokens into the current instruction stream,
but typed constructors produce opaque values encapsulating patterns.
The values are opaque in the sense that their only use is to be passed
as operands to other constructors.%
\footnote{We recognize that this restriction is a defect in the
toolkit, and that application writers might like to match on or
manipulate the opaque values produced by typed constructors.
We welcome suggestions about how such a facility should be designed,
implemented, or made available to applications.}
A new constructor type is defined simply by using a new \term{type-name}
in a specification.
All constructors of a type must be defined before the type is used,
for example, as an operand to another constructor.
This restriction prevents circular definitions.%
\footnote{It also prevents recursive constructor types.
We avoid recursive constructors for two reasons.
For efficiency, the toolkit's generator ``inlines'' constructor
operands by enumerating all possible patterns, a technique that works
only when the enumeration is finite.
When decoding, the toolkit can use a simple recognition algorithm
instead of having to create a parser, which might be ambiguous.}

The literal type serves to hold strings that are to be emitted literally.

<toplevel>+= (<-U) [<-D->]
record literal(s)       # holds string or list to be emitted literally
Defines literal (links are to index).

Every operand that is not of type literal is of type input:

<toplevel>+= (<-U) [<-D->]
record input(name, meaning)                     # input name and meaning
Defines input (links are to index).

There are five legitimate types for meaning:

constype --- Constructor of the given type.
field --- The given field.
null --- Integer.
string --- Relocatable address (the value of the string is always "reloc").
integer --- Extended field; the meaning is its width in bits. Finding the field requires looking up the input's name in the global symbol table.
<refman: constructor operands by name>= (<-U)
The real operands of a constructor are all given by name.
Constructor operands may have several different types, but we don't
want to clutter up our notation with type information, so we use
special names shamefully:
\term{field-name} \optional{\lit!}&
        Field, possibly signed
| \term{constructor-type-name}&
        Result of applying typed \rlap{constructor}
| \term{relocatable-name}&
        Relocatable address
| \term{unbound-name}&
        Integer operand
| \nt{literal}&
        Ignored for encoding and decoding
Constructor inputs may have underscores and
digits appended; they can be used to distinguish multiple inputs of
the same type.
Field and constructor inputs are available to be used as output
patterns; fields stand for the pattern binding that field to the value
of the operands, and constructor inputs stand for themselves---the
result of applying a constructor is always a pattern. 
Relocatable addresses and integers can be used in equations, in field
constraints, or both.
A \term{relocatable-name} is an identifier declared with the 
\lit{relocatable} directive described in Section~\ref{section:relocatable}.
The toolkit always treats field values as unsigned.
When a field is actually signed, specification writers must use an
exclamation point as an explicit sign-extension operator.
Indulging in a minor abuse of notation, we interpret
 the exclamation point after a field name in an operand list to mean
 that the parameter to an encoding procedure is a signed
integer, which should be narrowed to produce the value of the field.

An Operand may be the name of a symbol or the name of a constructor type trailed by zero or more digits (e.g., Address, Address1, etc). Note that variables used as relocatable addresses live in the symbol table. Finally, if not recognized, it is treated as a variable. Only a field may be sign-extended with a bang.

<*>+= [<-D->]
procedure name_to_input(name, bang)
  i := input(name, 
    if name ? type(ct := symtab[tab(reversetrailing('0123456789_'))]) == "constype" then
    else case type(x := symtab[name]) of {
      "field"       : if /bang then x else fwidth(x)
      "null"        : { <warn if no bang>; x }
      "relocatable" : "reloc"
      "constype"    : impossible("missed first-round search for constype")
      default       : typeerror(x, "free variable, field, or constructor input", name,
  /bang | type(i.meaning) == ("integer"|"null"|"string") |
    typeerror(i.meaning, "field or integer (to be sign-extended)", name, globals)
  return i
Defines name_to_input (links are to index).

<warn if no bang>= (<-U)
\bang | 
warning("Use ", name, "! for signed inputs --- future versions will require it")

Here's all the relocation goo.

<*>+= [<-D->]
record relocatable()    # type of a name that means relocatable address
global the_relocatable  # we only need one...
Defines relocatable, the_relocatable (links are to index).

<initialization>+= (<-U) [<-D->]
the_relocatable := relocatable()
<grammar>+= (<-U) [<-D->]
RelocSpec : "relocatable" Ident {Ident} /* every make_relocatable(ii2 | !ii3) */;
<*>+= [<-D->]
procedure make_relocatable(name)
  return (/symtab[name] := the_relocatable) | deferror("Relocatable name", image(name))
Defines make_relocatable (links are to index).

<refman: misc: relocatable>=
To write specifications that use relocatable addresses, you must
declare which operand names refer to relocatable addresses.
\indexedlit{relocatable} \sequence{\term{identifier}}
to mark one or more identifiers as relocatable addresses.
In subsequent constructor definitions, such identifiers denote
relocatable addresses when used as operands.

A constructor type is marked when it is first used as an input.

<*>+= [<-D->]
procedure mark_ct_as_used(type)
  if /type.used := lineno then
    {<sort members of type>}
  return type
Defines mark_ct_as_used (links are to index).

<sort members of type>= (<-U)
t := table()
every m := !type.members do
  t[] := m
t := sort(t)
type.members := []
every put(type.members, (!t)[2])

reversetrailing(c) generates, from right to left, the positions at which the rest of the subject is composed entirely of characters from c.

<toplevel>+= (<-U) [<-D->]
procedure reversetrailing(c)
  local i
  suspend *&subject + 1
  i := *&subject 
  while any(c, &subject, 0 < i) do { suspend i; i -:= 1 }
Defines reversetrailing (links are to index).

Deleting constructors

For tests, or for other purposes, we may want to delete constructors from a specification. There are two ways: say what we want to keep, or say what we want to throw away.
<grammar>+= (<-U) [<-D->]
Constructors : "discard" {Opcode} /* every discard_cons_named(explode_names(!ii2)) */
             | "keep"    {Opcode} 
                  /* s := set()
                     every insert(s, is_constructor(explode_names(!ii2), warning))
                     every k := key(constructors) do
                       if not member(s, constructors[k]) then
                         delete(constructors, k)
<refman: misc: discard and keep>=
You may want to delete constructors from
a specification, reducing the number of encoding procedures generated.
You may say what to keep or what to throw away.
\indexedlit{keep} \sequence{\nt{opcode}}
| \indexedlit{discard} \sequence{\nt{opcode}}
Constructors (or groups thereof) are named by their opcode specifications.


Extensions and ranges are guaranteed unique within a set of equations, and they are reset before each set of equations. The grammar is bit awkward because we're trying to keep it LL(1).
<grammar>+= (<-U) [<-D->]
Equations : [Equation { "," Equation } /* push(ii2, ii1) */] /* \ii1 | [] */ ;
Equation  : Expr Relop Expr /* eqn(ii1, ii2, ii3) */ ;
<refman: equations>=
Equations are written by relating two expressions using the relational
Multiple equations are separated by commas.
\nt{equation} \sequence{\lit, \nt{equation}}
\begin{production} {equation}
\nt{expr} \nt{relational-operator} \nt{expr}
Only the \lit= operator truly contributes information to a set of
equations to be solved.
All inequalities in a list of equations are simply added to the list
of {\em conditions} that must be satisfied if the equations are to
have a solution.

During encoding, it is a relocatable address, which can be treated as
an integer by the usual mechanism.
During decoding, it has to be converted to an integer by a special
template (see Section~\ref{sec:address-to-pc}).

Applications are forbidden in expressions except in special contexts.

<grammar>+= (<-U) [<-D->]
Expr      : AppExpr /* if has_app_or_literal(ii1) then 
                         error("Application or literal string not legal")
                       else ii1 */
<toplevel>+= (<-U) [<-D->]
procedure has_app_or_literal_f(e)
  return type(e) == ("Eapp"|"literal")

procedure has_app_or_literal(e)
  suspend expwalk(e, has_app_or_literal_f)
Defines has_app_or_literal, has_app_or_literal_f (links are to index).

<grammar>+= (<-U) [<-D->]
AppExpr   : Term { AOp Term }   
            /* every t := !ii2 do ii1 := binop(ii1, t[1], t[2]); ii1 */ ; 
Term      : Factor { Mop Factor }     
            /* every f := !ii2 do ii1 := binop(ii1, f[1], f[2]); ii1 */ ; 
AOp       : "+" | "-" ; 
Mop       : "*" | "/" ; 
Factor    : Integer
          | String /* literal(ii1) */   # legal only in constructor apps
          | "_" /* fresh_variable("_") */
          | IDENT ( [ Bitrange ] [ "!" ]          /* SyntaxRange(ii1, ii2) */
                  | "(" Args ")" 
             /* <note new operand or id ii1>
                if type(ii2) == "SyntaxRange" then mkfactor(ii1, ii2.bits, ii2.bang) 
                else Eapp(ii1, ii2) */
          | "(" AppExpr ")"        
          | "-" Factor /* binop(0, ii1, ii2) */
Bitrange  : "@" "[" INT [":" INT /*ii2*/] "]" /* [ii3, (\ii4|ii3)+1] */ ;
Args      : AppExpr {"," AppExpr} /* push(ii2, ii1) */
          | /* [] */

Can't use a square-bracket syntax for bit ranges because it creates parsing conflict with list of names on left-hand side of pattern binding, list of patterns in square brackets, and optional name in arm of case statement.

<toplevel>+= (<-U) [<-D->]
record SyntaxRange(bits, bang)
Defines SyntaxRange (links are to index).

<refman: expressions>=
The expression syntax uses the standard binary operators, unary minus,
and special
postfix syntax for bit slicing and sign extension:
| \term{identifier} \optional{\nt{bit-slice}} \optional{\lit!}
| \nt{expr} \nt{binary-operator} \nt{expr}
| \lit( \nt{expr} \lit)
\lit+ \vbar{} \lit- \vbar{} \lit* \vbar{} \lit/
        \lit@\lit[ \term{lo-bit} \optional{\lit: \term{hi-bit}} \lit]\\
The special identifier \lit{\_} is a ``wildcard.''
It stands for a ``don't care'' value, and every instance stands for 
a different variable.

The binary operators have the standard precedence.
Using explicit division is almost certain to cause the solver to
Multiplying two unknowns will also break the solver---you should only
multiply by integer literals.

A bit slice denotes the range of bits from \term{lo-bit} to
\term{hi-bit}, inclusive.%
\footnote{The current syntax for bit slices differs from that used in earlier versions
of the toolkit.  The old syntax led to parsing conflicts when restrictions on the use
of expressions were relaxed.}
By default, bit~0 is the least significant bit, but the significance of bit~0 
can be changed with the \lit{bit 0} directive (Section~\ref{section:bit-0}).
If \term{hi-bit} is omitted, it defaults to \term{lo-bit}, that is,
the slice is one bit wide.%
\footnote{It is a defect in the toolkit that there is no way to notate
a slice from bit $k$ up to the most significant bit, whatever that may
be (e.g., 31 or 64).  
If you want this defect remedied, suggest a suitable notation.}
Any quantity may be sliced.
Only slices and fields may be sign-extended, because they are the only
quantities whose widths are known.

The toolkit's equation solver is simple-minded, and it gives
terrible error messages.\index{equation solver!apology
for}\index{error messages!quality of}
<*>+= [<-D->]
procedure mkfactor(ident, range, ext)
  e := ident
  if \range then {
    e := mkslice(e, range[1], range[2])
    w := e.n
  if \ext then {
    /w := if type(f := symtab[ident]) == "field" then fwidth(f)
          else error("Can't sign-extend ", ident, " (not a field)")
    e := Ewiden(e, w)
  return e
Defines mkfactor (links are to index).

Code containing matching statements

Using the global value succptr is OK for now, but it will break horribly if we ever permit nested matching statements.
<toplevel>+= (<-U) [<-D]
global matching_stmts, codeheader
Defines codeheader, matching_stmts (links are to index).

<grammar>+= (<-U) [<-D->]
CodeFile      : (/*codeheader := arm(filename, lineno)*/)
                ({CODELINE} /* codeheader.code := ii1 */)
                { Casestmt {CODELINE} /* ii1.trailer.code := ii2 ; ii1 */ } 
                /* codeheader.original := codeheader; matching_stmts := ii3 */;
Casestmt      : CASELINE { Casearm } [ElseArm] "endmatch" 
               /* x := matching_stmt(ii2, ii1, succptr) ; put(x.arms, \ii3)
                  x.trailer := arm(filename, lineno); x */
Casearm       : ("|" /*arm(filename, lineno) */)
                Pattern OptEquations OptName "=>" {CODELINE}
                /* ii1.pattern := ii2; ii1.eqns := ii3; := ii4; ii1.code := ii6
                   ii1.original := ii1 */;   # value is ii1
ElseArm       : ("else"           /* arm(filename, lineno, epsilon()) */) {CODELINE} 
                /* ii1.code := ii2; ii1.original := ii1 */ ;
OptName       : [ "[" Ident "]" ];
OptEquations  : [ "{" Equations "}" ];
<refman: matching grammar>=
The toolkit's matching statements are inspired by pattern matching in
Standard~ML \cite{milner:definition}.
Matching statements are embedded directly in C or Modula-3 code.
<refman: how matching statements begin>
The body of a matching statement contains a series of arms,
and the matching statement ends with \lit{endmatch} on a line by

The first line of an arm begins with a \litbar{} and has the following
\litbar\ \nt{pattern} \optional{\lit\lbr\ \nt{equations} \lit\rbr} 
                   \optional{\lit[ \term{name} \lit]} \lit{=>} \term{code}
Free variables of \nt{pattern} or \nt{equations} are available to be
used in the \term{code} on the right-hand side.
Free variables from \nt{pattern} can come from field constraints or
from constructor applications; in either case, all such variables
represent binding instances.
If a free variables is an integer-typed argument to a constructor or
it  appears in  field bindings, it is an integer.
If a free variable is a constructor-typed argument to a constructor,
then on the right-hand side of the \lit{=>} it stands for the location
in the instruction stream where that argument matches.
It's as if the name of the argument were written as a label preceding
each constructor of the appropriate type.
The location is converted to an integer by the usual mechanism
(arithmetic from the matched address in units of

The special variable \verb+_+ can be used in constructor applications
to stand for operands whose values are uninteresting.
The optional \term{name}, if present, is an identifier that is bound
to the name of the \nt{pattern}.
This name may be a name introduced by a \verb+patterns+ declaration,
or it may be the name of a constructor that was applied to produce the
The \term{code} may occupy multiple lines; the toolkit takes all the
text between the \lit{=>} and the next initial \litbar, or the end of
the matching statement.

The last arm of a matching statement may be written
\lit{else} \term{code}
which is syntactic sugar for
\lit{| epsilon =>} \term{code}

<refman: summary of matching grammar>

The decoder generated by the toolkit has the effect of checking
each arm of the matching statement in turn.
As soon as it finds an arm whose pattern matches and whose equations
have a solution,%
\footnote{Note that patterns obtained by constructor application may
carry along some equations implicitly.} it binds all the free
identifiers on the left-hand side of the arm, then executes the code
on the right-hand side.
The decoders the toolkit generates are actually much more efficient,
testing inputs against all arms at once, but the sequential model accurately
describes the semantics.

Specifications for fetching words

I introduce the nonterminal Address in order to avoid making address a reserved word.
<grammar>+= (<-U) [<-D->]
FetchSpec : "fetch" ( INT   "using" String /* newfetch(ii1, ii3, 'a') */
                    | "any" "using" String /* newfetch(ii1, ii3, 'aw') */
          | "address" (Add "using" String          /* newfetch(ii1, ii3, 'ao') */
                      | "type" "is" String         /* newfetch(ii1, ii3, '') */
                      | "to" IntIdent "using" String     /* newfetch(ii2, ii4, 'a') */

Add      : Ident /* (ii1 == "add") | error("expected `add', `type', or `to'") */;
IntIdent : Ident /* (ii1 == "integer") | error("expected", image("integer")) */;
<*>+= [<-D->]
procedure newfetch(k, template, expected)
  <issue warnings for chars in expected not in template>
  return fetchtab[k] := template
Defines newfetch (links are to index).

<refman: fetching templates>=
Decoders generated by the toolkit's translator instantiate code
templates to get access to the tokens of the instruction stream being
The toolkit treats addresses and instruction streams
as abstract data types; these templates give the operations.
  \indexedlit{fetch} \alternate{\term{width} | \lit{any}} \lit{using}
| \indexedlit{address type} \lit{is} \term{template}
| \indexedlit{address add} \lit{using} \term{template}
| \indexedlit{address to integer} \lit{using} \term{template}
Each \term{template} is a string.
\lit{fetch} templates are used to fetch words from an instruction
Some applications may be able to give a single \lit{fetch} templates
that works with any width; others may have to give a different
\lit{fetch} template for each width used in the specification.
\lit{address type} identifies the type the application uses to
represent a location in an instruction stream;
the \lit{address add} template is used for address arithmetic, which
moves through the stream.
\lit{address to integer} is used to convert an address to an integer; it is
used to assign values  to labels.\index{labels!template for conversion}
<refman: meaning of fetchtab strings>
<refman: fetch template>
<refman: address add template>
<refman: address to integer template>
The \lit{address type} template doesn't use any escapes.

This warning may help people who are hung up on left-to-right.

<issue warnings for chars in expected not in template>= (<-U)
c := ''
template ? while tab(upto('%')) & ="%" do c := c ++ move(1)
every x := !(expected -- c) do
  warning("expected %", x, " in template")

Assembly syntax

<refman: assembly syntax>= [D->]
You can write perfectly interesting encoding and decoding applications 
without specifying an assembly-language syntax, so you may want to skip this section
on first reading and go on to \secn{following-asm-syntax}

A constructor's 
left-hand side, i.e., the constructor's name, its operands, and
some syntactic sugar, is often sufficient to specify the constructor's 
syntax in the assembly language recognized by a vendor's assembler. 
This is the case for most instructions on the MIPS and SPARC when
assembled by Unix assemblers.
Other architectures, like the Pentium, 
are targets of multiple assemblers, each of
which may recognize a distinct assembly syntax.
Multiple syntaxes prevent the specification writer
from using one constructor to specify all variant syntaxes.
Moreover, some assemblers overload instruction names,
% Some assemblers overload instruction names, 
using context to determine which instruction is meant.  
For example, the Pentium
\verb|add| can mean any of five different instructions, depending on
the sizes and locations of the operands.  The toolkit cannot do
this kind of overloading; it must use different names for
different instructions (constructors).  The reason is that the
toolkit must generate a different encoding procedure for each
instruction, and in most programming languages, different
procedures must have different names.  
This is untrue.  Name resolution in C++, Self, Cecil, etc.,
is extremely powerful and completely confounding.

Even in languages that do
support overloading, we might not be able to use the same name,
because the name-resolution mechanisms used in programming
languages are not as powerful as what an assembler uses (LR

We solve this problem by requiring each constructor to have a
different name.  Typically, the specification writer distinguishes
variants by adding suffixes to the base name of the constructor.
For example, the Pentium \verb|add| instructions include constructors
called \verb|addb|, \verb|addib|, \verb|addiowb|, \verb|addmrb|, and
\verb|addrmb|.  To get from these names back to assembly language,
we have to define an appropriate mapping.  
These mappings, and others for operand syntax, are
defined separately from the main specification, because different
vendors use different syntaxes for their assembly languages.

%To support various syntaxes and , 
%the toolkit provides assembly specifications separate from
%the constructor specifications.
An assembly specification includes three parts:
a mapping of constructor names to assembly opcodes,
the assembly format for each constructor operand,
and the assembly syntax for each constructor.
We use the Pentium specification of 
the assembly language supported by the
GNU assembler to illustrate assembly specifications.
First, we describe assembly name mappings, then operand format and
constructor assembly syntax.
| \nt{assembly-operand-syntax}
| \nt{assembly-syntax}
<grammar>+= (<-U) [<-D->]
AsmSpec : "assembly"
   ( "operand" {IdentBinding "is" OperandSyntaxSpec
                                 /* every asmoperand(!ii1, ii3[1], ii3[2]) */}
   | "component"  {Globbing "is" GlobTarget  /* asmopcode (ii1, ii3) */}
   | "opcode"  {Globbing "is" GlobTarget  /* asmopcode (ii1, ii3, 1) */}
   | "syntax"  (/*see_newline()*/) { AsmSyntax /* see_newline() */}

<refman: assembly syntax>+= [<-D->]
\nextsection{Assembly names for opcodes}
A constructor's name may contain multiple parts derived from pattern
names and constant strings.
Each part may or may not contribute to the assembly name.
For example, the constructor name \verb|add^"mrb"| contains two parts, 
the first derived from the pattern \verb|add| and the second from the
string \verb|"mrb"|.  The assembly name for this constructor is
\verb|addb|, so the pattern \verb|add| contributes its name and the suffix 
\verb|"mrb"| is mapped to the string \verb|"b"|.
This example illustrates how the constructor \verb|addmrb| has a more specific name to
disambiguate it from other overloaded variants of \verb|addb|.

{\hfuzz=12pt % make room for wide production
\verb|assembly opcode| introduces mappings from complete
constructor names (strings) to their assembly names (strings).
\verb|assembly component| introduces mappings from parts 
of constructor names to their assembly names.
\indexedlit{assembly} \indexedlit{opcode} \nt{glob-pattern} \lit{is} \nt{glob-target}
| \indexedlit{assembly} \indexedlit{component} \nt{glob-pattern} \lit{is} \nt{glob-target}
Complete mapping simply enables a specification writer to rename entire constructors. 
We also provide component-wise mapping because it improves
factoring of assembly names among constructors that
share common suffixes and prefixes.  For example, the suffix
\verb|B.Eb.Ib| always maps to \verb|b| in every constructor name
where it appears.  That mapping is specified by
  assembly component B.Eb.Ib is b

Here's how we disambiguate the two mapping mechanisms.
If a complete name mapping exists for a constructor name, it is
applied, and componentwise mappings are ignored.
If no complete mapping exists, 
mappings are applied individually to {\em each} part of a
constructor's name and the resulting strings are concatenated into the
complete assembly name.

<refman: assembly syntax>+= [<-D->]
\nextsection{Globbing expressions}
There is redundancy in the mapping of constructor names, so we
use a regular expression syntax to group related names.
The regular expression syntax is the same as the syntax for C-shell
``globbing'' expressions~\cite{joy:c-shell}.
For example, the  mappings applied to the parts of the constructor
name \verb|add^"mrb"| are:
assembly component
    add     is  add
    *b      is  b
The first rule maps \verb|add| to itself 
and the second maps any string that matches \verb|*b|
to \verb|b|.
The least general rule that matches a string is applied.
\remark{What do we mean by ``least general?''}

It is often useful to define a default mapping, 
i.e., for the pattern ``\verb|*|''.
On the supported targets, 
most constructor names map directly to assembly names, 
so the default maps a name to itself, i.e., 
\verb|assembly component {*} is $1|.

In globbing expressions, ``*'' matches any string.
Any other 
character, including the dot, matches itself.
The concatenation operator is implicit, so adjacent characters are concatenated.
Alternatives are comma-separated lists of strings delimited by ``\lit\lbr''
and ``\lit\rbr''.

The string on the right-hand side may contain elements of the form
{\tt\$}$n$, where $n$ is the $n$-th braced expression on the
left-hand side. 
{\tt\$\$} stands for a single dollar sign.
  \sequence{\lit{*} | \term{string} | \nt{glob-alternatives}}
  \lit{\lbr} \nt{glob-pattern} \sequence{\lit{,} \nt{glob-pattern}} \lit{\rbr}
  \sequence{\term{string} | \lit{\$}\term{integer} | \lit\$\lit\$}
For example, the rules below specify that the 
suffixes \verb|ow| and \verb|aw| map to the assembly name \verb|w|, 
and that \verb|od| and \verb|ad| map to \verb|l|.
assembly component
    {o,a}w   is w
    {o,a}d   is l

<refman: assembly syntax>+= [<-D->]
\nextsection{Assembly formats for operands}
\verb|assembly operand| introduces mappings from operands to formatted strings
that specify how to print the operands in assembly code.
Operands may be fields, integer inputs, relocatable addresses, or
constant strings. 
We use \verb|printf|-style syntax for formatted strings.

\indexedlit{assembly operand} \term{Ident} \lit{is} \nt{operandsyntax}
  \term{format-string} \lit{using} \nt{operandname}
| \nt{operandname}
  \lit{sparse [} \nt{binding} \sequence{\lit, \nt{binding}} \lit]
| \lit{names [} \sequence{\term{Ident} | \term{String}} \lit]
| \lit{field} \term{Ident}
<grammar>+= (<-U) [<-D->]
OperandSyntaxSpec : String ["using" OperandNameSpec ] /* [ii1, ii2] */
                  | OperandNameSpec /* ["%s", ii1] */
OperandNameSpec   : NameTable
                  | "field" Ident /* lookuptype(ii2, "field") */

<refman: assembly syntax>+= [<-D->]
The first \verb|assembly operand| rule below specifies that immediate operands
are prefixed by ``\$'' and are printed as integers.
The second rule specifies that the listed field operands are prefixed by
``\%'' and printed as strings, using the names provided in their
\verb|fieldinfo| declarations.
%The third rules specifies that the operands are printed as strings,
%using the assembly names of field \verb|base|.
assembly operand
    [count i8 i16 i32]           is "$%d"
    [r32 sr16 r16 r8 base index] is "%%%s"

Some inputs are not declared as fields but should be printed in the
same format as fields.   For example, the input \verb|reg| should be
printed using the names associated with the field \verb|base|.  
The following rule uses the optional \verb|using |{\em field}
clause to specify that \verb|reg|, \verb|reg8|, etc. should
be printed using the fieldinfo associated with \verb|base|.
assembly operand
    [reg reg8 sreg cr dr]       is "%%%s" using field base

<refman: assembly syntax>+= [<-D->]
\nextsection{Assembly syntax for constructors}
The default assembly syntax for a constructor appears in the
constructor's specification;
an alternate syntax may be specified with \verb|assembly syntax|.
Providing assembly syntax in a constructor specification can help a
specification writer or user read and identify a constructor, and it
is concise when only one assembly syntax is required.
An alternate syntax may be needed, however, if more than one assembly language
is used on the target.

\verb|assembly syntax| uses the same syntax as the \verb|constructors| directive:
a constructor name followed by a list of operands.
  \indexedlit{assembly syntax} \term{opcode} \term{operands}

<grammar>+= (<-U) [<-D->]
AsmSyntax: Opcode Operands NEWLINE 
               /* every set_asmsyntax(is_constructor(explode_names(ii1), warning), ii2) */;

<refman: assembly syntax>+= [<-D->]
The assembly-syntax specification must use 
the same set of operands that the constructor
uses, but the operands may appear in any order and with any
syntactic sugar.
For example, 
the GNU assembly language reverses the order of some 
constructor's operands, such as \verb|arithI^"d"| below. 
assembly syntax
  arithI^"d"     i32!, Eaddr
  MOV.Eb.Iv^od   i32!, Eaddr
Some constructors have implicit operands, such as the floating-point stack 
constructors.  Assemblers require those operands to distinguish
between overloaded floating point instructions.
assembly syntax
  fstack^Sstack  "%st", "%st"(idx)

The GNU syntax for effective addresses
bears no resemblance to the DOS syntax, but it can be specified in only a
few lines.
assembly syntax
  Indir       (reg)
  Disp32      d(reg)
  Index       (base,index,ss)
  Index32     d(base,index,ss)
  ShortIndex  d(,index,ss)

<refman: assembly syntax>+= [<-D]

Globbing expressions

A globbing expression is a string, a list (concatenation), a wildcard, or a list of alternatives in braces.
<*>+= [<-D->]
record glob_any(alternatives, number)   # braced list of alternate globbing patterns
record glob_wildcard()
global the_glob_wildcard                # only need one value
record glob_dollar(number)              # $n on right-hand side
Defines glob_any, glob_dollar, glob_wildcard, the_glob_wildcard (links are to index).

<initialization>+= (<-U) [<-D]
the_glob_wildcard := glob_wildcard()
<grammar>+= (<-U) [<-D->]
Globbing         : SeeWhite GlobPattern StopWhite White /* ii2 */;
GlobPattern      : {GlobAlternatives | Literal | Ident} 
                          /* number_braces(cat_adjacent_strings(ii1)) */;
GlobAlternatives : "{" GlobPattern {"," GlobPattern } "}" /* glob_any(push(ii3, ii2))*/
                 | "*"                                    /* the_glob_wildcard */
<grammar>+= (<-U) [<-D->]
GlobTarget       : SeeWhite GlobTargets StopWhite White /* ii2 */;
GlobTargets      : {GlobTargetSpecial | GlobTargetLiteral} 
                                                /* cat_adjacent_strings(ii1) */;
GlobTargetSpecial: "$" (Integer /* glob_dollar(ii1) */ | "$")
GlobTargetLiteral: Literal | Ident | "*" | "{" | "}" | "," ;

<*>+= [<-D]
procedure cat_adjacent_strings(l)
  m := []
  every x := !l do 
    if type(x) == "string" & type(m[-1]) == "string" then
      m[-1] ||:= x
      put(m, x)
  return m
Defines cat_adjacent_strings (links are to index).

Lexical elements

<terminal symbols>= (<-U)
%term IDENT
%term INT

<grammar>+= (<-U) [<-D]
Ident     : IDENT | "_" ;
String    : '"' ;
Integer   : INT | "'" /* ord(ii1) */ ;
White     : WHITESPACE;
SeeWhite  : /* see_whitespace() */;
StopWhite : /* ignore_whitespace() */;
Defines Ident, String (links are to index).