The assembly-language syntax you design this week will ultimately be the syntax that is emitted by your Universal Forward Translator. When you are debugging your translator, you will be reading a lot of assembly code. Take the time to design a concrete syntax that you will find easy to read. I urge you to design a syntax that looks a lot like C++, Python, Go, JavaScript, Pascal, or any other procedural language with infix operators. Such a syntax will make it obvious to anyone what the assembly code is doing, and if you want to implement peephole optimization in module 12, it will give you a head start.

Design considerations

Hear are some specifics for you to keep in mind:

Familiarity. Making an assembly language easy to read and write might include using familiar notations for familiar operations like arithmetic. If you’re used to writing x = y + z in source code, you may as well write $r1 = $r2 + $r3 in assembly code.
Tokens you can work with. In file asmlex.sml, I provide a lexical analyzer that understands these tokens:
- Brackets, comma, colon, = sign, == sign, and Pascal-style assignment :=. Each of these is recognized as a token even if not separated from other tokens by whitespace.
- Comments beginning with ; and running to end of line.
- Register numbers beginning with r or $r.
- Integer literals
- String literals delimited with double quotes. Only \", \n, and \\ are recognized as escapes
- Names, which is basically everything else
You are welcome to extend, revise, or replace my lexical analyzer. But I recommend you use it primarily as an example of combinator parsing.
Ease of parsing. The grammar for your concrete syntax should be easy to translate into combinator parsers. Your grammar need not be LL(1); a combinator parser is not limited to a single symbol of lookahead. While an LL(1) grammar translates to a faster parser, fast parsing is not a goal. Any sane grammar that you write is likely to be easy to parse.

In module 4, when you’re testing parsers, you may want to fiddle with your grammar to improve its error messages. For module 3, use the clearest, simplest grammar that could possibly work.

Design for use in the system

A single VM instruction has multiple concrete representations: assembly language, virtual object code, and the SVM’s unparsing template in instructions.c. That instruction also has internal representations inside the UFT. Here are some of the functions involved:

Parse into an assembly-language AST (you write next week)
Unparse back to assembly language (you write this week)
Assemble into an object-code AST (you write next week)
Unparse into object code on disk (I’ve written)
Parse and load into SVM as binary code in memory (you wrote in module 2)
Unparse binary code using the printasm function in file disasm.c, using the unparsing templates in instruction.c (you wrote in module 2)

Here’s how the compositions are used:

Functions (a) and (b) should compose to be the identity function on well-formed assembly language; that composition is implemented by uft vs-vs.
Functions (a) through (d) compose to make an assembler; that composition is implemented by uft vs-vo.
Functions (a) through (e) should compose to the identity function or something close to it; that composition is implemented by
```
uft vs-vo | env SVMDEBUG=unparse svm
```

The better these compositions preserve the original concrete syntax, the easier your system will be to debug.

Traditional assembly-language syntax is obsolete

Traditional assembly languages are very old, and some of them were constrained by available hardware and software resources. In such cases it was helpful to have very simple syntax. But today’s assembler no longer has to be written in its own assembly language and then assembled by hand. So design a syntax you’ll enjoy reading, and have fun with your design!

If you want an example, check out the syntax I designed for the Universal Machine in COMP 40. Bear in mind that our target is simpler, so our assembly language can be simpler:

The SVM has no segmented memory and in fact no addressable memory of any kind (unless you count global variables).
The SVM has no stack.
The SVM has no “sections.”
The SVM has no initialized data.
The SVM operates on floating-point numbers, not machine integers, so there is no reason to call the comparisons “signed.”