Code generation, often called “instruction selection,” is the heart of any compiler. In the UFT, assembly code is generated from K-normal form, which is designed for easy code generation. The key step is the translation from an applicative language, in which names are let
-bound, to an imperative language, in which names are assigned to. That translation is described below.
This handout includes information that you will use now, and also information you will use in three weeks:
The handout includes examples you can learn from, plus the ideas of the translation (centered around
let
bindings and “destinies”), plus equations that specify exactly how the translation works. This information explains how the translation works—in order to write this week’s code, you must understand it.The handout also includes discussion of live and dead variables. This information explains why the translation works.
In module 9, when we return to translation, this information will help you implement the K-normalization translation correctly.
The handout assumes that you have internalized the syntax of K-normal form. If that assumption is faulty, consider refreshing your memory.
Code-generation example
The ideas behind code generation are best illustrated with an example. A short example that has primitives, a call, and a conditional is append
. Its source code looks like this:
(define append (xs ys)
(if (null? xs)
ys
(cons (car xs) (append (cdr xs) ys))))
The code in the body can be hand-translated into K-normal form:
(define append (xs ys)
(let* ([p (null? xs)])
(if p
ys
(let* ([z (car xs)]
[zs (cdr xs)]
[f append] ; 'getglobal' primitive
[ws (f zs ys)]
[t (cons z ws)])
t))))
Apart from the if
, the function body is essentially one for one with assembly code; each binding corresponds to a VM instruction. The correspondence is easier to see if the K-normal form is written using the ML-like notation for K-normal form:1
let p = null?(xs) in ; p := null? xs
if p then ; if p goto ...
ys ; return ys
else
let z = car(xs) in ; z := car xs
let zs = cdr(xs) in ; zs := cdr xs
let f = getglobal("append") in ; f := getglobal("append")
let ws = f(zs, ys) in ; ws := call f (zs, ..., ys)
let t = cons(z, ws) in ; t := z cons ws
t ; return t
Every line on the right is effectively a single assembly-language instruction. Translating each binding to an instruction is not too difficult—it’s easier to write the code than to understand exactly why it works. Since understanding, not just working code, is a goal of the course, let’s study why it is OK to replace each let
binding with an assignment.
From applicative to imperative: A matter of life and death
Translating K-normal form into assembly code is mostly a matter of translating each let
binding into an assignment. But a let
binding is not an assignment—that’s the essence of the difference between an applicative language and an imperative one. The difference manifests when a let
-bound name is reused:
let x = 3 in ; x := 3
let y = (let x = 2 in ; x := 2
x + x) in ; y := x + x
let z = x + y in ; z := x + y ; WRONG x
print z ; print z
In the applicative K-normal form, the second binding to x
is limited in scope. In the imperative assembly language, the second assignment to x
overwrites the first. K-normal form has to be translated to assembly language without overwriting key values that reside in machine registers.
One tactic is to rename let
-bound variables so that no two let
bindings ever use the same name. But since every let
-bound variable occupies a machine register, that tactic is viable only if some later pass, after translation, figures out how different let
-bound variables can occupy the same VM register. Such a pass is an example of register allocation. It’s an honorable technique, and it’s a centerpiece of any serious course in compiler construction, but for a course that aspires to simplicity, it’s too much.
The problem with the example above is that there are two different let
bindings of x
, and it’s not safe for both bindings to use the same VM register. Instead of solving this problem, we will rely on an upstream translation pass (module 9) to give us K-normal form that uses let
bindings only in safe ways. To explain safety, I’m going to teach you two key concepts of register allocation: “life” and “death.”
In an imperative setting, a register is live if its value might affect the results of a future computation. If the register’s value definitely cannot affect the results of a future computation (perhaps because it does not appear anywhere, or because the register is always overwritten), the register is dead.2 The same register may be live or dead at different points in the program; for example, in the imperative code above, x
is dead after the assignment x := 3
, live after the assignment x := 2
, and still live up to but not including the command print z
; x
dies right after the assignment to z
.
Life and death can explain the issue with the code example: after the assignment x := 3
, x
is intended to be live—but x
is killed by the next assignment (x := 2
). In general, a register is killed by any code that might overwrite it. The usual suspects are assignments and function calls.
Killing explains the difference between applicative binding and imperative assignment:
let y = e in e' |
(y := e; e') |
Does not kill y |
Kills y |
The assignment is equivalent to code that evaluates the let
expression and simultaneously kills y
. Simultaneous parallel composition is written using a vertical bar:
(y := e; e') ≡ let y = e in e' | kill y
We can therefore translate let
-binding into assignment as long as it is safe to kill y
—that is, as long as no other let
-binding of y
appears in the context of this let
expression. In our system, safety is going to be guaranteed by the translation that emits K-normal form (the K-normalizer you will build in module 9).3
Liveness and free variables
In an imperative language, liveness is computed by a program analysis that follows every branch and conditional and estimates all the things that might happen if every evaluation terminates and every branch is (eventually) both taken and not taken. This computation is an example of a dataflow analysis. Dataflow analysis is a great topic, but it’s too advanced for a typical compiler course. (In fact, I have taught an entire graduate seminar just on dataflow analysis and optimization.)
In an applicative language, by contrast, knowing what variables will be used is simplicity itself: the live variables correspond to the free variables of an expression. If you have not already encountered free variables in CS 105, you will find free variables as part of implementing lambda
in module 10. It’s an easy computation.
Translations of let
bindings
Given a let
binding let y = e in body
, once we know it is safe to kill y
, we can translate the let
binding into an assignment. The right-hand side e
can take any form, but the forms divide into three categories:
A form that has a direct translation is converted into a single instruction. Such forms include literals, variables, and primitives, among others.
A control-flow form is translated using
GOTO_LABEL
andIF GOTO_LABEL
. Such forms includeif
andwhile
.A floatable form is “floated” out of the
let
using an algebraic law. The result is a smallerlet
which is again translated. Such forms include assignment(x := e)
, sequencing(e₁; e₂)
, andlet
itself.
The direct translations from K-normal form to assembly code are the easiest:
Form of let expression |
Translation of binding | Instruction |
---|---|---|
let y = v in body | y := v | load literal |
let y = x in body | y := x | register/register move |
let y = @(x₁, …, xₙ) in body | y := @(x₁, …, xₙ) | REGS instruction |
let y = @(x₁, …, xₙ, v) in body | y := @(x₁, …, xₙ) | REGSLIT instruction |
let y = funcode (…) => e in body | y := funcode (…) => e | LOADFUNC instruction |
These translations will be implemented by a function called toReg
, which you will implement. Function toReg
takes the y and e from an expression of the form let y = e in body, and it returns a sequence of assembly-language instructions. These instructions compute the value of e and place that value in register y.
Cultural enrichment: Three-address code
Many, many machine primitives, like +
, >
, and cons
, take two arguments. So the most common instruction generated has the form y := @(x₁, x₂)
. In the compiler biz, this form is called “three-address code,” and it is the most common target language for compilers. Single-argument operations like !
and cdr
can be viewed as special cases of three-address code, as can load-literal and register-register move instructions. The ubiquity of three-address code explains why our SVM, like the MIPS, SPARC, and ARM, uses an instruction format that can name three registers. The lack of such formats in the x86 family is a nuisance.
The if
and while
forms are translated into control flow, which is described below.
The floatable forms are translated using algebraic laws:
When the form bound to a variable is itself a
let
, the nestedlet
can be “floated” out, providedx
doesn’t clash withy
and isn’t free in the body.let y = (let x = e in e’) in body
≡ { provided y ≠ x and x ∉ fv(body) }
let x = e in (let y = e’ in body)Here
fv(body)
stands for the free variables of the body.Because
x
is limited in scope, ifx
does not satisfy the side conditions, it is always possible to renamex
so that it does. Just pick a name that does not clash withy
and is not free inbody
. (If you have implemented capture-avoiding substitution in the lambda calculus, the renaming is the same renaming that you used to avoid variable capture.)In our own compiler, we are going to arrange things upstream so that nested
let
expressions can be floated without renamingx
.When the form bound to variable
y
is abegin
form, the first expression is floated outside thelet
:let y = (e₁; e₂) in body
≡
(e₁; let y = e₂ in body)When the form bound to a variable is an assignment, the assignment is floated outside the
let
, and the binding is changed to use the variable just assigned to. Ify
andx
happen to be identical or ifx
happens to be free in the body, that is OK.let y = (x := e) in body
≡
(x := e; let y = x in body)
These examples show the main idea behind the translation: convert everything to a sequence of let
bindings, then translate each binding to an instruction. But an actual translation function has a little more to it; in particular, the examples don’t tell us how to translate e₁
in the expression (e₁; e₂)
. To know what to do, we have to understand the meaning of the context in which each expression appears, which I refer to as the expression’s “destiny.”
Translation guided by destiny
How an expression is translated depends on how the expression’s value is going to be used, which in turn depends on the syntactic context in which the expression appears. The context determines how an expression will be used, which I refer to as its “destiny.” In our system there are three possible destinies:
The classic destiny is that the value of the expression is destined to be placed in a register. An example appearance that has this destiny is the
e
in the formlet x = e in e'
; its value is destined to be placed in registerx
. This destiny is useful for applications of primitive operators like+
andcdr
, among others.Expressions with this destiny will be translated by function
.Another possible destiny is the expression is evaluated only for its side effect. Some example appearances that have this destiny include the body of a
while
loop or the expressione₁
in the form(e₁; e₂)
. This destiny is useful for applications of primitive operators likeprint
andsetglobal
, among others.Expressions with this destiny will be translated by function
.The third possible destiny is that the value of the expression is returned from a function. An example appearance that has this destiny is the body of a function—or any expression that occurs in tail position in a function.4 This destiny is useful for any expression that produces a value.
Expressions with this destiny will be translated by function
.Where this destiny really makes a difference is when the expression is a function call: a function call whose value is returned needs to be optimized into a tail call. Both ordinary calls and optimized tail calls will be implemented in module 7.
Destiny is determined by context
The destiny of an expression is independent of the expression’s form; an expression’s destiny is completely determined by its context—that is, its position in the syntactic form in which it appears. Context determines destiny using these rules:
In the expression let x = e in e’, expression e is destined to be put in a register (register
x
in fact).In the
begin
expression (e₁; e₂), expression e₁ is destined to be evaluated for side effect.In the
set
expression x := e, expression e is destined to be put in registerx
.In the expression while x := e do e’, expression e is destined to be put in register
x
.In the expression while x := e do e’, expression e’ is destined to be evaluated for side effect.
In the expression funcode (x₁, … xₙ) => e, expression e is destined to be returned from a function.
Every top-level expression (that is, one that serves as the translation of a definition form) is destined to be evaluated for side effect.
Otherwise, the destiny of an expression is determined by the destiny of its parent expression. For example, in
if x then e₁ else e₂
, bothe₁
ande₂
have the same destiny as theif
expression, and inlet x = e in e'
,e'
has the same destiny as thelet
expression.
The rules of tail position
Tail calls and tail position are described in lecture notes from CS 105. Roughly speaking, an expression is in tail position if its value is returned from a function. Precisely speaking, tail position is defined inductively:
As the base case, in funcode \(xs\) => \(e\), \(e\) is in tail position.
If
if x then e₁ else e₂
is in tail position, then bothe₁
ande₂
are in tail position.If
let x = e in e'
is in tail position, thene'
is in tail position.If
(e₁; e₂)
is in tail position, thene₂
is in tail position.
Equations of translation for two of the three destinies
Each destiny defines a translation, which is implemented by a function dedicated to that destiny. The functions are toReg
, forEffect
, and toReturn
. In this handout, those names are abbreviated to 𝓡, 𝓔, and 𝒯. Their types are as follows:
𝓡 : register -> exp -> instruction list
𝓔 : exp -> instruction list
𝒯 : exp -> instruction list
The lists returned by the functions are written using the same notation as in the operational semantics from modules 1 and 2: ε
stands for the empty list; ·
stands for any and all of cons
, append
, and snoc
as needed; and a single expression stands for the singleton list containing only that expression.
In module 6, you’ll implement two of these three functions: 𝓡 and 𝓔. Their equations mostly speak for themselves, but some things worth noticing include the following:
Function 𝓡 (
toReg
) translates the bindingx = e
in alet
expression. That’s almost its entire job.Depending on the syntactic form in which it appears, a subexpression could be translated by 𝓡, 𝓔, or 𝒯.
The mutual recursion between 𝓡 and 𝓔 is most evident in the equations for the forms
let x = e in e'
and(e₁; e₂)
.The
funcode
form is just another value.The
while
loop has no real 𝓡 translation; it’s just run for effect, then the destination register is set to falsehood.The 𝓔 translation throws away any form whose evaluation is guaranteed not to have any side effects.
The 𝒯 translation (
) is used only for the body of a function; you’ll implement it in module 8.
The toReg
translations are as follows:
K-normal form | Assembly code |
---|---|
Forms with direct translations | |
𝓡r⟦v⟧ | = r := v |
𝓡r⟦x⟧ | = r := x |
𝓡r⟦@(x₁, …, xₙ)⟧ | = r := @(x₁, …, xₙ), provided @ sets a register |
𝓡r⟦@(x₁, …, xₙ, v)⟧ | = r := @(x₁, …, xₙ, v), provided @ sets a register |
𝓡r⟦x(x₁, …, xₙ)⟧ | = r := call x(x₁, …, xₙ), |
provided x, x₁, …, xₙ are consecutive | |
𝓡r⟦funcode (x₁, … xₙ) => e⟧ | = r := funcode (x₁, … xₙ) => 𝒯⟦e⟧ |
Control-flow forms | |
𝓡r⟦if x then e₁ else e₂⟧ | = if x goto L·𝓡r⟦e₂⟧·goto L’·L:·𝓡r⟦e₁⟧·L’:, |
where L and L’ are fresh labels | |
𝓡r⟦while x := e do e’⟧ | = 𝓔⟦while x := e do e’⟧·r := #f |
Floatable forms | (Note recursive calls to 𝓡) |
𝓡r⟦let x = e in e’⟧ | = 𝓡x⟦e⟧·𝓡r⟦e’⟧ |
𝓡r⟦(e₁; e₂)⟧ | = 𝓔⟦e₁⟧·𝓡r⟦e₂⟧ |
𝓡r⟦(x := e)⟧ | = 𝓡x⟦e⟧·r := x |
And the forEffect
translations are as follows:
K-normal form | Assembly code |
---|---|
𝓔⟦v⟧ | = ε |
𝓔⟦x⟧ | = ε |
𝓔⟦@(x₁, …, xₙ)⟧ | = @(x₁, …, xₙ), if @ has a side effect |
𝓔⟦@(x₁, …, xₙ)⟧ | = ε, otherwise |
𝓔⟦@(x₁, …, xₙ, v)⟧ | = @(x₁, …, xₙ, v), if @ has a side effect |
𝓔⟦@(x₁, …, xₙ, v)⟧ | = ε, otherwise |
𝓔⟦x(x₁, …, xₙ)⟧ | = 𝓡r⟦x(x₁, …, xₙ)⟧, |
where r is a register that is killed by the call | |
𝓔⟦funcode (x₁, … xₙ) => e⟧ | = ε |
𝓔⟦if x then e₁ else e₂⟧ | = if x goto L·𝓔⟦e₂⟧·goto L’·L:·𝓔⟦e₁⟧·L’:, |
where L and L’ are fresh labels | |
𝓔⟦while x := e do e’⟧ | = goto L·L’:·𝓔⟦e’⟧·L:·𝓡x⟦e⟧·if x goto L’, |
where L and L’ are fresh labels | |
𝓔⟦let x = e in e’⟧ | = 𝓡x⟦e⟧·𝓔⟦e’⟧ |
𝓔⟦(e₁; e₂)⟧ | = 𝓔⟦e₁⟧·𝓔⟦e₂⟧ |
𝓔⟦(x := e)⟧ | = 𝓡x⟦e⟧ |
The translations of if
and while
are discussed below.
Control flow
Our code generator doesn’t just convert let
bindings to assignments; it also eliminates structured control-flow constructs by lowering them to combinations of goto
s, conditional goto
s, and label definitions. The same techniques apply to other kinds of structured control including for
, do-while
, repeat-until
, and so on. Control operators like break
and continue
can be implemented with the aid of additional parameters to 𝓔; these parameters identify the header and exit of the innermost loop. Control constructs like switch
and case
can be implemented using conditional goto
, but to implement switch
in constant time requires hardware support—typically an instruction that can goto
an address computed at run time.
The translation of the while
loop is especially worth studying. The loop begins with a jump into the middle of a code block, and the “header” test e
is actually at the end. This translation is preferred for two reasons:
Each trip through the loop requires just one branch instead of two.
Branch-prediction hardware predicts each backward branch as taken, and the more iterations a loop executes, the more effective the branch prediction. With this layout, the prediction of the outcome of the conditional test is most effective exactly on those loops that take the most time. Nifty.
This approach to generating code for loops is based on a short note first published by Forest Baskett in 1978.
Translating expressions in tail position
The third destiny, which is implemented by toReturn
(𝒯), is the destiny of expressions in tail position. I’m providing no equations for 𝒯; you won’t need the details until module 8, and I want you to work them out for yourself.
Like the other translations, the 𝒯 translation is defined inductively. There are two base cases:
If a function call x(x₁, …, xₙ) appears in tail position, it is translated into a tailcall instruction.
If a name x appears in tail position, it is translated into an A.return instruction.
These base cases represent optimizations, and they are the reason for having a third destiny with its own translation. The other cases fall into two groups:
When an if form, a let form, or the (e₁; e₂) form occurs in tail position, then at least one of its subexpressions also occurs in tail position. The function should translate each such subexpression by making a recursive call to itself.
When another form of e—most commonly v or @(x₁, …, xₙ)—occurs in tail position, the function must get the value of that form into a return register
r
:𝒯⟦e⟧ = 𝓡r⟦e⟧·A.return r
Since the assignment to
r
is the very last thing that happens before the A.return instruction, it is safe to use register 0 asr
.