Code generation from K-normal form

Code generation, often called “instruction selection,” is the heart of any compiler. In the UFT, assembly code is generated from K-normal form, which is designed for easy code generation. The key step is the translation from an applicative language, in which names are let-bound, to an imperative language, in which names are assigned to. That translation is described below.

This handout includes information that you will use now, and also information you will use in three weeks:

The handout assumes that you have internalized the syntax of K-normal form. If that assumption is faulty, consider refreshing your memory.

Code-generation example

The ideas behind code generation are best illustrated with an example. A short example that has primitives, a call, and a conditional is append. Its source code looks like this:

(define append (xs ys)
    (if (null? xs)
        (cons (car xs) (append (cdr xs) ys))))

The code in the body can be hand-translated into K-normal form:

(define append (xs ys)
  (let* ([p (null? xs)])
    (if p
        (let* ([z  (car xs)]
               [zs (cdr xs)]
               [f append]        ; 'getglobal' primitive
               [ws (f zs ys)]
               [t (cons z ws)])

Apart from the if, the function body is essentially one for one with assembly code; each binding corresponds to a VM instruction. The correspondence is easier to see if the K-normal form is written using the ML-like notation for K-normal form:1

let p = null?(xs) in                  ; p := null? xs
if p then                             ; if p goto ...
  ys                                  ; return ys
  let z  = car(xs) in                 ; z  := car xs
  let zs = cdr(xs) in                 ; zs := cdr xs
  let f  = getglobal("append") in     ; f  := getglobal("append")
  let ws = f(zs, ys) in               ; ws := call f (zs, ..., ys)
  let t  = cons(z, ws) in             ; t  := z cons ws
  t                                   ; return t              

Every line on the right is effectively a single assembly-language instruction. Translating each binding to an instruction is not too difficult—it’s easier to write the code than to understand exactly why it works. Since understanding, not just working code, is a goal of the course, let’s study why it is OK to replace each let binding with an assignment.

From applicative to imperative: A matter of life and death

Translating K-normal form into assembly code is mostly a matter of translating each let binding into an assignment. But a let binding is not an assignment—that’s the essence of the difference between an applicative language and an imperative one. The difference manifests when a let-bound name is reused:

let x = 3 in                        ; x := 3
let y = (let x = 2 in               ; x := 2
         x + x) in                  ; y := x + x
let z = x + y in                    ; z := x + y  ; WRONG x
print z                             ; print z

In the applicative K-normal form, the second binding to x is limited in scope. In the imperative assembly language, the second assignment to x overwrites the first. K-normal form has to be translated to assembly language without overwriting key values that reside in machine registers.

One tactic is to rename let-bound variables so that no two let bindings ever use the same name. But since every let-bound variable occupies a machine register, that tactic is viable only if some later pass, after translation, figures out how different let-bound variables can occupy the same VM register. Such a pass is an example of register allocation. It’s an honorable technique, and it’s a centerpiece of any serious course in compiler construction, but for a course that aspires to simplicity, it’s too much.

The problem with the example above is that there are two different let bindings of x, and it’s not safe for both bindings to use the same VM register. Instead of solving this problem, we will rely on an upstream translation pass (module 9) to give us K-normal form that uses let bindings only in safe ways. To explain safety, I’m going to teach you two key concepts of register allocation: “life” and “death.”

In an imperative setting, a register is live if its value might affect the results of a future computation. If the register’s value definitely cannot affect the results of a future computation (perhaps because it does not appear anywhere, or because the register is always overwritten), the register is dead.2 The same register may be live or dead at different points in the program; for example, in the imperative code above, x is dead after the assignment x := 3, live after the assignment x := 2, and still live up to but not including the command print z; x dies right after the assignment to z.

Life and death can explain the issue with the code example: after the assignment x := 3, x is intended to be live—but x is killed by the next assignment (x := 2). In general, a register is killed by any code that might overwrite it. The usual suspects are assignments and function calls.

Killing explains the difference between applicative binding and imperative assignment:

let y = e in e' (y := e; e')
Does not kill y Kills y

The assignment is equivalent to code that evaluates the let expression and simultaneously kills y. Simultaneous parallel composition is written using a vertical bar:

(y := e; e')  ≡  let y = e in e' | kill y

We can therefore translate let-binding into assignment as long as it is safe to kill y—that is, as long as no other let-binding of y appears in the context of this let expression. In our system, safety is going to be guaranteed by the translation that emits K-normal form (the K-normalizer you will build in module 9).3

Liveness and free variables

In an imperative language, liveness is computed by a program analysis that follows every branch and conditional and estimates all the things that might happen if every evaluation terminates and every branch is (eventually) both taken and not taken. This computation is an example of a dataflow analysis. Dataflow analysis is a great topic, but it’s too advanced for a typical compiler course. (In fact, I have taught an entire graduate seminar just on dataflow analysis and optimization.)

In an applicative language, by contrast, knowing what variables will be used is simplicity itself: the live variables correspond to the free variables of an expression. If you have not already encountered free variables in CS 105, you will find free variables as part of implementing lambda in module 10. It’s an easy computation.

Translations of let bindings

Given a let binding let y = e in body, once we know it is safe to kill y, we can translate the let binding into an assignment. The right-hand side e can take any form, but the forms divide into three categories:

The direct translations from K-normal form to assembly code are the easiest:

Direct translations (one binding, one instruction)
Form of let expression Translation of binding Instruction
let y = v in body y := v load literal
let y = x in body y := x register/register move
let y = @(x₁, …, xₙ) in body y := @(x₁, …, xₙ) REGS instruction
let y = @(x₁, …, xₙ, v) in body y := @(x₁, …, xₙ) REGSLIT instruction
let y = funcode (…) => e in body y := funcode (…) => e LOADFUNC instruction

These translations will be implemented by a function called toReg, which you will implement. Function toReg takes the y and e from an expression of the form let y = e in body, and it returns a sequence of assembly-language instructions. These instructions compute the value of e and place that value in register y.

Cultural enrichment: Three-address code

Many, many machine primitives, like +, >, and cons, take two arguments. So the most common instruction generated has the form y := @(x₁, x₂). In the compiler biz, this form is called “three-address code,” and it is the most common target language for compilers. Single-argument operations like ! and cdr can be viewed as special cases of three-address code, as can load-literal and register-register move instructions. The ubiquity of three-address code explains why our SVM, like the MIPS, SPARC, and ARM, uses an instruction format that can name three registers. The lack of such formats in the x86 family is a nuisance.

The if and while forms are translated into control flow, which is described below.

The floatable forms are translated using algebraic laws:

These examples show the main idea behind the translation: convert everything to a sequence of let bindings, then translate each binding to an instruction. But an actual translation function has a little more to it; in particular, the examples don’t tell us how to translate e₁ in the expression (e₁; e₂). To know what to do, we have to understand the meaning of the context in which each expression appears, which I refer to as the expression’s “destiny.”

Translation guided by destiny

How an expression is translated depends on how the expression’s value is going to be used, which in turn depends on the syntactic context in which the expression appears. The context determines how an expression will be used, which I refer to as its “destiny.” In our system there are three possible destinies:

Destiny is determined by context

The destiny of an expression is independent of the expression’s form; an expression’s destiny is completely determined by its context—that is, its position in the syntactic form in which it appears. Context determines destiny using these rules:

The rules of tail position

Tail calls and tail position are described in lecture notes from CS 105. Roughly speaking, an expression is in tail position if its value is returned from a function. Precisely speaking, tail position is defined inductively:

Equations of translation for two of the three destinies

Each destiny defines a translation, which is implemented by a function dedicated to that destiny. The functions are toReg, forEffect, and toReturn. In this handout, those names are abbreviated to 𝓡, 𝓔, and 𝒯. Their types are as follows:

𝓡 : register -> exp -> instruction list
𝓔 :             exp -> instruction list
𝒯 :             exp -> instruction list

The lists returned by the functions are written using the same notation as in the operational semantics from modules 1 and 2: ε stands for the empty list; · stands for any and all of cons, append, and snoc as needed; and a single expression stands for the singleton list containing only that expression.

In module 6, you’ll implement two of these three functions: 𝓡 and 𝓔. Their equations mostly speak for themselves, but some things worth noticing include the following:

The toReg translations are as follows:

K-normal form Assembly code
Forms with direct translations
𝓡r⟦v = r := v
𝓡r⟦x = r := x
𝓡r⟦@(x₁, …, xₙ) = r := @(x₁, …, xₙ), provided @ sets a register
𝓡r⟦@(x₁, …, xₙ, v) = r := @(x₁, …, xₙ, v), provided @ sets a register
𝓡r⟦x(x₁, …, xₙ) = r := call x(x₁, …, xₙ),
    provided x, x₁, …, xₙ are consecutive
𝓡r⟦funcode (x₁, … xₙ) => e = r := funcode (x₁, … xₙ) => 𝒯⟦e
Control-flow forms
𝓡r⟦if x then e₁ else e₂ = if x goto L·𝓡r⟦e₂⟧·goto L’·L:·𝓡r⟦e₁⟧·L’:,
    where L and L’ are fresh labels
𝓡r⟦while x := e do e’ = 𝓔⟦while x := e do e’⟧·r := #f
Floatable forms     (Note recursive calls to 𝓡)
𝓡r⟦let x = e in e’ = 𝓡x⟦e⟧·𝓡r⟦e’
𝓡r⟦(e₁; e₂) = 𝓔⟦e₁⟧·𝓡r⟦e₂
𝓡r⟦(x := e) = 𝓡x⟦e⟧·r := x

And the forEffect translations are as follows:

K-normal form Assembly code
𝓔⟦v = ε
𝓔⟦x = ε
𝓔⟦@(x₁, …, xₙ) = @(x₁, …, xₙ), if @ has a side effect
𝓔⟦@(x₁, …, xₙ) = ε, otherwise
𝓔⟦@(x₁, …, xₙ, v) = @(x₁, …, xₙ, v), if @ has a side effect
𝓔⟦@(x₁, …, xₙ, v) = ε, otherwise
𝓔⟦x(x₁, …, xₙ) = 𝓡r⟦x(x₁, …, xₙ)⟧,
    where r is a register that is killed by the call
𝓔⟦funcode (x₁, … xₙ) => e = ε
𝓔⟦if x then e₁ else e₂ = if x goto L·𝓔⟦e₂⟧·goto L’·L:·𝓔⟦e₁⟧·L’:,
   where L and L’ are fresh labels
𝓔⟦while x := e do e’ = goto L·L’:·𝓔⟦e’⟧·L:·𝓡x⟦e⟧·if x goto L’,
   where L and L’ are fresh labels
𝓔⟦let x = e in e’ = 𝓡x⟦e⟧·𝓔⟦e’
𝓔⟦(e₁; e₂) = 𝓔⟦e₁⟧·𝓔⟦e₂
𝓔⟦(x := e) = 𝓡x⟦e

The translations of if and while are discussed below.

Control flow

Our code generator doesn’t just convert let bindings to assignments; it also eliminates structured control-flow constructs by lowering them to combinations of gotos, conditional gotos, and label definitions. The same techniques apply to other kinds of structured control including for, do-while, repeat-until, and so on. Control operators like break and continue can be implemented with the aid of additional parameters to 𝓔; these parameters identify the header and exit of the innermost loop. Control constructs like switch and case can be implemented using conditional goto, but to implement switch in constant time requires hardware support—typically an instruction that can goto an address computed at run time.

The translation of the while loop is especially worth studying. The loop begins with a jump into the middle of a code block, and the “header” test e is actually at the end. This translation is preferred for two reasons:

This approach to generating code for loops is based on a short note first published by Forest Baskett in 1978.

Translating expressions in tail position

The third destiny, which is implemented by toReturn (𝒯), is the destiny of expressions in tail position. I’m providing no equations for 𝒯; you won’t need the details until module 8, and I want you to work them out for yourself.

Like the other translations, the 𝒯 translation is defined inductively. There are two base cases:

These base cases represent optimizations, and they are the reason for having a third destiny with its own translation. The other cases fall into two groups:

  1. The actual translation of the if will swap the then and else blocks.

  2. The question of whether a value does affect the results of a future computation is undecidable—any algorithm that can answer this question can solve the halting problem.

  3. The word “dead” has two more uses, which are applied to code rather than to a register or a variable. If a value is assigned to a register, and if the register is dead immediately after the assignment—as in the assignment of x := 3 immediately before x := 2—then the assignment is also said to be dead. A dead assignment can be removed from a program without changing the program’s semantics. Dead-assignment elimination is a major component of code improvement (“optimization”). And many other code-improving transformations have the sole purpose of making assignments dead so they can be removed. The third use of “dead” is technically suspect, although common: code that is unreachable is often called “dead code.”

  4. You may have studied tail position and tail calls in CS 105.