K-Normalization and Register Allocation

K-Normalization is the middle of the three really fun transformations in our compiler. (The first was code generation; the last will be closure conversion.) It gets us from pure, first-order Scheme to K‑normal form.

Introduction

To introduce the transformation, let’s look at an example expression:

x + 1

This expression is not in K‑normal form because the operands of the + operation aren’t both in registers. After K‑normalization, this expression becomes

let t = 1 in x + t

where t is a temporary register.

Here’s a slightly more general expression:

x + e

It’s tempting to recursively K‑normalize e to e’, pull a temporary register t, and emit

let t = e' in x + t

But this approach has a problem: it K‑normalizes x + y to

let t = y in x + t,

which introduces an unnecessary move instruction (and also burns a register to no purpose).

To fix this issue, we’ll define a helper function that builds a let expression only when needed. We don’t care what t is, so we’ll pass the two expressions e’ and x + [●], where [●] stands for a “hole” that the helper function will fill with t. In our example,

let t = e' in x + t

is produced by metalanguage code that looks something like this:

smartLet e’ ⟦x + [●]⟧ (* artist’s conception *)

A value of type “expression with hole” is represented in the metalanguage by a function of type reg -> reg exp, and the hole is filled by applying the function to whatever register goes in the hole. The type of the helper doesn’t look much like the type of a let form, so instead of smartLet, We’ll call the helper function bindAnyReg: the helper function, not the caller, picks t. My first draft looks something like this:

bindAnyReg e’ (fn t => ⟦x + $(t)⟧) (* early rough mix *)

An example

A Scheme primitive call (@ e₁ e₂) will bind both argument expressions to registers:¹

normalize (@ e₁ e₂)
  = { ask bindAnyReg to place each argument in a register }
bindAnyReg (normalize e₁) (fn t =>
  bindAnyReg (normalize e₂) (fn u =>
    @(t, u)))

To illustrate this transformation, I K-normalize the expression (+ 2 3) (an expression that we might prefer to evaluate at compile time):

normalize (+ 2 3)
  = { place each argument in a register }
bindAnyReg (normalize 2) (fn t =>
  bindAnyReg (normalize 3) (fn u =>
    +(t, u)))

The literal 2 is K-normalized as 2:

bindAnyReg (normalize 2) (fn t =>
  bindAnyReg (normalize 3) (fn u =>
    +(t, u)))
  = { normalize the source literal 2 making a target literal 2 }
bindAnyReg 2 (fn t =>
  bindAnyReg (normalize 3) (fn u =>
    +(t, u)))

Now bindAnyReg allocates available register t to hold the value of target expression 2. It then returns a let binding whose body is obtained by applying bindAnyReg’s continuation to t:

bindAnyReg 2 (fn t =>
  bindAnyReg (normalize 3) (fn u =>
    +(t, u)))
  = { let-bind 2 to available register t }
let t = 2 in $((fn t =>
  bindAnyReg (normalize 3) (fn u =>
    +(t, u))) t)
  = { function application (β-reduction) }
let t = 2 in
  $(bindAnyReg (normalize 3) (fn u =>
    +(t, u)))

Now the same thing happens with literal 3: it gets K-normalized as 3, and bindAnyReg allocates another available register. Since t is now in use, bindAnyReg allocates u. And it returns a let binding whose body is obtained by applying bindAnyReg’s continuation to u:

let t = 2 in
  $(bindAnyReg (normalize 3) (fn u =>
    +(t, u)))
  = { K-normalize 3 as 3 }
let t = 2 in
  $(bindAnyReg 3 (fn u =>
    +(t, u)))
  = { let-bind 3 to available register u }
let t = 2 in
  let u = 3 in ((fn u =>
    +(2, u)) u)
  = { function application (β-reduction) }
let t = 2 in
  let u = 3 in
    +(t, u)

The final target expression let t = 2 in let u = 3 in +(t, u) is in K-normal form.

Another example: avoiding unnecessary moves

Suppose I K-normalize the expression (+ x 1). I don’t want to allocate a register to hold x, because as long as x is a local variable, it is already in a register. In this special case, bindAnyReg can simply apply its continuation to x, without emitting a let binding:

normalize (+ x 1)
  = { place each argument in a register }
bindAnyReg (normalize x) (fn t =>
  bindAnyReg (normalize 1) (fn u =>
    +(t, u)))
  = { normalize x as x }
bindAnyReg x (fn t =>
  bindAnyReg (normalize 1) (fn u =>
    +(t, u)))
  = { x is already in a register; pass it to the continuation }
(fn t =>
  bindAnyReg (normalize 1) (fn u =>
    +(t, u))) x)
  = { function application (β-reduction) }
bindAnyReg (normalize 1) (fn u =>
  +(x, u))
  = { normalize 1 as 1 }
bindAnyReg 1 (fn u =>
  +(x, u))
  = { bind 1 to register u; pass u to continuation }
let u = 1 in
  $(((fn u =>
    +(x, u))) u)
  = { function application (β-reduction) }
let u = 1 in
  +(x, u)

Expression (+ x 1) is K-normalized as let u = 1 in +(x, u) without burning an extra register.

Background: Managing registers

Saying “bindAnyReg can allocate a new register” is all very well, but the allocation of registers is the crux of the transformation. An essential property of K‑normal form is that if each name corresponds to a machine register, the code can be translated directly into assembly language without needing any additional registers. K‑normalization is mostly a register-allocation problem. And register allocation is one place where the simple virtual-machine approach diverges from a classic compiler:

A classic compiler manages registers in two stages. The first stage, of which K‑normalization is an example, allocates fresh “temporary registers” like crazy. The second stage, register allocation, assigns each temporary register to a machine register.

For this strategy to work well, the register allocator has to recognize “move instructions” (actually copy instructions) and to assign a machine register to each temporary register in such a way that as many moves as possible become redundant and can be eliminated. This part of the algorithm is called register coalescing.

The two classic register-allocation algorithms are graph-coloring register allocation (Chaitin) and linear-scan register allocation (Poletto and Sarkar). Both algorithms can coalesce. And although both algorithms are complicated and hard to get right, if you want to build a native-code compiler, one or the other is essential.²
Classic register allocation is hard because machine registers are a scarce resource. Because registers are scarce, the compiler has to figure out when a value assigned to a register is no longer needed—so it can reuse the register. Reusing registers optimally is “computationally intractable”—both algorithms above rely on heuristics.³

But on a virtual machine, registers aren’t scarce. We have hundreds of them. So we can say, “to hell with all that.” In our compiler, as long as a variable is in scope, we’ll simply assume that the value of that variable might be used, and we’ll keep the variable in a register. Only when the variable goes out of scope will we free up that register for other uses.

One other thing: what happens if you run out of machine registers?

A classic compiler can free up more registers by temporarily moving some values from registers onto the stack. This transformation is called spilling. Spilled registers have to be reloaded before they can be used in a computation, and there are heuristics to figure out which registers to spill and where.
Our compiler could do almost the same thing: we could spill registers into a record allocated on the heap. This transformation could be implemented at the source-code level quite easily; it’s not very different from the closure conversion we’ll implement in module 10. But we won’t bother—if we run out of registers, our compiler will just fail.

Registers and procedure calls

Lots of times the compiler just needs “any register,” but sometimes it needs a particular register. For example, if a compiler needs to divide integers on the x86, there is only one register that supports integer division, and the divisor has to go in that register.

Intel hardware is just annoying, but there is another example that applies to pretty much any hardware: procedure calls. When registers are used to pass information from one procedure to another, the choice of which values go where is dictated by a procedure calling convention. That convention usually calls for particular registers. Our convention is particularly simple: when a function is called, the function value itself arrives in register 0, and the actual parameters are in registers 1, 2, 3, and so on.

That’s from the callee’s perspective; from the caller’s perspective, thanks to register windows, we can pick any contiguous sequence of registers—provided we don’t mind losing the values in all the higher-numbered registers. To implement a procedure call from the caller’s side, we’ll identify a register $k$ such that all registers with numbers ${}\ge k$ are available. To call a procedure with $n$ parameters, we’ll use registers $k$ through $k+n$ (inclusive) to hold the procedure and its actual parameters. And the callee will have the right to destroy every single one of them—as well as all higher-numbered registers from $k+n+1$ onward.

Whether it’s to implement calling conventions or just to deal with finicky hardware, putting a value into a specific, conventional register is sometimes called targeting. If there’s a coalescing register allocator in the picture, targeting is no problem: the compiler just allocates a fresh temporary for every actual parameter, then emits an instruction that moves the temporary into the conventional register. When possible, the coalescing register allocator eliminates the move.

Our approach to register allocation

In a virtual machine, it’s still possible to “allocate temporaries like crazy.” But this strategy requires a separate pass to assign a machine register to each temporary. Instead, I’m advocating a one-pass approach that places both local variables and intermediate values into machine registers directly, during K‑normalization. This approach has the potential to waste lots of machine registers, but it is simple, fast, and relatively easy to implement. And by design, we have registers to burn.

Here are the elements of the approach:

The K‑normalizer keeps track of a set of available registers, which I’ll write $A$.
The K‑normalizer also keeps track of which register each variable is in: an environment ρ.
Since we won’t have a coalescing pass to eliminate move instructions, the K‑normalizer uses a helper function like the bindAnyReg shown in the example above. That function takes the available set $A$ as an additional parameter.
The only targeting required is in the implementation of procedure calls. The K‑normalizer identifies a suitable register $k$ and targets the function and each actual parameter into the correct register.

Although (in my opinion) bindAnyReg is best written in continuation-passing style, most of a K‑normalizer can be written nicely as a direct-style recursive function. That function’s main jobs are to manage the set of available registers and to keep track of the correct environments. Detailed advice can be found in the step-by-step instructions for the module.

Two strategies for implementation, or, why compiled implementations are fast

Now it’s worth taking a step back and thinking about how we are implementing the semantics and what we’re getting out of the virtual machine. For review, the simplest implementation of a language and its semantics is usually a definitional interpreter. This is the sort of interpreter you may have encountered in a course like CS 105, where a semantics is implemented directly as a more-or-less functional program. A definitional interpreter has these elements:

An expression $e$, the abstract syntax, is a run-time data structure in the interpreter.
A value $v$ is also a run-time data structure in the interpreter.
An environment ρ tells in what location every variable is stored. It too is a run-time data structure in the interpreter.
The store σ, which stands for the contents of all the locations, stands for locations in the environment data structures inside the interpreter.
These pieces fit together into an evaluation judgment

⟨e,ρ,σ⟩ ⇓ ⟨v,σ’⟩
A definitional interpreter does not distinguish “compile time” and “run time.” The closest it comes to “compile time” is parsing the concrete syntax to produce abstract syntax

When we transition to a virtual machine, our notions of values and stores change very little. But our notions of syntax and environment change radically:

An expression $e$ is translated into a lower-level language (VM instructions). And although our high-level language is expression-oriented, the low-level language is statement-oriented (or you might prefer to say “command-oriented”).
The environment ρ is split into two parts. The global part is implemented as a load-time data structure in the VM interpreter—the data structure used by function global_slot. In the operational semantics, the global part of the environment is $G$. Whether the data structure is an association list (as in the interpreter) or something more efficient, it always has the same role: at load time, map each defined name to a mutable location.

The local part, by contrast, is purely a compile-time entity. It exists only during compilation, and at run time, it is gone.

How does the environment work? How does the translation work? To answer these questions, we have to think more about σ. Ignoring heap objects, in the definitional interpreter, σ just refers to slots in the environment. In the virtual machine, σ refers to machine registers. And in both the definitional interpreter and in the virtual machine, ρ does not change during program execution. That is, once a variable’s location is chosen, that variable doesn’t move.

With this property in mind, let’s revisit the operational-semantics rule for evaluating a variable:

\[\frac{x \in \mathrm{dom}\; \rho \qquad \ell = \rho(x)} {\langle \mathtt{VAR}(x),ρ,σ\rangle \Downarrow ⟨σ(ℓ),σ⟩}\]

There are two lookups here: first look $x$ up in ρ to get location $ℓ$, then fetch the contents of $ℓ$. In the definitional interpreter, both lookups are done at run time. In the virtual machine, the compiler does the first lookup at compile time, gets location $ℓ$, and drops $ℓ$ into the assembly code: $ℓ$ is represented by a register number. The only thing that happens at run time is fetching a value out of location $ℓ$. This operation is not only constant time; it’s fast. (The whole purpose of registers is to be fast.)

This development answers the question, “what are we getting out of the virtual machine?” First and foremost, we are getting speed, and we are getting it by replacing an expensive data-structure lookup with a fast load from a register. This transformation provides the single biggest performance improvement available with a virtual machine, and it’s enabled by K‑normalization.

The code shown in the examples doesn’t show all the details. For example, it doesn’t show how to keep track of the available registers. Use the examples as an illustration of the key ideas, not as a template for your own code.↩
Or you could just pretend to build a compiler and leave the real work to LLVM.↩
If you’ve studied computational complexity, optimal register use is NP-hard.↩