Lecture notes for COMP 40

Table of Contents

7 Sep 2011: Intro to Comp 40

p9075088 p9075089 p9075090 p9075091 p9075092 p9075093 p9075095

Introduce course staff

Books and handouts

Hanson is essential, Kernighan and Ritchie is highly recommended

For the other book, it depends how much you like books---suggestions on web. If books will save you time, spend the money.

Today's handouts:

Course title: Machine-level programming

But also

Engineering problem-solving: tracking Irene

What to expect from COMP 40: A course built around problem-solving

The course is all about projects:

Truth in advertising: not everyone is going to think every project is great. But in focus groups last spring, and everybody has a couple of favorites, and they're not always the same ones. All the projects are at least pretty good, and each year I work to make them better.

Outcomes from COMP 40

Satisfied customer:

I just got out of back to back interviews at Google Cambridge today. I just wanted to thank you for stressing design so much in Comp 40. I know that I achieved optimal solutions for the two interviews. Both of my interviewers were very pleased with how I wanted to write out the program's process in english and go deeply into designing before even thinking about coding. They really appreciated the fact that I was always keeping how every module of code fit into the bigger picture in terms of complexity, placement, and design. Even though I was learning on the spot, having had similar learning experiences in 40 was a huge, huge benefit. I thought I'd just let you know how useful COMP 40 was to my experience.

What does Professor Ramsey want?


Discipline and technique are essential!

Group exercise: technique (20 minutes)

Fundamental technique: Abstraction

Who has an idea what abstraction is?


So what's so special about the machine level of abstraction?

A problem: AMD64 is too complex to be mastered in one semester. So, you'll learn to analyze AMD64 code, but you won't write much. To understand assembly code and machine instructions in depth, later in the term we'll work with a virtual machine. Most students find this pretty interesting.

12 Sep 2011: Technique, Programming with interfaces and implementations

This week in COMP 40:


Lab comments

Project comments

40 asks you to do more:

Response to Wednesday's exercise on technique

Some of your technical ideas:

Skills and techniques from other disciplines:

I hope a handout is coming

Techniques from the professor:

Why? One of the three topics of this course is programming, and you have now reached the stage where in order to improve, you need to observe yourself and how you do things. It's not just about the finished project. How you work really affects how long it takes to finish.

Handling the paper storm

You have to read the course policy pages; they will tell you important stuff like we don't take submissions that have tab characters in them or that are wider than 80 columns.

And more important stuff like:

For the rest, read when you are stuck or are moving more slowly than you would like on problems. For example,

For tonight's design document, we just want an explanation of what data structures you will use to solve the problem of the fingerprint groups, and how the contents of those data structures relate to the input. (The loop invariant.) A couple of well-written paragraphs will do the job. Certainly no more than a page.

Note: we expect your code to perform reasonably on inputs of several hundred thousand lines

What is a loop invariant?

What problem do invariants solve?

The problem we're trying to solve is that your brain is too small!

11 and 15: summarize code by giving a narrative sequence of events:

Problems with narrative sequences:

Conclusion: narrative thinking holds you back

Solution: replace narrative thinking with a new habit of thought

Think about data and its invariants

Two consequences:

Invariant in action: sorting an array in place

Sort array of integers in place:

14 Sep 2011: Hanson's Table interface

p9145097 p9145098 p9145099 p9145100 p9145101 p9145102 p9145103 p9145104 p9145105

Homework advice:

If seeking help with code that compiles but does not run:


Detailed analysis of the Table interface, with special attention to void type and C-style polymorphism.

Start with parametric polymorphism:

And now, 16 variations on a theme (computer demo)

High points:

19 Sep 2011: Iterators, abstraction and invariants: Implementing arrays

p9195106 p9195107 p9195108 p9195109 p9195110 p9195111 p9195112 p9195113

Learning objectives:

The parable of the terriers, the squirrels, and the lion

Iterators, also called map

Abstraction and invariants: where are we going?

Learning to program with abstraction and invariants so that we can tackle interesting problems instead of boring problems.

Most likely source of bugs in your code:

The design checklist will help you debug faster

Today's class: Invariants, contracts, and implementation

Wednesday: compiling and linking, starting to look at machine structure

Unboxed, polymorphic arrays

The UArray interface with operations classified:

typedef struct T *T;

extern T     UArray_new (int length, int size);      // creator
extern void  UArray_free(T *array);                  // destroyer

extern int   UArray_length(T array);                 // observer
extern int   UArray_size  (T array);                 // observer

extern void *UArray_at(T array, int i);             // observer

extern void UArray_resize(T array, int length);      // mutator
extern T    UArray_copy  (T array, int length);      // producer

Using values stored in Hanson's arrays

Suppose array a is a Hanson UArray_T containing values of type struct pixel. Here is how I recommend you get an expression containing the ith element:

struct pixel *p = UArray_at(a, i);   // capture pointer into the array
                                     // (valid until resized or freed)
assert(sizeof(*p) == UArray_size(a));
... *p ...        // use expression of struct type

This approach is more verbose than the uses of Array_get in Hanson's book, but it is also robust against changes in type. Why? The Ramsey approach has a single point of truth, that is, the type is mentioned in only one place. In the Hanson approach, the type is mentioned in multiple places, and it is easier to write inconsistent code.

Storing values into unboxed arrays

Suppose function f() returns a value of type struct pixel you want to store in an unboxed array.

struct pixel *p = UArray_at(a, i);   // capture pointer into the array
                                     // (valid until resized or freed)
assert(UArray_size(a) == sizeof(*p)); // detects some errors
*p = f();

The representation revealed

(picture this)

struct UArray_T {
    int length;
    int size;
    char *elems;

What's the invariant?

Question: can you change array without changing what's happening in the world of ideas?

Question: how long is it safe to retain a pointer returned by UArray_at?

Question: What are the contracts?

21 Sep 2011: Fire, Brimstone, Compiling, linking, and libraries

p9215115 p9215116 p9215117 p9215118 p9215119 p9215120 p9215121

Learning objectives:

Homework alerts

Special visitor at end of class

Fire and Brimstone

(You need this more than compiling)

Comments on names (and expectations for next homework)

Vague, pointless verbs:

Informative verbs:

A typical function that calculates something?

A typical function that acts on the state of the program, or the state of the world:

If your function does both, its contract is probably too complicated. If you can't fix it, document it very carefully. Show us you have thought about it.

Comments on fingerprint groups

Will have homework back later this week.

Fingerprint groups exposed some problems:

Primary offenders:

My diagnosis:

Most of these problems can be solved by attention to detail:

Two reasons why this stuff matters:

  1. It matters for your grades: if we can't tell easily that your answer is "almost right", it will be graded as wrong

  2. The real reason: to build big stuff, you have to get people out of the loop. Your code has to be exactly right. Microsoft and Google and Apple all understand this. In fact I think you'll find out that places like Microsoft know exactly what COMP 40 means.

From source code to a running process image: compiling, linking, and libraries

Your best source to read about this material is my handout on How to Write a Compile Script.


  1. Can we create a program just from UArray2? What would that mean?

  2. Where does every C program (and C++) start executing?

Things to remember from the compilation lecture:

Things to try to see what goes wrong:

Signs that your compile script is broken:

26 Sep 2011: From Newton to Einstein---locality and the costs of storage

p9265130 p9265131 p9265132 p9265133 p9265134 p9265135

Learning objectives

What to expect from the latest project

Another step on the software track, but the same kind of thing you've been doing. Plus two bonus activities: experimental computer science and performance improvement. Therefore, more time.

  1. Another abstract data type: blocked, 2D, unboxed array

  2. Another application: image rotation

  3. Object-oriented programming in C through mediators

  4. Understanding the hardware

  5. Measuring and understanding performance

New software:

A point about reading PNM files:



Newtonian (classical) picture of storage:

Classical picture works in everyday cases, but breaks down at the extremes.

To get from Newton to Einstein, we need to know more about storage

A register is a one-word container

It is impossible to create a pointer to a register

New terms of art (for language designers and compiler writers)

Very occasionally you'll hear a location called a "mutable cell".

This distinction is important because we need to understand the difference between a location (which a program can assign to) an a value (which cannot be assigned to).

Address space is a sequence of 2-raised-to-64 one-byte containers (but only 2-raised-to-48 are usable on today's machines)

Address-space layout (software convention)

| OS reserved |
|   ...       |
|   Dynamic   |
|    Stack    |
|   ...       |
|  (unmapped) |
|   ...       |
|  heap       |
|    data     |
|    text     |
| unmapped    |

Wait a minute! There's not enough memory!

Fast is small; big is slow

Registers < 300B

L1 cache from 8KB to 128KB

L2 cache 512KB to 8MB

L3 cache 2-6MB

Note on the numbers:

28 Sep 2011: The how and why of caching

p9285139 p9285140 p9285141 p9285142 p9285143 p9285144 p9285145 p9285146 p9285147 p9285148 p9285149

Review questions

How caches work

Sets of fast containers (not as fast as registers; faster than memory)

Two new terms of art (Bryant and O'Hallaron, first edition, page 506):

Today's machines have a line size of 64B (8 words).

Example: load a word

Phenom 9850 has very large 128KB L1 cache

2048 cache lines

To deliver two words per cycle, cannot possibly search all 2048 lines. Therefore lines are divided into sets.

Search through sets is done by very simple hardware

Phenom 9850 uses 2 lines per set, so 1024 sets

  1. Split address into pieces (see Bryant and O'Hallaron, first edition, page 488)

     64           48               16              6      0
       | canonical  |     tag       |  set index   |offset|

    A pedagogial lie:

     64                            16              6      0
       |           tag              |  set index   |offset|

    (Changes only how big things are, not how things work.)

  2. A set holds two lines:

  3. Cache-search algorithm:

    1. Take 10-bit set index from address, choose a 2-line set

    2. In parallel, compare tags and check valid bits

    3. If tags match and line is valid, use offset to index into block

    4. Otherwise, we have a cache miss

What happens on a cache miss

Go to next level down and get a block

Find a place to put the block in the cache:

  1. Determine which set by looking at bits of address

  2. Within set, is any line invalid?


Question: is there a scenario in which the program uses just 3 bytes in memory, but they cannot all fit in L1 cache?

Why caching works

Programs show a property called locality

Idea: keep recently used bytes/words and nearby words in the cache

Three reasons a block might not be in the cache

Einstein #1: more addresses than will fit in a set

Einstein #2: more addresses than will fit in the cache

Non-Einsteinian #3:

Extra hardware to avoid misses

Block diagram:

Predictors can avoid "compulsory" misses!

Cache review, Split caches, prediction hardware

3 Oct 2011: Cache review and practice

pa035150 pa035151 pa035152

(Nobody took photos, but there was an in-class exercise)

Cache review

Address split

     64        48                 16              6      0
       |canonical|     tag          |  set index   |offset|

Set of two cache lines (2-way associativity):

         +---+  +---+  +------------------------+    +---------------------+
         | D |  | V |  |  tag   (32 bits)       |    |  Block  (64 bytes)  |
         +---+  +---+  +------------------------+    +---------------------+
    LRU ==                                             
         +---+  +---+  +------------------------+    +---------------------+
         | D |  | V |  |  tag   (32 bits)       |    |  Block  (64 bytes)  |
         +---+  +---+  +------------------------+    +---------------------+

Question: CPU wants a byte from a random address: what block is loaded into what cache line?



Example: Conflict on a set of size 2 with three addresses

Class exercise: modeling the cache

Assume cache lines are loaded on demand only

Examine graphs:

  1. For stride of size k, how many misses per load?

  2. Which hardware conforms to the model?

  3. What is going on with the Core2 Quad Q6700?

Other questions

The alignment conspiracy (AMD64 version)

Type         size  alignment
bool           1       1
char           1       1
short          2       2
int            4       4
long           8       8
pointer        8       8
float          4       4
double         8       8
long double   16      16    (80 bits)

call stack    --      16

5 October 2011: Machine Arithmetic

pa055153 pa055154 pa055155 pa055156 pa055157 pa055158 pa055159 pa055160 pa055161 pa055162 pa055163

NOTE: In this lecture, I went wildly off script. The notes are not going to be very close to lecture. With luck I will have time to adjust the notes in retrospect. ---NR

Learning objectives

Current project wrapup

Points of emphasis on the homework:

Finite-precision arithmetic

Arithmetic more fun to teach than to learn:

Preface: Where have we been and where are we going

Where we've been:

Where we're going:

Arithmetic with limited precision

TopicBase 10 Base 2
Positional notation1 7
Natural numbers (unsigned ints) 2 8
Signed-magnitude integers 3 not used
Two's-complement integers impossible 10
Fixed-point numbers (scaled integers) 4 ---
Floating-point numbers5 9
Software emulation (rationals) --- ---
Software emulation (arbitrary precision) 6 ---
Bit vectors and logical operations senseless 11

Key concepts


Natural numbers using positional notation with finite precision

n digits base b

d_{n-1} * b^{n-1} + ... + d_1 * b^1 + d_0 * b_0

Can represent any number k in the range 0 <= k < b^n

Examples: 112, 894

Addition: 112 + 894 = 1006

Question: how many additional digits needed to hold sum in general?

Question: how do you prove a theorem?

Multiplication: 112 * 894 = 100128

Question: how many digits are required to hold the product of an n-digit number and an m-digit number?

Question: do any of these theorems depend on the choice of base?

Signed-magnitude representation

Add a plus or minus sign to a number in positional notation

No examples; you've all done this to death.

Fixed-point representation of real numbers

Finite precision: 3 digits

Want to represent real numbers.

Idea: scaled integers

If the denominator is a power of 10, we can use the world-famous decimal point. (It works just as well with any other base.) Let's use the denominator 10:

Addition works as addition of integers:

112 + 89 = 201

11.2 + 8.9 = 20.1

11.2 + 8.94 = 20.14


8.94 + 8.94 = 17.88  (to three digits, 17.9 or 179/10)

8.9 + 8.9 = 17.8     (loss of accuracy in calculation)

Multiplication requires dividing by the scale factor

11.2 * 8.94 = 100.128 = 100.0 (

11.2 * 8.9 = 99.68

112 * 89 = 9968    (overflow!)

112 / 10 * 89 = 11 * 89 = 979

Normalized numbers

Examples base 10 -- 3 digit significand

x = 10.1    normalized 1.01e1
y = 9.93    normalized 9.93e0
z = .707    normalized 7.07e-1

Typical coding exponent 10^n where n = exp - 5 and exp is one digit

12 Oct 2011: Binary arithmetic and floating-point numbers

pa125172 pa125173 pa125174 pa125175 pa125176 pa125177 pa125178 pa125179 pa125180 pa125181 pa125182 pa125183


Dismal laws of computing:

  1. Digits (bits) are a limited resource

  2. The computer works with a positional notation and really only "understands" integers.

  3. Real numbers can be approximated by scaling

Floating-point review: normalized and denormalized Floating-point numbers

Anatomy of a floating-point number:

A number with n+1 digits of significand:

                frac                      where 0 <= frac <= b^n
x = +/- (d +  ---------------) * b^e      where e typically is exp - bias
                b^n                       where n is # of digits - 1

Normalized if d > 0

Simple floating point, base 10, 3-digit significand, 1-digit exponent

3.14 * 10^0

314 * 10^-2

.000314 * 10^4

Most important concept: normalized numbers

(d + NN/100) * 10^[-4..4],   0 < d < b

Q: Can every number be normalized?

Denormalized numbers:

(d + NN/100) * 10^-5        0 <= d < b

Q: Are there nonzero numbers that can't be normalized?

Q: Which ones can be represented as denormalized numbers?

Zero is a denormalized number

What a normalized number looks like

Sometimes write 'mantissa' or 'significand'

m = (d +  ---------------)

IEEE double precision (binary format)

Question: what goes wrong with 2^60 + 1?

All the world's a bit vector

What's this?

1111 0100


Bit stands for binary digit.

It's positional arithmetic, base 2.

Two's-complement arithmetic

Most significant bit is interpreted as sign bit


1111 0100

-128 + 64 + 32 + 16    +  4

    -64   + 32 + 16    +  4

         -32   + 16    +  4
              -16      +  4

What therefore is the 6-bit

 11 0100 ?

What about 16-bit

1111 1111 1111 0100 ?

And 5-bit

1 0100

Truth: given a signed integer in two's-complement representation, you can extend the sign bit as far as you like.


Bitwise operations

Treat word as a bit vector, do Boolean operations in parallel:

&  bitwise and
|  bitwise or
~  bitwise complement

Complement changes 0 to 1 and 1 to 0. So if we write

k = -d{n-1}*2^{n-1} + d_{n-2}*2^{n-2} + ... + d1 * 2^1 + d0 * 2^0

Then what are the digits of ~k

     (1-d_{n-1}) ... (1-d0)

So what is the integer value of ~k + 1?

     -(1-d{n-1})*2^{n-1} + (1-d_{n-2})*2^{n-2} + ... + (1-d1)*2 + (1-d0+1)*2^0

     -(1-d{n-1})*2^{n-1} + (1-d_{n-2})*2^{n-2} + ... + (1-d1)*2 + (2-d0)*2^0

     -(1-d{n-1})*2^{n-1} + (1-d_{n-2})*2^{n-2} + ... + (1-d1+1)*2 + (-d0)*2^0

     -(1-d{n-1})*2^{n-1} + (1-d_{n-2})*2^{n-2} + ... + (2-d1)*2 + (-d0)*2^0

                                    (2-d_i)*2^i -> 1 * 2^{i+1} - d_i*2^i

And eventually

     (1-(1-d_{n-1}))*2^{n-1} + (-d_{n-2})*2^{n-2} + ... + (-d1)*2 + (-d0)*2^0

       d_{n-1}*2^{n-1} + (-d_{n-2})*2^{n-2} + ... + (-d1)*2 + (-d0)*2^0
      - (-d_{n-1})*2^{n-1} + (-d_{n-2})*2^{n-2} + ... + (-d1)*2 + (-d0)*2^0

      - k

Multiple interpretations of bit vectors

Most important are the signed and unsigned integers.

What are 'Big-Endian' and 'Little-Endian'

Serial formats:

Big-Endian is most significant byte first

Little-Endian is least significant byte first

We now reveal technology for extracting bytes (and implementing Bitpack)


Shift right and left, >> and <<

Suppose the two most significant bits are identical?

What happens on a right shift if the least significant bits are not zero?

Combining bitwise operations with shifts

Boolean algebra plus shifter.

Recall our example byte:

1111 0100

Suppose we want to replace 101 with 010?

Or more generally, given


and given abc, how do we produce


Next problem: given two bytes




how do you compute

mnofghst ?

How about

unsigned field = Bitpack_getu(abcdefgh, 3, 0);
return Bitpack_newu(mnopqrts, 3, 2, field);

17 Oct 2011: Miscellany

pa175192 pa175193 pa175194

In memoriam Dennis Ritchie

Learning objectives

Office hours cancelled; Tuesday instead.

Results from your homework:

Right and wrong ways to represent image transformations

What to expect from the midterm:

and a little bit of pointers.

Debugging and assembly code

Big transition: from data to code

Project: Defusing a binary bomb (by reverse engineering)

24 October 2011: C and assembly

pa245205 pa245206 pa245207 pa245208 pa245209 pa245210 pa245211 pa245212 pa245213 pa245214


Plan for the week




Introduction to assembly code

Calling convention: parameters and results

Parameter registers:

Result registers

How to analyze:

Computation by two-address code

Instructions may be

AMD64 alert: a memory reference can do a little arithmetic in the "address unit"!

Selected machine instructions to be aware of:

C code:

extern int universal_denominator;

int sqdiff(int num1, int num2) {
  int diff = num1 - num2;
  long int sq   = (long int) diff * (long int) diff;
  return (int) (sq / universal_denominator);


.globl sqdiff
        .type   sqdiff, @function
        subl    %esi, %edi   -- difference into %edi
        movslq  universal_denominator(%rip),%rax -- preload denominator
        movslq  %edi,%rdx    -- copy multiplier
        imulq   %rdx, %rdx   -- multiply into %rdx
        movq    %rax, %rcx   -- denominator into divisor register
        movq    %rdx, %rax   -- product into dividend register
        sarq    $63, %rdx    -- copy sign bit into most significant word
        idivq   %rcx         -- divide D/A pair by %rcx
        ret                  -- result is already in %rax

The same function using Register Transfer Language (curly braces contain assertions that hold between assembly-language statements):

  { precondition : num1 == rdi, num2 == rsi }
edi := esi - edi;                         // subl      %esi, %edi
  { diff == edi }
rax := sx64($m[universal_denominator]);   // movslq univ_den(%rip),%rax
  { diff == edi, rax == sx(universal_denominator) }
rdx := sx64(edi);                         //   movslq  %edi,%rdx
  { diff == edi, rax == sx(univ_denom), rdx == (long int) diff }
rdx := rdx * rdx;                         //   imulq   %rdx, %rdx
  { diff == edi, rax == sx(univ_denom), rdx == sq }
rcx := rax;                               //   movq    %rax, %rcx
  { rax == sx(univ_denom), rdx == sq, rcx == sx(univ_denom), ... }
rax := rdx;                               //   movq    %rdx, %rax
  { rax == sq, rdx == sq, rcx == sx(univ_denom), ... }
rdx := rdx >> 63;                         //   sarq    $63, %rdx
  { rax == sq, rdx == signbit(sq), rcx == sx(univ_denom), ... 
       (therefore rdx:rax == sx128(sq))
rax := rdx:rax / rcx  |  rdx := rdx:rax mod rcx;
                                          //   idivq   %rcx
  { rax == sq / universal_denominator, rdx == sq % universal_denominator, ... }
return                                    //   ret

26 October 2011: Lowering, control flow

pa265215 pa265216 pa265217 pa265218 pa265219 pa265220 pa265221 pa265222 pa265223 pa265224 pa265225 pa265226 pa265227 pa265228

Class exercise: Lowering to normal form

Control flow

Condition codes

All decision making funnelled through condition codes, which Intel calls flags.

Translating control flow

Four basic tools:

Translating control flow

Basic conditional:


Short-circuit Booleans:

if (C1 && C2) goto L;


 if (!C1) goto Lnot;
 if (C2) goto L;

Short-circuit Booleans:

if (C1 || C2) goto L;

becomes what?

How about

while (C) S;

Try Forrest Baskett!

 goto Ltest;
 if (C) goto Lhead;

And also

for (exp1; C; exp2) S;

do S while(C);


switch (exp) {
  case 0: SS0;
  case 1: SS1;
  case 2: SS2;
  case 3: SS3:
  default: SSd;

Uses a jump table (Pidgin C):

static code *jump_table[4] = { L0, L1, L2, L3 }; // addresses

 t = exp
 if ((unsigned)t < 4) {
    t = jump_table[t];
    goto t;
 } else
    goto Ld;
L0: SS0;
L1: SS1;
L2: SS2;
L3: SS3:
Ld: SSd;
Lend: ...    

And break becomes goto Lend

Calls and returns (to return to later)

In the beginning was FORTRAN. No recursion, no stack!

No excuse for a recursive factorial function, except it makes a good example:

int fact(int n) {
  if (n == 0)
    return 1;
    return n * fact(n-1);

Q: When fact(4) is calling fact(3), where is n-1?

Q: OK, then where is n?

Q: Where was n on procedure entry?

Q: We have a problem. What's the general shape of the solution?

Attack from a fully general point of view (timeless truth):

Here's the life cycle of a procedure:

What's wrong with this picture? Is it adequate to describe factorial?

Suppose we let the program run, and we suddenly freeze it at an arbitrary point:

Refine our idea of state:

If f() calls g() calls h(), how are their lifetimes related?

Is there any way that an activation of g() can outlive an activation of h()?

Therefore, what is the right data structure for activations?

Not covered in lecture: Mapping C data to the machine

Translation examples:

Key locations for values, including effective addresses (overview page 1)

  1. Registers (three kinds, two common)
  2. Special conventions around calls (see handout)

  3. Effective addresses:

Mapping C data onto machine data

Q: What is the C type system?

Mappings based on two principles: size and alignment

Example: what is the difference between

struct rgb { unsigned red; unsigned green; unsigned blue; };


unsigned pixels[3];

(In DDD, to inspect contents of arrays (and sometimes structs), use the Memory... tool on the Data menu.)

Example: how is a linked list of unsigneds represented?

struct node {
  unsigned val;
  struct node *next;

Q: What C could lead to the addressing mode 4(%ebx,%ecx,1)?

31 October 2011: A Universal Machine

pa315229 pa315230 pa315231 pa315232 pa315233 pa315234 pa315235 pa315236 pa315237

A Turing-complete machine in 14 instructions

Why the Universal machine?

Architecture of the Universal Machine

A 32-bit machine:

A segmented memory architecture:

Word-oriented, RISC instruction set architecture

Computational capabilities deliberately limited:

Instruction set

Emulating the Universal Machine

Keep representation of machine state:

Interface: given an initial 'segment 0' (binary program), run until reaching a halt instruction, then return.

Design of UM software

Asking for a more modular design:

2 November 2011: UM Design and Documentation

pb025238 pb025239 pb025240 pb025241 pb025242 pb025243 pb025244 pb025245 pb025246

Representation is still the essence of programming.

Let's look at the data in the problem:

What invariants are true even in the world of ideas (about the abstractions)?

Which modules should know about which representations?

Which representations should be public?

Which representations should be private?

UM segmented memory architecture

World of ideas:

World of implementation

7 November 2011: Documentation (as expected for UM)

pb075250 pb075251 pb075252 pb075253 pb075254 pb075255 pb075256 pb075257

Here is what I expect for the Universal Machine:

What I'm looking for in your code and documentation



9 November 2011: Refactoring and bit operations

pb095261 pb095262 pb095263 pb095264 pb095265 pb095266 pb095267 pb095268 pb095269 pb095270 pb095271

Results of labratory exercise

Programmers, break your chains! Kill the mouse.

Comments on image compression

Notes on functional correctness:

Notes on structure and organization:


Refactoring: making things better

Transformation of code to improve readability and internal structure.

A huge thing in industry.

Simpler is better.

Some ideas:

Suggested procedure for refactoring:

  1. Work with your programming partner

  2. When session is over, print out code

  3. Each of you can review code on printout

  4. At next meeting, discuss proposed revisions and simplifications

  5. Implement the best ones

Example: Bitpack_newu in 21 lines:

uint64_t Bitpack_newu(uint64_t word, unsigned width, 
                      unsigned lsb, uint64_t value) {
  if (width + lsb > 64) {

  if(!Bitpack_fitsu(value, width)) {

  uint64_t one = 1;

  uint64_t filter = shift_left64(one, width) - 1;
  filter = shift_left64(filter, lsb);

  uint64_t result = word & (~filter);
  uint64_t insert = shift_left64(value, lsb);
  result = result | insert;

  return result;

Simplified version in 11 lines:

static uint64_t one = 1; // can be shared among multiple procedures

uint64_t Bitpack_newu(uint64_t word, unsigned width, 
                      unsigned lsb, uint64_t value) {
  if (width + lsb > 64 || !Bitpack_fitsu(value, width))

  uint64_t mask =  shift_left64(one, width) - 1;
           mask = ~shift_left64(mask, lsb);
  return (word & mask) | shift_left64(value, lsb);

Factors of two matter!!

Review of bit operations

Fits in signed bits, fits in unsigned bits: arithmetic and visual

New term: mask

14 Nov 2011: Code tuning

pb145272 pb145273 pb145274 pb145275 pb145276 pb145277 pb145278 pb145279 pb145280 pb145281

New Learning outcomes:

Performance targets:

How fast is fast enough?

First & 2nd rule of performance optimization (Michael Jackson)

  1. Don't do it

  2. Don't do it yet

Amdahl's Law

The "bank robbery" model of code tuning:

Q: Mr Sutton, why do you rob banks?

A: That's where the money is.

A law of mathematics that says if input takes only 5% of a program's execution time, than even if you make input infinitely fast, you speed up the whole system by at most 5%. The full version of Amdahl's Law tells you exactly how to calculate the speedup of the whole from the speedup of a part.

Amdahl's law tells you that there are certain parts of a program which it is a waste of time to try to make faster.

The 80/20 rule (not a law of nature):

The 80/20 rule is not true all the time, but when it is true, it's worth taking advantage of.

Tools and techniques for finding hot spots

kcachegrind demo

Functions/procedures as the units of profiling

valgrind and cachegrind report on functions, so that's our first unit of thought. There are three ways to make a function take less time (== less area on the visualization):

Performance killers

  1. The biggest killer is malloc() without free(). This may occur two ways:

  2. The second biggest performance killer, of a lower order, is malloc() with free(). It is true across languages that allocating a heap object is among the most expensive things you can do. Because of technology trends, this is likely to remain true for some time.

  3. The third performance killer is loads and stores---memory traffic. Example: here's a relatively efficient representation of two-dimensional arrays:

     struct Array2_T {
       int width, height;
       char *elements;

    (Draw the representation)

    To get to an array element i, you have to dereference two pointers and do address arithmetic twice (one address is static and is coded into the instruction; the other is dynamic):

     a->elements[i]  // same as *((*a).elements+i)

    Now here's a trick:

     struct Array2_T {
       int width, height;
       char elements[];

    (Draw the trick): the C code is the same, but what happens at the machine level is quite different: it requires only one pointer dereference and one arithmetic (again done in the address unit):

     a->elements[i]  // something like *(a + offset(elements) + i)

    To allocate the first kind of array:

     struct Array_T *p = malloc(sizeof(*p));
     p->elements = malloc(n * sizeof(*p->elements));

    To allocate the second kind of array:

     struct Array_T *p = malloc(sizeof(*p) + n * sizeof(*p->elements));

    The advantages:

    The disadvantage:

16 November 2011: Code tuning, continued

pb165282 pb165283 pb165284 pb165285 pb165286 pb165287 pb165288


Fundamental principles:

Performance killers:

Four more Performance killers

More killers:

  1. The fourth performance killer is some function calls.

  2. The fifth performance killer unnecessary recomputation in loops.

  3. The sixth performance killer is control flow, especially conditional branches. Anytime you can use pure arithmetic or persuade the compiler to generate a cmov instruction, you win.

  4. Arithmetic is free, except possibly for integer divide. (Maybe integer multiply costs a little, too.)

Performance: refactoring loops

What to do with the following:

struct T {
  int width, height;
  int size;
  UArray_T elements; /* UArray_T of 'height * width' elements,
                        each of size 'size', in row-major order */
   // Element (i, j) in the world of ideas maps to
   //   elements[i + width * j], where the square brackets 
   //   stand for access to a Hanson UArray_T

void UArray2_map_row_major(struct T *a2, 
    void apply(int i, int j, struct T* a2, void *elem, void *cl), void *cl) {
  for (int k = 0; k < a2->width * a2->height; k++) {
    int col = k % a2->width;
    int row = k / a2->width;
    apply(col, row, a2, UArray_at(a2->elements, k), cl);


21 Nov 2011: Code improvement wrapup

pb215295 pb215296 pb215297 pb215298

Where have we been & where are we going?

Performance: function inlining and specialization

How does function-call overhead compare for Array_get vs Bitpack_getu?

Key knowledge: the cost of assert()

Class exercise!

What's to be gained here by inlining Array_get(segment, r4)?

void *Array_get(T array, int i) {
    assert(i >= 0 && i < array->length);
    return array->array + i*array->size;

And what's to be gained here by inlining Bitpack_getu(instr, 3, 6)?

static inline uint64_t shl(uint64_t word, unsigned bits) {
  assert(bits <= 64);
  if (bits == 64)
    return 0;
    return word << bits;

static inline uint64_t shr(uint64_t word, unsigned bits) { // shift R logical
  assert(bits <= 64);
  if (bits == 64)
    return 0;
    return word >> bits;

uint64_t Bitpack_getu(uint64_t word, unsigned width, unsigned lsb) {
  unsigned hi = lsb + width; // one beyond the most significant bit
  assert(hi <= 64);
  return shr(shl(word, 64 - hi), 64 - width); // different type of right shift

Using the world of ideas for advanced code tuning

What operations need to be fast? (Profile and count!)

What operations can be omitted?

Can you specialize the abstraction or change a data structure or algorithm?

28 November 2011: Universal Machine Macro Assembler

pb285299 pb285300 pb285301 pb285302 pb285303 pb285304 pb285305 pb285306

Getting from source to machine code: deepen and consolidate your knowledge

Assembling via named sections

The assembler accumulates relocatable object code in a sequence of named sections.

Typical sections for real hardware:

Example translation of C into sections

Simple program:

int main() {
  puts("Hello, world\n");
  return 0;

.byte 'H'
.byte 'e'
.byte 'l'
.byte 'l'
.byte 'o'
.byte ','
.byte ' '
.byte 'w' 
.byte '\n'
.byte 0

%rdi := L17    // mov L17, %rdi
call puts
%rax := 0

Label definitions that are remembered

Relocation entries

What happens in init, text, rodata?

Not shown: who calls main???

  set up call stack
  <get argument string from operating system>
  build argv and compute argc
  rc = main(argc, argv);

Sections on the Universal Machine

Typical sections for Universal Machine:

Your assembler is agnostic about the identity of sections (more modular reasoning)

In the UM Macro assembler, the first section is always init

We will not translate start code, but we know it is in the init section.

Sections, the state of the assembler, and its invariants

State includes the input read so far

Fundamental data structures:

Invariant (sections in the assembler):

  1. The list of sections L relates to the input read in the following way:

    Q: What needs to be done to establish and maintain this invariant?

    Relevant parts of the interface:

    T Umsections_new(const char *section, 
                     int (*error)(void *errstate, const char *message),
                     void *errstate);
       /* Create a new assembler which emits into the given section.
          The error function, which is called in case of errors,
          is guaranteed not to return.  
    void Umsections_section(T asm, const char *section);
      /* make the named section current, creating if needed */
    void Umsections_emit_word(T asm, Umsections_word data);
      /* Emit a word into the current section */

Invariant (the current section):

  1. The current section is the section named in the most recently read .section directive, or if no .section directive has been read, it is the initial section named in the call to Umsections_new().

    Q: What needs to be done to establish and maintain this invariant?

Definition: a word is emitted into a section if that word is emitted while the section is current.

Invariant (contents of a section):

  1. The contents of a section consists of all the words ever emitted into that section, in order.

    Q: What needs to be done to establish and maintain this invariant?

Enforcing the phase distinction: the linker

The link phase retires all the outstanding relocation entries. If a relocation entry cannot be retired (because the definition of a label is missing), the linker calls the error function:

When all entries have been discharged, a binary can be written:

void Umsections_write(T asm, FILE *output); 
  /* Write the contents of each section stored in asm,
     in the order in which they appear in asm's sequence.
     Write the words to file 'output' in UM format.

30 November 2011: Course evaluations, Macro instructions, Calls

pb305308 pb305309 pb305310 pb305311 pb305312 pb305313 pb305314 pb305315

Course evaluations

Points of emphasis:

Macro assembler Review

In metaphor:

Goal: umasm is easier to talk to, because it understands more. It is easier to write code for the Macro Assembler than to create UM code directly.

Technically, what appears to be one instruction expands to a sequence of instructions:

More macro insturctions

Macro assembly code examples

No procedures, just code:

.zero r0
.temps r7
 r1 := input()
 r2 := ~r1 // macro instruction
 r3 := exit
 if (r2 != 0) r3 := write
 goto r3
write: output r1
 goto L
exit: halt

The goto is a macro instruction that Santa expands as follows:

r0 := 0                    // Load Value instruction
r1 := L                    // load relocatable address
goto *r1 in program m[r0]  // Load Program instruction

Macro instructions and your homework

Six macro instructions:

typedef enum Ummacros_Op { MOV = LV+1, COM, NEG, SUB, AND, OR } Ummacros_Op;
  /* move, one's complement (~), two's-complement negation (-),
     subtract, bitwise &, bitwise | */

The corresponding assembly code:

r1 := r2        // MOV (move)
r1 := ~r2       // COM (one's complement)
r1 := -r2       // NEG (negate)
r1 := r2 - r3   // SUB (subtract)
r1 := r1 - r2   // SUB (subtract)
r1 := r2 & r3   // AND
r1 := r2 | r3   // OR

Instruction emission just as in the lab on UM unit testing: emit instructions into a sequence:

void Ummacros_op(Umsections_T asm, Ummacros_Op operator, int temporary,
                 Ummacros_Reg A, Ummacros_Reg B, Ummacros_Reg C);
void Ummacros_load_literal(Umsections_T asm, int temporary,
                           Ummacros_Reg A, uint32_t k);

Macro instructions and algebraic laws

5 December 2011: Procedures in macro assembler

pc055319 pc055320 pc055321 pc055322 pc055323 pc055324 pc055325 pc055326 pc055327 pc055328 pc055329 pc055330 pc055331 pc055332 pc055333 pc055334 pc055335 pc055336 pc055337

Return of profile assignment

Learning outcomes

SPEED DEMON achievement

Today's class

You have everything you need, but for some of it, maybe you could use a refresher. Tally votes on

Control-flow review



What do we need for a procedure call?

Salient fact: not enough registers

Calling and register-use conventions

What do we need to know about calling conventions?

Register conventions:

Compiler trick: if we need registers, save and restore r3, r4 on stack

Implementing a call instruction

What does a call instruction look like?

Must transfer control while capturing return address in register r1.
What ideas do you have?

The idea "call main if r6, r7 are temporaries and r0 is 0" translates to

Translates to

 r1 := this_ra             // Load Relocatable (LV)
 r7 := main                // Load Relocatable (LV)
 goto *r7 in program m[r0] // Load Program
this_ra:                   // labels next instruction

My front end accepts this code:

goto main linking r1  // goto main, setting r1 to the return address

Q: Why is it good to put the return address in a register and not on the stack?
A: "Optimized leaf routines" can run without a stack---sometimes without any data traffic at all.

Startup code

Here's what startup code looks like (equivalent of crt0.o):

.section text
.zero r0
.temps r6, r7

 goto main linking r1

Using the call stack

We still need a stack:

How to allocate and initialize the call stack (equivalent of crtN.o):

.zero r0
.section stk
.space 100000
.section init
r2 := endstack

Parameter passing: on the stack

m[r0][r2+0] // first parameter
m[r0][r2+1] // second parameter

Q: Why is the call stack in segment 0?
A: So we don't have to burn a two registers to point to it

Macro instructions for the stack:

.zero r0
.temps r6, r7

push r3 on stack r2 // what does it expand to?

r6 := -1 // macro instruction; implementation elided
r2 := r2 + r6
m[r0][r2] := r3

And also pop:

pop r3 off stack r2 // what does it expand to?

r3 := m[r0][r2]
r6 := 1
r2 := r2 + r6

These instructions can be used to save and restore registers.

Recursive functions

If the return address is in a register, how are we going to implement a recursive function?

Answer: save it on the stack.
Here is a skeleton for any procedure:

 push r1 on stack r2 // save return address
    // where is the first parameter now?
    // it's at m[r0][r2+1]

 ... // do many things
 r1 := the return value

 pop r5 off stack r2 // OK to overwrite r5; it's volatile
 goto r5             // passes for a return


A recursive factorial function

We have

fact(n) = if n == 0 then 1 else n * fact(n-1)


Because the parameter n is live across the recursive call, we'll keep it in a nonvolatile register. This means we must save and restore the register. (I've chosen r3.)


 push r1 on stack r2 // save return address
 push r3 on stack r2 // save nonvolatile register
 r3 := m[r0][r2+2] // first parameter 'n'

 if (r3 == 0) goto base_case

// recursive call
 r5 := r3 - 1             // OK to step on volatile register
 push r5 on stack r2      // now 'n-1' is a paraemter
 goto fact linking r1     // make the recursive call
 pop stack r2             // pop parameter
 r1 := r3 * r1            // final answer
 goto finish_fact

 r1 := 1

 pop r3 off stack r2   // restore saved register
 pop r5 off stack r2   // put return address in r5
 goto r5               // return

A main routine

Try this

void main(void) {
  for (int i = 10; i != 0; i--) {
    puts("! == ");

Output is

10! == 3628800
9! == 362880
8! == 40320
7! == 5040
6! == 720
5! == 120
4! == 24
3! == 6
2! == 2
1! == 1

Assembly code is

 push r1 on stack r1
 push r3 on stack r2
 r3 := 10
 goto main_test

 push r3 on stack r2
 goto print_d linking r1
 pop stack r2 // pop parameter

 output "! == "

 push r3 on stack r2
 goto fact linking r1
 push r1 on stack r2
 goto print_d linking r1
 r2 := r2 + 2 // pop 2 parameters
 output '\n'

 r3 := r3 - 1
 if (r3 != 0) goto main_loop

 pop r3 off stack r2
 pop r1 off stack r2
 goto r1

Printing numbers in decimal notation

Left as an exercise

Implmenting the RPN calculator


Register conventions:

Macro instructions that help with calls

goto f linking r1

push r1 on stack r2
push r3 on stack r2
pop r3 off stack r2
pop r5 off stack r2
goto r5

12 Dec 2011: Exams, past and future

Midterms: criteria 1 depth + 2 competence = 25pts + 2*10pts = 45 pts

Midterm grade ranges:

65+        Excellent
45-64      Very Good
37-44      Good
25-36      Fair
under 25   Poor

Work needed on

Expect to see all this on the final

I apologize for question 8; sometimes a new question just doesn't work.

Dramatic improvement over midterm performance will be noted

Contract vs invariant: length of heap

Back to class home page