COMP40 Assignment: Code Improvement Through Profiling

\classtermyear

COMP40 Assignment: Code Improvement Through Profiling

Assignment due Tuesday, November 22 at 11:59 PM. There is no design document for this assignment.

Table of Contents

Purpose and overview

The purpose of this assignment is to learn to use profiling tools to apply your knowledge of machine structure and assembly-language programming. You will improve the performance of two programs.

What we expect of you

Use code-tuning techniques to improve the performance of two programs:

If you wish, you may make special arrangements with Professor Ramsey to replace one of these two programs with Song Search from COMP 15.

The key parts of the assignment are to identify bottlenecks using \upshapevalgrind and to improve the code by increments. You will therefore want to do most of your profiling on small inputs.

Your grade will be based on three outcomes:

Your starting point

Please begin with your code in the state it was after the array-rotation and Universal Machine assignments. If your code did not work, you may fix it, or you may start with Professor Ramsey's array-rotation code and another student's Universal Machine. If you have not yet completed the UM, you may not look at another student's UM until you have submitted.

Please take baseline measurements of your code as submitted. (If you have already changed your Universal Machine, don't worry; your CS department account should have received an email containing your UM as submitted.)

Tracking changes as you make them

During this assignment, you may run into a dead end that requires you to go back to a previous version of your code. We recommend, but do not require, that you use git to keep track of each stage of improvement in your code.

Your first source for all things git should be a fellow student who has learned git in one of Ming Chow's courses, or failing that, Ben Lynn's online tutorial Git Magic. One task you might need extra help with is if you want to go back to an older snapshot and create a branch starting from that snapshot.

Laboratory notes

Begin by choosing a data set. For image rotation, choose one large image (at least 25 megapixels) and three small images (about 100 thousand pixels each). [If you do not have a large image, you can make a small image larger by using \texttt{pnmscale} with a scale factor greater than one.] For the Universal Machine, you will use the small midmark benchmark, the large sandmark benchmark, and a partial solution to the adventure game.

For each program, at each stage, for each input, please

You can see some sample reports (using made-up data) in Table [->].

When you change the code, it is critical that each set of changes be small and isolated. Otherwise you will not know what changes are responsible for what improvements.

  1. Your starting point should be your code as submitted, compiled and linked with your original compile scripts.
  2. Your first change should be to compile with -O1 and to link with -lcii40-O1, which must come before other libraries.
  3. Your second change should be to compile with -O2 and to link with -lcii40-O2.
  4. After that you can start profiling with callgrind and kcachegrind and improving your code based on the results.
Keep in mind that -O1 is not always better than -O2.


BenchmarkTimeInstructionsRel to startRel to prevImprovement
big30s1.000 1.000 No improvement (starting point)
small1s75.02×10^6 1.000 1.000
big28s0.933 0.933 Compiled with optimization turned on and linked against -lcii-O1
small900ms69.21×10^6 0.920 0.920
big28s0.933 1.000 Compiled with optimization turned on and linked against -lcii-O2
small900ms69.18×10^6 0.920 1.000
big25s0.8330.893 Removed Array_width() call from for loop and placed result in local variable instead
small833ms62.01×10^6 0.833 0.926
big22s0.733 0.880 Removed array->blocks expression from loop and placed result in local variable
small800ms56.16×10^6 0.800 0.960
big60s2.000 2.727 Used explicit for loop instead of blocked-array mapping function. Time improved for small image but got worse for big image—undid change.
small650ms49.20×10^6 0.650 0.813
big18s0.600 0.300 Changed representation of blocks so that access to elements within the blocked mapping function uses unsafe pointer arithmetic without bounds checking
small600ms44.89×10^6 0.600 0.923

Sample report for blocked image rotation (made-up data) [*]


Analysis of assembly code

Once you have improved the code as much as you can, use valgrind and kcachegrind to find the single routine that takes the most time. (You can find it by clicking on the Self tab in kcachegrind.) For this final phase you may want to use the –dump-instr=yes option so you can see the assembly code in kcachegrind. The advantage of kcachegrind over objump -d is that it will tell you how many times each instruction was executed.

Once you've found the routine in which the most time is spent, examine the assembly code and either identify specific ways (changes to to the assembly code itself) in which it could be improved or argue that there is not an obvious way to improve the assembly code.

Do this exercise for both ppmtrans and um binaries.

Things to look for include:

Be alert for a horrible idiom in the Intel assembly language: the instruction
  mov %esi, %esi
looks redundant, but it's not. This idiom stands for an instruction that zeroes out the most significant 32 bits of the 64-bit register %rsi. Whoever allowed this syntax into the assembler should be shot.

Do this exercise for both ppmtrans and um.

For this assignment, there is no need to modify assembly code.

Performance of the final stages

All profiling and measurements must be done on the Intel Q6700 machines in Halligan 116 or 118. Almost all the machines in 116 and 118 are suitable; to be sure you have the right kind of machine, run the command

  grep 'model name' /proc/cpuinfo
and make sure the output is
  model name      : Intel(R) Core(TM)2 Quad CPU    Q6700  @ 2.66GHz
  model name      : Intel(R) Core(TM)2 Quad CPU    Q6700  @ 2.66GHz
  model name      : Intel(R) Core(TM)2 Quad CPU    Q6700  @ 2.66GHz
  model name      : Intel(R) Core(TM)2 Quad CPU    Q6700  @ 2.66GHz
Any other output means you have the wrong kind of machine and your time measurements will not be consistent. In particular, The machines in 120 are not suitable for this assignment.

The importance of using the lab machines cannot be overstated.

The machine you're using matters.

\penalty-200

Measure your code with both gcc -O1 and gcc -O2. Neither one is faster for all problems; report the better of the two results. In your final compile script, use whichever gives the best results.

You will be evaluated both on your improvement relative to the code you start with and on the absolute performance of your final results. This means it is easier for you to get top marks if you start with your own code rather than NR's, since his has less room to be improved. Your laboratory notes will record all your improvements and the performance of your final stages.

What to submit

Please use the submit40-profile script to submit the following items.

  1. A compile file, which when run with sh, compiles all your source code and produces both ppmtrans and um binaries.
  2. A run file, which when run with sh runs all your test cases. For accurate performance measurements, large inputs should be copied to /data or /tmp, so they reside on a local disk. (See Section [->].) Here's a sample for ppmtrans; you will want to change it to suit your own image files: {smallverbatim} #!/bin/sh . /usr/sup/use/use.sh use comp40 img=`tempfile –suffix=.ppm`

    djpeg big.jpg > img time -f "large rotate 90: time -f "large rotate 180:

    for i in small1.jpg small2.jpg small3.jpg do djpeg i > img time -f "rot i 90 deg: time -f "rot i 180 deg: done {smallverbatim} (Because we are using a sane, sensible shell, we can use time instead of \path/usr/bin/time.)

    And here is an example for the UM:

      #!/bin/sh
      . /usr/sup/use/use.sh
      use comp40
    
      for i in midmark.um sandmark.umz
      do
        time -f "um $i: %e seconds" um $i > /dev/null
      done
    

  3. A README file which
    • Identifies you and your programming partner by name
    • Acknowledges help you may have received from or collaborative work you may have undertaken with others
    • Explains what routine in the final ppmtrans takes up the most time, and says whether the assembly code could be improved
    • Explains what routine in the final um takes up the most time, and says whether the assembly code could be improved
    • Says approximately how many hours you have spent analyzing the problems posed in the assignment
    • Says approximately how many hours you have spent solving the problems after your analysis
  4. A labnotes.pdf file that gives your laboratory notes in nice, readable format
  5. All images and benchmarks you used as test data, preferably in a compressed format like jpg or png

Methods of improving performance

In performance, really big wins usually come from better algorithms which provide superior asymptotic complexity. But for our programs, there is not much algorithmic action; everything should be pretty much linear in either the number of pixels (images) or the number of UM instructions executed (Universal Machine). You can often improve things by changing your data structures.

Here are some trite but true observations:

What if your program is nothing but memory references and procedure calls?! How can you make progress?

There are some external sources you might find useful.


n take bolt take spring inc spring take button take processor take pill inc pill take radio take cache comb processor cache take blue transistor comb radio transistor take antenna inc antenna take screw take motherboard comb motherboard screw take A-1920-IXB comb A-1920-IXB bolt comb A-1920-IXB processor comb A-1920-IXB radio take transistor comb A-1920-IXB transistor comb motherboard A-1920-IXB take keypad comb keypad motherboard comb keypad button s {verbatim}

Partial solution to the adventure game

[*]


Partial solution to the adventure game

Figure [<-] gives a a partial solution to the adventure game. This solution can be made into a benchmark that is intermediate in difficulty between the midmark and the sandmark.

Secrets of the shell-programming masters

[*] Here are some untested ideas for automatically copying files to \path/data or \path/tmp lazily, on demand. They should work with #!/bin/bash or #!/bin/ksh. The ideas are:

First, shell function datafile names a file in a subdirectory of \path/data that belongs just to you, like \path/data/nr:
 
  function datafile {      # pathname of a personal file in /data
    echo "/data/$USER/$1"
  }
Second, shell function in_data takes an arbitrary file and copies it to your \path/data directory. Copying is done only if a file of the same name is not already present:
 
  function in_data {   # possibly copy a file to /data;return its path there
    # set -x # uncomment me to see commands execute
    typeset data="$(datafile "$(basename $1)")"
    if [[ ! -r "$data" ]]; then # file is not already in /data
      mkdir -p "$(dirname "$data")"  # make the directory
      cp -v "$1" "$data"             # copy the file
    fi
    echo "$data"                     # print the new location
  }
You can now use this function routinely to make sure that every ``big input'' is in \path/data:
 
  # set -x # uncomment me to see commands execute
  for i in biginput1 biginput2 biginput3
  do
    /usr/bin/time -f ... my_program $(in_data $i)
  done