COMP40 Assignment: Code Improvement Through Profiling

Fall 2008

COMP40 Assignment: Code Improvement Through Profiling

Assignment due Thursday, December 3 at 11:59 PM. There is no design document for this assignment.

Table of Contents

Purpose and overview

The purpose of this assignment is to learn to use profiling tools to apply your knowledge of machine structure and assembly-language programming. You will improve the performance of two programs.

What we expect of you

Use code-tuning techniques to improve the performance of two programs:

The key parts of the assignment are to identify bottlenecks using \upshapevalgrind and to improve the code by increments. You will therefore want to do most of your profiling on small inputs.

Your grade will be based on three outcomes:

Your starting point

Please begin with your code in the state it was after the array-rotation and Universal Machine assignments. If your code did not work, you may fix it, or you may start with my array-rotation code and another student's Universal Machine.

Please take baseline measurements of your code as submitted. (If you have already changed your Universal Machine, don't worry; your CS department account should have received an email containing your UM as submitted.)

Laboratory notes

Begin by choosing a data set. For image rotation, choose one large image (at least 25 megapixels) and three small images (about 100 thousand pixels each). [If you do not have a large image, you can make a small image larger by using \texttt{pnmscale} with a scale factor greater than one.] For the Universal Machine, you will use the small midmark benchmark, the large sandmark benchmark, and a partial solution to the adventure game.

You have three operations: rotate 90, rotate 180, and run um.

At each stage, for each input, please

When you change the code, it is critical that each set of changes be small and isolated. Otherwise you will not know what changes are responsible for what improvements.
  1. Your starting point should be your code as submitted, compiled and linked with your original compile scripts.
  2. Your first change should be to compile with -O1 and to link with -lcii-O1, which must come before other libraries.
  3. Your second change should be to compile with -O2 and to link with -lcii-O2.
  4. After that you can start improving the code.
Keep in mind that -O1 is not always better than -O2.


Here is sketch with some made-up examples:
BenchmarkTimeInstructionsRelativeImprovement
big30s--- 1.000 No improvement (starting point)
small1s75×10^6 1.000
big28s--- 0.933 Compiled with optimization turned on and linked against -lcii-O1
small900ms69×10^6 0.920
big28s--- 0.933 Compiled with optimization turned on and linked against -lcii-O2
small900ms69×10^6 0.920
big25s--- 0.833Removed Array_width() call from for loop and placed result in local variable instead
small833ms62×10^6 0.833
big22s--- 0.733 Removed array->blocks expression from loop and placed result in local variable
small800ms56×10^6 0.800
big60s--- 2.000 Used explicit for loop instead of blocked-array mapping function. Time improved for small image but got worse for big image---undid change.
small650ms49×10^6 0.650
big18s--- 0.600 Changed representation of blocks so that access to elements within the blocked mapping function uses unsafe pointer arithmetic without bounds checking
small600ms45×10^6 0.600

Analysis of assembly code

Once you have improved the code as much as you can, use valgrind and kcachegrind to find the single routine that takes the most time. (You can find it by clicking on the Self tab in kcachegrind.) For this final phase you may want to use the --dump-instr=yes option so you can see the assembly code in kcachegrind. The advantage of kcachegrind over objump -d is that it will tell you how many times each instruction was executed.

Once you've found the routine in which the most time is spent, examine the assembly code and either identify ways in which it could be improved or argue that there is not an obvious way to improve it.

Do this exercise for both ppmtrans and um binaries.

For this assignment, there is no need to modify assembly code.

Performance of the final stages

All measurements will be taken on the machines in Halligan 118. This means you should do your own profiling and measurements on those machines, by remote login if necessary. The importance of using the lab machines cannot be overstated. I recently made what I thought was a minor change to my Universal Machine, intended to improve modularity. On my computer at home, the new code was 25% slower. On the linux.cs.tufts.edu server, the new code was about the same speed. On the machines in Halligan 118, the new code was about twice as fast. The machine you're using matters.

Measure your code with both gcc -O1 and gcc -O2. Neither one is faster for all problems; report the better of the two results. In your final compile script, use whichever gives the best results.

You will be evaluated both on your improvement relative to the code you start with and on the absolute performance of your final results. This means it is easier for you to get top marks if you start with your own code rather than mine, since mine has less room to be improved. Your laboratory notes will record all your improvements and the performance of your final stages.

What to submit

Please use the submit40-profile script to submit the following items.

  1. A compile file, which when run with sh, compiles all your source code and produces both ppmtrans and um binaries. Please discard any copies you may have of pnmrdr.c or pnmrdr.o. The early versions of this code were extremely inefficient. I have replaced them with more efficient versions which you can get by linking with -L/comp/40/lib64 -lpnm -lnetpbm.
  2. A run file, which when run with sh runs all your test cases. For accurate performance measurements, large inputs should be copied to /data or /tmp, so they reside on a local disk. Here's a sample; you will want to change it to suit your own image files:
    #!/bin/sh
    . /usr/sup/use/use.sh
    use comp40
    img=`tempfile --suffix=.ppm`
    
    djpeg big.jpg > $img
    /usr/bin/time -f "large rotate  90: %e seconds" ./ppmtrans -rotate  90 $img > /dev/null
    /usr/bin/time -f "large rotate 180: %e seconds" ./ppmtrans -rotate 180 $img > /dev/null
    
    for i in small1.jpg small2.jpg small3.jpg
    do
      djpeg $i > $img
      /usr/bin/time -f "rot $i  90 deg: %e seconds" ./ppmtrans -rotate  90 $img > /dev/null
      /usr/bin/time -f "rot $i 180 deg: %e seconds" ./ppmtrans -rotate 180 $img > /dev/null
    done
    
    for i in midmark.um sandmark.umz
    do
      /usr/bin/time -f "um $i: %e seconds" um $i > /dev/null
    done
    
    rm -f $img
    
  3. A README file which
  4. A labnotes.pdf file that gives your laboratory notes in nice, readable format
  5. All images and benchmarks you used as test data, preferably in a compressed format like jpg or png

Methods of improving performance

In performance, really big wins usually come from better algorithms which provide superior asymptotic complexity. But for our programs, there is not much algorithmic action; everything should be pretty much linear in either the number of pixels (images) or the number of UM instructions executed (Universal Machine). You can, however, often improve things by changing your data structures.

Here are some trite but true observations:

What if your program is nothing but memory references and procedure calls?! How can you make progress?

There are some external sources you might find useful.

This partial solution to the adventure game can be made into a benchmark that is intermediate in difficulty between the midmark and the sandmark:]

n
take bolt
take spring
inc spring
take button
take processor
take pill
inc pill
take radio
take cache
comb processor cache
take blue transistor
comb radio transistor
take antenna
inc antenna
take screw
take motherboard
comb motherboard screw
take A-1920-IXB
comb A-1920-IXB bolt
comb A-1920-IXB processor
comb A-1920-IXB radio
take transistor
comb A-1920-IXB transistor
comb motherboard A-1920-IXB
take keypad
comb keypad motherboard
comb keypad button
s