Fall 2008
The purpose of this assignment is to learn to use profiling tools to apply your knowledge of machine structure and assembly-language programming. You will improve the performance of two programs.
Use code-tuning techniques to improve the performance of two programs:
Your grade will be based on three outcomes:
Please begin with your code in the state it was after the array-rotation and Universal Machine assignments. If your code did not work, you may fix it, or you may start with my array-rotation code and another student's Universal Machine.
Please take baseline measurements of your code as submitted. (If you have already changed your Universal Machine, don't worry; your CS department account should have received an email containing your UM as submitted.)
Begin by choosing a data set. For image rotation, choose one large image (at least 25 megapixels) and three small images (about 100 thousand pixels each). [If you do not have a large image, you can make a small image larger by using \texttt{pnmscale} with a scale factor greater than one.] For the Universal Machine, you will use the small midmark benchmark, the large sandmark benchmark, and a partial solution to the adventure game.
You have three operations: rotate 90, rotate 180, and run um.
At each stage, for each input, please
| Benchmark | Time | Instructions | Relative | Improvement |
| big | 30s | --- | 1.000 | No improvement (starting point) |
| small | 1s | 75×10^6 | 1.000 | |
| big | 28s | --- | 0.933 | Compiled with optimization turned on and linked against -lcii-O1 |
| small | 900ms | 69×10^6 | 0.920 | |
| big | 28s | --- | 0.933 | Compiled with optimization turned on and linked against -lcii-O2 |
| small | 900ms | 69×10^6 | 0.920 | |
| big | 25s | --- | 0.833 | Removed Array_width() call from for loop and placed result in local variable instead |
| small | 833ms | 62×10^6 | 0.833 | |
| big | 22s | --- | 0.733 | Removed array->blocks expression from loop and placed result in local variable |
| small | 800ms | 56×10^6 | 0.800 | |
| big | 60s | --- | 2.000 | Used explicit for loop instead of blocked-array mapping function. Time improved for small image but got worse for big image---undid change. |
| small | 650ms | 49×10^6 | 0.650 | |
| big | 18s | --- | 0.600 | Changed representation of blocks so that access to elements within the blocked mapping function uses unsafe pointer arithmetic without bounds checking |
| small | 600ms | 45×10^6 | 0.600 |
Once you have improved the code as much as you can, use valgrind and kcachegrind to find the single routine that takes the most time. (You can find it by clicking on the Self tab in kcachegrind.) For this final phase you may want to use the --dump-instr=yes option so you can see the assembly code in kcachegrind. The advantage of kcachegrind over objump -d is that it will tell you how many times each instruction was executed.
Once you've found the routine in which the most time is spent, examine the assembly code and either identify ways in which it could be improved or argue that there is not an obvious way to improve it.
Do this exercise for both ppmtrans and um binaries.
For this assignment, there is no need to modify assembly code.
All measurements will be taken on the machines in Halligan 118. This means you should do your own profiling and measurements on those machines, by remote login if necessary. The importance of using the lab machines cannot be overstated. I recently made what I thought was a minor change to my Universal Machine, intended to improve modularity. On my computer at home, the new code was 25% slower. On the linux.cs.tufts.edu server, the new code was about the same speed. On the machines in Halligan 118, the new code was about twice as fast. The machine you're using matters.
Measure your code with both gcc -O1 and gcc -O2. Neither one is faster for all problems; report the better of the two results. In your final compile script, use whichever gives the best results.
You will be evaluated both on your improvement relative to the code you start with and on the absolute performance of your final results. This means it is easier for you to get top marks if you start with your own code rather than mine, since mine has less room to be improved. Your laboratory notes will record all your improvements and the performance of your final stages.
Please use the submit40-profile script to submit the following items.
#!/bin/sh . /usr/sup/use/use.sh use comp40 img=`tempfile --suffix=.ppm` djpeg big.jpg > $img /usr/bin/time -f "large rotate 90: %e seconds" ./ppmtrans -rotate 90 $img > /dev/null /usr/bin/time -f "large rotate 180: %e seconds" ./ppmtrans -rotate 180 $img > /dev/null for i in small1.jpg small2.jpg small3.jpg do djpeg $i > $img /usr/bin/time -f "rot $i 90 deg: %e seconds" ./ppmtrans -rotate 90 $img > /dev/null /usr/bin/time -f "rot $i 180 deg: %e seconds" ./ppmtrans -rotate 180 $img > /dev/null done for i in midmark.um sandmark.umz do /usr/bin/time -f "um $i: %e seconds" um $i > /dev/null done rm -f $img
In performance, really big wins usually come from better algorithms which provide superior asymptotic complexity. But for our programs, there is not much algorithmic action; everything should be pretty much linear in either the number of pixels (images) or the number of UM instructions executed (Universal Machine). You can, however, often improve things by changing your data structures.
Here are some trite but true observations:
Nothing is more frustrating than to spend a lot of time improving code that is rarely executed.
There are some external sources you might find useful.
Steve's spelling could use some work, don't you think?
This partial solution to the adventure game can be made into a benchmark that is intermediate in difficulty between the midmark and the sandmark:]
n take bolt take spring inc spring take button take processor take pill inc pill take radio take cache comb processor cache take blue transistor comb radio transistor take antenna inc antenna take screw take motherboard comb motherboard screw take A-1920-IXB comb A-1920-IXB bolt comb A-1920-IXB processor comb A-1920-IXB radio take transistor comb A-1920-IXB transistor comb motherboard A-1920-IXB take keypad comb keypad motherboard comb keypad button s