\classtermyear
The purpose of this assignment is to learn to use profiling tools to apply your knowledge of machine structure and assembly-language programming. You will improve the performance of two programs.
Use code-tuning techniques to improve the performance of two programs:
The key parts of the assignment are to identify bottlenecks using \upshapevalgrind and to improve the code by increments. You will therefore want to do most of your profiling on small inputs.
Your grade will be based on three outcomes:
Please begin with your code in the state it was after the array-rotation and Universal Machine assignments. If your code did not work, you may fix it, or you may start with Professor Ramsey's array-rotation code and another student's Universal Machine. If you have not yet completed the UM, you may not look at another student's UM until you have submitted.
Please take baseline measurements of your code as submitted. (If you have already changed your Universal Machine, don't worry; your CS department account should have received an email containing your UM as submitted.)
During this assignment, you may run into a dead end that requires you to go back to a previous version of your code. We recommend, but do not require, that you use git to keep track of each stage of improvement in your code.
git initin the directory that contains your source code.
git commit *.c *.h
git commit *.c *.h
git commit *.c *.h git tag replaced-segment-map-with-turtles
git guifor seeing what has changed since the last snapshot. It may help you with your notes.
Begin by choosing a data set. For image rotation, choose one large image (at least 25 megapixels) and three small images (about 100 thousand pixels each). [If you do not have a large image, you can make a small image larger by using \texttt{pnmscale} with a scale factor greater than one.] For the Universal Machine, you will use the small midmark benchmark, the large sandmark benchmark, and a partial solution to the adventure game.
For each program, at each stage, for each input, please
For ppmtrans, each input is an image, and ``executing the program'' means running both 90-degree and 180-degree blocked rotations on that image.
For um, each input is a Universal Machine binary, and ``executing the program'' means running um on that binary, supplying a suitable test input on standard input.
When you change the code, it is critical that each set of changes be small and isolated. Otherwise you will not know what changes are responsible for what improvements.
Benchmark | Time | Instructions | Rel to start | Rel to prev | Improvement |
big | 30s | — | 1.000 | 1.000 | No improvement (starting point) |
small | 1s | 75.02×10^6 | 1.000 | 1.000 | |
big | 28s | — | 0.933 | 0.933 | Compiled with optimization turned on and linked against -lcii-O1 |
small | 900ms | 69.21×10^6 | 0.920 | 0.920 | |
big | 28s | — | 0.933 | 1.000 | Compiled with optimization turned on and linked against -lcii-O2 |
small | 900ms | 69.18×10^6 | 0.920 | 1.000 | |
big | 25s | — | 0.833 | 0.893 | Removed Array_width() call from for loop and placed result in local variable instead |
small | 833ms | 62.01×10^6 | 0.833 | 0.926 | |
big | 22s | — | 0.733 | 0.880 | Removed array->blocks expression from loop and placed result in local variable |
small | 800ms | 56.16×10^6 | 0.800 | 0.960 | |
big | 60s | — | 2.000 | 2.727 | Used explicit for loop instead of blocked-array mapping function. Time improved for small image but got worse for big image—undid change. |
small | 650ms | 49.20×10^6 | 0.650 | 0.813 | |
big | 18s | — | 0.600 | 0.300 | Changed representation of blocks so that access to elements within the blocked mapping function uses unsafe pointer arithmetic without bounds checking |
small | 600ms | 44.89×10^6 | 0.600 | 0.923 |
Sample report for blocked image rotation (made-up data) [*]
Once you have improved the code as much as you can, use valgrind and kcachegrind to find the single routine that takes the most time. (You can find it by clicking on the Self tab in kcachegrind.) For this final phase you may want to use the –dump-instr=yes option so you can see the assembly code in kcachegrind. The advantage of kcachegrind over objump -d is that it will tell you how many times each instruction was executed.
Once you've found the routine in which the most time is spent, examine the assembly code and either identify specific ways (changes to to the assembly code itself) in which it could be improved or argue that there is not an obvious way to improve the assembly code.
Do this exercise for both ppmtrans and um binaries.
Things to look for include:
mov %esi, %esilooks redundant, but it's not. This idiom stands for an instruction that zeroes out the most significant 32 bits of the 64-bit register %rsi. Whoever allowed this syntax into the assembler should be shot.
Do this exercise for both ppmtrans and um.
For this assignment, there is no need to modify assembly code.
All profiling and measurements must be done on the Intel Q6700 machines in Halligan 116 or 118. Almost all the machines in 116 and 118 are suitable; to be sure you have the right kind of machine, run the command
grep 'model name' /proc/cpuinfoand make sure the output is
model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHz model name : Intel(R) Core(TM)2 Quad CPU Q6700 @ 2.66GHzAny other output means you have the wrong kind of machine and your time measurements will not be consistent. In particular, The machines in 120 are not suitable for this assignment.
The importance of using the lab machines cannot be overstated.
\penalty-200
Measure your code with both gcc -O1 and gcc -O2. Neither one is faster for all problems; report the better of the two results. In your final compile script, use whichever gives the best results.
You will be evaluated both on your improvement relative to the code you start with and on the absolute performance of your final results. This means it is easier for you to get top marks if you start with your own code rather than NR's, since his has less room to be improved. Your laboratory notes will record all your improvements and the performance of your final stages.
Please use the submit40-profile script to submit the following items.
djpeg big.jpg > img time -f "large rotate 90: time -f "large rotate 180:
for i in small1.jpg small2.jpg small3.jpg do djpeg i > img time -f "rot i 90 deg: time -f "rot i 180 deg: done {smallverbatim} (Because we are using a sane, sensible shell, we can use time instead of \path/usr/bin/time.)
And here is an example for the UM:
#!/bin/sh . /usr/sup/use/use.sh use comp40 for i in midmark.um sandmark.umz do time -f "um $i: %e seconds" um $i > /dev/null done
In performance, really big wins usually come from better algorithms which provide superior asymptotic complexity. But for our programs, there is not much algorithmic action; everything should be pretty much linear in either the number of pixels (images) or the number of UM instructions executed (Universal Machine). You can often improve things by changing your data structures.
Here are some trite but true observations:
Nothing is more frustrating than to spend a lot of time improving code that is rarely executed.
There are some external sources you might find useful.
Steve's spelling could use some work, don't you think?
Partial solution to the adventure game
Figure [<-] gives a a partial solution to the adventure game. This solution can be made into a benchmark that is intermediate in difficulty between the midmark and the sandmark.
[*] Here are some untested ideas for automatically copying files to \path/data or \path/tmp lazily, on demand. They should work with #!/bin/bash or #!/bin/ksh. The ideas are:
function datafile { # pathname of a personal file in /data echo "/data/$USER/$1" }Second, shell function in_data takes an arbitrary file and copies it to your \path/data directory. Copying is done only if a file of the same name is not already present:
function in_data { # possibly copy a file to /data;return its path there # set -x # uncomment me to see commands execute typeset data="$(datafile "$(basename $1)")" if [[ ! -r "$data" ]]; then # file is not already in /data mkdir -p "$(dirname "$data")" # make the directory cp -v "$1" "$data" # copy the file fi echo "$data" # print the new location }You can now use this function routinely to make sure that every ``big input'' is in \path/data:
# set -x # uncomment me to see commands execute for i in biginput1 biginput2 biginput3 do /usr/bin/time -f ... my_program $(in_data $i) done