The purpose of this assignment is to take the first step toward one of the central goals of this course: to acquire a working knowledge of the AMD64 instruction-set architecture. [Now that Intel also supports AMD64, it is more officially known as \texttt{x86\_64}, but certain old dogs have trouble learning new tricks, not to mention that credit should be given where credit is due. Down with revisionist history!] To that end you will read, analyze, and understand a half-dozen or so C procedures given access only to an executable binary. An important secondary objective is for you to learn to debug programs using the Data Display Debugger (DDD).
Fundamental questions about assembly code include
The homework requires analysis of assembly-language programs. Important questions include
You'll mix three kind of activities:
A ``binary bomb'' is a program that consists of a sequence of phases, each of which can be defused by the proper input. In a Comp 40 bomb, each phase expects one line on standard input; if the phase receives the input it expects, it is defused, and you continue to the next phase. If the phase receives any other input, the bomb explodes: it prints
BOOM!!! The bomb has blown up.Worse, it sends a message to the class ``bomb server,'' and you you lose 1/40 of the available credit for that phase. Be careful!
The bomb is divided into six phases. If you defuse a phase successfully, you will receive either full credit (if your bomb has never blown up) or partial credit (if your bomb blew up at any point). If you fail to defuse a phase, you receive no credit for defusing. The phases get progressively harder to defuse, but the increasing difficulty should be offset by the expertise you gain as you move from phase to phase. The final phase will challenge even the best students—don't start at the last minute.
We've doctored the bomb to help you avoid repetetive work and slips of the finger. First, the bomb ignores blank input lines. Second, if you run your bomb with a command-line argument, for example,
./bomb inputs1-3.txtthen the bomb reads input lines from inputs1-3.txt, which might contain the inputs expected by phases 1 to 3. Afterward, when the bomb reaches end of file on inputs1-3.txt, it switches to reading standard input. Use this feature to your advantage by putting correct phrases into a file!
On this assignment you are encouraged to work in pairs. Each pair should use one and only one bomb. Bombs are available from http://bomb.cs.tufts.edu:54321 (think countdown). This URL works only from inside the firewall; if you are not in the lab, ssh to linux and download your bomb with elinks or w3m. Each bomb is an executable binary which has been compiled from C; it is packaged in a tar file which can be unpacked with, e.g.,
tar xvf bomb4.tarSave the bombk.tar file to the directory in which you plan to do your work. Unpack the tar file by
Run objdump -d bomb to get a printout of the assembly code of your bomb. You'll be especially interested in functions main (to which you have the source) and functions phase_1 through phase_6. You will also be interested in other functions, but anything before main in the printout is probably not interesting.
You must defuse the bomb on linux.cs.tufts.edu or one of the machines in Halligan 116, 118, or 120. If you wish to work remotely, you will need either an X windows client or a VNC client. If you are not already familiar with X clients and ssh forwarding, I recommend you use a VNC client, available from http://www.realvnc.com. Information about how to set up VNC is available at http://www.cs.tufts.edu/comp/40/howto-vnc/, with emphasis on using the Windows PuTTY client to get through the firewall. If you are using Linux you can SSH tunnel by using the -via option on the VNC viewer.
Connect to a class server, launch the Data Display Debugger (DDD) on your bomb, and start stepping through the code.
((int *)($rbp + $rbx * 4)[0]watches the value in the address known to the assembler as 0x0(%rbp,%rbx,4). [For reasons known only to the people at the Free Software Foundation, the assembly-language tools use the \texttt{\char`\%}~sign to refer to a register name, but the debuggers use a \texttt{\char`\$}~sign.]
You will need access to volumes of information, including
Once you are comfortably seated on this indigestible brick of information, launch DDD and start single-stepping. Key skills are
Results from defusing the bomb are automatically sent to your instructor, so there is no need to hand them in. You should validate your results by checking the page at
http://www.cs.tufts.edu/comp/40/bombstats.htmlThis web page is updated frequently and shows everyone's progress.
In addition to defusing the bomb, we expect you to write down the results of your analysis:
We therefore expect you to submit the following:
The Data Display Debugger is actually a user interface that operates on top of gdb. Here are some things to know:
ddd bomb
ddd -rhost linux.cs.tufts.edu /h/nr/cs/40/arith-student/40image
((int *)$rax)[0] ((int *)$rax)[1] ((int *)$rax)[2] ((int *)$rax)[3]Type an expression into the argument field and click the display flashlight, then edit the expression to show the next element. You can easily display all four at once.
DDD is actually ``just'' a user interface. To control the underlying program, DDD, relies on the the GNU debugger, gdb. gdb is a command-line debugger that supports a handful of languages and a great many target machines. With luck, you'll avoid most of gdb, but you'll want to keep in mind that the gdb window at the bottom of the DDD interface allows you to interact directly with gdb. All of DDD's abilities to single step, show data, set breakpoints, and so on, are also present in gdb.
Full documentation for gdb can be found at http://sourceware.org/gdb/documentation. It is probably more useful to browse the textbook site at http://csapp.cs.cmu.edu/public/students.html, which has a number of gdb resources. In particular, the ``Quick GDB x86-64 reference'' contains a very short summary of commands, most of which are in common with DDD. This is the place to learn what ``Stepi'' and ``Nexti'' actually mean.
The command objdump -d is invaluable to print assembly language for all of the code in the bomb. You can also just look at individual functions. Having a printed copy of the assembly code, which you can write on, is an invaluable tool to keep track of your growing understanding of the bomb.
Running nm -p bomb will show all the names defined and used in the bomb. These names include all of the bomb's functions and global variables as well as all the functions the bomb calls. You may learn something by looking at the function names!
You'll also see many undefined names which should look vaguely familiar; they refer to names defined in the GNU C Library. This library is loaded dynamically, after the bomb starts. (Dynamic loading is also responsible for most of the junk between _init and _start. These are functions that are called from the bomb, which in turn make indirect calls through something called the Global Offset Table. At this point, sensible people run screaming from the room.)
For more information, some people like objdump -t, but I prefer the simplicity of nm. Check out the man page!
It won't get you far, but
strings bombwill display the printable strings in your bomb. Sometimes strings squeezes a surprising amount of information out of a reticent program. If nothing else, strings will teach you not to store your passwords in the clear!
You may find the following documentation useful:
The most important thing to know is that the GNU people put the destination on the right (AT&T syntax) where Intel and AMD put it on the left. If you are moved to mayhem, Richard Stallman is but a short train ride away...
Another useful thing to know is the meaning of the address syntax; here are the addressing modes you're most likely to encounter:
%rax == $r[0] # similarly for other registers %eax == least significant 32 bits of %rax (presence may signal a 32-bit operation) disp(base, index, scale) == $m[base + index * scale + disp] disp(base) == $m[base + disp] label(,index,scale) == $m[label + index * scale] symbol(%rip) == relative reference to symbol in memory
The primary technique involved is to maintain at each program point an account of the contents of machine registers and the procedure's stack frame. [You are not yet expected to know what a stack frame is; it is covered in your book and will be covered in class.] Registers and stack slots that you don't care about can be omitted from description or can be written using the ``don't care'' value _|_ (pronounced ``undefined''). The technique is best applied both backwards and forwards:
add $0x1,%rbx # %rbx := %rbx + 1then the desired state preceding the instruction must be to have `%rbx + 1 > 5, or equivalently, `%rbx > 4, or equivalently. The condition characterizing the desired state preceding the instruction is called the weakest precondition, and I calculated it by substituting the right-hand side of the assignment for the register on the left.
40122d: je 401234 <phase_1+0x17> 40122f: callq 40155b <explode_bomb> 401234: add $0x8,%rspYou might naturally wish to avoid ever having control arrive at address 0x40122f, which calls the explode_bomb function. It is therefore natural to ask what has to be done to make sure the preceding condition jump succeeds in jumping around the call to the add instruction. Looking backward, we see the preceding instruction is
40122b: test %eax,%eaxLooking up this instruction in the architecture manual, we see that the instruction computes the ``bitwise and'' of %eax with itself (that is, the ``bitwise and'' of the least significant 32 bits of %rax), uses the result to set the sign, parity, and zero bit in the condition codes, and then discards the result. So the je instruction is mnemonic for ``jump on equal,'' which is the same as ``jump if zero'', [You can see where this analysis is going to get tedious. And that you're going to need to start early.] which means the jump will succeed if and only if %eax is zero.
Now, as is utterly typical, the control-flow program has become a dataflow problem, and we must look backward to see what sets %eax. We haven't far to look, as the preceding instruction is
401226: callq 40125f <strings_not_equal>and we know from the calling convention that a 32-bit integer value is returned in %eax. Thus it is safe to conclude that in order to prevent the bomb from exploding, the strings_not_equal function must return zero. Suggestive, is it not?
The ``binary bomb'' idea and the associated code were developed by David O'Hallaron, who has a twisted sense of humor. [Terry Pratchett helped with the footnotes.] Many of the informational resources were suggested by Soha Hassoun. Backwards dataflow analysis was invented by Sherlock Holmes, who was invented by Arthur Conan Doyle.