Research Summary --- Retargeting

Retargeting programming environment tools, such as compilers and editors, is now common, and is assisted by retargeting tools and methodologies. For example, compiler writers use machine-independent intermediate representations and code-generator generators. My dissertation described the design and implementation of a retargetable debugger. More recently, I have developed techniques for generating parts of applications that manipulate machine code, including assemblers, disassemblers, code generators, linkers, and profilers.

Debugging

ldb, my dissertation project, is a prototype retargetable debugger. It can be used with C programs compiled with lcc, a retargetable compiler for ANSI C, and it can debug VAX, MC68000, SPARC, and MIPS R3000 programs. ldb makes three contributions: it uses debugging symbols that contain procedures as well as data, it achieves considerable simplification by virtue of modest compiler support, and its design minimizes and isolates machine dependencies.

ldb represents symbol tables as PostScript programs, which its embedded PostScript interpreter evaluates as necessary. This approach provides a machine-independent mechanism for representing symbol tables that contain both code and data, shields the debugger from irrelevant information, and supports machine-independent expression evaluation.

ldb is an experiment in coupling between compiler and debugger. In most systems, compiler and debugger are connected only by machine-dependent symbol table data. In some experimental systems, the compiler and debugger execute in the same address space, calling each other and sharing data structures. ldb and lcc execute separately, but ldb depends on and uses existing compiler function as much as possible. For example, ldb uses a variant of lcc as an ``expression server,'' which implements assignment and expression evaluation by translating C to PostScript. Making modest demands on the compiler simplifies the debugger substantially.

ldb's design embodies engineering choices that minimize and isolate machine-dependent code. For example, it controls target processes with a small ``debug nub'' that is loaded with the target program. It attaches to targets dynamically, exchanging messages with this nub using a machine-independent protocol. ldb's breakpoint implementation is largely machine-independent; the only machine-dependent code implements control-flow analysis, which takes about 50 lines per target. ldb's machine-dependent code depends only on the architecture the target program runs on, not on the architecture ldb runs on. As a result, cross-architecture debugging is identical to single-architecture debugging, and ldb can change architectures dynamically.

Machine code

The New Jersey Machine-Code Toolkit helps programmers write applications that process machine code, like code generators, linkers, profilers, and debuggers. It turns symbolic manipulations of instructions into bit manipulations, guided by a specification that maps between symbolic and binary representations of instructions. Without the toolkit, application writers must either work with text and use native assemblers and disassemblers, or else implement encoding and decoding by hand, using different ad hoc techniques for different architectures. The toolkit automates encoding and decoding, a single technique that works on multiple architectures.

The toolkit's specification language is simple, and it is designed to resemble instruction descriptions found in architecture manuals. To guarantee consistency, it uses a single, bidirectional construct to describe both encoding and decoding. The toolkit checks specifications for unused constructs, underspecified instructions, and inconsistencies. An instruction set can be specified with modest effort; our MIPS, SPARC, and Intel 486 specifications are 127, 193, and 460 lines.

The toolkit has been used to reduce retargeting effort in two applications. ldb uses the toolkit for its MIPS disassembler, for which it needs less than 100 lines of machine-dependent code. The toolkit supports relocation as well as encoding; a retargetable linker that uses the toolkit does relocation in only 20 lines of code, all of which are machine-independent. A previous version required 450 lines of encoding and relocation code for the MIPS alone.

The toolkit provides other practical benefits. By hiding shift and mask operations, by replacing case statements with matching statements, and by checking specifications for consistency, the toolkit reduces the possibility of error. The toolkit can speed up applications that would otherwise have to generate assembly language instead of binary code. For example, the linker mentioned above emits executable files 1.7 to 2 times faster by using the toolkit to write machine code instead of writing assembly language and using native assemblers.