Early Experience with ASDL in lcc
David R. Hanson
Microsoft Research
drh@microsoft.com
September 1998
Technical Report
MSR­TR­98­50
Microsoft Research
Microsoft Corporation
One Microsoft Way
Redmond, WA 98052
http://www.research.microsoft.com

Abstract
The Abstract Syntax Description Language (ASDL) is a language for specify­
ing the tree data structures often found in compiler intermediate representa­
tions. The ASDL generator reads an ASDL specification and generates code
to construct, read, and write instances of the trees specified. Using ASDL
permits a compiler to be decomposed into semi­independent components that
communicate by reading and writing trees. Each component can be written
in a di#erent language, because the ASDL generator can emit code in several
languages, and the files written by ASDL­generated code are machine­ and
language­independent. ASDL is part of the National Compiler Infrastructure
project, which seeks to reduce dramatically the overhead of computer systems
research by making it much easier to build high­quality compilers. This paper
describes dividing lcc, a widely used retargetable C compiler, into two compo­
nents that communicate via trees defined in ASDL. As the first use of ASDL in a
`real' compiler, this experience reveals much about the e#ort required to retrofit
an existing compiler to use ASDL, the overheads involved, and the strengths
and weaknesses of ASDL itself and, secondarily, of lcc.

Early Experience with ASDL in lcc
Introduction
High­quality compilers for a range of modern languages are essential for conducting experimental research in
computer architecture, programming languages, and programming environments. For example, compilers are
required to run benchmarks for evaluating new ideas in architecture and code optimization. And compilers
for new languages need optimizers, code generators, and runtime systems for existing platforms.
Building compilers is often a bottleneck in these kinds of research projects because compiler construction
is a labor­intensive activity. Often, nearly complete compilers must be constructed even if the essential com­
ponents are relatively small parts of the whole. To evaluate a new architecture, for instance, requires a code
generator for that architecture and perhaps an architecture­dependent optimizer. Writing toy compilers for
toy languages is insu#cient: The research community demands measurements using established benchmarks,
like the SPEC benchmarks [8], which are written in real programming languages.
The National Compiler Infrastructure (NCI) project seeks to reduce dramatically the e#ort needed to
perform realistic experiments by making it much easier to build high­quality compilers. The goal is to make
it possible to build complete compilation systems from pieces, replacing or modifying only those components
that are relevant to the client researchers. For example, researchers studying global optimization algorithms
for C++ would replace or add only their optimizers and would use existing C++ front ends and code
generators.
The NCI can pay both economic and intellectual dividends. It should reduce significantly the costs of
doing computer systems research, in terms of time, barrier to entry, and direct monetary outlay. It should
also encourage more researchers to attack computer systems problems and thus increase the rate at which
new research results appear.
The NCI includes the Stanford Intermediate Format [11] and the emerging Zephyr program­generation
tools, which includes the Abstract Syntax Description Language (ASDL) [10]. ASDL describes the abstract
syntax of compiler intermediate representations and other tree­like data structures. The ASDL generator, as­
dlGen, converts ASDL specifications into appropriate data­structure definitions, constructors, and functions
to read and write these data structures to files in a variety of programming languages.
This paper describes how ASDL is used with lcc [1], a well­documented, small, production­quality com­
piler for ISO Standard C [5]. This experience is valuable for two reasons. First, lcc is perhaps the simplest
C compiler available and thus provides a `basis' test case for ASDL and other NCI tools. If ASDL can't
handle lcc's intermediate representation, it's unlikely to work in more ambitious compilers or in compilers for
higher­level languages. Second, lcc wasn't designed to be decomposed into reusable program components,
so doing so suggests how di#cult it is to retrofit ASDL into existing compilers.
ASDL
ASDL is a small, domain­specific language for describing tree data structures [10]. ASDL specifications are
concise and independent of any particular programming language. The ASDL generator, asdlGen, accepts an
ASDL specification and emits code that defines a concrete representation for the data structures described
in the specification, along with code that constructs, reads, and writes instances of those data structures.
Currently, asdlGen can emit data­structure implementations in C, C++, Java, ML, and Haskell. ASDL
specifications tend to be much smaller than the corresponding, language­specific data structure and function
definitions.
Compiler writers can use ASDL to partition a compiler into several independent programs as depicted
in Fig. 1. A front end reads source code and builds an intermediate representation using the data structure
constructors generated by asdlGen. It writes these data structures to a file---a `pickle'---using the I/O
functions generated by asdlGen. Subsequent phases read and write pickles as necessary, perhaps modifying
them in the process. For example, optimizers would read a pickle, improve the code therein, and write a
new pickle.
1

Front End
written in C
Optimizer
written in ML
Back End
written in Java
IR
"pickle"
IR
"pickle"
Object
Code
Source
Code
Figure 1: Sample compiler organization using ASDL.
The binary pickle format is independent of both language and host platform. Thus, as suggested in
Fig. 1, compiler phases can be written in whatever ASDL­supported language best suits the task at hand. If
ASDL becomes widely used, researchers can tap into a complete compiler by adding new phases or replacing
just the phases of interest.
ASDL is not a universal intermediate representation [9], because it supports any IR that can be described
by trees. Likewise, ASDL is not a universal distribution format [6], because it does not mandate specific
formats, capabilities, or platforms. ASDL and asdlGen are to compilers what interface definition languages
(IDLs) [7] and stub generators are to distributed systems. Typical IDLs describe the interfaces between
program components running in di#erent address spaces, and stub generators generate implementations of
these functions that use remote procedure calls to communicate between clients and servers. ASDL describes
the data­structure interfaces between compiler phases, and asdlGen generates functions to communicate
between these phases.
ASDL is such simple language that examples su#ce to explain nearly all of its features. The following
ASDL specification describes an IR for a language of arithmetic expressions, assignment statements, and
print statements.
module IR {
stm = SEQ(stm, stm)
| ASGN(identifier, exp)
| PRINT(exp*)
exp = OP(binop, exp, exp)
| ID(identifier)
| ICON(int)
| RCON(real)
real = (int, int)
binop = ADD | SUB | MUL | DIV
}
This specification defines four types: stm, exp, real, and binop. The first three productions define the sum
type stm, which has three constructors.Astm is a SEQ tree with two stm children, an ASGN tree with two
children of types identifier and exp,oraPRINT tree with one child of type list of exp. identifier is a
built­in type, and the `*' following a type specifies a list of that type.
Similarly, exp is a sum type with four constructors that describe trees for binary operators, identifiers,
and integer and real constants. int is another built­in type, but there is no built­in type for reals. So, the
real type is a product type whose instances represent real numbers as two integers. Finally, binop is a
simple sum type that defines constructors for each of the possible binary operators.
All four types are wrapped in a module named IR; this name is used to provide a disambiguating prefix
for the names in the generated implementations.
It is easy to confuse ASDL specifications with grammars for programming languages. This ASDL spec­
ification describes the abstract syntax of the intermediate representation for programs written in some un­
2

...
struct IR_stm_s {
enum {IR_SEQ_enum, IR_ASGN_enum, IR_PRINT_enum} kind;
union {
struct IR_SEQ_s { IR_stm_ty stm1; IR_stm_ty stm2;} IR_SEQ;
struct IR_ASGN_s {
identifier_ty identifier1;
IR_exp_ty exp1;
} IR_ASGN;
struct IR_PRINT_s { list_ty exp_list1;} IR_PRINT;
}v;
};
...
IR_stm_ty IR_SEQ(IR_stm_ty stm1, IR_stm_ty stm2);
IR_stm_ty IR_ASGN(identifier_ty identifier1, IR_exp_ty exp1);
IR_stm_ty IR_PRINT(list_ty exp_list1);
...
extern IR_stm_ty IR_read_stm(instream_ty s_);
extern void IR_write_stm(IR_stm_ty x_, outstream_ty s_);
Figure 2: Generated C interface for the example ASDL specification.
specified concrete syntax.
Given an ASDL specification, asdlGen emits an interface and an implementation in the programming
language specified. The interface defines the language­specific representation for the types and declares func­
tions for constructing instances of those types and for reading and writing them. The functions themselves
appear in the implementation. For languages that do not separate interfaces and implementations, like Java,
asdlGen emits a single implementation.
In C, for example, given the ASDL specification above in the file IR.asdl, asdlGen writes the interface to
IR.h and the implementation to IR.c. Figure 2 shows the snippets from IR.h that define the representation
for stm and the associated constructors, readers, and writers. ASDL uses compact and e#cient represen­
tations whenever possible. A sum type is represented by a union. There is one field for each constructor
and a corresponding function for building instances of that constructor, as shown for stm in Fig. 2. ASDL
represents simple sum types with integers or their language­specific equivalent. In C, for example, binop
is represented by just an enumeration type. The implementation, IR.c, contains the definitions for the
functions declared in IR.h.
ASDL comes with libraries of basic types and functions for each programming language it supports.
These libraries provide support for lists and for the built­in types, such as identifier, in languages that do
not support them directly. In C, lists are represented by an implementation of variable­length sequences [4,
Ch. 11], and identifiers are represented by a C implementation of atoms [4, Ch. 3].
As Fig. 2 reveals, asdlGen generates field names and parameter names as necessary to complete the data
structure and function definitions. Compiler writers can specify these names in the ASDL specification. For
example, if stm is defined as
stm = SEQ(stm first, stm rest)
| ASGN(identifier id, exp e)
| PRINT(exp* elist)
first, rest, id, e, and elist will be used for the corresponding field and parameter names in Fig. 2.
Sum types can also have attributes, which are fields that are common to all constructors. For example,
stm = SEQ(stm first, stm rest)
3

| ASGN(identifier id, exp e)
| PRINT(exp* elist)
attributes(int lineno)
attaches a line number attribute to each constructor. Attributes are usually factored into a common prefix
for the type, e.g., the C type for stm from Fig. 2 becomes
struct IR_stm_s {
int_ty lineno;
enum {IR_SEQ_enum, IR_ASGN_enum, IR_PRINT_enum} kind;
union {
struct IR_SEQ_s { IR_stm_ty first; IR_stm_ty rest;} IR_SEQ;
struct IR_ASGN_s { identifier_ty id; IR_exp_ty e;} IR_ASGN;
struct IR_PRINT_s { list_ty elist;} IR_PRINT;
}v;
};
The lcc Code­Generation Interface
lcc is a retargetable compiler for ISO Standard C. It is distributed with back ends for the SPARC, MIPS,
X86, and ALPHA for a variety of platforms. Others have written back ends for additional platforms, and lcc
is used by other compiler researchers; for example, a modified, older release of lcc is used as the C compiler
in the SUIF project.
Communication between lcc's target­independent front end and its target­dependent back ends is specified
by a small code­generation interface. This interface consists of a few shared data structures, a 33­operator tree
IR that represents executable code, and 18 functions that manipulate trees and the shared data structures.
The shared data structures include tree nodes, symbol­table entries, and types. The 33 tree IR operators
are listed in Fig. 3. Each of these generic operators can be specialized by appending an operand type su#x
and a size in bytes. The 6 type su#xes are:
F float
I integer
U unsigned
P pointer
B `block' (aggregate)
V void
There can be up to 9 sizes. For example, ADDF4 denotes a 4­byte floating addition, and CVII2 denotes a
conversion from an integer to a 2­byte integer. While it looks like 33 × 6 × 9 = 1782 specific operators, not
all combinations are meaningful, and the number of sizes on most targets is limited. On 32­bit targets, there
are 130 type­ and size­specific operators. Conversions on 32­bit targets, for instance, convert only between
4 and 4­ or 8­byte floats, or widen or narrow between 3 sizes of integers. Some operators have only one or
a few valid su#xes; for instance, the address operators ADDRL, ADDRF, and ADDRG can have only the `P' type
su#x and whatever size is the size of a pointer on the target. Back end authors need accommodate only
those type­ and size­specific operators that are meaningful on their targets.
CNST ARG ASGN INDIR CVF CVI CVP CVU
NEG CALL RET ADDRG ADDRF ADDRL ADD SUB
LSH MOD RSH BAND BCOM BOR BXOR DIV
MUL EQ GE GT LE LT NE JUMP LABEL
Figure 3: lcc tree IR generic operators.
4

void progbeg(int, char *[]) initialize the back end
void progend(void) finalize the back end
void defsymbol(Symbol) initialize a symbol­table entry
void export(Symbol) export a symbol
void import(Symbol) import a symbol
void global(Symbol) define a global
void local(Symbol) define a local
void address(Symbol, Symbol, long) define an address relative to a symbol
void blockbeg(Env *) open a block­level scope
void blockend(Env *) close a block­level scope
void function(Symbol, Symbol [], Symbol [], int)
define a function body
void gen(Node) generate code
void emit(Node) emit code
void defconst(int, int, Value) initialize a arithmetic constant
void defaddress(Symbol) initialize an address constant
void defstring(int, char *) initialize a string constant
void space(int) define an uninitialized block
void segment(int) switch logical segments
Figure 4: lcc code­generation functions.
Incidentally, the lcc 3.x interface [1] supported only three sizes of integers, two sizes of floats, and insisted
that pointers fit in unsigned integers. These assumptions simplified the compiler and were suitable for 32­bit
architectures, but not for 64­bit architectures. The main di#erence between the 3.x interface and the 4.x
interface described here are the operator size su#xes.
Figure 4 summarizes the purpose of the 18 code­generation functions. On most targets, implementations
of many of these functions are very short, perhaps only a few calls to printf, because they simply emit
assembly language. Most of the work goes into gen, emit, and function, which collaborate to generate and
emit code for a function. While not required by the interface, all of lcc's distributed back ends use a variant
of the IBURG code­generator generator [2] to specify instruction selection. The resulting code generators
emit optimal local code. Instruction selection specifications and target­dependent functions run about 700
lines per target. There are about another 900 lines of code that are shared between all targets and include
functions for register allocation, etc.
lcc's packaging is somewhat novel: Pointers to the code­generation functions and some target­specific
parameters are packaged in the following `interface record:'
struct interface {
Metrics charmetric, shortmetric, intmetric, longmetric, ...;
unsigned little_endian:1, mulops_calls:1, wants_callb:1, ...;
void (*address)(Symbol, Symbol, long);
void (*blockbeg)(Env *);
void (*blockend)(Env *);
...
};
The Metrics values give the sizes and alignments of the basic data types, and the 1­bit flags identify
other target­dependent features, like endianness. There is one interface record for each distinct target, but
di#erent records can share functions. lcc is a small compiler, so all of the back ends are combined into a
single executable program, which makes lcc a cross compiler. As depicted in Fig. 5, a command­line option
selects the desired target, e.g.,
lcc ­Wf­target=mips/irix ­S wf1.c
5

IR
symbolic
alpha/osf
mips/irix
sparc/solaris
x86/win32
x86/linux
...
null
Figure 5: Specifying a target, e.g., mips/irix, selects an interface record.
symbolic
alpha/osf
mips/irix
sparc/solaris
x86/win32
x86/linux
...
null
Front End Assembly
Language
C
Source
Figure 6: lcc's monolithic design: One front end, numerous back ends.
causes lcc to compile wf1.c and leave the generated MIPS assembly code in wf1.s. The ­Wf­target option
points IR to the appropriate interface record, and the front end makes indirect calls to the code­generation
functions, e.g.,
(*IR­>defsymbol)(p);
The default target is the host, so this option is required only for cross compilation.
Dividing lcc
lcc is---by design---a monolithic compiler: The front end and the back ends are combined into a single
address space, so the front and back ends communicate by function calls that exchange pointers to shared
data structures, as Fig. 6 illustrates. Also, back ends can make upcalls to functions provided by the front
end. There are about a half dozen such functions, e.g., data­structure constructors, a memory allocator,
type predicates, and so on.
lcc is small, at least in comparison with other compilers, because it omits some components, most notably
a global optimizer. One way to add more functionality is to split lcc into a separate front end and one or
more separate back ends so that an optimizer can be run between these programs. This design would also
make it easier to use lcc in research projects.
Splitting lcc into two separate programs requires either massive revisions or some way to read and write
the data and actions represented by the existing code­generation interface. ASDL facilitates the second
alternative: It helps divide lcc into separate programs with no change to the code­generation interface. So,
the existing back ends can be used unmodified.
Figure 7 depicts this revised design. The front end, rcc, emits a pickle that encodes all the data structures
and the function calls made when compiling a C source file. The new program, dubbed `pass2,' reads a pickle,
6

symbolic
alpha/osf
mips/irix
sparc/solaris
x86/win32
x86/linux
...
null
rcc
C
Source
Generated
constructors,
readers, writers
ASDL
‘back end'
symbolic
alpha/osf
mips/irix
sparc/solaris
x86/win32
x86/linux
...
null
Assembly
Language
lcc IR
pickle
Generated
constructors,
readers, writers
pass2
ASDL
Grammar
asdlGen
Figure 7: lcc's revised design: Separate front end and back ends.
recreates the internal data structures, and makes the function calls encoded in the pickle. The generated
assembly language is often byte­for­byte identical to the code emitted by the monolithic compiler. Di#erences
occur only when the back end calls the label generator, which causes the revised design to number labels
di#erently.
asdlGen reads the lcc­specific ASDL grammar described in the next section and emits C code for the
data­structure constructors, readers, and writers, which is included in both rcc and pass2. Otherwise, the
revised front end, rcc, is nearly identical to the original front end. The ASDL emission is accomplished by
an ASDL back end, which `spoofs' the back end by overwriting the target­specific code­generation function
pointers with pointers to its own functions. The back ends are actually linked into both rcc and pass2,
because the interface records carry important machine parameterizations required by both programs. This
packaging is not essential; rcc could link in the interface records without the functions.
The revised compiler is used much like the original, except that ASDL output must be specified to rcc,
and pass2 must be run to emit the generated code, e.g.,
lcc ­Wf­target=mips/irix ­Wf­asdl ­S wf1.c
mv wf1.s wf1.pickle
...
pass2 wf1.pickle >wf1.s
The first command runs rcc and leaves the pickle in wf1.s, which the second command renames. The third
command generates the MIPS assembly code in wf1.s, perhaps after it has been optimized or otherwise
processed by an intervening step.
The ASDL Grammar
The ASDL grammar for lcc is small---only about 70 lines. It specifies data structures that are target­
dependent, which means, for example, that it is impossible to specify, say, the mips/irix target to rcc and
the x86/win32 target to pass2. Indeed, pass2 does not accept a target option, because the target specification
is embedded in the pickle.
Figures 8, 9, 10, and 12 show the complete lcc ASDL grammar. The grammar specifies more than just
what is in the code­generation interface, because pass2 must recreate the compilation environment built by
rcc, the front end. There are two important ramifications of this requirement. First, the pickles include
complete type information, for example, everything about structures and unions, etc. The code­generation
7

interface includes only the 6 basic types. This information is defined by the type sum type. Second, lcc's
code­generation data structures and related internal structures are graphs, not trees. Thus, items with
multiple references are identified by integers, and the references to them replaced by these integers. Fields
named uid in the grammar identify these integers. Dealing with graphs is ASDL's major shortcoming.
The product type program is the first ASDL type in Fig. 8, and an instance of this type represents a C
compilation unit. That is, rcc `compiles' a C source file into a program and writes it to a pickle, which pass2
reads and traverses to generate code. The program type carries counts of the number of unique integers,
uids for short, and the number of generated labels, a sequence of item types, a sequence of interface types,
the command­line argument count, and the arguments themselves. The type string is an ASDL built­in
type. The sum type item carries a uid (as an attribute) and either the associated symbol or type. The item
sequence in a program associates uids with symbols and types, as described below.
Symbol­table entries are an example of a multiply referenced data structure. Symbol­table entries are
represented by the product type symbol (see Fig. 8), which is a straightforward rendition of lcc's internal
symbol­table entry. It carries the symbol's name, type, scope, storage class, how often it's referenced, and
some flags. For example, the C declaration
struct elem { int count; struct elem *left, *right; char *word; } *root;
declares root to be a pointer to a struct elem. The corresponding symbol is
(id = root, type = 10, scope = LOCAL, sclass = AUTO,
ref = 120000, flags = addressed)
where, for clarity, symbolic values appear for the scope, sclass, and flags fields. The id field is an instance
of the built­in ASDL type identifier, which are atoms. The type field---`10 ' in this example---is a uid
that identifies a type value defined somewhere else in the item sequence. Here and below, uids are shown
in a slanted typewriter font.
Types
Types are represented by instances of the sum type type defined in Fig. 9 and are essentially abstract syntax
trees of the C type constructors. For example, INT is a basic type; POINTER represents a pointer type and its
module rcc {
program = (int nuids,int nlabels,item* items,interface* interfaces,
int argc,string *argv)
item = Symbol(symbol symbol)
| Type(type type)
attributes(int uid)
symbol = (identifier id,int type,int scope,int sclass,
int ref,int flags)
See Fig. 9. . .
See Fig. 12. . .
See Fig. 10. . .
}
Figure 8: The ASDL grammar for lcc's code­generation interface.
8

field = (identifier id,int type,int offset,int bitsize,int lsb)
enum = (identifier id,int value)
type = INT
| UNSIGNED
| FLOAT
| VOID
| POINTER(int type)
| ENUM(identifier tag,enum* ids)
| STRUCT(identifier tag,field* fields)
| UNION(identifier tag,field* fields)
| ARRAY(int type)
| FUNCTION(int type,int* formals)
| CONST(int type)
| VOLATILE(int type)
attributes(int size,int align)
Figure 9: ASDL grammar for C types.
integer field is the uid of the referent type; and STRUCT represents a structure type with a tag and an ordered
set of fields. The fields are represented by a sequence of field product types, one for each field, giving the
field name, its type, o#set, and location information for bit fields. Other types are similarly represented.
Every type has attributes that give its size and alignment constraint in bytes.
A snippet of the item sequence for the type representing the C type struct elem defined previously helps
clarify the definition of uids and their use:
11: STRUCT( size = 16, align = 4, fields = [
id type offset bitsize lsb
(count, 12,0, 0,0),
(left, 10,4, 0,0),
(right, 10,8, 0,0),
(word, 13, 12, 0, 0)] )
12: INT( size = 4, align = 4)
10: POINTER(size = 4, align = 4, type = 11)
13: POINTER(size = 4, align = 4, type = 8)
8: INT( size = 1, align = 1)
Again, the italicized numbers are uids. The uids on the left are the uid attributes in the item type, and
each of these define a uid and its associated type. The occurrences of uids in a type field are references to
types. Type 11 is the type value for `structtype;' its fields give the size of instances of this struct (16 bytes),
their alignment (on 4­byte boundaries), and their fields. Each of the field values in the sequence include
a uid for the type of that field. Type 10 is the C type `struct elem *.' Notice the two kinds of INTs: Type
12 is a 4­byte integer, which is the C type `int,' and type 8 is a 1­byte integer, which is type C type `signed
char.' Thus, type 13 is the C type `char *.'
IR Trees
IR trees are represented by a nearly isomorphic set of trees defined by the ASDL sum type node, defined in
Fig. 10. Some generic operators are represented by corresponding constructors, e.g., CNST and ADDRL. Others
are represented by constructors for a class of generic operators in which the specific operator is provided
as a parameter: CVT nodes represent the conversion operators (CVF, CVI, CVP, and CVU), Unary and Binary
9

node = CNST(int value)
| CNSTF(real value)
| ARG(node left,int len,int align)
| ASGN(node left,node right,int len,int align)
| CVT(int op,node left,int fromsize)
| CALL(node left,int type)
| CALLB(node left,node right,int type)
| RET
| ADDRG(int uid)
| ADDRL(int uid)
| ADDRF(int uid)
| Unary(int op,node left)
| Binary(int op,node left,node right)
| Compare(int op,node left,node right,int label)
| LABEL(int label)
| BRANCH(int label)
| CSE(int uid,node node)
attributes(int suffix,int size)
Figure 10: ASDL grammar for IR trees.
nodes represent the unary (INDIR, RET, JUMP, NEG, BCOM) and binary operators (ADD, SUB, DIV, MUL, MOD,
BOR, BAND, BXOR, RSH, LSH), and Compare nodes represent the comparisons (EQ, NE, GT, GE, LE, LT).
LABEL nodes are label definitions, and BRANCH nodes are unconditional jumps. Compare, LABEL, and
BRANCH nodes use label numbers instead of symbol­table entries for labels; pass2 recreates the symbol­table
entries as it reconstructs the IR. CSE nodes identify common subexpressions and associate a symbol­table
entry for a temporary (uid) with a node that computes the subexpression (node). Every node includes su#x
and size attributes, which correspond to the type and size su#xes in the type­ and size­specific IR operators.
It is important to realize that nodes are not lcc IR trees---they represent IR trees. In pass2, nodes provide
the data necessary to recreate the lcc IR trees, which are passed to the back ends. This `duplication of e#ort'
is an onerous side e#ect of retrofitting an existing compiler with ASDL and is discussed in more detail below.
lcc compiles the C code
char *s; int c; *s++ = c;
into the equivalent of
t1=s;s=t1+1;*t1=c;
where t1 is a compiler­generated temporary. Figure 11 shows the ASDL nodes for these three statements on
a 32­bit target. The notation ASGN P 4 gives the constructor, the su#x attribute as one of the types listed
above, and the size attribute. Notice the constructor for the indirection in the leftmost tree; it's a Unary
node with three values: operator INDIR, su#x P, and size 4. The node for Binary is similar. Leaves, like
ADDRL, include the uid of the appropriate symbol. For clarity, Fig. 11 shows the names, too, e.g., t1, but
the names are not in the node.
Nodes (and all ASDL­defined data structures) are written to pickles in a compact, prefix, binary repre­
sentation in which integers can take as little as one byte. For example, the leftmost tree in Fig. 11 takes 15
bytes:
ASGN P 4 ADDRLP442Unary INDIR P 4 ADDRLP437
Interface Calls
The ASDL types described above encode the data structures in lcc's code­generation interface. The ASDL
sum type interface, defined in Fig. 12, encodes the calls made from the front end to the back end. Compare
10

ASGN P 4
ADDRL P 4
uid=42 t1
Unary
INDIR P 4
ADDRL P 4
uid=37 s
ASGN P 4
ADDRL P 4
uid=37 s
Unary
INDIR P 4
ADDRL P 4
uid=42 t1
Binary
ADD P 4
CNST I 4
1
ASGN I 1
Unary
INDIR I 4
ADDRL P 4
uid=36 c
CVT
CVI I 1 4
Unary
INDIR P 4
ADDRL P 4
uid=42 t1
Figure 11: ASDL representation for t1=s;s=t1+1;*t1=c.
this type definition with the interface calls listed in Fig. 4. The only significant change is that symbol­table
pointers have been replaced by the corresponding uids or sequences of uids. progbeg and progend have been
omitted because pass2 can simply make these calls; pass2 also supplies the actual arguments for blockbeg
and blockend, so these arguments are not included in interface. Calls to defconst with real values are
represented by a separate constructor, Defconstf, which confines the use of real values. ASDL has no
built­in support for reals, so they are represented with integers for their most significant and least significant
bits.
Address and Local constructors associate uids with `relative' symbols and locals and parameters. A
relative symbol is one that is defined by a constant o#set from another symbol, e.g., a[i+10] would elicit a
definition of a symbol to represent &a[10]. Instances of Forest carry the nodes that represent the executable
code in each function. These appear in the interface list in the codelist field of functions. There is one
function for each C function in the input.
Here's an example: The small function
err(s) char *s; {
printf("? %s\n", s);
exit(1);
}
yields the following sequence of interfaces:
Local(uid = 27, symbol = (id = s, ...))
Local(uid = 28, symbol = (id = s, ...))
Function(f = 22 err, caller = [ 27 ], callee = [ 28 ],
ncalls = 2, codelist = [
Blockbeg,
Forest(nodes = ...),
Forest(nodes = ...),
Blockend,
Forest(nodes = ...)])
...
Global(p = 24, seg = LIT)
Defstring("? %s\n")
The Locals above define two views of the formal parameter s, one as seen by callers of err and one as
seen by the callee itself. These have the same name---s---but are di#erent symbols as indicated by their
di#erent uids. Often, these symbols are identical, but there are important cases when they are di#erent, as
11

real = (int msb,int lsb)
interface = Export(int p)
| Import(int p)
| Global(int p,int seg)
| Local(int uid,symbol p)
| Address(int uid,symbol q,int p,int n)
| Segment(int seg)
| Defaddress(int p)
| Deflabel(int label)
| Defconst(int suffix,int size,int value)
| Defconstf(int size,real value)
| Defstring(string s)
| Space(int n)
| Function(int f,int* caller,int* callee,
int ncalls,interface* codelist)
| Blockbeg
| Blockend
| Forest(node* nodes)
Figure 12: ASDL grammar for code­generation interface calls.
detailed below. The first two occurrences of Forest carry the nodes for the two function calls. The third
Forest holds a single LABEL node that marks the location of the function epilogue. Global and Defstring
collaborate to initialize a compiler­generated static variable for the format string shown.
Measurements
While retrofitting lcc to use ASDL changed lcc's structure dramatically, this process did not add much code.
The ASDL grammar described in the previous section is about 70 lines, the ASDL back end is 409 lines of
C, and pass2 is 681 lines of C.
The 70­line ASDL grammar generates about 2183 lines of C declarations and function definitions. This
code is approximately what would be required if the ASDL­generated constructors, readers, writers were
written by hand. The savings would increase if the ASDL grammar were used to generate code in other
languages. For example, if a Java optimizer were written, it would use the 3332 lines of Java generated from
the same ASDL grammar.
On Windows NT, the size of the monolithic compiler executable is 380 KB (produced by the Microsoft
Visual C/C++ 5.0 compiler with --O1 optimization). The revised rcc with the ASDL back end and the
generated constructors, readers, and writers is 437 KB, and pass2 is 395 KB. Rcc includes the back ends
for all of lcc's targets, because that packaging is the simplest one. The code for these back ends and the
symbol­table emission code could be omitted saving about 199 KB.
Table 1 summarizes the compilation times and the file sizes for the monolithic and divided variants of lcc
when compiling its own non­trivial modules. The times are given in centiseconds and are for the compilation
phase only; that is, the timings do not include preprocessor and assembler times. All timings were taken on
a lightly loaded 200MHz Gateway PC with 128 MB of RAM and SCSI disks running Windows NT 4.0. The
compiler variants were compiled by the Microsoft Visual C/C++ 5.0 compiler with --O1 optimization.
The first column gives the time in centiseconds for compiling the module named in the rightmost column
with the monolithic version of lcc. The second and third columns give the compilation times for rcc and
pass2. Thus, for example, the fourth row shows that the monolithic compiler compiled dag.c to assembly
language in 33 csecs., and rcc and pass2 accomplished the same task in 61 + 31 = 92 csecs.
The fourth column gives the size of each module's pickle in kilobytes. By way of comparison, the fifth
column gives the size of the corresponding unoptimized object file produced by the Microsoft Visual C/C++
12

Table 1: Compilation times and output file sizes.
lcc ­S lcc ­S ­asdl pass2 Pickle Object file Object file File
100ths sec. 100ths sec. 100ths sec. size in KB size in KB w/symbols in KB
8 20 6 22.2 1.7 3.4 alloc.c
115 211 98 242.1 91.5 124.4 alpha.c
15 33 13 37.7 11.8 24.1 bytecode.c
33 61 31 81.7 26.0 45.4 dag.c
45 67 35 90.5 33.9 48.5 dagcheck.c
42 77 44 100.8 36.9 61.4 decl.c
35 65 34 82.6 20.6 36.7 enode.c
26 25 8 27.4 5.3 12.2 error.c
48 65 34 81.2 25.3 44.2 expr.c
44 67 33 77.0 23.7 47.0 gen.c
24 40 20 40.5 8.4 20.0 init.c
16 26 8 26.5 4.9 11.2 input.c
28 43 21 58.9 20.2 30.9 lex.c
22 32 12 35.7 8.8 19.8 main.c
119 161 87 206.2 75.6 106.4 mips.c
35 32 10 33.1 7.5 14.7 output.c
18 28 11 34.7 7.1 18.5 prof.c
19 33 13 34.4 5.3 11.6 profio.c
55 69 30 82.6 26.8 69.1 rcc.c
45 65 34 92.2 22.4 37.3 simp.c
147 221 157 264.9 94.1 129.9 sparc.c
44 35 13 40.1 11.0 22.9 stab.c
40 52 25 63.5 20.1 39.6 stmt.c
16 28 8 25.5 2.7 4.8 string.c
25 31 11 35.6 11.8 26.2 sym.c
45 51 18 55.1 22.7 43.0 symbolic.c
17 32 12 32.4 8.0 18.5 trace.c
44 32 12 35.2 7.6 15.1 tree.c
40 59 31 79.7 28.9 49.5 types.c
185 249 177 298.0 97.3 133.9 x86.c
182 342 227 365.5 117.1 157.9 x86linux.c
1577 2352 1150 2783.5 885.0 1428.1 Total
compiler. Pickles contain complete symbol­table information, so perhaps a more meaningfull comparison is
with the sizes of object files with embedded symbol tables, which the sixth column shows.
Input/output time dominates the compilation times. The revised rcc is about 1.5--2 times slower than
the monolithic compiler; building the ASDL data structures and emitting them accounts for most of this
time. As detailed in the next section, rcc essentially duplicates its data structures as it builds the ASDL
representation, which costs both time and space.
pass2 is faster than both the monolithic compiler and rcc because it doesn't have to read and analyze
the C source code; it simply inhales the pickle, rebuilds the compiler's data structures, and calls the back
end functions. While rcc plus pass2 adds a factor of 2--3 to the compilation time, lcc is fast enough that this
overhead is acceptable, especially in an experimental setting; for example, the monolithic compiler compiles
itself in 15 secs., and rcc plus pass2 takes 35 secs.
Pickle sizes run about 3 times the sizes of object files and about 2 times the sizes of object files with em­
bedded symbol tables. Compression could reduce pickle sizes to that of object files; for example, compressing
all of the pickles listed in Table 1 yields a 867 KB zip file. Each pickle includes the symbol­table entries
from the common header files included by each module. Pickle sizes could also be reduced by emitting these
symbol­table data into a separate pickle and omitting them from the per­module pickles.
13

Evaluation
Retrofitting lcc to use ASDL highlighted some strengths and weaknesses in both ASDL and lcc. One of the
somewhat unexpected strengths of ASDL is that it helps find bugs. Writing the ASDL back end revealed
two related long­standing bugs in lcc. The first is illustrated by the following code.
f(void) { extern int x; ... }
int x;
The two declarations for x refer to the same identifier. The error is that lcc created two symbol­table entries
for x: One was created at the extern declaration for x and used when compiling the body of f, and the other
one was created at the top­level declaration for x and used thereafter, including announcing the definition of
x via the code­generation function global (see Fig. 4). It was intended that there be only one symbol­table
entry for x. These symbol­table entries had identical contents, and all of the existing back ends examined
only the contents. The ASDL back end, however, used the pointer to the symbol­table entry as a handle to
the corresponding ASDL symbol type (see Fig. 8) and thus erroneously created two symbols. References to
x from within f referred to the wrong symbol and thus the generated code was incorrect---it was if the code
had been written as
f(void) { static int x; ... }
int x;
The second bug adds another twist to the first bug and is illustrated by the following code.
static int x;
f(void) { extern int x; ... }
Again, lcc erroneously created two symbol­table entries when one was expected. It also changed the storage
class of the x declared within f to be static when it was announced to the back end, then changed it back
to extern. As a result, the x appeared to be static when declared and extern when used. On targets that
handle statics di#erently than externs, pass2 emitted incorrect code.
ASDL exposed some awkward binding times in the lcc code­generation interface, which required revising
the implementation. lcc compiles the function
f(x, y) char x; int y; { ... }
as
f(? int x # , ? int y # ) { ?charx=x # ;?inty=y # ; ... }
lcc generates two symbol­table entries for each parameter: one for the parameter as passed by the caller---x #
and y # in the code above---and one for the parameter as seen by the callee---x and y above. It generates
assignments of the caller parameter to the corresponding callee parameter if their types di#er or if their
storage classes di#er. In the example above, in which the occurrences of ? denote storage classes, the char
parameter x is promoted and passed as an int, so the types of x and x # di#er. Back ends can change the
storage class of caller and callee parameters to reflect target­dependent calling conventions. On the MIPS,
for example, y # is passed in a register, so an assignment to y is generated if y lands in memory.
The binding time problem is that rcc doesn't have the information necessary to determine whether or
not to generate these assignments. Storage class information is known only to the back end and thus isn't
known until pass2 runs. The solution was to move the code that generates these assignments into pass2.
A similar problem arises in common subexpression elimination (CSE), but requires extra work by both
rcc and pass2. lcc does CSE on extended basic blocks, but it needs to know something about register
assignments before it hoists an rvalue into a temporary. For example, in the expression
a = b*c + b*d
the rvalue of b is a common subexpression. lcc copies b to a temporary, but only if the temporary is in a
register and b isn't. Again, rcc does not have the data necessary to make that decision, because the back
14

end has the final say on storage classes. So, rcc makes a conservative assumption and generates temporaries
for all multiply referenced rvalues, and pass2 eliminates those that don't pay.
One of the flaws in ASDL is that it can lead to a duplication of data structures, which is perhaps should
be expected when modifying an existing compiler to use ASDL. lcc builds numerous data structures to
represent the C source program, e.g., symbol­table entries, tree nodes, strings, etc. Most of the code in
the ASDL back end is devoted to building copies of these data structures---that is, building a di#erent, but
logically equivalent representation for nearly everything. All this duplication could be avoided if ASDL were
used at the outset to define all the important data structures, but this approach would have required a much
more drastic revision of lcc.
Perhaps the biggest nuisance in using ASDL is dealing with non­tree data structures, e.g., by using uids
for symbols and types. These are common and there should be a better way to handle them, or at least
some more built­in support for defining and referencing them.
The lcc ASDL grammar is `ambiguous;' that is, it permits construction of type instances that do not
represent valid lcc code­generation interface structures. For example, the grammar in Fig. 12 permits a
Function whose codelist field includes another Function. This sequence of calls never occurs in lcc.
Similar comments apply to the CVT and Binary, Unary, Compare constructors: any operator could be given
as the op field. Ambiquity shortens grammars, much the same way as an ambiguous YACC grammar is
smaller than a non­ambiguous one. While few bugs could be attributed directly to using an ambiguous
grammar, the savings probably isn't worth it. A non­ambiguous ASDL grammar, which might be no more
than 50% longer, would catch more errors at compile­time and it would document the semantics more
accurately.
Conclusions
Revising lcc to use ASDL was, overall, straightforward, and the resulting components---rcc and pass2---
provide an improved platform for compiler­related research using lcc. It is now possible to insert additional
passes into the compilation pipeline without modifying or even understanding the front and back ends. The
obvious first candidate is a global optimization pass. Adding an optimizer will surely identify weaknesses in
the current ASDL grammar. For example, it is likely that additional data structures, such as a flow graphs,
will be needed. The optimizer could build a flow graph itself, but it may prove useful to add these kinds of
generally useful structures to the pickles.
Fortunately, ASDL can accommodate additions gracefully. A pickle consists of one or more instances
of ASDL types. Currently, lcc pickles hold just an instance of program defined in Fig. 8. Other passes
can append instances of additional types to pickles and use these instances without a#ecting pass2, because
pass2 reads only the instance of program.
ASDL is equivalent to Document Type Declarations (DTDs) in XML [3], so it is natural to wonder if the
increasing investment in XML tools can be leveraged to provide better compiler infrastructure tools. As a
first step, the ASDL pickle readers and writers have been modified to emit pickles in XML instead of in the
original binary format. These pickles are necessarily huge, because they are written in readable ASCII, but
they compress to approximately the sizes suggested in Table 1; for example, the XML pickles for all of the
modules listed in Table 1 compress into a 1394 KB zip file. They can, however, be examined and processed
by generic XML browsers and editors, which obviates the need for ASDL­specific tools.
Work is also underway on the minimal support for non­tree data structures. XML supports ID and IDREF
`attributes;' these provide a way to name an instance of a type and to refer to it from instances of other
types, which is essentially identical to the use of uids in the lcc ASDL grammar. Similar features may be
added to ASDL.
Finally, ASDL is an ideal way to specify abstract data types and application programming interfaces
(APIs), independent of whether or not they are going to be pickled. ASDL grammars are compact, language­
independent, and hide implementation details. Debugging an ASDL grammar is usually much easier than
debugging the corresponding handwritten code. With su#cient care, ASDL grammars might help simplify
both the implementations of APIs and their tedious language­specific descriptions.
15

Acknowledgements
Daniel C. Wang wrote asdlGen and responded promptly when rcc and pass2 exposed bugs. He also wrote
the XML pickler mentioned in the last section.
References
[1] C. W. Fraser and D. R. Hanson. A Retargetable C Compiler: Design and Implementation. Addison­Wesley,
Menlo Park, CA, 1995.
[2] C. W. Fraser, D. R. Hanson, and T. A. Proebsting. Engineering a simple, e#cient code­generator generator.
ACM Letters on Programming Languages and Systems, 1(3):213--226, Sept. 1992.
[3] C. F. Goldfarb and P. Prescod. The XML Handbook. Prentice Hall, Englewood Cli#s, NJ, 1998.
[4] D. R. Hanson. C Interfaces and Implementations: Techniques for Creating Reusable Software. Addison­Wesley,
Reading, MA, 1997.
[5] S. P. Harbison and G. L. Steele, Jr. C: A Reference Manual. Prentice Hall, Englewood Cli#s, NJ, fourth edition,
1995.
[6] The Open Group, Cambridge, MA. Architecture Neutral Distribution Format (XANDF) Specification, Jan. 1996.
[7] R. T. Snodgrass. The Interface Description Language: Definition and Use. Computer Science Press, Rockville,
MD, 1989.
[8] Standard Performance Evaluation Corp., Manassas, VA. SPEC, 1996.
[9] T. B. Steel, Jr. A first version of UNCOL. In Proceedings Western Joint Computer Conference, pages 371--378,
May 1961.
[10] D. C. Wang, A. W. Appel, J. L. Korn, and C. S. Serra. The Zephyr abstract syntax description language. In
Proceedings of the Conference on Domain­Specific Languages, pages 213--227, Santa Barbara, Oct. 1997.
[11] R. P. Wilson, R. S. French, C. S. Wilson, S. P. Amarasinghe, J. M. Anderson, S. W. K. Tjiang, S.­W. Liao,
C.­W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An infrastructure for research on parallelizing
and optimizing compilers. SIGPLAN Notices, 29(12):31--37, Dec. 1994.
$Id: asdl.tex,v 1.11 1998/10/12 18:39:54 drh Exp drh $
16