Automatic Detection and Diagnosis of Faults
in Generated Code for Procedure Calls
Mark W. Bailey, Member, IEEE, and Jack W. Davidson, Member, IEEE Computer Society
Abstract---In this paper, we present a compiler testing technique that closes the gap between existing compiler implementations and
correct compilers. Using formal specifications of procedure­calling conventions, we have built a target­sensitive test suite generator
that builds test cases for a specific aspect of compiler code generators: the procedure­calling sequence generator. By exercising
compilers with these specification­derived target­specific test suites, our automated testing tool has exposed bugs in every compiler
tested on the MIPS and one compiler on the SPARC. These compilers include some that have been in heavy use for many years.
Once a fault has been detected, the system can often suggest the nature of the problem. The testing system is an invaluable tool for
detecting, isolating, and correcting faults in today's compilers.
Index Terms---Target­sensitive test suite generation, automatic fault isolation, procedure­calling convention, code generation,
compiler testing and debugging.
#
1 INTRODUCTION
B UILDING compilers that generate correct code is difficult.
To achieve this goal, compiler writers rely on auto­
mated compiler building tools and thorough testing.
Automated tools, such as parser generators, take a
specification of a task and generate implementations that
are more robust than hand­coded implementations. Con­
versely, testing tries to make hand­coded implementations
more robust by detecting errors. One aspect of a compiler
that has traditionally been hand­coded is the portion that
generates calling sequences---implementations of procedure
calls. We have developed a language, called CCL, for
specifying procedure­calling conventions. CCL specifica­
tions are used to automatically generate calling sequences
for the vpcc/vpo retargetable optimizing compiler [1]. While
experimenting with CCL, we realized that the descriptions
could be used to make other compilers more robust without
requiring that the compiler implementation use CCL. In this
paper, we describe how CCL's underlying finite state
machine model can be used to generate tests for hand­
coded calling sequence generators in other compilers. This
technique has exposed a number of calling convention
errors in production­quality compilers that have been used
heavily for years. Although the convention examples used
here were originally specified using CCL, we omit a
description of CCL both for brevity and since the generated
finite state machine tables used here serve as equivalent
specifications of convention behavior. Only the knowledge
of how these automata model convention behavior is
necessary to understand our testing technique or to reap
its benefits. However, readers interested in the CCL
language and the automatic translation of CCL specifica­
tions to finite state automata can find these details in a
previous paper [1].
In this paper, we describe several contributions. First, we
present a method for automatically testing implementations
of procedure­calling conventions. Using this technique, we
have found bugs in mature C compilers. This approach,
which uses a formal model of procedure­calling conven­
tions, methodically generates tests that offer complete
coverage of the specified convention. Second, we introduce
an algorithm for intelligently selecting important tests from
the complete coverage suite. These tests include boundary
cases that are more likely to reveal bugs than exhaustive or
randomly generated tests. Third, because the tests focus
only on the calling convention, they isolate errors more
effectively than tests from a general test suite. Fourth, we
describe a method for automatically diagnosing the nature
of some types of faults. Finally, we describe a method for
quickly determining the conformance of multiple compilers
at once.
2 PROCEDURE CALLING CONVENTIONS
An important feature of high­level programming languages
that compilers must implement is the procedure call. The
interface between procedures facilitates separate compila­
tion of program modules and interoperability of program­
ming languages. This is accomplished by defining a
procedure­calling convention that dictates the way that
program values are communicated and how machine
resources are shared between a procedure making a call
(the caller) and the procedure being called (the callee). The
calling convention is machine­dependent because the rules
for passing values from one procedure to another depend
on machine­specific features such as memory alignment
restrictions and register usage conventions. The code that
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003 1
. M. Bailey is with the Department of Computer Science, Hamilton College,
198 College Hill Road, Clinton, NY 13323.
E­mail: mbailey@hamilton.edu.
. J. Davidson is with the Department of Computer Science, University of
Virginia, 151 Engineer's Way, Charlottesville, VA 22911.
E­mail: jwd@virginia.edu.
Manuscript received 3 July 2002; revised 24 July 2003; accepted 6 Aug. 2003.
Recommended for acceptance by T. Reps.
For information on obtaining reprints of this article, please send e­mail to:
tse@computer.org, and reference IEEECS Log Number 116879.
0098­5589/03/$17.00 # 2003 IEEE Published by the IEEE Computer Society

implements the calling convention, known as the calling
sequence [2], must be generated by the code generator. This
aspect of the code generator, which we name the calling
sequence generator, is a source of great difficulty for the
compiler writer because it not only suffers from being hand­
coded, it also changes each time the compiler is retargeted.
2.1 A Simple Calling Convention
To aid in our discussion of calling conventions, we use a
simplified example calling convention. Fig. 1 contains the
calling convention rules for a hypothetical machine. Con­
sider the following ANSI C prototype for a function warp:
int warp(char p1, int p2, int p3, double p4);
For the purpose of transmitting procedure arguments for
our simple convention, we only consider the signature of the
procedure. We define a procedure's signature to be the
procedure's name, the order and types of its arguments,
and its return type. This is analogous to ANSI C's abstract
declarator [3], which, for the previous function prototype, is:
int warp(char, int, int, double);
which defines a function that takes four arguments (a char,
two int's, and a double), and returns an int.
With warp's signature, we can apply the calling
convention in Fig. 1 to determine how to call warp.
Arguments to warp would be placed in the following
locations:
. p1 in register R1,
. p2 in register R2,
. p3 in register R3, and
. p4 on the stack at offsets 0­7.
Notice that, although register R4 is available, p4 is placed
on the stack since it cannot be placed completely in
argument­transmitting registers (rule 4). Such restrictions
are common in actual calling conventions.
2.2 Convention to Implementation
Once the calling convention has been established, a
compiler can be targeted to generate the calling sequence
code that implements the procedure calls for the source
language. Traditionally, this code has been handcrafted. In
contrast, we use a calling convention specification and an
interpreter. The interpreter can generate tables that can be
used in the calling­convention­specific portion of vpcc/vpo
[1] or in a test suite generator. The test suite generator uses
information from the table to tailor the test suite to the
specific calling convention. The test suite can either be used
to confirm that the vpcc/vpo implementation properly uses
the convention tables or that another, independent compi­
ler conforms to the convention described in the CCL
specification. In the next section, we describe the formalism
that we use to capture convention details.
3 THE FORMAL MODEL
We use finite state automata to model a calling convention's
placement of arguments (and return values) in a machine's
memory locations. The use of FSAs for modeling parts of a
compiler and as an implementation tool has a long and
successful history. For example, FSAs have often been used
to implement lexical analyzers [4]. More recently, Proebst­
ing and Fraser [5] and Mu ˜ ller [6] have used finite state
automata to model and detect structural hazards in
pipelines for instruction scheduling.
The FSA model characterizes only the placement of
argument descriptors regardless of type or passing mechan­
ism. For pass­by­value, the placement of the descriptor is
the same as the placement of the actual argument value. For
pass­by­reference and more complicated passing mechan­
isms, the model describes the placement of the reference,
not the actual argument value. The model does not describe
how a descriptor is to be interpreted once it has been
transmitted from caller to callee or vice versa. If, however,
the caller and callee do not agree on how the descriptor is to
be interpreted, this fault will be detected since the argument
will not be properly transmitted to the callee.
3.1 P­FSA Representation
An example FSA that we use to model calling convention
placement is shown in Fig. 2. This FSA models the
placement of procedure arguments for the simple calling
convention described in Fig. 1. A placement FSA (P­FSA)
takes a procedure's signature as input and produces
locations for the procedure's arguments as output. The
automaton works by moving from state to state as the
location of each value is determined. When a transition is
used to move from one state to the next, information about
the current parameter is read from the input and the
resulting location is written to the output.
The states of the machine represent the state of allocation
for the machine's memory resources. For example, the state
q 2 (labeled 1100 000) represents the fact that registers R1 and
R2 have been allocated (the first two bits: 11), that registers
R3 and R4 have not been allocated (the second two bits: 00),
and that the stack pointer is currently eight­byte aligned
(the remaining three bits: 000). A transition between states
represents the placement of a single argument. Since
arguments of different types and sizes impose different
demands on the machine's resources, we may find more
than one transition leaving a particular state. In our
example, q 8 has three transitions even though two of them
2 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003
Fig. 1. Rules for a simple calling convention.

(int and double) have the same target state (q 4 ). This
duplication is required since the output from mapping an
int is different from the output from mapping a double.
In most language implementations, the runtime stack is
modeled as an infinite or boundless resource. Modeling the
allocation of an infinite resource using an FSA poses a
problem, however. As mentioned previously, the state
represents which resources have been allocated. For finite
resources, this is easily accomplished by maintaining a bit
vector. When a resource no longer may be used, the
associated bit is set. For an infinite resource, this scheme
cannot work if we hope to use an FSA since this would
require a bit vector of infinite length. To simplify the
problem, we impose a restriction on infinite resources: Their
allocation must be contiguous. 1 Thus, for an infinite
resource I ¼ fi 1 ; i 2 ; . . .g, we can store the allocation state
by maintaining an index p whose value corresponds to the
index of the first available resource in I. Because the
allocation of I must be contiguous, p partitions the
resources since a resource i j is unavailable if j < p or
available if j # p. For instance, if the stack is the infinite
resource, p can be considered the stack pointer.
Nevertheless, we still have a problem. Although, for a
particular machine, the value of p must be finite, the
resulting FSA could have as many as 2 32 stack allocation
states for a 32­bit machine. However, we can significantly
reduce this number by observing that the decision of where
to place a parameter in memory is not based on p, but rather
on alignment restrictions. For our example, we care only if
the next available memory location is 1, 4, or 8­byte aligned.
Consequently, we can capture the allocation state of the
machine with three bits that distinguish the memory
allocation states. We call these the distinguishing bits for
infinite resource allocation.
Handling pass­by­value structures creates an analogous
problem. Structures of different sizes allocate different
amounts of space. Hence, each structure of a different size
impacts the state of resource allocation differently. This
implies that each P­FSA state requires an infinite number of
exiting transitions; one for every different structure size.
Fortunately, since only the ``alignment state'' of the stack
pointer is of interest, we need only include transitions for
structures that leave the P­FSA in a different state. So, for a
convention that requires structures to be passed in 8­byte
aligned memory locations, all structures of size n, where
n mod 8 ¼ 1, share the same transition out of a given state
because they leave the alignment, p, in the same state.
Therefore, the number of transitions leaving a state is
limited by the alignment restrictions of the machine.
Placement functions are described in terms of finite
resources, infinite resources, and selection criteria. A set of
finite resources R ¼ fr 1 ; r 2 ; . . . ; r n g is used to represent
machine registers, while an infinite resource I ¼ fi 1 ; i 2 ; . . .g 2
is used to represent the stack. The selection criteria C ¼
fc 1 ; c 2 ; . . . ; c m g correspond to characteristics about argu­
ments (such as their type and size) that the calling
convention uses to select the appropriate location for a
value. We encode the signature of a procedure with a tuple
w 2 ðC # ; C # Þ. The first element of the tuple contains zero or
more 3 return value criteria, while the second element
contains zero or more parameter criteria. Each state q in
the automaton is labeled according to the allocation state
that it represents. The label includes a bit vector v of size n
that encodes the allocation of each of the finite resources in
R. Additionally, to express the state of allocation for the
stack, we include d, the distinguishing bits that indicate the
state of stack alignment. So, a state label is a string vd that
indicates the resource allocation state. In our example
convention, n ¼ 4 and the length of the string d (jdj) is 3. So,
each state is labeled by a string from the language
f0; 1g 4
f0; 1g 3 . The output of the machine is a string s 2 P ,
where
BAILEY AND DAVIDSON: AUTOMATIC DETECTION AND DIAGNOSIS OF FAULTS IN GENERATED CODE FOR PROCEDURE CALLS 3
Fig. 2. P­FSA for transmission of parameters for a simple calling convention.
1. P­FSAs could also handle a convention whose stack allocation was not
continuous by treating the stack as a finite resource. Although this approach
might require a very large bit­vector, any implementation of such a
convention would have to use a bit­vector (or equivalent) to properly
implement the convention. We know of no such compiler or convention.
2. This can easily be extended to model more than one infinite resource.
3. Supports languages that allow multiple return values.

P ¼ R [ f0; 1g jdj
which contains the placement information.
Since the P­FSA produces output on transitions, we have
a Mealy machine [7]. We define a P­FSA, M, as a six­tuple 4
M ¼ ðQ; #; #; #; #; q 0 Þ, where:
. Q is the set of states with labels f0; 1g n
f0; 1g jdj
representing the allocation state of machine
resources.
. The input alphabet # ¼ C is the set of selection
criteria.
. The output alphabet # ¼ P is the set of memory
location strings.
. The transition function # : Q# # ! Q.
. The output function # : Q# # ! # þ .
. q 0 is the state labeled by 0 n w where jwj ¼ jdj, and w is
the initial state of d.
We also define ^
## : Q# # # ! Q and ^
## : Q# # # ! # # ,
which are just string versions 5 of # and #, respectively. So,
for our example, we have
M ¼ðQ; fchar; int; doubleg;
fR1; R2; R3; R4g [ f0; 1g 3 ; #; #; q 0 Þ;
where Q and # are shown in Fig. 2 and # is defined in Table 1.
Note that we have modified the traditional definition of # to
allow multiple symbols to be output on a single transition.
This reflects the fact that arguments can be located in more
than one resource. For example, in state q 5 on an int, Table 1
indicates that M produces the string of four symbols 100 101
110 111 that designates four bytes that are four­byte aligned,
but are not eight­byte aligned.
The signature:
int f(double, double, char, int);
will take the P­FSA in Fig. 2 along the path
q 0 ! q 2 ! q 4 ! q 5 ! q 4
producing the string (R1 R2) (R3 R4) (000) (100 101 110 111)
along the way. The parentheses in the output string are
required to determine where the placement of one argu­
ment ends and the next argument's placement begins. From
the string, we can derive the placement of f's arguments.
The first double is placed in registers R1 and R2, the
second in registers R3 and R4, the char at the stack location
with offset zero, and the int at the stack location with
offset four.
3.2 Completeness and Consistency in P­FSAs
In our experience, we have encountered many recurring
difficulties in the calling convention code generators of
optimizing compilers for RISC machines. 6 There are three
sources for these problems: the convention specification, the
convention implementation, and the implementation pro­
cess. We address each of these in the following paragraphs.
Many problems arise from the method of convention
specification. Often, no specification exists at all. Instead,
the native compiler uses a convention that must be
extracted by reverse engineering. In the cases where a
specification exists, it typically takes the form of written
prose or a few general rules, e.g., our example description
in Fig. 1. Such methods of specification have obvious
deficiencies. Furthermore, even if we have an accurate
method for specifying a convention, it still may be possible
to describe conventions that are internally inconsistent or
incomplete. For example, the convention may require that
more than one procedure argument be placed in a
particular resource. Another possibility is that the specifica­
tion may omit rules for a particular data type or combina­
tion of data types.
Those problems that do not arise from the specification
result from incorrect implementation of the convention.
Many of the same problems in the specification process also
plague the implementation. Many conventions have nu­
merous rules and exceptions that must be reflected in the
implementation. Another difficulty is the implementation
may require the use of the convention in several different
locations. Maintaining a correspondence between the
various implementations can itself be a great source of
errors. Finally, this problem is exacerbated by the fact that
the implementation frequently undergoes incremental
development. Rather than taking on the chore of imple­
menting the entire convention at once, a single aspect of the
convention, such as providing support for a single data
type, is tackled. After successfully implementing this
subset, the next increment is undertaken. During this
4 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003
4. We use the notation of Hopcroft and Ullman for finite state automata
and regular expressions [8]. We use letters early in the alphabet (a, b, c) to
denote single symbols. Letters late in the alphabet (w, x, y, z) will denote
strings of symbols.
5. Defined by Hopcroft and Ullman [8].
6. Unlike many CISC machines, RISC machines typically increase the
complexity of calling conventions by requiring that procedures pass
parameters in registers whenever possible.
TABLE 1
Definition of # for Example P­FSA

process, some aspect of the first stage may break due to the
interactions between the two pieces.
The result of these observations is that there are several
properties that we would like to ensure about a specifica­
tion and implementation. The preceding discussion moti­
vates the following categories of questions:
. Completeness:
­ Does the specified convention handle any
number of arguments?
­ Does the convention handle any combination of
argument types?
. Consistency:
­ Does the convention map more than one
argument to a single machine resource?
­ Do the caller and callee's implementations agree
on the convention?
Many questions like these can be answered using P­FSAs.
The following sections show how we can prove certain
properties about P­FSAs that ensure desirable responses to
the preceding questions.
3.2.1 Completeness
The completeness properties address how well the conven­
tion covers the possible input cases. A convention must
handle any procedure signature. If we could guarantee that
the convention was complete, or covered the input set, then
we could answer the completeness questions posed in the
previous section. We can determine if a convention is
complete by looking at the resulting P­FSA. For example,
will the convention work for any combination of argument
types? The answer lies in the P­FSA transitions. For the
convention to be complete, each state q 2 Q must have
#ðq; cÞ defined for all c 2 C. Should there exist some state
q 2 Q and criteria c 2 C such that #ðq; cÞ is undefined, then,
having arrived in state Q on input w, the machine would fail
to accept any input string whose prefix was wc. Thus, there
would be some signature whose placement could not be
determined by the P­FSA. Since all correct P­FSAs must
accept all strings in C # , we can easily detect any P­FSA that
implements an incomplete convention by looking for states
with missing transitions.
3.2.2 Consistency
The consistency properties address whether the convention
is internally and externally consistent. A convention is
internally consistent if there is no machine resource that can
be assigned to more than one argument. A convention is
externally consistent if the caller and callee agree on the
locations of transmitted values. In our model, we detect
internal inconsistency and prevent external inconsistency.
To detect internal inconsistencies, we again turn to the
P­FSA. If the convention only used finite resources,
detecting a cycle in the P­FSA would be sufficient to detect
the error. However, when infinite resources are introduced,
so are cycles. We cannot have an internal inconsistency for
an infinite resource since p is defined to be monotonically
increasing. We detect finite resource inconsistencies in the
following manner: An inconsistency can occur when there
is a transition from some state q j to q k , where bit i in the
finite bit vector is 1 in q j , but 0 in q k . At this point, M has lost
the information that resource r i was already allocated. We
can detect this change by comparing all pairs of bit vectors
v 1 , v 2 such that v 1 labels q j , v 2 labels q k , and #ðq j ; cÞ ¼ q k for
some c 2 C. To do the comparison, we compute
v 3 ¼ ðv 1 # v 2 Þ ^ v 1 :
v 1 # v 2 selects all bits that differ between v 1 and v 2 . We
logically AND this with v 1 to determine if any set bits
change value. Thus, if v 3 has any bit set, we have an
inconsistency.
Our convention specification language prevents external
inconsistencies in the calling convention. A convention
specification only defines the argument transmission loca­
tions once. Although both the caller and the callee must
make use of this information, the specification does not
duplicate the information. Since we only have a single
definition of argument locations, we only construct a single
P­FSA to model the placement mapping. This single P­FSA
is used in both the caller and callee. Thus, we prevent
external inconsistencies by requiring the caller and callee
use the same implementation for the placement mapping.
4 CONSTRUCTION OF DIAGNOSTIC PROGRAMS
Using P­FSAs as an implementation foundation for a
compiler enables all of the static analyses described in the
previous section. However, when a compiler does not use a
P­FSA in its implementation, we can still leverage off the
P­FSA formalism to increase the implementation's robust­
ness through systematic testing.
4.1 Test Vector Selection
To test a compiler's implementation of a calling convention,
we must select a set of programs to compile and run. To
exercise the calling convention, each test program must
contain a caller and a callee procedure. For the purpose of
testing the proper transmission of program values between
procedures, the signature of the callee uniquely identifies a
test case. Thus, two different programs whose callees'
signatures match perform the same test. Therefore, the
problem of generating test cases reduces to the problem of
selecting signatures to test.
Selecting which procedure signatures to test is a difficult
problem. Because the set of signatures, S ¼ fðC # ; C # Þg, is
infinite, one cannot test all signatures. However, since we
can model the function that computes the placement of
arguments as an FSA, there must be a finite number of
states in an implementation to be tested. This is the case for
any implementation, including those that do not explicitly
use FSAs to model the placement function.
The problem of confirming that an implementation
properly places procedure arguments is equivalent to
experimentally determining if the implementation behaves
as described by the P­FSA state table. This problem is
known as the checking experiment problem from finite­
automata theory [9], [10]. There are numerous approaches
to this problem, most of which are based on transition
testing. Transition testing forces the implementation to
BAILEY AND DAVIDSON: AUTOMATIC DETECTION AND DIAGNOSIS OF FAULTS IN GENERATED CODE FOR PROCEDURE CALLS 5

undergo all the transitions that are specified in the
specification FSA.
An obvious first approach to generating test vectors
using the P­FSA specification is to generate all vectors
whose paths through the FSA are acyclic and those whose
path ends in a cycle. 7 This solution insures that each state q
is visited and each transition #ðq; aÞ is traversed. However,
the number of such paths for an FSA is Oðj#j jQj Þ. Table 2,
which contains profiles for five P­FSAs generated auto­
matically from CCL specifications, demonstrates that the
acyclic method is not feasible for complex conventions such
as the MIPS and SPARC. A simpler approach is to
guarantee that each transition is exercised at least once.
This limits the number of test vectors to no more than
jQj # j#j. However, this method results in poor coverage that
does not inspire confidence in the test suite. For example,
for the P­FSA in Fig. 2, the three signatures:
void f(double, double);
void f(int, int, int, int);
void f(int, double);
cover all int and double transitions leaving states q 0#2 .
This leaves the signature:
void f(double, int);
untested. Clearly, such a test should be included in the
suite. To further illustrate the problem, consider the FSA
specification shown in Fig. 3a. An erroneous implementa­
tion, shown in Fig. 3b, contains an extra state q 1 that is
reached on initial input ``b.'' The two strings, ``aaa'' and
``bbb'' completely cover the specification FSA transitions.
Unfortunately, these test vectors will not detect that the
implementation has an additional (fault) state. Thus, it is
not sufficient to include only test vectors that cover the
transition set.
An alternative, which falls between the simple transition
approach and the acyclic path approach, we call the
transition­pairing approach. In transition pairing, we exam­
ine each state in the specification FSA. For each state, we
include a test vector that covers each pair of entering and
exiting transitions. This eliminates the faulty state detection
problem illustrated in Fig. 3. To illustrate how, consider the
test vectors this process generates: While examining state q 1 ,
transition­pairing will add the substrings ``aa,'' ``ab,'' ``ba,''
and ``bb'' to the set of substrings used to generate test
vectors. Since the context in which these substrings are to be
used is q 0 , they contribute prefixes to the test vector set.
Upon exercising q 1 using the prefix ``ba,'' the implementa­
tion FSA will generate incorrect output: 10 instead of 11.
This difference can be identified and the faulty state
detected.
In addition to such fault detection, transition­pairing
provides tests that have a similar characteristic to the acyclic
method: Transitions are tested in all the contexts that they
can be applied. Although there are many combinations that
are not tested, they are similar to ones included in the set.
For example, in the simple FSA pictured in Fig. 2, we could
have a set of test vectors that includes the vector double
double double to exercise the state q 4 with the transition
pair ((q 2 , double), (q 4 , double)). Such a set would not need
to include int int double double to cover the same
transition pair.
This method of test vector generation provides a
complete coverage of transitions in the specification FSA.
Further, the tests reflect the context sensitivity that transi­
tions have. This allows for some erroneous state and
transition detection, while significantly reducing the num­
ber of test vectors. The test vector sizes are significantly
smaller than the acyclic method, while still providing a
significant degree of confidence (Table 3).
An algorithm for generating transition­pair paths is
shown in Fig. 4. The algorithm performs a depth­first search
of the FSA state graph. Each time a transition ðq; aÞ is
encountered, it is marked. This mark indicates that all paths
that go beyond ðq; aÞ have been visited. When the algorithm
6 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003
7. We define a path that ends in a cycle to be a cyclic path wa where the
path w is acyclic.
TABLE 2
P­FSA Profiles for Several Calling Conventions
Fig. 3. Example FSA where a fault will not be detected. (a) Specification
FSA. (b) Implementation FSA.
TABLE 3
Sizes of Test Suites for Various Selection Methods

reaches a state q n on transition ðq m ; aÞ, each transition ðq n ; bÞ
where b 2 # is visited whether or not it is marked. This
causes all pairs of transitions ððq m ; aÞ; ðq n ; bÞÞ to be included.
These pairs represent all combinations of one entering
transition with all exiting transitions. Because the algorithm
is depth­first, each entering transition is guaranteed to be
visited. Thus, all combinations of entering and exiting
transitions are included.
Work related to the automatic generation of test suites
has received much attention recently in the area of
conformance testing of network protocols [11]. The purpose
of these suites is to determine if the implementation of a
communication protocol adheres to the protocol's specifica­
tion. Often, the protocol specification is provided as a finite­
state machine. This has resulted in many methods of test
selection, including the Transition tour, Partial W­method
[12], Distinguishing Sequence Method [10], and Unique­
Input­Output method [13]. These methods are derivatives
of the checking experiment problem where an implementa­
tion is checked against a specification FSM [14]. Such
techniques have also been used in the automatic verification
of digital circuits [9], [15].
What distinguishes these methods from ours are the
underlying assumptions concerning the characteristics of
the implementation FSAs. Unlike theirs, our FSAs can have
a large number of states and transitions. This significantly
changes the nature of the solution to the problem.
Furthermore, much of the problem that network confor­
mance researchers are faced with is identifying which state
the implementation FSA is in. A significant portion of their
work focuses on generating test vectors that discover the
state of the machine. Fortunately, we can always put our
implementation FSA in the start state. Also, in their work, a
bound on the number of states in the implementation FSAs
is assumed. Because we have no practical bound on the
number of states in the implementation, their work is not
applicable.
4.2 Test Case Generation
After selecting the appropriate test vectors or procedure
signatures, the corresponding test cases must be realized. In
our approach, we generate a separate test program for each
test vector so that we can easily match any reported errors
to the specific test vector.
A procedure call is composed of two pieces: the
procedure call within the caller (the call site) and the body
of the callee. Because they are implemented differently,
these two pieces of code are typically generated in separate
locations in a compiler. This natural separation is reflected
in the way that we construct our test cases. Each test case is
comprised of two files, one contains the caller, the other
contains the callee. The two files are compiled and linked
together. The programs are self­checking so that, if a
procedure call fails, this event is reported by the test itself.
Fig. 5 shows the compiler conformance test process. One
file is compiled by the compiler­under­test (CUT), while the
other is compiled by the reference compiler. The reference
compiler operationally defines the procedure­calling con­
vention (its implementation is defined to be correct). The
resulting object files are linked together and run. Results of
the test are checked by the conformance verifier and given
to the test conductor. The test conductor tallies the results of
all tests for a test suite and generates a conformance report.
Although this process uses two compilers, the same process
may still be used if a reference compiler is not available.
However, this will weaken the conformance verifier's
BAILEY AND DAVIDSON: AUTOMATIC DETECTION AND DIAGNOSIS OF FAULTS IN GENERATED CODE FOR PROCEDURE CALLS 7
Fig. 4. Test vector generation algorithm.

ability to automatically diagnose errors as discussed in the
next section.
Figs. 6 and 7 show an example test case for the C
signature void (int, double, struct(2) 8 ). The caller
loads each argument with randomly selected bytes. How­
ever, the values of these bytes have an important property:
Each contiguous set of two bytes is unique. Thus, for a
string B of m bytes, for all indexes 0 < i # m, there exists no
index 0 < j # m and j 6¼ i such that B½j þ k# ¼ B½i þ k# for
all 0 # k < 2. We can easily guarantee this property for all
strings B whose length is no greater than 65,536 (2 16 ) bytes.
Since the likelihood of using an argument list of size greater
than 64 Kbytes is small, this is sufficient to guarantee that
any two bytes passed between procedures are unique. This
makes it easier to identify if an argument has been shifted
or misplaced. The callee receives the values and checks
them against the expected values. If the values do not
match, an error condition is signaled.
As one might expect, the generation of good test cases
from selected signatures is language dependent. One
convention used in the C programming language is varargs.
varargs is a standard for writing procedures that accept
variable length argument lists. The proper implementation
of varargs in a C compiler is difficult. For each test case that
we generate, we also generate a varargs version to verify
that this standard convention is implemented correctly.
4.3 Automatic Diagnosis of Errors
Generation of good tests is only a part of the testing process.
If a test fails, the problem must be diagnosed and a solution
developed. In this section, we discuss how the second step,
diagnosis, can be partially automated.
As discussed earlier, the conformance verifier links a
caller and callee together and runs the resulting program.
When both a reference compiler and CUT are used, this
results in four distinct caller­callee pairs whose results we
call an outcome. We can glean more information than a
single test can supply by considering the composite result
that an outcome provides. When a test fails, the results of
two other tests can help isolate the fault. For example, in the
outcome shown in Fig. 8, the CUT/reference test has failed.
Since the CUT/CUT test failed, but the reference/CUT test
passed, this indicates the fault is in the CUT caller.
This method of isolating errors by swapping different
components makes it possible to automatically diagnose
common errors. Since there are only 16 outcome configura­
tions, each outcome can be hand­analyzed once and the
results tabulated, as shown in Table 4. Several diagnoses
deserve mention. First, although the reference compiler is
considered the authority, there are six cases where the
reference can be determined to be faulty. Second, the four
outcome configurations where only a single test fails are not
possible. This cannot occur with a single test failure since
we assume each component uses a single convention. 9
Finally, for two of the cases, we not only can isolate the
location of the fault, but we can identify the nature of the
error. This occurs in outcomes where two conflicting
conventions have been discovered.
The combination of test vector selection and automatic
diagnosis proves to be a powerful debugging tool. As tests
are generated, run, and analyzed, patterns of errors tend to
emerge. We have found that the patterns themselves
suggest the nature of the problem. For example, finding
that an error occurred for every signature that included a
struct of size greater than seven bytes might suggest an
alignment problem. More sophisticated patterns can exist
and with knowledge of the calling convention can sig­
nificantly help the developer correct faults.
4.4 Test Results
We used our technique for selecting test vectors to test
several compilers on several target machines. Several errors
8 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003
Fig. 5. The compiler conformance test process.
8. We denote a structure whose size is n bytes as struct(n).
9. Appel observes that such outcomes actually are possible [16]. In his
counterexample, the CUT caller implements a different convention than the
reference compiler, but the CUT callee implements both conventions. In this
scenario, the fault is detected in the CUT/reference test, but not in either the
CUT/CUT or the reference/reference tests. Although such a case is
possible, the probability of a callee implementing two different conventions
that do not conflict, i.e., using the same register for two different purposes,
is small. The benefits, in terms of diagnostic ability, of considering such a
case as invalid far outweigh any accuracy gained by labeling it a valid
outcome. Finally, if such a case were to occur, it would still be detected; it
just could not be automatically diagnosed.

were found in C compilers on the MIPS. In this section, we
present these results.
We selected several C compilers that generate code for
the MIPS architecture (a DECStation Model 5000/125).
These included the native compiler supplied by DEC, two
versions of Fraser and Hanson's lcc compiler [17], [18],
several versions of GNU's gcc [19], and a previous version
of our own C compiler, vpcc/vpo, that used a hand­coded
calling sequence generator [20]. Although we feel that this
technique is extremely valuable throughout the compiler
development cycle, we believe that it would be fairest to
evaluate its effectiveness in finding errors in young
implementations of compilers. Where possible, we have
used early versions of these compilers. These versions,
called legacy compilers, represent younger implementations
that more accurately exhibit bugs found in initial releases of
compilers. However, each of these compilers is a produc­
tion­quality compiler that has been widely used for years.
Finding any bugs in their implementations is still a
significant challenge.
In testing the compilers, we checked for two types of
conformance: internal and external. Compiler A internally
conforms if code that it generates for a caller can properly
call code for a callee that it generated. We denote this using
A7!A. Compiler A externally conforms if its caller code can
call another compiler B's callee code and vice versa (A7!B
and B 7!A). Thus, the callees and callers are compiled using
each of the compilers under test. This results in n object
versions for n compilers. Each caller version is then linked
with the callee that was generated by the same compiler.
This results in the n tests necessary to verify internal
conformance for this test case. To establish external
conformance, we could na ˜vely link each caller to each
callee, which would yield 2n 2 tests. However, we can do
better. Recognizing that procedure call ( 7!) is symmetric,
we can easily reduce this to n 2 (since if A7!B, then B7!A).
Furthermore, procedure call is also transitive, so if A7!B
and B7!C, then A 7!C. This reduces the number to 2n # n as
BAILEY AND DAVIDSON: AUTOMATIC DETECTION AND DIAGNOSIS OF FAULTS IN GENERATED CODE FOR PROCEDURE CALLS 9
Fig. 6. Code generated for caller.
Fig. 7. Code generated for callee.

pictured in Fig. 9. Each compiler's caller is linked to the
reference compiler's callee. This facilitates the isolation of
which compiler does not conform when an error is detected.
The results of running both internal and external tests on
the compiler set for the MIPS are shown in Table 5. We
found both internal and external conformance errors in all
of the tested compilers. Table 5 reports internal and external
errors separately. Within each class, the number of actual
tests that failed and the number of faults that caused failure
are indicated. 10 The numbers reported in the fault columns
indicate the approximate number of actual coding errors
resulting in test failures. These numbers are only approx­
imate. We tried, as best we could, to glean this information
from the results of tests. More accurate numbers can only be
obtained by examining the compiler's source.
4.4.1 Standard Procedure Calls
Internal conformance errors were found in two versions of
gcc. gcc 1.38 failed 24 tests that focus on passing structures
in registers. Structures between nine and 12 bytes in size
(three words) are not properly passed starting in the second
argument register. gcc 2.4.5 fails a single test. The fault
occurs with procedures with the signature:
void (struct(1), struct(1), struct(1));
gcc 2.4.5 fails to even compile a procedure with this
signature. 11 The fact that gcc 2.1 does not have this error
indicates that the error was introduced after version 2.1. This
supports our conjecture that such a method of automatic
testing is extremely useful throughout the development and
maintenance life­cycle of a compiler.
External conformance errors were more prevalent.
gcc 1.38 does not properly pass 1­byte structures in
registers. gcc 1.38 and 2.4.5 cannot pass a structure in the
third argument register when that structure is followed by
another structure. vpcc/vpo has two faults: 1) Structures are
not passed properly in registers and 2) 1 to 4­byte structures
are not passed in memory correctly if they are immediately
followed by another structure.
4.4.2 Variadic Procedure Calls
Procedures that take variable­length argument lists (var­
iadic functions) are written using one of the two standard
header files: varargs.h (for traditional C) and stdarg.h
(for ANSI C). The following paragraphs detail the results of
calling callees that are implemented using varargs/stdarg.
When running test cases that contained variadic func­
tions whose first argument was a double, we found that
none of the compilers, including the reference compiler,
properly implemented the calling convention. Version 2
releases of gcc managed to avoid this problem at the
expense of interoperability; their generated callees do not
conform to the established calling convention.
We also tested several compilers targeted to the SPARC
architecture. On the SPARC, the test suite generator
produces 12,034 tests. Using our automated testing infra­
structure, we tested three mature compilers and one
research compiler. The mature compilers were cc---the
native C compiler supplied by Sun Microsystems (Sun
WorkShop Compiler C SPARC Version 5.000), gcc (version
2.95.2), and lcc/vpo---a compiler built using the Zephyr
compiler infrastructure (lcc version 4.1 and vpo version 2.0).
The research compiler was built using the Scale compiler
infrastructure [21]. We tested a compiler included with
version 1.7, which is the third release of the compiler. We
shall refer to this compiler as scale. The results of running
both internal and external tests on the compiler set for the
SPARC are given in Table 6.
The scale compiler failed a large number of tests.
Inspection of the test report showed that the internal
failures were because scale did not handle variadic func­
tions. There appeared to be two different faults: 1) scale
threw an exception when trying to produce code for the
callee and 2) the compiler was able to generate both a caller
and a callee, but there appeared to be a mismatch between
the convention implemented by the caller and the callee.
For the additional 442 failed external compliance tests, scale
generated caller code that was incompatible with cc callee
code. In this case, scale generated incorrect caller code for
functions with signatures of the following form:
void (int, int, int, int, int, int,
struct(1, 2, 3, 5, 6, 7));
The test cases represented by these signatures test a
boundary condition---the first six arguments are passed in
registers (%o0­%o5) and the structure is placed on the stack.
Notice that both 4­byte and 8­byte structures are success­
fully passed. This suggests the type of fault that has
occurred: Fetching the last bytes of structures that are not
multiples of four is implemented incorrectly. The test
failures also demonstrate a type of fault that is not likely
be covered by a hand­generated test suite.
From these results, obviously the state­of­the­art in
compiler testing is inadequate. All of the compilers we
tested had undergone rigorous testing. However, hand
development of test suites is an arduous and, itself, error­
prone task. Furthermore, because these tests are target
specific, they must be revisited with each retargeting of the
compiler. In contrast, by using automatic test generators
that are target sensitive, compilers can quickly be validated
10 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003
Fig. 8. An example outcome.
10. These numbers include tests of both standard procedure calls and
variadic procedure calls.
11. The error returned by gcc 2.4.5 was: gcc: Internal compiler
error: program cc1 got fatal signal 4.

BAILEY AND DAVIDSON: AUTOMATIC DETECTION AND DIAGNOSIS OF FAULTS IN GENERATED CODE FOR PROCEDURE CALLS 11
Fig. 9. Determining conformance of n compilers.
TABLE 5
Results of Running the MIPS Test Suite on Several Compilers
TABLE 4
All Outcome Configurations
TABLE 6
Results of Running the SPARC Test Suite on Several Compilers

before each release. Although the ratio of failed tests to
faults (1,000:1) may seem high, these tests are generated,
run, and analyzed automatically. It is not necessary to
examine each of the failed tests. Instead, a single failed test
can be examined and corrected. The suite can then be run
again to determine if faults still exist (or if the fix has
introduced new faults).
5 CONCLUSIONS
Building compilers that generate correct code continues to
be a difficult problem. Current implementations of calling
sequence generators often contain errors. This comes from
the lack of a formal model and implementation mechanism
that can guarantee completeness and consistency proper­
ties. We have presented such a formal model, called P­FSAs,
for procedure­calling conventions that can ensure these
properties. A P­FSA that models a convention can be
automatically constructed from the convention's specifica­
tion. During construction, the convention can be analyzed
to determine if it is complete and consistent. The resulting
P­FSA can then be directly used as an implementation of the
convention in an application.
Although it is possible to automatically generate the
calling sequence generator using P­FSAs, some work is
required to retrofit an existing compilation system to use
them. Fortunately, it is possible to reap the benefits of
P­FSA without any modification of the compiler. Using
automated compiler tools and testing, one can significantly
increase the robustness of any compiler. We have combined
these two techniques in a new way that further closes the
gap between actual compiler implementations and the ever­
sought­after correct compiler. By using a formal model of
procedure­calling conventions, we have designed and
implemented a technique that automatically identifies
boundary test cases for calling sequence generators and
diagnoses the nature of the fault. We then applied this
technique to measure the conformance of a number of
production­quality compilers for the MIPS and SPARC.
This system identified a total of at least 23 faults in the
tested compilers. These errors were significant enough to
cause over 6,000 different test cases to fail. Clearly, this
technique is effective at exposing and isolating faults in
calling sequence generators of mature compilers. Undoubt­
edly, it would be even more effective during the initial
development of a compilation system.
REFERENCES
[1] M.W. Bailey and J.W. Davidson, ``A Formal Model and Specifica­
tion Language for Procedure Calling Conventions,'' Proc. ACM
SIGPLAN­SIGACT Symp. Principles of Programming Languages,
pp. 298­310, Jan. 1995.
[2] S.C. Johnson and D.M. Ritchie, ``The C Language Calling
Sequence,'' Bell Labs, Year?
[3] B.W. Kernighan and D.M. Ritchie, The C Programming Language,
second ed. Prentice Hall, 1988.
[4] W.L. Johnson, J.H. Porter, S.I. Ackley, and D.T. Ross, ``Automatic
Generation of Efficient Lexical Processors Using Finite State
Techniques,'' Comm. ACM, vol. 11, no. 12, pp. 805­813, 1968.
[5] T.A. Proebsting and C.W. Fraser, ``Detecting Pipeline Structural
Hazards Quickly,'' Proc. ACM SIGPLAN­SIGACT Symp. Principles
of Programming Languages, pp. 280­286, 1994.
[6] T. Mu ˜ ller, ``Employing Finite Automata for Resource Scheduling,''
Proc. 26th Ann. Int'l Symp. Microarchitecture, pp. 12­20, 1993.
[7] G.H. Mealy, ``A Method for Synthesizing Sequential Circuits,'' Bell
System Technical J., vol. 35, no. 5, pp. 1045­1079, 1955.
[8] J.E. Hopcroft and J.D. Ullman, Introduction to Automata Theory,
Languages, and Computation. Addison­Wesley, 1979.
[9] F.C. Hennie, ``Fault Detecting Experiments for Sequential Cir­
cuits,'' Proc. Fifth Ann. Symp. Switching Theory and Logical Design,
pp. 95­110, Nov. 1964.
[10] Z. Kohavi, Switching and Finite Automata Theory, second ed.
McGraw­Hill, 1978.
[11] D.P. Sidhu and T.­K. Leung, ``Formal Methods for Protocol
Testing: A Detailed Study,'' IEEE Trans. Software Eng., vol. 15, no.
4, pp. 413­426, Apr. 1989.
[12] S. Fujiwara and G.v. Bochmann, F. Khendek, M. Amalou, and A.
Ghedamsi, ``Test Selection Based on Finite State Models,'' IEEE
Trans. Software Eng., vol. 17, no. 6, pp. 591­603, June 1991.
[13] A.V. Aho, A.T. Dahbura, D. Lee, and M.U. Uyar, ``An Optimiza­
tion Technique for Protocol Conformance Test Generation Based
on UIO Sequences and Rural Chinese Postman Tours,'' IEEE
Trans. Comm., vol. 39, pp. 1604­1615, Nov. 1991.
[14] M. Yannakakis and D. Lee, ``Testing Finite State Machines: Fault
Detection,'' J. ComputerandSystem Sciences, vol. 50, pp. 209­227, 1995.
[15] R.C. Ho, C.H. Yang, M.A. Horowitz, and D.L. Dill, ``Architecture
Validation for Processors,'' Proc. Int'l Symp. Computer Architecture,
pp. 404­413, 1995.
[16] A.W. Appel, personal communication, May 1996.
[17] C.W. Fraser and D.R. Hanson, ``A Code Generation Interface for
ANSI C,'' Software---Practice and Experience, vol. 21, no. 9, pp. 963­
988, 1991.
[18] C. Fraser and D. Hanson, A Retargetable C Compiler: Design and
Implementation. Benjamin Cummings, 1995.
[19] R.M. Stallman, Using and Porting GNU CC (Version 2.0). Free
Software Foundation, Inc., Feb. 1992.
[20] M.E. Benitez and J.W. Davidson, ``A Portable Global Optimizer
and Linker,'' Proc. SIGPLAN Conf. Programming Language Design
and Implementation, pp. 329­338, July 1988.
[21] G.E. Weaver, B.D. Cahoon, J.E.B. Moss, K.S. McKinley, E.J. Wright,
and J.H. Burrill, ``The Common Language Encoding Form (CLEF)
Design Document,'' Technical Report 97­58, Dept. of Computer
Science, Univ. of Massachusetts, Amherst, Aug. 1997.
Mark W. Bailey received the BS degree in
computer and information science from the
University of Massacusetts in 1988. He received
the MCS and PhD degrees in computer science
from the University of Virginia in 1990 and 2000,
respectively. He has been a member of the
faculty of Hamilton College since 1997, where
he is an assistant professor of computer
science. His research interests include compi­
lers, optimization, embedded systems, computer
architecture, and computer security. He is author of numerous work­
shop, conference, and journal articles. He has been a member of
organizing committees of international conferences in his field. He is a
member of the IEEE, IEEE Computer Society, ACM SIGPLAN, and
SIGCSE.
Jack W. Davidson received the BAS and MS
degrees in computer science from Southern
Methodist University in 1975 and 1977, respec­
tively. He received the PhD degree in computer
science from the University of Arizona in 1981.
He has been a member of the faculty of the
University of Virginia since 1982, where he is a
professor of computer science. His main re­
search interests include compilers, code gen­
eration, optimization, embedded systems, and
computer architecture. He is author of more than 100 research articles in
refereed conferences and journals articles as well as coauthor of two
widely used introduction to programming textbooks. He has been a
program/general chair or member of steering/program/organizing
committee of many international conferences in his field and he was
an associate editor of the ACM Transactions on Programming Language
and Systems from 1994 to 2000. He is a member of the IEEE Computer
Society, ACM SIGPLAN, SIGARCH, and SIGCSE.
. For more information on this or any computing topic, please visit
our Digital Library at http://computer.org/publications/dlib.
12 IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 29, NO. 11, NOVEMBER 2003