Random Testing of C Calling Conventions Christian Lindig Saarland University Department of Computer Science Saarbr ˜ ucken, Germany lindig@cs.uni­sb.de ABSTRACT In a C compiler, function calls are di#cult to implement correctly because they must respect a platform­specific call­ ing convention. But they are governed by a simple invariant: parameters passed to a function must be received unaltered. A violation of this invariant signals an inconsistency in a compiler. We automatically test the consistency of C com­ pilers using randomly generated programs. An inconsistency manifests itself as an assertion failure when compiling and running the generated code. The generation of programs is type­directed and can be controlled by the user with com­ posable random generators in about 100 lines of Lua. Lua is a scripting language built into our testing tool that drives program generation. Random testing is fully automatic, re­ quires no specification, yet is comparable in e#ectiveness with specification­based testing from prior work. Using this method, we uncovered 13 new bugs in mature open­source and commercial C compilers. Categories and Subject Descriptors D.2.5 [Software Engineering]: Testing and Debugging--- Testing tools General Terms Reliability, Measurement, Experimentation, Languages Keywords Random Testing, Calling Convention, C, Compiler, Compo­ sition, Consistency 1. INTRODUCTION C compilers have been around virtually forever and build­ ing them is so well understood that it is taught in com­ piler classes. The truth is function calls in C are specifically di#cult to implement correctly. The reason is that gener­ ated machine code must adhere to a strict regime of calling Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. AADEBUG'05, September 19--21, 2005, Monterey, California, USA. Copyright 2005 ACM 1­59593­050­7/05/0009 ...$5.00. conventions. These are issued by hardware vendors to en­ sure interoperability between compilers. For each supported platform a compiler must implement such a calling conven­ tion, of which typical programs only exercise a small part. This makes the implementation of function calls a source for latent compiler bugs even in mature compilers. For example, the code in Figure 1 on the following page uncovers a bug 1 in the GNU C compiler 3.3 that is part of Apple's MacOS X 10.3 operating system for the PowerPC. Function main passes the values of four global variables to variadic function callee, which checks that it indeed re­ ceives the right values. This fails for the fourth argument i: $ gcc ­O2 ­o bug bug.c $ ./bug bug:23: failed assertion `y.f == i.f' Abort trap We found several compiler bugs of this kind using randomly generated programs in the style of Figure 1. These pro­ grams are designed to test a simple invariant: a value passed to or from a function must be received unaltered. We call this property consistency of functional calls, which must be guaranteed by a compiler. Testing compilers for consistency is fully automatic since the generated programs encode the consistency tests them­ selves. Running such a program fails an embedded assertion in case of an inconsistency. Therefore, testing consistency requires no specification or test oracle and dramatically sim­ plifies testing of C compilers compared to prior work (Bailey and Davidson, 1996). Our generation of programs is type driven: a test case is constructed around the declaration of a function (Sec­ tion 2). But to be e#ective, the statistical distribution of test cases must be controlled. Claessen and Hughes (2000) proposed composable random generators to test functional Haskell programs; we applied their idea and designed com­ posable test­case generators for Lua (Ierusalimschy et al., 1996), a scripting language that drives our testing tool. We claim the following contributions: . Composable random generators provide a concise way to control the distribution of test cases. The genera­ tor for ANSI C is specified in about 100 lines of Lua (Section 3). . Random testing of calling conventions is e#ective. We found 13 new bugs in production­quality compilers on 1 Reported as GCC bug #18742. 1 #include 2 #include 3 union A {float a; double b;} 4 c = { 52.54 }; 5 struct B {double d; int e;} 6 h = { 78.01, 834 }; 7 union C {short int f; char g;} 8 i = { 68 }; 9 struct D {char j; double k;} 10 n = { 'c', 31.01 }; 11 struct E {long long l; double m;} 12 o = { 167L, 17.2 }; 13 14 union A 15 callee(struct D a, struct E b, ...) 16 { 17 va_list ap; 18 struct B x; 19 union C y; 20 va_start (ap, b); 21 x = va_arg (ap, struct B); /* 3rd */ 22 y = va_arg (ap, union C); /* 4th */ 23 assert (y.f == i.f); /* fails */ 24 va_end (ap); 25 return c; 26 } 27 int main( int argc, char **arg ) { 28 union A r; 29 r = callee (n, o, h, i); 30 return 0; 31 } Figure 1: GCC 3.3 on MacOS X 10.3 passes union C i incorrectly to variadic function callee; the asser­ tion in line 23 fails. The code was generated by our testing tool Quest and is slightly simplified for pre­ sentation. In a variadic function, extra arguments must be accessed using macro va arg, which receives the argument's type and returns its value. Unix systems (Section 4, Table 3). Finding such bugs takes typically a few minutes. . Manually coded C programs exercise only a small part of a calling convention. This explains why users (and developers) haven't tripped over the bugs we found (Section 6). We investigated random testing of calling convention with Quest, a new tool that generates programs in the style of Figure 1. We discuss prior and related work in Sections 7 and 8, and provide our conclusions in Section 9. 2. TEST GENERATION SCHEME The generation of test cases is type­driven: we generate function declarations randomly and generate a test case for each declaration. For example, the declaration char f(int, short*) is enough to generate a function f(int, short*) and a func­ tion void g(void) that calls it, both shown in Figure 2. Function g passes values to f, which checks that it receives int x = 6362; /* all random */ short *y = (short*) 6328282U; char z = 'q'; char f(int a, short* b) { assert(a == x); assert(b == y); return z; } void g(void) { char c; c = f(x,y); assert(c == z); } Figure 2: Code generation scheme for a function char f(int, short*): function g calls f and passes values of global variables, which are checked by f; likewise for the return value of f. the right values, and returns a value, which in turn is checked by g. Given a declaration, we generate for each parameter and return type a global variable that we initialize with a ran­ dom value. Function g passes these values to f, which uses assertions to check that each value it receives is indeed the one of the corresponding global variable. Likewise, to test the return value, f returns the value of a global variable, which is checked by g. Function main (not shown) finally calls g. Two compound values, like two structures, cannot be com­ pared directly. We unfold them recursively and generate a sequence of assertions that compare values component by component. This code generation scheme extends to all simple and compound C types that can be passed to a function or be returned by it: floating­point types, pointers, arrays, struc­ tures, and unions. A pointer is treated as an unsigned inte­ ger; we only compare a pointer's value but do not allocate a value to point at. The value of a compound global vari­ able like an array, structure, or union can be defined with a C initializer; this also works for a hierarchical type, like an array of structures. The current code generation scheme was designed for sim­ plicity: the expressions defining a value that is passed to and from functions exists in the source code only once. Such a value is stored as a global variable that therefore can be ini­ tialized with an initializer expression. However, a slightly more complicated code generation scheme could trigger a compiler to emit di#erent code and is therefore planned for future work: . Simple values could be passed as literals, without stor­ ing them in a global variable first. . Variables could be declared as const. . A variable for a value returned from a function could be declared and initialized locally. . To force a compiler storing a value in memory rather than in a register, a function could take a variable's address. Assertions, which are used to compare values, might be han­ dled specially by compilers and might exclude optimizations. While Quest generates them by default, a user is free to define and call his or her own macros for comparing values instead. The values we use to initialize global variables are se­ lected randomly, without special attention to extremal val­ ues. This is su#cient since we don't apply any operations to them but test them for equality only. Our Quest tool can generate variadic functions, whose best­known instance is printf(char *fmt,...): such a func­ tion takes regular named arguments like fmt, plus unnamed extra arguments. A variadic function must access the ex­ tra arguments one by one using the va arg macro from the #stdarg.h# header. This is demonstrated in the code in Figure 1. Finally, Quest can emit the called function f (the callee) and the calling function g into two di#erent files. This way they can be compiled by two di#erent compilers to test their consistency. For code generation this means that global variables and functions defined in one file must be declared extern in the other. 3. RANDOM DECLARATIONS In principle it would be enough to generate function decla­ rations (and from them programs) that use all legal ANSI­C features. But from a practical point of view it is desirable to control the statistical distribution of declarations: some compilers, like GCC, accept C programs beyond the ANSI standard, others support only a subset. Section 6 below also presents reasons why we believe that the distribution of test cases has a strong impact on their ability to uncover bugs. We therefore want distributions to be user­programmable and provide composable random generators as a solution (Claessen and Hughes, 2000). Readers not interested in the details of test­case generation and customization may safely skip this section. The composition of test generators is expressed in Lua, a scripting language embedded in Quest that drives the generation of test cases. Lua is a Pascal­like scripting lan­ guage that is designed to be embedded into applications to make them scriptable by the user (Ierusalimschy et al., 1996; Ramsey, 2004). Lua's most prominent data type is table, an associative array that is used in Lua to model lists, arrays, and modules. Quest, which is implemented in the ML di­ alect Objective Caml (Leroy et al., 2004), contains a Lua interpreter that provides random generators as Lua func­ tions. By writing small Lua functions, a user can define generators, which in turn define test cases. The sole role of Lua is to let the user control test case generation. The design of the test case generators itself is independent from Lua---we could have used any other script­ ing language, or none at all, if we had decided to provided only hard­coded generators. A function declaration, which is the base for a test case, is characterized by three components: (1) a list of argument types, (2) a list of types for values passed as extra arguments to a variadic function, and (3) a return type. We generate test cases by using three generators, one for each component of a declaration. A generator in general is a composition of simpler gener­ ators. The most basic generators provide a source for ran­ domness; they are complemented by a set of generators for C types. These follow the abstract syntax for C types cap­ tured by the grammar c in Figure 3. Together they may width ::= char | short | int | long | long long fwidth ::= float | double | long double sign ::= signed | unsigned member ::= (name, c) c ::= void | int(sign, width) | float(fwidth) | array(c, length) | pointer(c) | struct(name, member, . . . ) | union(name, member, . . . ) Figure 3: Abstract syntax for C types. Generator Generator Type C Type char c gen char long c gen long int unsigned c gen # c gen unsigned type float c gen float pointer c gen # c gen pointer type array c gen × num gen # c gen array, n mem's structure c gen × num gen # c gen struct, n mem's union c gen × num gen # c gen union, n mem's Table 1: Some generators for C types. They are available as Lua functions to compose high­level gen­ erators. be combined into generators for complex types, like a list of structures. 3.1 Type Generators To characterize generators as they are available to the user in Lua, we give ML­style types to them: a generator that produces a C type has type c gen, a generator that produces values of type # has type # gen. The most basic generators produce the same simple C type in every run. These are listed at the top of Table 1: generator char produces type char, generator long type long, and so on. A generator can take another generator as argument, for instance unsigned, which makes it a function of type c gen # c gen: unsigned(char) produces the type unsigned char from the generator for the char type. Likewise, a generator producing pointer types is obtained by applying pointer to any type generator. The flexibility gained from composing complex generators from simple generators becomes more evident with arrays: the array generator takes two generators: one for types and one for numbers. The type generator determines the ele­ ment type of the array, the number generator the length of the array. Every time the array generator is run, it runs the two generators as well. These could produce, for ex­ ample, unsigned char and 3 in the first run, and double* and 2 in the second. This leads to two di#erent array types: unsigned char[3] and double*[2]. Generator Generator Type Value generated number num # num gen pick number from 0, . . . , n - 1 choose num × num # num gen pick number from m, . . . , n list num gen × # gen # (# list) gen list with n members elements # list # # gen pick value from list oneof (# gen) list # # gen pick generator from list unit # # # gen constant generator iszero bool gen indicate end of recursion smaller (# gen) list # # gen like oneof, limits recursion bind # gen × (# # # gen) # # gen monadic bind operator Table 2: Primitive random generators; they are available as Lua functions. A generator for a structure (or union) type also takes two generators as arguments, one for the type and one for the number of members. Here however, the number generator is run first and returns how often the type generator is to be run to produce a list of member types. The type generator may return a totally di#erent type in each run, which leads to a structure (or union) type with diverse member types. 3.2 Basic Generators To build a generator whose output varies in every run we need some source of randomness. It is provided by the primitive generators shown in Table 2. Generator number provides a random number, choose a number from an interval. Given a generator for a value, list creates a generator that produces a list of values whose length depends on a number generator. Given a list of val­ ues, elements produces a generator that picks one value at random in every run. This idea is raised to the next level with oneof: it takes a list of generators and picks a generator at every run. 3.3 The ANSI C Generator The default generator for ANSI­compliant declarations in Quest is composed in about 100 lines of Lua code, from which we show a subset in Figure 4. The test­case generator is defined by ANSI.test, which returns a table---denoted by curly braces---that binds three generators for arguments, extra arguments (for variadic func­ tions), and the result type. Several simple generators for sizes and lengths are bound to names at the top of Figure 4. The generator ANSI.arg for argument types is recursive. It considers two cases: in the base case it returns an integer or float type, otherwise it selects with smaller from a list a generator that itself uses ANSI.arg. The two cases are necessary to ensure that recursion terminates. 3.4 Taming Recursion The grammar for C types in Figure 3 is recursive, which leads to recursive generators. Without care, recursive gen­ erators could fail to terminate when run. To avoid this we limit the depth of recursion and therefore the size of a type. The function that runs a generator takes two arguments: the generator, and the maximum recursion depth (which is 2 by default). The parameter for the depth is only passed be­ hind the scene between generators and does not clutter their interface; three functions cooperate to ensure termination: . Function smaller ((c gen) list # c gen) takes a list of generators and selects one in every run. In addition, ANSI.members = choose(1,3) ­­ for structs ANSI.argc = choose(1,10) ­­ argv length ANSI.vargc = choose(0,3) ­­ var args ANSI.array_size = choose(1,3) function ANSI.arg_ (issimple) if issimple then return oneof { any_int, any_float } else return smaller ­­ like oneoef (c.f. 3.4) { any_int ­­ signed/unsigned , any_float ­­ all sizes , pointer(ANSI.arg) , array(ANSI.arg,ANSI.array_size) , struct(ANSI.arg,ANSI.members) , union(ANSI.arg,ANSI.members) } end end ANSI.arg = bind(iszero,ANSI.arg_) function ANSI.test () return { args = list(ANSI.argc,ANSI.arg) , varargs = list(ANSI.vargc,ANSI.varg) , result = ANSI.result } end Figure 4: The test generator for ANSI C in Lua. The generators ANSI.result for return types and ANSI.varg for extra arguments are omitted. Curly braces denote tables. it decrements the depth before passing it to subordi­ nated generators. Hence, only recursion going through smaller counts towards maximum the depth. . Generator iszero (bool gen) generates true if the ac­ tual depth is zero. . Function bind (# gen × (# # # gen) # # gen) defines a generator and takes two arguments: a generator (like iszero) and a function. It passes the generated value to the function which returns a generator depending on the value. In our case, the generator returned by ANSI.arg depends on the depth of the recursion. The generator abstraction is implemented as a monad. This makes it easy to pass values like the depth behind the scene while maintaining full composability (Wadler, 1997). The monad also hides sources for member names as they are required for the implementation of struct and union. 4. EVALUATION To evaluate the e#ectiveness of random testing we tried to find inconsistencies in function calls translated by Unix C compilers. We focused on production­quality commercial and open­source compilers but also included two compilers still under development. In particular: . GCC, the GNU C compiler, including the experimental development version 4.0.0 (FSF, 2003). . LCC, a retargetable ANSI C compiler (Fraser and Han­ son, 1995). . TCC, a small and fast ANSI C compiler for the Linux/ x86 platform (Bellard, 2004). This compiler is under early development. . PathCC, a commercial compiler from PathScale Inc. with a special focus on performance and the genera­ tion of 64­bit code on the x86 architecture (PathScale, 2004). . MipsPro, the ANSI C compiler for SGI workstations running the IRIX operating system (SGI, 1999). . PGCC, a commercial compiler with a special focus on performance on the x86 architecture (Portland Group, 2004). . ICC, the commercial Intel C compiler for Linux on the x86 platform (Intel, 2004). We tested these compilers on a number of platforms where the selection of platforms was influenced foremost by our access to them. As a consequence, we have tested more compilers on Linux/x86 than on any other platform. Compilers GCC, ICC, and TCC support the C language beyond the ANSI standard. They support empty structures and unions, as well as arrays of length zero. When test­ ing GCC, ICC, and TCC we generated programs that used these features. All other compilers were tested with ANSI­ compliant code. To find inconsistencies, we executed a loop for 15 min­ utes that generated code using Quest, compiled it using the compiler under test, and ran it. Each generated pro­ gram contained 20 test cases, that is, pairs of a calling and a called function. The loop was left immediately when ei­ ther the compiler failed, or an inconsistency in the compiled code was found. As far as compilers supported them, we tried di#erent optimization options (none, ­O, ­O2, ­O3) but did not try options unrelated to optimization. Our setup was guided by experience from preliminary ex­ periments: most bugs show up quickly and running Quest over a long period rarely produced new bugs. Files with 20 test cases each correspond to files with about 1800 lines of code; files of this size still compile quickly even with ag­ gressive optimizations. The essence of our test procedure fits in one long line of shell commands: while true; do quest > test.c # generate test cc ­o test test.c || break # compile ./test || break # run test echo ­n . # progress indicator done The infinite loop is left when the compiler fails, or, more likely, the program test raises an assertion failure. This would leave test.c as a test case to report to the developers. Table 3 shows our results. We found bugs 2 in all com­ pilers except PGCC, and TCC. The 13 bugs that we found constitute two classes: 4 bugs that crashed a compiler and 9 bugs that showed up as inconsistencies. The three com­ piler crashes of GCC involved language extensions, in par­ ticular the usage of empty structures. It is not clear what caused PathCC to crash. We believe that the 3 crashes of GCC on Linux result from the same bug; we uncovered it with di#erent test cases in di#erent compiler versions. We could identify the bug by its distinctive error message from the register allocator. The same probably applies for one inconsistency found in GCC on MacOS X. The 9 bugs that showed up as inconsistencies involve ad­ vanced function declarations: 4 bugs showed up in variadic functions, 2 bugs involved a struct­ or union­typed param­ eter, and the bug where PathCC fails to pass a float the test function had the following declaration: union A f(double a, union B b, struct C c, float d, struct E e) Here we suspect that one of the compound parameters caused the problem. 5. CONSISTENCY ACROSS COMPILERS Two functions, one calling another, may be compiled by di#erent compilers. When both compilers adhere to a plat­ form­specific calling convention values should pass unaltered from one function to another. Quest can generate appro­ priate test cases and we used this for a small experiment. We refrained from a more systematic test because it would be a#ected by the inconsistencies we had found already. It would be di#cult to attribute an inconsistency to a partic­ ular compiler knowing that it had shown internal inconsis­ tencies before. We conducted two experiments with GCC 4.0.0, TCC 0.9. 22, and LCC 4.2 on Linux/x86. We found that GCC and TCC agreed perfectly---we did not find any inconsistencies. On the other hand, GCC 4.0.0 and LCC 4.2 showed more inconsistencies than we expected from the one bug we found in LCC. The inconsistencies between LCC and GCC could be ex­ plained with the lack of standards for calling conventions. Architecture manuals like Intel (2003) typically specify a calling convention for simple values but omit any discussion how to pass structures or unions. Here, compiler writers are left on their own or are forced to reverse­engineer existing compilers. 2 The programs that uncovered these bugs are available at http://www.cs.uni­sb.de/~lindig/quest/bugs/ Compiler and Options Platform Symptoms and Comments MipsPro 7.3.1.3m, ­O3 Irix 6.5/MIPS struct not passed correctly GCC 2.95.3, ­O2 SunOS 5.8/Sparc double as var arg not passed correctly GCC 2.95.4, ­O Linux/x86 compiler crashes, reported as bug #16819, fixed in GCC 3.4 GCC 3.2.2 ­O3 Linux/x86 same as bug#16819? GCC 3.3, none Irix 6.5/MIPS union as var arg not passed correctly, reported as bug #19268, fixed in GCC 3.4 GCC 3.3, none MacOS X 10.3 involves GCC extension, reported as bug #18742, see Figure 1 GCC 3.3.3, ­O3 Linux/x86 same as bug #16819? GCC 4.0.0, none MacOS X 10.3 same as bug #18742? GCC 4.0.0 Linux/x86 no inconsistency found LCC 4.2, none Linux/x86 double var arg not passed correctly PathCC 1.4, ­O2 ­m32 Linux/x86 float not passed correctly, bug reported, fixed in Release 2.0 PathCC 1.4, ­O2 ­m32 Linux/x86 union with struct not passed correctly, fixed in Release 2.0 PathCC 2.0, ­Ofast Linux/x86 compiler crashes with floating­point exception in ``LNO'' phase, reported as bug #5273, fixed in upcoming Release 2.1 ICC 8.1, none Linux/x86 var arg not passed correctly, reported as bug #292019 TCC 0.9.22 Linux/x86 no inconsistency found PGCC 5.2 Linux/x86 no inconsistency found Table 3: Bugs found with Quest in compilers on Unix systems. 0% 50% 100% Argument Type SPEC pointer int enum floats etc. GCC pointer int integer & char etc. Quest pointer int enum float double union struct etc. Return Type SPEC void int pointer etc. GCC void int etc. Quest void int pointer floats struct union etc. Argument Number SPEC 0 1 2 3 4 5 etc. GCC 0 1 2 3 45 6 Quest 0 1 2 3 4 5 6 7 8 9 10 Figure 5: Distribution of declarations in C programs: types in argument positions, types in return position, and length of argument lists. Shown are the distributions for 12 programs from the SPEC CPU 2000 bench­ mark, for the GCC 4.0.0 test suite gcc.c­torture, and for code generated by Quest. Declarations generated by Quest are more varied than those found in the SPEC and GCC suites. 6. NON­RANDOM C CODE Surprised by the many inconsistencies we found we looked for some explanation. For this we analyzed the function definitions of real­world programs and programs in the GCC test suite; we compared them to 200 programs generated by Quest. . As a representative collection of real­world programs we looked at the SPEC CPU 2000 benchmark suite. The SPEC benchmark suite is a standardized set of programs to evaluate the performance of a computer's processor, memory architecture, and compilers (SPEC, 1999). The suite is intended to cover a range of typical compute­intensive applications and we looked specifi­ cally at 12 programs 3 written in C. . The GCC 4.0.0 source code contains various test suites that are used during development of the compiler. We analyzed the gcc.c­torture test suite that contains 3 The SPEC CPU 2000 suite contains 3 more programs writ­ ten in C that we could not analyze using CIL: perlbmk, vortex, and equake. 1,638 (short) C files; about 5% of them we could not analyze due to syntactical problems. As a fairly coarse measure, we analyzed the programs statically using CIL, a framework for analyzing C programs (Necula et al., 2002). On Apple MacOS 10.3 we measured in particular for function definitions: . the number of arguments; . the number of variadic function definitions; . the distribution of argument types; . the distribution of return types. The SPEC benchmark, the GCC test suite, and the Quest­ generated code contain 5,566, 5,055, and 4,200 functions, re­ spectively. From these, the following fractions were variadic: 14 (or 0.3%) in SPEC, 137 (2.7%) in GCC, and a substantial 985 (23.5%) in Quest­generated code. Figure 5 summarizes the distribution of types and argu­ ment numbers for our subjects. In the SPEC benchmark suite over 80% of argument types are either int or a pointer, and over 95% of functions either return no value, an int, or a pointer. Types are similarly distributed in the GCC test suite; pointer and int dominate the argument types, and void and int the return types. Essentially no function in the SPEC or GCC suite returns or receives a structure or union. Even simple types like char or float are almost ab­ sent among return types. This is very di#erent for Quest­ generated programs: structures, unions, and floating­point types constitute a sizeable part of the distribution. The length of argument lists of programs in the SPEC and GCC test suites are heavily skewed towards functions with less than 3 arguments. There are, however, a curious 13.1% percent of functions with 6 arguments in the GCC test suite. The maximum argument length in the SPEC suite was 17, and 32 in the GCC suite. Again, the programs generated by Quest show a much more even distribution with up to 10 arguments. (There are 47.6% of functions with declaration void f(): those call the function under test and are somewhat misleading.) Currently Quest does not generate functions with more than 10 parameters. We chose this limit because we know of no calling convention that puts more than 8 parameters in registers. We thus are confident that 10 parameter suf­ fice to exercise all registers used for parameter passing, as well as some stack positions. In any case, this could be eas­ ily changed by defining a di#erent generator ANSI.argc in Figure 4. Our results are based on static analysis and don't re­ flect how often a certain function is actually used in a pro­ gram run. Theoretically, a statically rare declaration could dominate the dynamic execution. However, early exper­ iments suggest that the statically dominant types in the SPEC benchmark are even more dominant at run time. In the SPEC and the GCC test suite simple types and short argument lists are predominant. We suspect that they never execute a large part of a calling convention implementation in a compiler. Both users (good) and developers (bad) are thus unlikely to find latent bugs in the implementation of function calls. We believe that the decidedly wider spec­ trum of declarations generated by Quest is responsible for uncovering bugs in otherwise mature compilers. 7. PRIOR WORK Bailey and Davidson (1996) had previously tested the con­ formance of C function call implementations with calling conventions. Conformance with a calling convention im­ plies consistency but requires a specification to test against. Bailey and Davidson's test methodology builds upon their formalization of calling conventions (Bailey and Davidson, 1995). A calling convention is intended to ensure interoperabil­ ity between compilers and libraries from di#erent vendors. It details which registers and stack locations to use when passing parameters between functions. Calling conventions are defined in architecture manuals issued by hardware ven­ dors, which makes them platform­dependent. They are typ­ ically informal and sometimes confusing to the degree that compilers failed even on their examples, as noted by Bailey and Davidson (1996). The especially arcane C calling con­ ventions are the main di#culty in implementing C function calls correctly. Bailey and Davidson formalized calling conventions. Their model is based on automata, so­called P­FSA; an automa­ d 1 1,2 1,2, 3 1,2, 3,4 1,2, 3,4 1,2, 3,4 1,2, 3,4 1,2, 3,4 1,2, 3,4 1,2, 3,4 1,2, 3,4 c,i c,i i c,i,d d d c c c c c d,i c d,i d,i d c i c,d d d,i 0 1 2 3 4 5 6 7 0 0 0 0 i: int c: char d: double c 1,2 0 alignment allocated registers Figure 6: Automaton for a simple calling convention (Bailey and Davidson, 1995). Numbers inside nodes denote allocated registers, numbers next to them alignment. A parameter list (double, char) allocates registers 1 to 3 and no stack positions. ton for a simplistic calling convention is shown in Figure 6. They derive test cases from such an automaton to test the conformance of a compiler with the modelled calling con­ vention. An automaton models the allocation of registers and stack locations for parameter passing. A state represents a set of allocated resources, a transition is labeled with a parame­ ter type for which a resource is allocated in the next state. Since the number of pfunction arameters is unbounded, an automaton is infinite in principle. To make it finite, Bailey and Davidson track only register resources precisely. Stack locations, on the other hand, are grouped into equivalence classes based on their alignment: for example, all 8 byte­ aligned stack locations are represented by the same node. Nodes for stack locations are often reachable from each other and lead to cyclic automata. Every path in an automaton represents a declaration, each of which is a candidate for a test case. In the presence of cycles there are infinite many; looking at the finitely many acyclic paths, Bailey and Davidson found more than 10 8 of them on some RISC architectures---still too many to test. They devised a smart heuristic to extract a manageable number of test cases. Bailey and Davidson tested the conformance of compilers on the MIPS platform, whose calling convention is notori­ ous for being confusing and di#cul to implement. They found 9 bugs related to consistency and 13 bugs related to parameter passing between functions compiled by di#erent compilers. 8. RELATED WORK The test methodology of Bailey and Davidson (1996) val­ idates the conformance of a function call implementation based on test cases derived from a model of a calling conven­ tion. The validation is almost as good as a verification since the method guarantees a high degree of coverage. As its main drawback it depends on a target­specific model or for­ mal specification, which is not readily available and the rea­ son why Bailey and Davidson's methodology wasn't widely adopted. Random testing is target­independent and requires no spec­ ification. Of course, we cannot argue for exhaustiveness: by analogy, only random paths in Bailey­Davidson automaton would be used as test cases. We cannot verify the confor­ mance with a calling convention directly. But testing ex­ ternal consistency against a conforming reference compiler could detect violations of a calling convention. Overall, the main attraction of random testing is simplicity. To guarantee exhaustiveness, one could use model check­ ing (Clarke et al., 1999) instead of random generation of test cases: a model checker can enumerate all test cases. Since there are infinite many function declarations, these must be partitioned into equivalence classes that are tested. Random testing in general is surrounded by a discussion of its e#ectiveness in comparison to other methods (Duran and Ntafos, 1984). It attracted a lot of theoretical attention with the goal of measuring its e#ectiveness, or to derive upper bounds for remaining bugs after testing (Bernot et al., 1997; Chen and Yu, 1996; Tsoukalas et al., 1993). QuickCheck by Claessen and Hughes (2000) introduced the idea of composable generators as first­class values. They use composable generators to test Haskell programs. We adopted this idea to create a domain­specific extension of Lua (Ierusalimschy et al., 1996) for the generation of C dec­ larations. Celentano et al. (1980) present a grammar­driven program generator that covers the entire input language for a com­ piler. As such, it has to cope with many more context­ sensitive aspects than we do for C types. As the main dif­ ference, Quest­generated tests are self­evaluating because they test the consistency of a compiler. The correct translation of function calls is only a small part of a totally correct compiler. At least for optimizing compilers, verified correctness is still a research problem. A promising trend is proof­carrying code (Necula, 1997): in­ stead of verifying the entire compiler, verify the results of an individual translation. For this a compiler augments the code it emits with annotations that can be verified before ex­ ecuting the code. The successful verification of annotations implies certain code characteristics. Because of its smaller code base, a verifier is easier to implement correctly than an optimizing compiler. 9. CONCLUSIONS Random testing the consistency of function calls has re­ vealed 13 new bugs in mature C compilers (see Table 3). This leads us to conclusions about both this specific prop­ erty of C compilers, and our test method. Despite their long history C compilers still contain bugs in the implementation of functions calls. These are provoked by arcane C calling conventions that compilers must imple­ ment to ensure interoperability with existing code. By their nature calling conventions are platform specific and thus a compiler cannot implement a general solution---the problem is therefore unlikely to vanish. This suggests an opportunity for a framework to implement calling conventions more eas­ ily by providing common abstractions, for example for the stack frame layout (Lindig and Ramsey, 2004). Our analysis of typical C code shows that it is unlikely to trigger a remaining inconsistency in a compiler. All bugs that we found were in advanced aspects of calling conven­ tions, like structures passed by value to a variadic function. These are almost absent in real­world code, but probably also in compiler test suites. We therefore suggest that com­ piler developers adopt our tool as a complement to existing regression tests. Random testing of consistency is as e#ective as specifica­ tion­based testing, yet easier; indeed it is so easy that anyone could use it in a few lines of shell code to automatically find bugs in a compiler. The simplicity stems from a number of sources: First, testing for consistency requires no specifica­ tion. Second, composable random generators are a power­ ful device to compose highly structured test data while still controlling their statistical distribution. And third, tests are self­evaluating: running a test is equivalent to running code that the program under test generated. This suggests that random testing of consistency is especially well suited for compilers and interpreters and is the main result of this paper. An open problem is how to avoid finding the same bug again. Distinct failing test cases might be caused by the same bug, which is undecidable in general. It would be helpful to fingerprint the execution of test cases such that similar executions lead to similar fingerprints which could identify bugs. An idea for future work is to tie Quest, a compiler, and the execution of generated code into a feedback loop. This would allow for the automatic minimization of a failed test using the Delta Debugging algorithm (Zeller, 2001): after a failing test was found, the test generator would try to find a minimal failing test case. Smaller test cases make it easier for developers to locate a bug. While random testing the consistency of function calls could be done for any language, it is unlikely to be as useful as for C and C++ compilers: most languages have calling conventions that are much simpler on the machine level than the C convention. Languages that do allow to call C often use only a small part of the C convention and thus are more likely to be implemented correctly. Random testing consistency is promising whenever the consistency of something conceptually simple is guaranteed by a complex implementation. For compilers, the evalua­ tion of expressions with side e#ects could be such a case. Another one is the consistency of file systems (Yang et al., 2004) or the consistency of a heap after garbage collection. Quest is open source; the code and Quest­generated test cases are are available from http://www.st.cs.uni­sb.de/ ~lindig/src/quest/. Acknowledgements. Christopher Krauß conducted the stat­ ic analysis of SPEC benchmarks. Discussions with Stephan Neuhaus, Tom Zimmerman, and Andreas Rossberg helped to improve this paper, as well as remarks from anonymous reviewers. Flash Sheridan helped polishing the final version of this paper. References Mark W. Bailey and Jack W. Davidson. A formal model and specification language for procedure calling conventions. In Conference Record of the 22nd Annual ACM Sym­ posium on Principles of Programming Languages, pages 298--310, January 1995. Mark W. Bailey and Jack W. Davidson. Target­sensitive construction of diagnostic programs for procedure calling sequence generators. Proc. of the ACM SIGPLAN '96 Conference on Programming Language Design and Im­ plementation, in SIGPLAN Notices, 31(5):249--257, May 1996. Fabrice Bellard. TCC -- Tiny C Compiler's Unix Manual Page, November 2004. URL http://www.tinycc.org/. Release 0.9.22. Manual part as part of the source code. Gillis Bernot, Laurent Bouaziz, and Gall Gall. A theory of probabilistic functional testing. In Proc. of the 1997 International Conference on Software Engineering, pages 216--226, 1997. Augusto Celentano, Stefano Crespi­Reghizzi, Pierluigi Della Vigna, Carlo Ghezzi, G. Granata, and F. Savoretti. Com­ piler testing using a sentence generator. Software--- Practice and Experience, 10(11):897--918, November 1980. Tsong Yueh Chen and Yuen­Tak Yu. On the expected num­ ber of failures detected by subdomain testing and random testing. IEEE Transactions on Software Engineering, 22 (2):109--119, February 1996. Koen Claessen and John Hughes. QuickCheck: a lightweight tool for random testing of Haskell programs. In Proc. of the ACM Sigplan International Conference on Functional Programming (ICFP­00), volume 35.9 of ACM Sigplan Notices, pages 268--279, September 18--21 2000. Edmund Clarke, Orna Grumberg, and Doron A. Peled. Model Checking. MIT Press, Cambridge, Massachusetts ­ London, England, 1999. Joe W. Duran and Simeon C. Ntafos. An evaluation of ran­ dom testing. IEEE Transactions on Software Engineering, 10(4):438--444, July 1984. Chris W. Fraser and David R. Hanson. A Retar­ getable C Compiler: Design and Implementation. Ben­ jamin/Cummings Pub. Co., Redwood City, CA, USA, 1995. FSF. GCC Internals Manual. Free Software Foundation, 59 Temple Place ­ Suite 330, Boston, MA, 2003. URL http://gcc.gnu.org/. Roberto Ierusalimschy, Luiz Henrique de Figueiredo, and Waldemar Celes Filho. Lua --- an extensible extension language. Software --- Practice and Experience, 26(6): 635--652, 1996. URL http://www.lua.org/. Intel. IA­32 Intel Architecture Software Developers's Man­ ual, Vol. 1. Intel Corporation., P.O. Box 7641, Mt. Prospect, IL 60056, 2003. Intel. Intel C++ Compiler for Linux Systems User's Guide. Intel Corporation, 2200 Mission College Blvd., Santa Clara, CA 95052, USA, June 2004. Release 8.1, Docu­ ment Number 253254­031. Xavier Leroy, Damien Doligez, Jacques Garrigue, Didier R’emy, and J’er“ome Vouillon. The Objective Caml System 3.08: Documentation and User's Manual. INRIA, 2004. Available from http://caml.inria.fr. Christian Lindig and Norman Ramsey. Declarative com­ position of stack frames. In Evelyn Duesterwald, editor, Proc. of the 14th International Conference on Compiler Construction, number 2985 in Lecture Notes in Computer Science, pages 298--312. Springer, 2004. George C. Necula. Proof­carrying code. In Conference Record of POPL'97: The 24th ACM SIGPLAN­SIGACT Symposium on Principles of Programming Languages, pages 106--119, January 15--17, 1997. George C. Necula, Scott McPeak, Shree P. Rahul, and West­ ley Weimer. CIL: Intermediate language and tools for analysis and transformation of C programs. In Proc. of Conference on Compiler Construction (CC'02), volume 2304 of Lecture Notes in Computer Science, pages 213-- 228, 2002. PathScale. PathScale EKO Compiler Suite Release Notes. PathScale, Inc, 477 N. Mathilda Avenue, Sunnyvale, CA 94085, USA, 2004. URL http://www.pathscale.com/. Release 1.4. Portland Group. PGI User's Guide. The Portland Group Compiler Technology, 9150 SW Pioneer Ct, Suite H, Wilsonville, OR 97070, USA, June 2004. URL http: //www.pgroup.com/. Release 5.2. Norman Ramsey. Embedding an interpreted language us­ ing higher­order functions and types. Journal of Func­ tional Programming, 2004. To appear. Preliminary ver­ sion appeared on pages 6--14 of Proc. of the ACM Work­ shop on Interpreters, Virtual Machines, and Emulators, June 2003. MIPSpro ANSI C Compiler Release Notes. SGI, 1500 Crit­ tenden Lane, Mountain View, CA 94043, USA, 1999. URL http://www.sgi.com/. Release 7.3. SPEC. SPEC CPU 2000 benchmark suite. Standard Per­ formance Evaluation Corporation, December 1999. URL http://www.spec.org/. Version 1.0. Markos Z. Tsoukalas, Joe W. Duran, and Ntafos Ntafos. On some reliability estimation problems in random and parti­ tion testing. IEEE Transactions on Software Engineering, 19(7):687--697, July 1993. Philip Wadler. How to declare an imperative. ACM Com­ puting Surveys, 29(3):240--263, September 1997. Junfeng Yang, Paul Twohey, Dawson Engler, and Madanlal Musuvathi. Using model checking to find serious file sys­ tem errors. In Proceedings of 6th Symposium on Operat­ ing Systems Design and Implementation, pages 273--287. USENIX, December 2004. Andreas Zeller. Automated debugging: Are we close? Com­ puter, 34(11):26--31, November 2001.