What happens when you build a program?

The C compiler is not a monolithic transformation from source code to executable. There are a number of stages along the way. cc or gcc or clang accepts arguments for each stage and allows you to stop and examine the result at any stage boundary.
  1. The C pre-processor removes comments (replacing each comment with a space) and is responsible for handling pre-processor directives (those lines beginning with # in column 1). Lines with #include are replaced by the contents of the referenced file (with different search rules for names in quotes versus those in angle brackets). Names introduced with #define are systematically replaced with their definitions throughout the program, expanding as necessary in the case of macro definitions. #if and its relatives are processed. You can invoke the C pre-processor independently using the command cpp or you may examine the result by using the gcc -E.
  2. The actual compiler translates pre-processed source into assembly language. You may examine the assembly language output with gcc -S. Assembly language file names normally end with .s in Unix-like systems. (The compiler can actually skip generating assembly code, but it's important to understand and we will deal a lot with it later.)
  3. The assembler converts the assembly language source to an object, .o, file. An object file is not an executable: it may require definitions from other files, including libraries. The assembler can be run separately with the as command. You can stop the compilation process here using gcc -c and get an unlinked, relocatable object file.
  4. The linker resolves all the references in a set of object, .o files (and libraries in the form of archive, .a files and/or shared object/dynamic load library .so or .dll files) and produces an executable image. The linker, which used to be called the linking loader, can be run separately using the ld command, and you can get some debugging hints by noticing when an error message is preceded by ld:, which means that what follows is a link time error (probably a missing object file or library).
Example

Need to understand the phase distinction.

The original program file and the output of the pre-processor, and compiler are all text files. The object file and executable are binary files.

When you try to run a program, the operating system creates a new process (with its attendant resources), loads the executable image into memory, and then runs the process.

There are some important gcc command line arguments you should get in the habit of using. -Wall tells the gcc to print out all warnings. We also have -Wextra for extra warnings. This will often help you to spot a surprising number of errors that won't stop the program from compiling but will make it run incorrectly. The other argument you should always specify is -g, which tells gcc to emit special information the gdb debugger can use to help you debug your program.

-Werror turns warnings into errors, i.e., the compiler will not produce object code if there are warnings. -Wfatal-errors causes the compiler to abort after the first error. I often turn this one off.

To get gcc to help you write ANSI standard C (important if you would like your code to run under various operating systems and with different compilers), you should also specify

-ansi -pedantic
for the C90 standard or
-std=c99 -pendantic
for the C99 standard (what we assume in this course).

A Common Confusion

As stated above, the result of each phase of compilation can be viewed using appropriate compiler arguments, it is unusual to stop compilation after the pre-processing phase or after the assembly code has been generated. However, it is common, in fact, it is the usual routine of building practical systems, to stop after the compiler produces object (.o) files and to run the program again in a separate linking step.

Beginners get confused about this because, unfortunately, we use the same shell command both for producing object files and for linking them togeter (gcc, for example). Despite the same command name, the activities are different, and what is required is different in the two cases.

Suppose a program in a file called control-panel.c needs to use a linked list package and a specialized graphics package whose source is in linked-list.c and window-toolkit.c, respectively. We want to build a control-panel program, but how do we do this?

We will proceed in two phases:

  1. Convert all the source files to object files, and
  2. Link all the object files together into an executable.

In order to do the first job, we will perform the first 3 phases of compilation on each .c file individually, and we don't need the other .c files for this. The compiler only needs to know the types of any variables or functions that will be used in a particular file The externally defined variables and functions will have their definitions elsewhere. For example, the code in control-panel.c will refer to list and graphics functions like append() and resize_window(), but these defintions will be in the other .c files. These types will be written in corresponding .h header files that are #included by control-panel.c. I.e., control-panel.c will contain lines like this:

#include "linked-list.h"
#include "window-toolkit.h"
It is important to understand that this only provides the compiler with type information so it knows how big data values returned from external functions are, how many arguments functions take, etc. This is enough to produce the object code for control-panel.c. To get the object file for control-panel.c we need to tell the compiler not to produce an exectuable, but to stop after program building phase 3 by using the -c compiler switch:
gcc -Wall -g -o control-panel.o -c control-panel.c
If you omit the -c, then gcc will assume you wnat an executable program, but when it gets to compiler phase 3 it will find it doesn't have the actual definition of, say append. You'll get an error about a missing reference to append, and you'll be told that ld failed, i.e., the program could not be linked.

We will repeat this procedure for all the source files in the system we are building, and then we will have a bunch of .o files that refer to values and functions that they don't yet have access to.

The final build phase happens after all the object files are made. The executable program will need the actual definitions of externally defined items in order to run, so the object files must be linked together. That is, we need to perform phase 4 of the program building process. This time, we already have all the object files, but we need to resolve references among them. We don't need the header files any more, nor do we need the C source files.

gcc -o control-panel control-panel.o linked-list.o window-toolkit.o
      
This build process, which can get very involved, can be automated. The conventional Unix tool for managing this problem is the make program. One can also write special purpose compilation/linking scripts to do this. Many large systems are built using IDEs that contain tools for managing builds.

Fun Example

[I constructed the following example based on a bug identified by Gemma Stern on the very day this was being discussed in class. Kudos to her sharp eyes and good timing!]

The following client of an abstraction is getting a syntax error:

Can you spot the error? Even if you do, you might want to see what error messages you get when you compile the file:
% gcc -Wall -g -ansi -pedantic -Wall -g   cpp_to_rescue.c   -o cpp_to_rescue
In file included from cpp_to_rescue.c:3:
pointless_abstraction.h:5: warning: useless storage class specifier in empty declaration
pointless_abstraction.h:5: warning: data definition has no type or storage class
pointless_abstraction.h:5: warning: type defaults to 'int' in declaration of 'Pointless_T'
pointless_abstraction.h:5: warning: ISO C does not allow extra ';' outside of a function
pointless_abstraction.h:7: error: expected ')' before ';' token
% 

Now have a look at cpp_to_rescue.i.

First, just admire what cpp has wrought (this output came by running gcc -E). All the references to darwin are there because I ran gcc on a Mac.

Now scroll down to the lines just above main() — it's at the bottom. Now do you see why the error messages say what they do? (OK, you may not have known that typedef is a “storage class specifier,” but with that piece of knowledge, things should be falling into place.)

Can you see what the empty declaration is? Can you see a data definition with no type or storage class? Assuming something is an int?

Advice on program building

Questions:

  1. Can we create a program just from UArray2? What would that mean?
  2. Where does every C program (and C++) start executing?

Things to remember from the program building lecture:

Things that can go wrong:

Signs that your build process (Makefile/compilescript) is broken: