COMP 40 — Fall 2017

A Whirlwind Tour of C

Warning: This is not a comprehensive document on the C programming language, rather a first look for Tufts students who are familiar with C++.

Why are we programming in C? It's not just to boost your resumes! C continues to be the defacto systems programming language (though C++ is making inroads). It has a syntax many find convenient, yet it provides fairly low-level control over what happens in the machine. It also exposes many details of the computer, forcing you to learn more about how the computer works, which is a central mission of this course.

C was created 45 years ago as a way to write operating system code in something that, unlike assembly language, could be ported easily to new machines, but that allowed data manipulations at nearly the same low, assembly language level. In this, it has been very successful. For those of you who know C++, C will look very familiar. The basic syntax and control constructs are very much the same. This is not an accident: C++ is an extension to C. Before we get too involved, however, why don't we look at a typical program.

Anatomy of a C Program

Several things to notice:

Let's consider another example (taken from this online C tutorial).

Note the use of #define to create compile-time constants. It is conventional to use lower case for variables and UPPER CASE for compile-time constants. Tokens beginning with # are sometimes called pre-processor directives: they are handled by a separate program called the C pre-processor, cpp, that actually runs before the compiler sees your program. The pre-processor step is one of the four main steps of compilation of a C program, which we will learn about more in-depth later in the semester. For now, see below for some quick examples of what compiling a C program looks like.

Declarations

The previous examples show a variety of data and function declarations. A declaration in C defines a name, and its properties. Some declarations only introduce type names, but most introduce a name for an actual object. The properties of object names include their type (e.g., int and Function returning int), the status of their storage requirements, and their visibility.

The type of an object implies a size, or the amount of storage the object requires (and you can ask for the size of any type using the sizeof operator). The other properties are determined by a rather complicated set of rules. Data declarations can include initializations, but there are some restrictions.

Declaring a name with extern says that the compiler should not allocate any space for it because the name is defined (and its space is allocated) externally to this program file, i.e., it will come from somewhere else during program linking.

All names in C must be declared before they are used. A declaration that occurs before the definition of a function or variable (i.e., a prototype which does not allocate space or specify a function body) is called a forward declaration. This is useful for organizing input files with utility functions down at the bottom and for mutually recursive definitions.

Top-level definitions described as static are only visible in the current file: they cannot be linked to by other code modules. (They are statically allocated whether so labelled or not.) Top-level initializations must use constant (compile-time computable) data.

Normal non-top-level declarations allocate space on the stack, and are called automatic. This space is deallocated on exit from the function or program block in which the declaration occurs. Initializations (if any) are performed upon every entry to the block. Automatic variables are visible only within the block in which they are declared.

Compiling a C Program

While the act of compiling a C program has some similarities to compiling C++, there are a couple important differences.

Apart from these differences, many of the compiler flags that you might have seen in Comp 15 are the same (like -Wall, -Wextra, etc.)

Here are some quick examples of compiling some C projects. These should look similar in some ways to what it would look like for a C++ project.

To compile a single, standalone source code file called hello.c:

gcc -Wall -g -o hello_world hello.c
This will produce an executable called hello_world.

Now imagine we had two files, A.c and B.c that work together. First we compile them to object files:


gcc -Wall -g -c -o A.o A.c;
gcc -Wall -g -c -o B.o B.c;
And then we link the object files together:
gcc -Wall -g -o my_program A.o B.o
which produces an executable my_program.

C Base Types

C only supports four built-in data types:

Integers come in short, long, and unsigned flavors. Integer constants look normal, most of the time. 123L forces the integer to be long. Integer constants beginning with a 0 are interpreted as being in octal, which is base 8. Integer constants that begin with a 0x or 0X are interpreted as hexadecimal (base 16) numbers. An unsigned integer is one that is meant to take on positive variables, i.e. for a size. We'll learn much more about the difference between signed and unsigned integers later in the course.

Floating point numbers (float or double) are probably the same as whatever you are used to. We will discuss floating point numbers at the low level later in the course.

Characters are assumed to be unsigned, one-byte, integers. Character constants come between single quotes, e.g., 'a', and include a variety of escape sequences for non-printing characters like '\n' for newline. Character constants are automatically promoted to the integer value of the corresponding character in the ASCII sequence. This gives meaning to character comparisons such as ((c > 'a') && (c < 'z')). It is also relied on for various input and output operations as we'll see later. It also means that you can do arithmetic on characters: ('z' - 'a') == 25 if we're using ASCII.

You will notice some omissions: There are no strings or boolean values. There is special support for string constants, but don't expect them to work the way they do in C++! We will discuss "C strings" in greater detail later.

In C, 0 is false; all other numeric values are considered to be true. Boolean operators, like && return 0 for false and 1 for true. And if is treated as if it compared its test value against zero:

if (test) ...
is the same as
if ((test) != 0) ...

It's common to use this property of integers, integer expressions, and pointers (though the case with pointers is a bit complicated under the covers):

while (things_to_do) {
        ...
        things_to_do--;
}
Modern usage discourages this shorthand for clearer formulations:
while (things_to_do > 0) {
        ...
        things_to_do--;
}

There is no string type in C, though there are string constants. As we shall see below, a string is the same as an array of characters (bytes) terminated by the null character '\0' (in ASCII called NUL) whose value is guaranteed to be zero. The '\0' acts as a sentinel character, that is, starting at the first character in a string, adjacent bytes will be treated as part of the same string until a '\0' is in the array. For example, the string "foo" represents a sequence of four characters: 'f', 'o', 'o', '\0'. If you write a string constant, the terminating '\0' is inserted for you by the compiler. But when you manipulate strings, you need to be aware of its existence.

void

The type name void has three distinct uses.

In C, void as the return type of a function means that the function is only to be called for side effect and must not be used in a context where a return value is expected.

If a function prototype contains the argument list(void), then the function takes no arguments and the compiler will not let you supply any. This is different from an empty argument list in a function prototype which is taken to mean `I don't want to say what the arguments are, allow any number and trust me.'

void as the target of a pointer type has another meaning, which we'll discuss below.

Derived/Aggregate Data Types

C provides 4 derived/aggregate data types:

Arrays

An array is a sequence of values of a specified type:
#define MAX_PARTS 256
/* ... */
int part_numbers[MAX_PARTS];
declares a sequence of 256 integers indexed from 0 through 255.

C guarantees that array elements are stored in successive memory locations (which is important when we talk about pointers and arrays). Also, C has no built-in support for array bounds checking.

There are no multi-dimensional arrays, as such, in C: one simply uses an array with an array at each element:

double soil_samples[100][500]
is an array of 100 elements. Each of those elements is itself an array of 500 double precision floating point elements.

The following code fragment prints out the values in our soil_samples matrix with one row (of 500 elements) per line:

for (i = 0; i < 100; i++) {
        for (j = 0; j < 500; j++)
                printf("%f\t", soil_samples[i][j]);
        printf("\n");
}

Using array notation, you can declare a variable that represents a string of some maximum length like this: char pathname[MAX_PATH_LEN];

In C, the size of an array must be specified when its space is allocated, and its size cannot change. This sort of declaration for a string is therefore common when allocating a static area for a string. It is also common when creating a buffer into which characters will be placed for parsing/processing.

You can leave out an array size if you have an explicit initialization:

char hello[] = "hello";
static int backward_digits[] 
        = {9, 8, 7, 6, 5, 4, 3, 2, 1, 0}

Arrays cannot be passed to or returned from functions, though pointers to them can (more on pointers below). That is, if backward_digits is an int array as initialized above, and you pass it to a function reverse_array, reverse_array receives a pointer to the beginning of the array. As reverse_array receives a pointer to the original array, and not a copy, it can modify the array it is passed.

/* Modifies arr with length arr_len to have its elements stored in reverse order */
void reverse_array(int arr[], int arr_len) {
    int i;
    int tmp;
    for (i = 0; i < arr_len / 2; i++) {
        tmp = arr[i];
        arr[i] = arr[arr_len - i - 1];
        arr[arr_len - i - 1] = tmp;
    }
}

int main() {
    int backwards_digits[] = {9,8,7,6,5,4,3,2,1,0};
    int i;
    reverse_array(backwards_digits, 10);
    for (i = 0; i < 10; i++) {
        printf("%d\n", backwards_digits[i]);
    }
    return 0;
}

Pointers

Pointers are memory addresses. Here is the declaration of a variable whose type is pointer to an integer:
int *int_ptr;
You get the value pointed to by such a variable by dereferencing the pointer. So int_ptr is a pointer to an integer (address of an integer), and *int_ptr is the integer it points to. We can assign to pointer variables, and until you get very used to this, I recommend that you draw pictures to represent your pointer structures. For example, int_ptr = p; does not change the memory the location int_ptr was pointing to: it makes int_ptr point to some new place in memory. *int_ptr = 3; changes the contents of the memory location pointed to by int_ptr.

You can get the address of a variable by using the & operator.

int size;
int *p = &size;

Look at pointer_example.c.

Declarations of pointers are a source of confusion to students at first. The logic is that type (and optional storage class) appears first followed by a list of variable expressions. The variable expressions use the declared variable in an expression that shows how you use it to get to the declared type. Thus, declaration of int_ptr above should be understood to mean “dereferencing int_ptr (*int_ptr) will yield an int.”

There is a distinguished pointer, NULL or 0 that is guaranteed not to point to any object. So it is common to use this pointer to terminate lists or, when used as a return value of a function, to indicate failure. It is not guaranteed that the value of the null pointer is zero, only that writing a constant 0 will be compiled to the null pointer (in fact, NULL is really #defined to be 0).

NULL is a perfectly legal pointer, but it is an error to try to dereference it.

One use of arrays and pointers is important to most C programs: in the argument list. C assumes that every main() is actually a function of two arguments:

int main(int argc, char *argv[])
where argc is the number of arguments (the argument count) specified on the command line that invoked the program, and argv is an array (the argument vector) of strings, one for each input argument. argv[0] is the name of the program being invoked, and argv[1] up to argv[argc-1] are the rest. C also guarantees that argv[argc] is the NULL pointer:

The following code uses two different mechanisms (both common in practice) to print out the command line arguments passed to the program:

The type pointer to void, written (void *), has special meaning. It represents a pointer to an unknown type, and C guarantees that any pointer may be converted to a (void *) and back without any loss of information. This pointer type is typically used as a kind of abstraction device: functions can take and return pointers to an abstract data value without revealing to a client what the value actually looks like. We will discuss and use void pointers more later on in the course.

Pointers, Arrays, Strings, and Address Arithmetic

C guarantees and exposes certain low-level details about how memory and data structures are laid out. An unsubscripted use of an array name is interpreted as the address of the first array element (except that sizeof() will yield the size of the array and not the size of the first element).

Thus, passing an integer array a to a function will convert a from type array of int to type pointer to int. The formal parameter can thus be declared either as int a[] or int *a. However, array types are different from pointer types.

Array elements are in successive memory locations, so if you know the address of one element, the address of the next one is the sum of the current address and the size of an array element. Further, arithmetic on pointers, including increment and decrement, adjusts the pointer by the size of the data pointed to. (You can find out the size of a datatype by using the sizeof() operator.) The address of an element of an integer array arr with index i is therefore equivalent to arr + i which has the value arr + (i * sizeof(int)).

In the case of strings, which are sequences of characters, it is conventional to use the declaration

char *s;
to refer to a string, unless the size of the string or the buffer containing the string is an issue.

Note: Declaring an array allocates space for the array; declaring a pointer to a type only allocates space for the pointer.

int i;
char *s = "Hello!";

for (i = 0; s[i] != '\0'; i++)
        s[i] = tolower(s[i]); 
converts the string in s to lower case. (Actually, there is a subtle bug — we'll come back to that.) But so does the function str_lower() in the following program:

This program illustrates quite a few interesting things. There are a variety of ways to declare and initialize arrays of characters (strings). Interestingly, they are not all exactly the same: str_lower(oh) produces a segmentation fault on my computer, but only under certain circumstances. Why? (String literals are, in many implementations, stored in read-only storage, and they cannot therefore be modified. The C standard says that the result of modifying a string literal is undefined, which means that implementations can whatever they like. String literals are supposed to be immutable. In Comp 40 parlance, its an “unchecked runtime error” to modify one.)

This program also illustrates how a string constant can be used like an array and how array subscripting is commutative! That latter property is a consequence of the fact that C defines array subscripting in terms of address arithmetic. Given the following code:

int index = 5;
char s[] = "Halligan";
The following C expressions are all equivalent:
'g';
s[index]; // this is the preferred notation- don't use the following
*(s + (index * sizeof(char)));
*((index * sizeof(char)) + s);
index[s];

There are also two occurrences of the ternary operator for conditional expressions, which has nothing to do with arrays or pointers or strings, but is cool nonetheless.

Don't use atoi().

Structures

Structures in C are data aggregates, like arrays. However, there are two key differences: components have names rather than integer indices; and they are heterogeneous, i.e., they can contain values of different types.
struct employee {
        char  name[MAX_NAME_LEN];
        int   number;
        float salary;
};
is a structure of 3 fields: a string for the employee's name, an integer for the employee number, and a floating point number for their salary. One extracts or sets a field in a structure with dot notation:
struct employee an_employee; an_empoyee.number = 1234;
Declaring structures can seem a bit strange. The name of the structure is optional, and these names inhabit a separate namespace from other type names. If you leave out the name, then you have an anoynmous structure definition. We can use the employee structure above directly in a declaration (and we could leave out the name):
struct {
        char  name[MAX_NAME_LEN];
        int   number;
        float salary;
} an_employee;
or, given the named structure declaration above, we could declare an array of employees:
struct employee sales_personnel[300];
You might like to use the name employee as a type, but you can't. That is why you will often see declarations like the struct employee followed immediately by a type declaration
typedef struct employee employee_t ;
which allows employee_t to be used like any type name (it's a synonym for struct employee):
float give_raise(employee_t emp, float pct_incr);
You might think that, by analogy with arrays, a name of a structure (e.g., emp inside the give_raise() function above) would be equivalent to a pointer to the first location of the structure. That would be wrong. A structure has the same status as an integer or a pointer in that a name (or expression) with a structure type stands for the entire structure value. That means that when you pass a structure to a function as an argument, the bytes are copied into new storage (on the stack) for a new structure that will be manipulated by the function. An assignment of a structure to a structure variable similarly copies the structure's contents into the structure denoted by the variable on the left of the assignment.

Refer to the code in struct_test.c. This program demonstrates that a structure that contains a fixed-sized array is not considered the same as one that contains a pointer to the array's type and that structure values are copied on assignment or when passed as arguments to a function.

The fact that the array and pointer cases are different makes sense when you consider the storage allocation issues. Declaring a structure component that is an array allocates the space for the elements in the structure. This is possible, because you told the compiler how many slots there are and how big each slot is. A pointer declaration only tells the compiler there is a pointer, so only space for the pointer is allocated. An implication of this is that a pointer in a structure implies that memory will be allocated somewhere else for that component.

Programming with Structures
The Big Ideas of abstraction, modularity, and divide/conquer/glue are always applicable; in C you must work extremely hard to use them well. One can think of a structure as a way to glue data values together, but this is too simple. They are that, but you should think of them as a basis for a data abstraction. Any structure that is central to your program (used by more that one module) and is not used merely as a way to pass arguments and return results should have a set of functions for manipulating the structure defined in a .c file that can be separately compiled. Then you'll want a .h file containing the interface specification. In general, the actual structure type should be hidden, and we'll see some examples of how to do this in the class.

Enumerations

Enumerations are useful for declaring a set of named constants. They are often used in place of #defined constants. For example,
enum code_quality {good, bad, ugly} my_program;
declares the variable my_program to have the type enum code_quality. I can then test and assign any one of the enumerated names to the variable:
my_program = ugly;
C exposes the fact that the listed identifiers are actually integers (whose size is implementation-dependent), and they may be freely used in integer contexts (testing them for equality is a natural place). The above declaration therefore defines 5 names: the type enum code_quality; the integer variables good, bad, and ugly; and the variable my_program of type enum code_quality.

By default, the first enumerated name has the value 0, and each successive name is one more than the previous one. This can be changed through explicit assignment:

enum code_quality {good = 1, bad, ugly = 0} my_program;
The above code creates an enumeration data type in which good has the value 1, bad has the value 2 (if there is no assigned value, it is one more than the previous value), and ugly has the value 0.

A very common definition among C programmers is:

typedef enum {FALSE = 0, TRUE = 1} bool;
which allows us to declare a variable or return type to be bool.

No name may be defined in common between two enumerations.

Function Call and Return

Syntactically, a function call is an expression or statement in which an expression is followed by zero or more comma-separted expressions in parentheses. The easy case is when we refer to a function by name: factorial(n). Here, factorial is an identifier that, presumably, is the name given to a defined function. The value n is the argument or actual parameter.

function call
exprfun(expr1, ...)
  1. exprfun is evaluated. It must produce a function pointer (the address of the first instruction in the compiled function definition).
  2. The argument expressions expri are evaluated left-to-right.
  3. Create a new stack frame or activation record on the stack with enough space for the return address (the address of the instruction to be executed after the function completes), the argument values, the return value, and anything else the compiler needs to put there.
  4. Fill in the stack frame by copying the values of the arguments, etc.
  5. The function code is run with the new stack frame by updating the stack pointer and jumping to the first instruction of the function.
  6. If and when the function the returns, its return value is placed on the stack where the caller can extract it and then the code jumps to the return address saved during the function call.
  7. The return value (if one) is used as the value of the function and the stack frame is destroyed.

Because the values of the arugments are copied onto the stack, C is a call-by-value language. This means that variables in the function are stored in different locations than variables in the caller: no function can directly refer to the local/stack-allocated variables in another function.

In order to alter the values of the caller's variables, we need call-by-reference parameters, which we hand code in C using pointers. That is, we manually pass in pointers to the caller's variables to functions (often using the & operator). Call-by-reference parameters may also be used for both input and output, and so the same technique is used in C to for multiple value return. In the above example, one can view the reference parameters as being used for both input and output. add_and_multiply.c (shown below) illustrates the use of multiple reference parameters for output. This overcomes the restriction that in C, a function must return either no values (it's return type is void) or one value.

Assigning and casting pointers

The explanation above of pointers leaves out some important techniques that you will use often, and especially when allocating memory. Specifically, you will often need to convince C that the type of a pointer has changed. One common example mentioned briefly above is converting a pointer to or from void *, but other conversions are possible too.

There are two main ways of converting pointers: assigning to a different pointer type, and casting. The sections below describe both of these techniques.

Pointer assignment and conversion

The following code is not legal in C:

   int   *ip;
   float *fp;

   /* ...assume ip1 has been initialized ... */

   /* WON'T COMILE! The compiler assumes that if ip points
      to memory containing an integer then treating 
      that same memory as a float is a mistake */

   fp = ip1;       
  

However, and perhaps surprisingly, the following code is legal C:

   void *vp;
   int *ip1;
   int *ip2;
   float *fp;

   /* ...assume ip1 has been initialized ... */
   /* the following causes ip2 to point to the same
      integer as ip1 */
   vp = ip1; /* can assign void * from any ptr type */
   ip2 = vp; /* can assign any ptr type from void */

    /* The following also will compile and run, but
       is very unlikely to be correct! Originally
       ip1 points to some integer. Once the following
       code runs, fp points to the same memory, but
       assumes it's a float. That's almost always
       a mistake, but the compiler won't catch it! */
   fp = vp;              
 

The above shows that one can assign any pointer type to a void pointer, and vice versa. Furthermore, as shown on the last line, this allows us to make a mistake equivalent to the one that was caught in the earlier example. We've effectively assigned an int pointer to a float pointer. Very rarely there are good reasons for doing such things; usually it's an error, and the compiler won't catch it!

Passing arguments to functions follows the same rules. So:

   void takes_intp(int *ip_parm);
   void takes_voidp(void *vp_parm);
                  
   void *vp;
   int *ip1;
   int *ip2;
   float *fp;

   takes_intp(ip1);  /* works (of course!) */
   takes_intp(vp);   /* works */
   takes_intp(fp );  /* doesn't compile */

   takes_voidp(ip1);  /* works */
   takes_voidp(vp);   /* works (of course!) */
   takes_voidp(fp );  /* works */

Casting pointers

In some of the examples above, the type of a pointer was succesfully changed by an assignment or when passing to a function. Another way to change the type of a pointer is with a cast. The syntax of a cast in C involves putting the desired type in parenthesis ahead of the expression to be cast:

   (type)expression

The cast does not permanently affect the type or contents of the pointer; casting can be thought of as a unary operator (similer to - or sizeof). The cast operator takes in a pointer and returns a pointer to the same spot in memory, but with the desired type. For example:

   (int *)vp;

Here's how that might be used in a C program:

   void *vp;    
   int i = 3;
              
   vp = &i;     /* address as void * */

   printf("The integer is %d\n", *((int *)vp));

This prints:

   The integer is 3

The cast is the int * in the argument to printf. Again, the vp variable is not changed by the cast in the printf; it is just treated at that spot as having the type int * instead of void *. Thus, when it's dereferenced this way: *((int *)vp, the result is an int that can be printed.

By the way, there's no obvious reason to use void * in the code above. This simple example would work better with int * and no cast needed:

   int *ip;    
   int i = 3;
              
   vp = &i;     /* address as int * */

   printf("The integer is %d\n", *ip);

...but the first example let's us see casting in action.

Aside: Numeric casts

In this part of the tutorial we are focused mainly on pointers but it's worth noting that casts can also be used with numeric types:

This program prints:

Floating division: 1.500000   Integer division: 1
            

Memory allocation

In C++ new objects are created with the new operator. That operator allocates memory, but it does much more: it calls the constructor method on every member of the class being created, and on the class itself. As discussed above, C does not have classes at all. It has struct, but those don't have constructors or in fact any methods.

The model in C is very different: you don't construct objects, instead you allocate memory. You have to write code that runs after the allocation to do any necessary initialization, and you must manually handle any typing that is involve. This program allocates 1,000,000 bytes of memory, has a comment in place of code that might use the memory, and then frees the memory:

If you don't free the memory, then your program leaks memory. That is, it leaves memory in use. Although the operating system will tend to clean up after programs that leak, you should get in the habit of freeing everything you allocate! In COMP 40, your grade will suffer significantly if we find the your code leaks. The valgrind tool is an excellent way to find leaks. Another hint: when you are developing code, as soon as you code a malloc, code the corresponding free!

malloc returns a void *. Using the techniques should above you can easily assign that to a typed pointer (may be a char *, for example, if you are planning to store characters), or you can cast the pointer.

Often you will want to allocate memory for a structure. Here's an example of how to do it. Indeed, this code makes two allocations: both of them do the same thing, but they illustrate two different ways of specifying the size to be allocated.

Here's an example of output from that program:

   Successfully allocated 28 bytes for student 1 at address 0x166f010
   Successfully allocated 28 bytes for student 2 at address 0x166f040
Authors: Mark A. Sheldon, Charlie Meyer, Graham Goudeau
Noah Mendelsohn Modified: 9 Sept 2017