Performance-tuning tips

Memory

Avoid memory leaks (malloc w/o free). You can monitor memory usage using htop; press F6 and sort by VIRT (virtual memory).

Avoid malloc when possible. Especially, instead of allocating structures on the heap, look for opportunities to use local variables of struct type to and return values of struct type.

Avoid memory traffic.

- Instead of reading or writing through pointers, use local
  variables, parameters, and results.  (May require interface
  changes or returning values of `struct` type.)

- Reduce memory traffic by taking advantage of pointer
  arithmetic.  (Example?)

Use fewer levels of indirection. Here's a representation of a 2-D array that uses two levels of indirection:

struct Array2_T {
  int width, height;
  char *elements;
};

To get to an array element 'i', you have to dereference two pointers and do address arithmetic twice (one address is static and is coded into the instruction; the other is dynamic):

a->elements[i]  // same as *((*a).elements+i)

Instead, you can use a trick built into C99: the incomplete array field:

struct Array2_T {
  int width, height;
  char elements[];
};

This representation, although it uses exactly the same notation, requires only one pointer dereference and one arithmetic operation:

 a->elements[i]  // something like *(a + offset(elements) + i)

To allocate the first kind of array:

struct Array_T *p = malloc(sizeof(*p));
p->elements = malloc(n * sizeof(*p->elements));

To allocate the second kind of array:

struct Array_T *p = malloc(sizeof(*p) + n * sizeof(*p->elements));

The advantages:

One heap object, not two
Created by one malloc(), not two
Element access with one load, not two

The disadvantage:

The array is impossible to resize without destroying its object identity.

Inlining and specialization

If you have very small functions, where the work of the function is only a few instructions, avoid the function-call overhead by declaring the function inline. (This trick works only on function calls that don't cross module boundaries. It's not worth violating modularity to save just a few instructions.)

If one are more parameters are compile-time constants, it may be worth inlining a function in order to specialize it (meaning that much of the function's work can be done once, at compile time).

Know your function-call overheads:

- The overhead of calling a large function is much smaller than
  the cost of executing the function body: if the function can't
  be specialized, there's no point in inlining it, and inlining
  will increase code size and make the instruction cache less
  effective. 

- The overhead of calling a very small function is comparable to
  the cost of executing the function body: such a function might
  be worth inlining.

- If a function can be specialized, it may be that the work of the
  function body can be reduced by a significant factor:
  specialization may make a function 5 or 10 times faster.
  Inlining is needed to enable specialization, but to know whether
  specialization looks worthwhile, you have to
  analyze the situation carefully.  If you want to specialize,
  mark the function as `inline` and  **let the compiler specialize it**.

Loops

Raise operations out of loops

- Loop-invariant computations (arithmetic, memory references, afunction
  calls, and more) can be lifted out of the loop.

- Often you know that a computation is loop-invariant, but the
  compiler doesn't know.  In these cases, **assign the results of
  loop-invariant computations to local variables**.    For
  example, if you write

     for (int i = 0; i < a->width * a->height; i++) {
        ... function call that gets passed 'a' ...
     }

  the compiler doesn't know that `a->width` and `a->height` don't
  change.
  Rewrite the loop as

      int w = a->width;
      int h = a->height;
      for (int i = 0; i < w * h; i++) { ... }

- If the compiler can identify a loop-invariant computation, **let
  the compiler deal with it.**  For example, in

      int w = a->width;
      int h = a->height;
      for (int i = 0; i < w * h; i++) { ... }

  there is nothing to be gained by fiddling with the loop
  condition: the compiler can see *all* accesses to local
  variables `w` and `h`, and it's *easy* for the compiler to hoist
  the multiplication out of the loop.  **Let the compiler do its job.**

- Calculations that are *partially* or *almost* loop-invariant can
  often be made loop invariant by refactoring the code or
  introducing a nested loop structure.