Transparent on-the-fly data compression COS595

Matthias Blume
Dept. of Computer Science, Princeton University,
Princeton, NJ 08544

[Table of contents]

Introduction

The C library on UNIX provides functions for calling the operating system. Replacing those functions can provide a different program functionality without the need to make changes to the program text.

Replacing the functions open, creat, close, read and write along with a couple of other routines will change the file system interface. Since on today's computer systems most information are stored with a high redundancy, it seems to be useful to apply data compression algorithms to them. If the compression and uncompression is done by the file system interface itself, then it will become transparent to the programs using it.

Although a truly general solution can only be done in the file system itself, it is nevertheless possible to approximate this to a high degree within the C library. I will show a sample implementation, which does this.

System call substitutes

Since I am going to replace the functions open, read, write etc. by my own versions, I will not be able to use them to do actual input and output. Therefore it is necessary to write new versions of the original C library function. The only things to change are the names of those routines.

The following program text gives the implementation for MIPS-based machines. By using the m4 macro processor, which is available on most UNIX systems, I take advantage of the fact, that the sequences of instructions needed for any of the functions follow a common pattern.

<mips-asm.m>=
# include <sys/syscall.h>
# include <regdef.h>

        .text

        .globl _cerror

define(sc,`
        .globl _sys_$1
        .ent _sys_$1
_sys_$1:
        li      v0, SYS_$1
        syscall
        beq     a3, zero, 9f
        j       _cerror
9:$2
        j       ra
        .end')

sc(open)
sc(creat)
sc(close)
sc(read)
sc(write)
sc(dup)
sc(lseek)
sc(pipe,`       sw      v0, 0 (a0)
        sw      v1, 4 (a0)')

The Second argument of the sc macro is necessary to implement the pipe system call, because it has to store two file descriptors into locations given as the argument of the call.

Overall structure

My new versions of the file system interface functions are implemented in the file compress.c. The structure of the file can be described as follows:

<compress.c>=
<include files>
<constant definitions>
<type definitions>
<external prototypes>
<static prototypes>
<static definitions>
<initialization>
<write compressed>
<read compressed>
<other io-substitutes>
<replaced system calls>

The necessary system header files are:

<include files>= (<-U)
# include <stdlib.h>
# include <stdio.h>
# include <string.h>
# include <assert.h>
# include <fcntl.h>
# include <errno.h>

Replacing system calls

In this implementation I will give substitutes for the following functions:

The real file system interface will be driven by the external functions:

<external prototypes>= (<-U)
extern _sys_open (const char *, int);
extern _sys_creat (const char *, int);
extern int _sys_close (int);
extern int _sys_read (int, char *, int);
extern int _sys_write (int, const char *, int);
extern int _sys_dup (int fd);
extern long _sys_lseek (int fd, long offset, int whence);
extern int _sys_pipe (int [2]);

and the replaced system call functions are:

<replaced system calls>= (<-U)
<replaced creat>
<replaced open>
<replaced close>
<replaced read>
<replaced write>
<replaced dup>
<replaced tell>
<replaced lseek>
<replaced pipe>

Data compression and uncompression will be done by the Lempel-Ziv algorithm. It is necessary to maintain several independent compression- or uncompression-``engines'', because there can be many files open. There is no fixed relationship between the number of characters read from or written to the real file system and the number of characters seen by the program. This means, both compression and uncompression must be able to stop ``in the middle of the operation''. All the relevant variables, which constitute the ``state'' of the ``engine'' have to be saved in a data structure, which in turn has to be associated with the file descriptor.

The existence of the dup system call introduces some further requirements on the implementation. Basically this means, that the same ``engine'' can be associated with more than one file descriptor.

Currently I restrict myself to at most MAXFILES open files. This can easily be changed by using getdtablesize to find out the maximum number of open files allowed by the operating system.

<constant definitions>= (<-U) [D->]
# define MAXFILES 256

The states of ``engines'' are stored in structures of type struct cfd.

<type definitions>= (<-U) [D->]
typedef struct cfd *cfd;

struct cfd {
  struct methods *methods;
  int nbits;
  int shared;
  <other cfd members>
};
Defines cfd (links are to index).

One crucial idea to deal with the complexity of the problem is to adopt some ``object-oriented'' techniques. I use the member methods to point to a collection of function pointers. Depending on whether a file is read or written, compressed or plain, I need different algorithms to access the file. Using the table of methods allows to do this without complicated if-then-else chains all over the place.

<type definitions>+= (<-U) [<-D->]
struct methods {
  int (* read) (int, unsigned char *, int);
  int (* write) (int, unsigned char *, int);
  int (* close) (int);
  long (* seek) (int, long, int);
};

There will be a global file table filetab, indexed by file descriptors, which contains pointers to structures of type struct cfd. The contents of this table at a given position depends on the mode of operation used with the file descriptor under question. Possible modes are:

Descriptors for plain files and unknown file descriptors are passed directly to the real system calls. Therefore, these two modes are represented the same: by a NULL pointer in the corresponding position of the file table.

We need two different method tables for reading or writing in compressed mode. A third table is necessary to deal with the remaining modes of operation. A compressed file is recognized by the first three characters in the file, which are known as the compress_prefix.

<static definitions>= (<-U)
<file table>
<method tables>
<compress prefix>

<compress prefix>= (<-U)
static unsigned char compress_prefix [] = { 0x1f, 0x9d, 0x10 };
Defines compress_prefix (links are to index).

<file table>= (<-U)
static cfd filetab [MAXFILES] = { NULL, };
Defines cfd (links are to index).

The three collections of methods are:

<method tables>= (<-U)
<write-compressed table>
<read-compressed table>
<initial read table>

For writing:

<write-compressed table>= (<-U)
static struct methods cw_m = {
  refuse_io,
  write_compressed,
  close_compressed_write,
  refuse_seek,
};
Defines cw_m (links are to index).

For reading:

<read-compressed table>= (<-U)
static struct methods cr_m = {
  read_compressed,
  refuse_io,
  close_compressed_read,
  refuse_seek,
};
Defines cr_m (links are to index).

For deciding, whether to read a compressed or a plain file:

<initial read table>= (<-U)
static struct methods ir_m = {
  read_prefix,
  refuse_io,
  close_prefix_read,
  prefix_seek,
};
Defines ir_m (links are to index).

To define the above tables, I need the following prototype definitions:

<static prototypes>= (<-U)
static int read_compressed (int fd, unsigned char *buf, int n);
static int read_prefix (int fd, unsigned char *buf, int n);
static int write_compressed (int fd, unsigned char *buf, int n);
static int refuse_io (int fd, unsigned char *buf, int n);

static int close_compressed_read (int fd);
static int close_prefix_read (int fd);
static int close_compressed_write (int fd);

static long refuse_seek (int fd, long offset, int whence);
static long prefix_seek (int fd, long offset, int whence);

In order not to confuse other programs started by a combination of fork and exec, I always use the real system call along with the replaced one. Therefore, the indices into filetab are real file descriptors provided by the operating system.

creat always sets the mode of operation to ``write with compression'' by using the methods cw_m.

As a general rule, I set the member nbits to zero when opening a file. This signals, that the data structures have not been fully initialized yet.

The member shared counts the number of filetab entries, which point to the same data structure. The routines for closing a file will use this to determine, whether the last connection to the file will be closed. Only in this case I can perform cleanup operations like freeing the data structures associated with the file.

<replaced creat>= (<-U)
int creat (const char *path, int mode)
{
  int fd;

  init ();

  fd = _sys_creat (path, mode);

  if (fd < 0 || fd > MAXFILES ||
      (filetab [fd] = malloc (sizeof (struct cfd))) == NULL)
    return fd;

  filetab [fd]->nbits = 0;
  filetab [fd]->methods = &cw_m;

  filetab [fd]->shared = 1;

  return fd;
}
Defines creat (links are to index).

Opening a file for writing using open assumes, that the file already exists. Therefore it is not useful to write with compression, because the compressed data will interfere with what has already been in the file. As a consequence I leave the entry in filetab unchanged.

It turns out, that new versions of the C library don't use creat, but call open with some additional flags and parameters as specified by POSIX. This means, that I have to simulate the desired behavior by calling creat from within open if necessary.

Opening a file for reading is the most complex case. At some time I need to read the first few characters of the file to decide, whether the file is compressed or not. The most natural place to do this seems to be the open routine. Unfortunately, this would violate the semantics of open. (Imagine opening a terminal file for reading!) The decision has to be delayed until the first call to read will be executed. open sets the mode of operation to ``unclear, whether to use decompression'' by using the method table ir_m.

<replaced open>= (<-U)
int open (const char *path, int how, int mode)
{
  int fd;

  if (how == (O_WRONLY | O_CREAT | O_TRUNC))
    return creat (path, mode);

  init ();

  fd = _sys_open (path, how);
  if (fd < 0 || fd > MAXFILES || how != 0)
    return fd;

  if ((filetab [fd] = malloc (sizeof (struct cfd))) == NULL)
    return fd;

  filetab [fd]->nbits = 0;
  filetab [fd]->methods = &ir_m;
  filetab [fd]->shared = 1;

  return fd;
}
Defines open (links are to index).

Most of the remaining substitutes for system call functions follow a common pattern:

<replaced close>= (<-U)
int close (int fd)
{
  return (fd < 0  || fd > MAXFILES || filetab [fd] == NULL)
    ? _sys_close (fd)
    : (* filetab [fd]->methods->close) (fd);
}
Defines close (links are to index).

<replaced read>= (<-U)
int read (int fd, char *buf, int n)
{
  return (fd < 0  || fd > MAXFILES || filetab [fd] == NULL)
    ? _sys_read (fd, buf, n)
    : (* filetab [fd]->methods->read) (fd, (unsigned char *) buf, n);
}
Defines read (links are to index).

<replaced write>= (<-U)
int write (int fd, const char *buf, int n)
{
  return (fd < 0  || fd > MAXFILES || filetab [fd] == NULL)
    ? _sys_write (fd, buf, n)
    : (* filetab [fd]->methods->write) (fd, (unsigned char *) buf, n);
}
Defines write (links are to index).

dup simply copies, what is in filetab at a given place to another place. The cfd-member shared must be incremented.

<replaced dup>= (<-U)
int dup (int fd)
{
  int res = _sys_dup (fd);

  if (fd < 0  || fd > MAXFILES || filetab [fd] == NULL)
    return res;
  assert (res < MAXFILES);
  assert (filetab [res] == NULL);
  filetab [res] = filetab [fd];
  filetab [res]->shared++;
  return res;
}
Defines dup (links are to index).

lseek, again, follows the general pattern. There is one minor variation: if the arguments indicate, that no actual repositioning is asked for, tell gets called.

<replaced lseek>= (<-U)
long lseek (int fd, long offset, int whence)
{
  if (offset == 0 && whence == 1)
    return tell (fd);
  return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL)
    ? _sys_lseek (fd, offset, whence)
    : (* filetab [fd]->methods->seek) (fd, offset, whence);
}
Defines lseek (links are to index).

tell is not a real system call. I simulate it using lseek if necessary. For files in compressed mode I keep track of the file position myself. Note, that filepos is not initialized until the first read from or write to the file has been executed.

<other cfd members>= (<-U) [D->]
long filepos;
Defines filepos (links are to index).

<replaced tell>= (<-U)
long tell (int fd)
{
  return (fd < 0 || fd > MAXFILES || filetab [fd] == NULL)
    ? _sys_lseek (fd, 0L, 1)
    : (filetab [fd]->nbits < MINBITS)
      ? 0
      : filetab [fd]->filepos;
}
Defines tell (links are to index).

Since pipe creates two file descriptors, I have to deal with two entries in filetab. Writing to the pipe will be performed in compressed mode. This seems to imply, that reading has to use compressed mode as well, but this is not the case. The most common case of using pipes is in the context of fork and exec. It is very likely, that the pipe will be written by another program, and I have to check, whether this program uses compression or not.

<replaced pipe>= (<-U)
int pipe (int fd [2])
{
  cfd p0, p1;

  init ();

  if (_sys_pipe (fd) < 0)
    return -1;

  if (fd [0] > MAXFILES || fd [1] > MAXFILES)
    return 0;
  p0 = malloc (sizeof (struct cfd));
  if (p0 == NULL)
    return 0;
  p1 = malloc (sizeof (struct cfd));
  if (p1 == NULL) {
    free (p0);
    return 0;
  }

  p0->nbits = p1->nbits = 0;
  p0->methods = &ir_m;
  p1->methods = &cw_m;
  p0->shared = p1->shared = 1;

  filetab [fd [0]] = p0;
  filetab [fd [1]] = p1;

  return 0;
}
Defines pipe (links are to index).

Initialization

The library function atexit provides a way to register a function, which will be called, when the program exits (i.e. when it calls exit). I use this to register a function, which closes all the open files found in filetab.

As you might have noticed, init will be called from any of the functions, which create non-NULL entries in filetab.

<initialization>= (<-U) [D->]
static void cleanup (void)
{
  int i;

  for (i = 0; i < MAXFILES; i++)
    if (filetab [i] != NULL)
      (* filetab [i]->methods->close) (i);
}
Defines cleanup (links are to index).

The registration of cleanup will be done exactly once.

<initialization>+= (<-U) [<-D]
static void init (void)
{
  static int initialized = 0;

  if (initialized == 0) {
    atexit (cleanup);
    initialized = 1;
  }
}
Defines init (links are to index).

Compression

Compression employs the adaptive Lempel-Ziv algorithm. Tables are constructed as data are written. Every sequence of characters ever seen by the algorithm (which uses a greedy heuristic to construct these sequences) is associated with a unique code. The cfd-member nextcode always holds the next available code to be associated with the next sequence. Because the algorithm writes data, which are not always aligned to byte-boundaries, I have to use a buffer, the size of which is a multiple of the current code size nbits and a multiple of eight. Since the maximum code size is sixteen, a buffer of at most 16 bytes is required.

<constant definitions>+= (<-U) [<-D->]
# define TABSIZE 8192
# define MAXBITS 16
# define MINBITS 9
# define FIRSTCODE 256

I use struct cfd for both compress and uncompress. Some of the members in struct cfd are only used for either compression or uncompression, and not for both. In order to save some space, these members are placed into a union.

<other cfd members>+= (<-U) [<-D]
unsigned long nextcode;
unsigned char buf [MAXBITS];
int bitpos;
union {
  struct {
    <compress-only members>
  } c;
  struct {
    <uncompress-only members>
  } d;
} u;
Defines bitpos, buf, nextcode, u (links are to index).

lastcode holds the code of the character sequence seen so far. codes is a hashtable, which is used to describe the mapping of strings to codes.

<compress-only members>= (<-U)
struct centry **codes;
unsigned long lastcode;
Defines codes, lastcode (links are to index).

The hashtable used by this algorithm has fixed size and uses chaining to deal with collisions. The data structure for the chaining is described by:

<type definitions>+= (<-U) [<-D]
struct centry {
  unsigned short w;
  unsigned char c;
  unsigned short code;
  struct centry *next;
};

Here, w is the code for the string without the last character, c is the last character, code is the code for this sequence and next holds the next entry of the chain.

<write compressed>= (<-U)
<hashtable management>
<writing bits>
<writing character arrays>
<finish writing>

This is the hashtable management:

<hashtable management>= (<-U)
<hash function>
<hashtable lookup>

<hash function>= (<-U)
# define hash(x,y) (((x)<<8|(y))%TABSIZE)

There is not very much to say about hashtable lookup. The only important thing to note is, that I use a ``move-to-front'' heuristic to speed things up.

<hashtable lookup>= (<-U)
static struct centry *lookup (cfd fd, unsigned char c)
{
  unsigned lc = fd->u.c.lastcode;
  struct centry **start =
    &fd->u.c.codes [hash(lc, c)];
  struct centry **cur = start;
  struct centry *tmp;

  while (*cur != NULL && ((*cur)->w != lc || (*cur)->c != c))
    cur = & (*cur)->next;
  if (*cur == NULL)
    return NULL;
  else {
    tmp = *cur;
    *cur = tmp->next;
    tmp->next = *start;
    *start = tmp;
    return tmp;
  }
}
Defines lookup (links are to index).

In order to write a number of bits it is necessary to use the buf member of the cfd structure, because I cannot write fractions of a byte. It is just a matter of shifting and masking bits correctly...

I give the description of invmask here, although it is used only later for reading bits.

<writing bits>= (<-U) [D->]
# define mask(x,n) ((x) & (~(~0 << (n))))
# define invmask(x,n) ((x) & (~0 << (n)))

<writing bits>+= (<-U) [<-D]
static int output (cfd fd, int ifd)
{
  unsigned char *byte = fd->buf + fd->bitpos / 8;
  int bit = fd->bitpos % 8;
  unsigned code = fd->u.c.lastcode;

  *byte = mask (*byte, bit) | code << bit;
  byte++;
  code >>= 8 - bit;
  if (fd->nbits + bit > 16) {
    *byte++ = code;
    code >>= 8;
  }
  *byte = code;
  fd->bitpos += fd->nbits;
  if (fd->bitpos == 8 * fd->nbits) {
    if (_sys_write (ifd, (char *) fd->buf, fd->nbits) < 0)
      return -1;
    fd->bitpos = 0;
  }
  return fd->bitpos;
}
Defines output (links are to index).

To be able to write arbitrary arrays of characters I need to suspend compression not after a certain amount of characters written, but after any amount of characters seen. This means, that character sequences, which are collapsed into one code, may extend across multiple calls to write.

write checks first, whether this is the very first call to write for this file and initializes the data structures. Remember, that creat and pipe set nbits to zero to indicate this situation.

After using all possible codes no further entries to the hashtable are made---write has to live with what is in the table.

<writing character arrays>= (<-U)
static int write_compressed (int ifd, unsigned char *buf, int n)
{
  cfd fd = filetab [ifd];
  int i, h;
  struct centry *tmp;
  unsigned char c;

  if (n == 0)
    return 0;

  <write initialization>

  while (i-- > 0) {
    c = *buf++;

    if ((tmp = lookup (fd, c)) == NULL) {

      <output code for prefix>
      <add code to table if necessary>

      fd->nextcode++;
      fd->u.c.lastcode = c;

    } else
      fd->u.c.lastcode = tmp->code;
  }

  fd->filepos += n;

  return n;
}
Defines write_compressed (links are to index).

A value of zero in nbits indicates, that the data structures have to be initialized.

<write initialization>= (U->)
if (fd->nbits == 0) {
  if (_sys_write (ifd, (char *) compress_prefix,
                  sizeof (compress_prefix)) < 0)
    return -1;
  fd->nbits = MINBITS;
  fd->nextcode = FIRSTCODE;
  fd->bitpos = 0;
  fd->u.c.codes = malloc (TABSIZE * sizeof (struct centry *));
  if (fd->u.c.codes == NULL) {
    errno = ENOMEM;
    return -1;
  }
  for (i = 0; i < TABSIZE; i++)
    fd->u.c.codes [i] = NULL;
  fd->filepos = 0;
  fd->u.c.lastcode = *buf++;
  i = n - 1;
} else
  i = n;

<output code for prefix>= (U->)
if (output (fd, ifd) < 0)
  return -1;

As long as not all the possible codes have been used, codes for new sequences have to be introduces.

<add code to table if necessary>= (U->)
if (fd->nextcode < (1L << MAXBITS)) {
  if ((tmp = malloc (sizeof (struct centry))) == NULL) {
    errno = ENOMEM;
    return -1;
  }
  tmp->w = fd->u.c.lastcode;
  tmp->c = c;
  tmp->code = fd->nextcode;

  if (fd->nextcode == (1L << fd->nbits)) {
    if (fd->bitpos > 0) {
      if (_sys_write (ifd, (char *) fd->buf, fd->nbits) < 0)
        return -1;
      fd->bitpos = 0;
    }
    fd->nbits++;
  }

  h = hash (fd->u.c.lastcode, c);
  tmp->next = fd->u.c.codes [h];
  fd->u.c.codes [h] = tmp;
}

An important thing to note, is that I cannot simply close a file using the operating system call. It may be the case (in fact, it is always the case) that there is still some accumulated code in lastcode that wants to be written out. I have to make sure, that I only really close the file, if the last reference to this file is going to be abandoned.

Unlike during a switch from nbits to nbits+1, where I always flush the entire buffer buf (up to nbits bytes), I write only those parts of buf which really contain written bits when closing the file. This provides the necessary information about the end of the file to the uncompression algorithm.

<finish writing>= (<-U)
static int close_compressed_write (int ifd)
{
  cfd fd = filetab [ifd];
  int res, i;
  struct centry *run, *next;

  if (--fd->shared == 0 && fd->nbits > 0) {
    for (i = 0; i < TABSIZE; i++)
      if ((run = fd->u.c.codes [i]) != NULL)
        do {
          next = run->next;
          free (run);
          run = next;
        } while (run != NULL);
    free (fd->u.c.codes);
    if (output (fd, ifd) < 0)
      return -1;
    if (fd->bitpos > 0 &&
        _sys_write (ifd,
                    (char *) fd->buf,
                    (fd->bitpos + 7) / 8)
        < 0)
      return -1;
  }
  res = _sys_close (ifd);
  if (fd->shared == 0)
    free (fd);
  filetab [ifd] = NULL;
  return res;
}
Defines close_compressed_write (links are to index).

Uncompression

The algorithm to uncompress compressed files looks a little bit more complicated. First, I need a stack (described by the members stack, stacktop and stacksize) to reverse the sequence of characters, which I obtain from a code. The stack is realized as a rubber-band array, which is automatically expanded when necessary. Furthermore, buflen keeps track of the number of characters, which have actually been read from the file system---fewer characters than nbits indicate the end of the file. oldcode holds the last code that has been read and finchar is the final character produced from the last code. This is necessary to deal correctly with the ``AwAwA''-phenomenon, where a code can be read, which is not in the table yet.

The ``hashtable'' htab is a rubber-band array which contains for each code the associated prefix (i.e. the code for the string without the last character) together with that last character. Since entries are made in a sequential order, it is not necessary to use a hash function. The entry for a code k is at position k-256, because the first codes 0-255 stand for themselves and don't need to be stored into the table.

Here are the missing members of struct cfd:

<uncompress-only members>= (<-U)
long tabsize;
struct { unsigned w; unsigned char c; } *htab;
unsigned stacksize;
unsigned stacktop;
unsigned char *stack;
int buflen;
unsigned short oldcode;
unsigned char finchar;
Defines buflen, finchar, oldcode, stack, stacksize, stacktop, tabsize (links are to index).

Uncompression is split into the following tasks:

<read compressed>= (<-U)
<stack management>
<reading bits>
<reading the first few bytes>
<reading compressed files>
<finish reading>

The stack management maintains a rubber band array. The array can only expand, therefore ``popping'' items from the stack can be ``in-lined''. push is more complicated and gets its own function:

<constant definitions>+= (<-U) [<-D]
# define STACKGROWTH 64

<stack management>= (<-U)
static int push (cfd fd, unsigned char c)
{
  if (fd->u.d.stacktop >= fd->u.d.stacksize) {
    fd->u.d.stacksize += STACKGROWTH;
    if ((fd->u.d.stack =
         realloc (fd->u.d.stack, fd->u.d.stacksize))
         == NULL) {
      errno = ENOMEM;
      return -1;
    }
  }
  fd->u.d.stack [fd->u.d.stacktop++] = c;
  return 0;
}
Defines push (links are to index).

Reading the bits into the buffer is a little bit more trickier than writing. Consider a pipe: If the pipe contains fewer characters than required, then only those bytes are delivered. The system call blocks for empty pipes only. Therefore refill_buffer repeats the call to _sys_read until either the buffer is completely filled or _sys_read signals end-of-file or an error condition.

<reading bits>= (<-U) [D->]
static int refill_buffer (cfd fd, int ifd)
{
  int n, r;

  n = 0;
  r = 0;
  while (n < fd->nbits &&
         (r = _sys_read (ifd,
                         (char *) (fd->buf + n),
                         fd->nbits - n))
         > 0)
    n += r;
  if (r < 0)
    return -1;
  fd->u.d.buflen = 8 * n;
  fd->bitpos = 0;
  return 0;
} 
Defines refill_buffer (links are to index).

The input-function is very much like output, except that it has to return an end-of-file condition when the end of the file has been reached. I use -1 to signal the end of the file and -2 to signal an error. buflen is used to describe, what is really in the buffer (the number of bits).

<reading bits>+= (<-U) [<-D]
static int input (cfd fd, int ifd)
{
  unsigned char *byte;
  int bit;
  unsigned code;

  if (fd->bitpos + fd->nbits > fd->u.d.buflen) {
    if (refill_buffer (fd, ifd))
      return -2;
    if (fd->u.d.buflen == 0)
      return -1;
  }
  byte = fd->buf + fd->bitpos / 8;
  bit = fd->bitpos % 8;
  code = invmask (*byte, bit) >> bit;
  byte++;
  bit = 8 - bit;
  if (fd->nbits - bit >= 8) {
    code |= *byte++ << bit;
    bit += 8;
  }
  code |= mask (*byte, fd->nbits - bit) << bit;
  fd->bitpos += fd->nbits;

  return code;
}
Defines input (links are to index).

There is still the problem, that read has to check first, whether the contents of the file is compressed or not. If read detects, that the file is not compressed, than it has to arrange for all file descriptors associated with the file, that it is read using the plain operating system call. This is done using the function mark_uncompressed.

<reading the first few bytes>= (<-U) [D->]
static void mark_uncompressed (cfd fd)
{
  int i;

  for (i = 0; i < MAXFILES; i++)
    if (filetab [i] == fd)
      filetab [i] = NULL;
  free (fd);
}
Defines mark_uncompressed (links are to index).

The first attempt to read a file will be used to check, whether the file is compressed. To do this, I try to read the first three characters in the file and compare them with compress_prefix. If I don't get three characters or if those characters do not coincide with compress_prefix, then the file is considered to be not compressed. Otherwise I simply change the method table to be cr_m and call read_compressed.

In the case that the file is not compressed, the characters read are part of the data and have to be placed into the buffer, which is the argument to read. After doing so, the file has to be marked being uncompressed. If read has asked for more than three characters, then _sys_read will try to get those.

Difficulties arise, if read had asked for fewer characters than received by the first _sys_read. These characters have to be kept for further calls to read. (To reset the file pointer to the beginning might be impossible, because the file can be a terminal device or a pipe.) I set nbits to 1 to indicate, that the file has already proven to be in uncompressed state. In this case read_prefix fetches the characters from the buffer instead of reading them from the file system. When all the characters are ``eaten up'' I mark the file as being uncompressed.

<reading the first few bytes>+= (<-U) [<-D]
static int read_prefix (int ifd, unsigned char *buf, int n)
{
  cfd fd = filetab [ifd];
  int l;

  if (n == 0)
    return 0;

  if (fd->nbits == 0) {
    if ((l = _sys_read (ifd,
                        (char *) fd->buf,
                        sizeof compress_prefix))
        < 0)
      return -1;
    if (l == sizeof (compress_prefix) &&
        memcmp (fd->buf,
                compress_prefix,
                sizeof compress_prefix)
        == 0) {
      fd->methods = &cr_m;
      return read_compressed (ifd, buf, n);
    }
    fd->nbits = 1;
    fd->filepos = 0;
    fd->bitpos = l;
  }

  if (n < fd->bitpos - fd->filepos) {
    memcpy (buf, fd->buf + fd->filepos, n);
    fd->filepos += n;
    return n;
  }

  memcpy (buf, fd->buf + fd->filepos, fd->bitpos - fd->filepos);
  if (n > fd->bitpos - fd->filepos) {
    l = _sys_read (ifd, (char *) (buf + fd->bitpos - fd->filepos),
                   n - (fd->bitpos - fd->filepos));
    if (l < 0)
      l = 0;
    n = l + fd->bitpos - fd->filepos;
  }

  mark_uncompressed (fd);

  return n;
}
Defines read_prefix (links are to index).

This is the code for reading compressed files. Note, that the function initialized all the relevant data structures if nbits equals zero. Later it reads codes, constructs the corresponding character strings using the stack and places those characters into the buffer. Usually there will remain some characters, which have to be kept until the next call to read. The table of codes is again constructed as the algorithm goes. The uncompress algorithm lags always one step behind, so it may happen, that a code is not yet in the table. In this case, the sequence of characters can be reconstructed using finchar and oldcode.

<reading compressed files>= (<-U)
static int read_compressed (int ifd, unsigned char *buf, int n)
{
  cfd fd = filetab [ifd];
  int i = n;
  unsigned incode, c;
  int cin;

  <read initialization>

  for (;;) {
    <empty the stack>
    <return or read next code>

    c = incode = cin;

    <special AwAwA case handling>
    <analyse code and put bytes onto stack>
    <enter code to table htab>

    fd->u.d.oldcode = incode;
  }
}
Defines read_compressed (links are to index).

The first time read gets called for a compressed file, the following code will be executed:

<read initialization>= (<-U)
if (fd->nbits == 0) {
  if (n == 0)
    return 0;

  fd->nbits = MINBITS;
  fd->nextcode = FIRSTCODE;
  fd->bitpos = 8 * MINBITS;
  fd->filepos = 0;
  fd->u.d.tabsize = TABSIZE;
  if ((fd->u.d.htab =
       malloc (TABSIZE * sizeof (*fd->u.d.htab)))
      == NULL) {
    errno = ENOMEM;
    return -1;
  }
  fd->u.d.stacksize = STACKGROWTH;
  fd->u.d.stacktop = 0;
  if ((fd->u.d.stack = malloc (STACKGROWTH)) == NULL) {
    errno = ENOMEM;
    return -1;
  }
  fd->u.d.buflen = 8 * MINBITS;
  cin = input (fd, ifd);
  if (cin < 0)
    return cin == -1 ? 0 : -1;
  *buf++ = cin;
  --i;
  fd->u.d.oldcode = cin;
  fd->u.d.finchar = cin;
}

First, read has to empty the stack:

<empty the stack>= (<-U)
while (fd->u.d.stacktop > 0 && i > 0) {
  *buf++ = fd->u.d.stack [--fd->u.d.stacktop];
  --i;
}

Then it can try to get another code, if necessary:

<return or read next code>= (<-U)
if (i == 0 || (cin = input (fd, ifd)) == -1) {
  fd->filepos += n - i;
  return n - i;
}
    
if (cin < -1)
  return -1;

It may happen, that a code is not in the table yet. oldcode and finchar contain enough information to deduce, what has to be in the table:

<special AwAwA case handling>= (<-U)
if (c >= fd->nextcode) {
  if (c > fd->nextcode) {
    errno = EIO;
    return -1;
  }
  if (push (fd, fd->u.d.finchar))
    return -1;
  c = fd->u.d.oldcode;
}

A code is analyzed from right to left. Therefore, I need to use the stack to reverse the order of the characters:

<analyse code and put bytes onto stack>= (<-U)
while (c >= FIRSTCODE) {
  if (push (fd, fd->u.d.htab [c - FIRSTCODE].c))
    return -1;
  c = fd->u.d.htab [c - FIRSTCODE].w;
}
fd->u.d.finchar = c;
if (push (fd, c))
  return -1;

Unless all possible codes are already used, I have to insert the a new code into the table.

<enter code to table htab>= (<-U)
if (fd->nextcode < (1L << MAXBITS)) {
  if (fd->nextcode - FIRSTCODE >= fd->u.d.tabsize) {
    fd->u.d.tabsize += TABSIZE;
    if ((fd->u.d.htab =
         realloc (fd->u.d.htab,
                  fd->u.d.tabsize * sizeof (*fd->u.d.htab)))
        == NULL) {
      errno = ENOMEM;
      return -1;
    }
  }
  fd->u.d.htab [fd->nextcode - FIRSTCODE].c = c;
  fd->u.d.htab [fd->nextcode - FIRSTCODE].w = fd->u.d.oldcode;

  if (fd->nextcode == (1L << fd->nbits) - 1 &&
      fd->nbits < MAXBITS) {
    fd->nbits++;
    if (refill_buffer (fd, ifd))
      return -1;
  }

  fd->nextcode++;
}

Closing a file, which has been read from is not as difficult as closing a file which has been written to, because no buffers have to be flushed. Nevertheless, it has to take care of freeing the data structures. I have two different routines for closing, one for cr_m, the other for ir_m.

<finish reading>= (<-U) [D->]
static int close_compressed_read (int ifd)
{
  cfd fd = filetab [ifd];

  if (--fd->shared == 0) {
    if (fd->nbits > 0) {
      free (fd->u.d.htab);
      free (fd->u.d.stack);
    }
    free (fd);
  }
  filetab [ifd] = NULL;
  return _sys_close (ifd);
}
Defines close_compressed_read (links are to index).

<finish reading>+= (<-U) [<-D]
static int close_prefix_read (int ifd)
{
  cfd fd = filetab [ifd];

  if (--fd->shared == 0)
    free (fd);
  filetab [ifd] = NULL;
  return _sys_close (ifd);
}
Defines close_prefix_read (links are to index).

Miscellaneous IO-substitutes

Reading from a file, that has been opened for writing or vice versa is not allowed.

<other io-substitutes>= (<-U) [D->]
static int refuse_io (int fd, unsigned char *buf, int n)
{
  errno = EINVAL;
  return -1;
}
Defines refuse_io (links are to index).

As far as lseek is concerned, a compressed file (regardless whether read or written) behaves like a pipe, i.e. lseek returns -1.

<other io-substitutes>+= (<-U) [<-D->]
static long refuse_seek (int fd, long offset, int whence)
{
  errno = ESPIPE;
  return -1;
}
Defines refuse_seek (links are to index).

An attempt to lseek a file, that is opened for reading and has not (yet) proven to contain compressed data, will force to treat the file as a plain file.

<other io-substitutes>+= (<-U) [<-D]
static long prefix_seek (int ifd, long offset, int whence)
{
  cfd fd = filetab [ifd];

  mark_uncompressed (fd);
  return _sys_lseek (ifd, offset, whence);
}
Defines prefix_seek (links are to index).

Examples

The remainder of the text gives a collection of sample code, which provides some evidence, that the implementation is correct.

I start with a simple copy program. The program takes one or two command line arguments the first of which is the input file, while the second names the output file and defaults to the standard output.

The most interesting property of this program is, that is can be used both as a substitute for compress and as a replacement for uncompress. Further, it will also work on plain files. Consider:

<t.c>=
# include <stdio.h>
# include <errno.h>
# include <string.h>

int main (int argc, char **argv)
{
  FILE *in, *out;
  int c;

  if (argc != 3 && argc != 2) {
    fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]);
    exit (1);
  }

  if ((in = fopen (argv [1], "r")) == NULL) {
    fprintf (stderr, "Cannot open file %s for reading: %s\n",
             argv [1], strerror (errno));
    exit (1);
  }
  if (argc == 2)
    out = stdout;
  else if ((out = fopen (argv [2], "w")) == NULL) {
    fprintf (stderr, "Cannot open file %s for writing: %s\n",
             argv [2], strerror (errno));
    exit (1);
  }

  while ((c = getc (in)) != EOF)
    putc (c, out);

  fclose (in);
  fclose (out);

  return 0;
}
Defines main (links are to index).

The next example does the same thing while not using the standard library. Instead, it calls open, read etc. directly.

<v.c>=
# include <stdio.h>
# include <errno.h>
# include <string.h>

int main (int argc, char **argv)
{
  int ifd, ofd;
  int n;
  char buf [4096];

  if (argc != 3 && argc != 2) {
    fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]);
    exit (1);
  }

  if ((ifd = open (argv [1], 0)) < 0) {
    fprintf (stderr, "Cannot open file %s for reading: %s\n",
             argv [1], strerror (errno));
    exit (1);
  }
  if (argc == 2)
    ofd = 1;
  else if ((ofd = creat (argv [2], 0666)) < 0) {
    fprintf (stderr, "Cannot open file %s for writing: %s\n",
             argv [2], strerror (errno));
    exit (1);
  }

  while ((n = read (ifd, buf, 512)) > 0)
    write (ofd, buf, n);

  close (ifd);
  close (ofd);

  return 0;
}
Defines main (links are to index).

The next example uses dup and performs interleaved writes to both file descriptors. Of course, this isn't necessary, but it shows, that dup works as expected.

<u.c>=
# include <stdio.h>
# include <errno.h>
# include <string.h>

int main (int argc, char **argv)
{
  int ifd, oofd, dofd;
  int n;
  char buf [512];

  if (argc != 3 && argc != 2) {
    fprintf (stderr, "Usage: %s infile [ outfile ]\n", argv [0]);
    exit (1);
  }

  if ((ifd = open (argv [1], 0)) < 0) {
    fprintf (stderr, "Cannot open file %s for reading: %s\n",
             argv [1], strerror (errno));
    exit (1);
  }
  if (argc == 2)
    oofd = 1;
  else if ((oofd = creat (argv [2], 0666)) < 0) {
    fprintf (stderr, "Cannot open file %s for writing: %s\n",
             argv [2], strerror (errno));
    exit (1);
  }

  dofd = dup (oofd);

  while ((n = read (ifd, buf, 512)) > 0)
    write (oofd, buf, n / 2),
    write (dofd, buf + n / 2, n - n / 2);

  close (ifd);
  close (oofd);
  close (dofd);

  return 0;
}
Defines main (links are to index).

The following program is the most complex example. It shows the use of pipes in the framework of fork and exec. By two times executing fork I create a sequential arrangement of three processes, which are connected by two pipes. Pipe p1 connects the child with the parent, while p2 provides a channel from the grandchild to the child.

The child starts another program by calling either execl or execvp, depending on what command line arguments are given. The idea is to pipe compressed data to a filter and read it back. Simple ``filters'' like cat or tee don't change the data. Therefore, the parent should see, what the grandchild wrote in this case. It is worth trying to put uncompress -C into the place of the filter. (The behavior should not change, because read automatically detects, that the data in p1 are not in compressed format.)

Note, that all calls to fork and exec have to be executed before any actual input or output takes place.

<w.c>=
# include <stdio.h>
# include <assert.h>
# include <errno.h>
# include <string.h>

# define check(x) \
        ((x)<0?(fprintf(stderr,"%s(%d): (%s) < 0 (%s)\n", \
                        __FILE__, __LINE__, #x, strerror(errno)), \
                exit(1)):0)

int main (int argc, char **argv)
{
  char buf [512];
  int n;
  int p1 [2], p2 [2];
  int f;

  check (pipe (p1));

  check (f = fork ());
  if (f > 0) {
    check (close (p1 [1]));
    while (check (n = read (p1 [0], buf, 512)), n > 0)
      check (write (1, buf, n));
    putc ('\n', stderr);
    check (close (p1 [0]));
    check (wait (NULL));
    exit (0);
  } else {
    check (close (1));
    check (dup (p1 [1]));
    check (close (p1 [0]));
    check (close (p1 [1]));

    check (pipe (p2));

    check (f = fork ());

    if (f > 0) {
      check (close (p2 [1]));
      check (close (0));
      check (dup (p2 [0]));
      check (close (p2 [0]));

      if (argc == 1)
        check (execl ("/bin/cat", "cat", NULL));
      else
        check (execvp (argv [1], argv + 1));
    } else {
      check (close (p2 [0]));
      while (check (n = read (0, buf, 512)), n > 0)
        check (write (p2 [1], buf, n));
      check (close (p2 [1]));
      exit (0);
    }
  }
}
Defines main (links are to index).

A small program shows the use of pipes through the popen-interface:

<x.c>=
# include <stdio.h>

# include <assert.h>

int main (int argc, char **argv)
{
  FILE *fp;
  int c;

  assert (argc == 2);
  fp = popen (argv [1], "r");
  assert (fp != NULL);
  while ((c = getc (fp)) != EOF)
    putchar (c);
  pclose (fp);
  return 0;
}
Defines main (links are to index).

A final test case makes sure, that read_prefix works correctly, even in the case, that the number of characters asked for is less than the length of the compress_prefix. The program always reads only one character at a time. Try to use it with a plain file and remember, how read is implemented.

<y.c>=
# include <assert.h>

int main (int argc, char **argv)
{
  char c;
  int fd;

  assert (argc == 2);
  fd = open (argv [1], 0);
  assert (fd >= 0);
  while (read (fd, &c, 1) == 1)
    write (1, &c, 1);
  close (fd);
  return 0;
}
Defines main (links are to index).

Conclusions

The examples in the previous section show, that the new file system interface really hides the details of data compression, thereby providing this service in a fashion transparent to the programs that use it. However, everything works only as long as a single program has complete control over an open file. In a multi-tasking environment like UNIX, this is generally not true. This puts some severe constraints onto the usage of the new interface. The implementation presented in this paper assumes, that files opened for writing by open are not compressed. What, if the data in that file actually are compressed?

Recently, there are some efforts made to integrate data compression with file system implementations themselves. Only here one has the opportunity to know about every access to a file, and things can be synchronized properly.

Nevertheless, the given program text can be useful in many circumstances. I, for instance, tried to integrate it with VSCM, which is my implementation of Scheme. Memory dumps written by VSCM are less than half as big as before, and everything still works fine. Probably, the fact that files containing symbolic expressions are written in compressed mode as well is a little bit surprising and annoying. (This leads to the demand for a switch to turn off compression, but this would expose details of the implementation, which is what I wanted to avoid in the first place.)

Indexes

Code Chunks

Identifiers