Do foreign keys have "back-pointers"?

Let's discuss the question in the context of this schema:
create table P( p_id int not null primary key, x int ); create table F( f_id int not null primary key, p_id int default null references P, y int );

Table F has a foreign key to table P, via the column F.p_id.

I take the term "back-pointer" to mean a reference or pointer in the other direction, which is certainly not explicit in the schema. I.e., in the declarations above, there is nothing inside P that refers to F.

Why are we talking about back-pointers at all?

Why might you want one?

A back-pointer is definitely not needed to be able to "go the other way" in a query. To repeat a comment from the Jan. 30 lecture, foreign keys have a direction, but joins do not. So the concept of "going the other way" doesn't really make sense. You might need to join F and P for various kinds of queries, and then you have access, in the joined table, to all the columns of both tables. The fact that the join goes from F to P instead of P to F is not that important, and only affects the SQL syntax for expressing the join. (In fact, the relational algebra expression isn't affected, assuming the columns have been named so that the join columns involve only the FK and PK columns.)

In other words, a back-pointer is logically unnecessary. At the logical level -- at the level of table definitions -- a back-pointer is useless.

Actually, a back-pointer is worse than useless. There are two cases to consider: 1) The FK represents a 1:1 relationship; and 2) the FK represents a 1:many relationship.

Suppose the FK represents a 1:1 relationship

Let's add a back-pointer and see what happens. Here is the modified schema:
create table P2( p_id int not null primary key, f_id int default null references F2, x int ); create table F2( f_id int not null primary key, p_id int default null references P2, y int );

(Ignore the fact that the highlighted code is illegal, as it refers to table F2 which has not yet been created. That is easily fixable, but using syntax not yet covered in class.)

We could maintain P2.f_id and F2.p_id so that they represent inverses of each other. This doesn't actually benefit you at all. You can join P2 and F2 using F2.p_id and P2.p_id, as before. (In relational algebra, you would need to use rename to avoid having both columns in each table considered as join columns.)

Or, you could join using P2.f_id and F2.f_id. If the back-pointer (P2.f_id) is being maintained correctly, then the join is the same either way.

Which brings us to the next point: Your applications have to be careful to maintain P2.f_id and F2.p_id to truly be inverses of one another. The update costs something, and applications have to be careful to update the two columns consistently.

To summarize, this back-pointer is a lot of trouble for no benefit at all.

Suppose the FK represents a 1:many relationship

Now suppose that we have 1:many relationship, in which P is the parent, and F is the child, (referring to the original schema). Here is some possible data:


p_id x
1 100


f_id p_id y
11 1 800
12 1 900

If we add the back-pointer as before (using the second version of the schema, with tables P2 and F2), what does P2 look like? Table P has one row, with p_id = 1. It is associated with two rows in F, also with p_id = 1, but with different values of f_id. So would P2 look like this?


p_id f_id x
1 11 100
1 12 100

That can't be right, because P2.p_id is a primary key, and we now have two rows with the same p_id value.

We are heading down the wrong path, so I'm not going to pursue this idea further. The back-pointer is not logically necessary, (i.e., if it appears in the schema), and in the case of a 1:many relationship, simply unworkable.

So why are we still talking about back-pointers?

The preceding discussion points out that back-pointers are either unnecessary or unworkable at the logical layer. However the idea of back-pointers makes a lot of sense at the physical layer.

By physical layer, I mean the set of data structures that implement tables in such a way that commonly used queries can be executed as quickly as possible.

The most important data structures used for improving database performance are indexes. An index is a mapping from some column values to a row address. The mapping is typically implemented as a search tree (similar to a binary tree), or as a hash table.

The row address is something that identifies a row, and enables retrieval of the row quickly. Consider the data of the F table from before, with rows (11, 1, 800) and (12, 1 900). When stored on disk, each row has a physical address. That could be a disk page address combined with an offset within the page. If the table is small enough, and kept in main memory, the physical address would simply be a virtual memory address. Suppose F is in main memory and looks like this:

F (in main memory)

row address f_id p_id y
0x1000 11 1 800
0x100c 12 1 900

Then an index on F.y would be:

F.y index (hash table, keyed by y)

y row address
800 0x1000
900 0x100c

The key is a y value, and a lookup yields a row address. We use the row address to locate the full F row.

Primary keys, foreign keys, and indexes

When you declare a primary key, you automatically get an index keyed by the columns of the primary key. This is true in every kind of relational database system. The reason for this is that a PK guarantees uniqueness. Every time a new row is inserted to the table, we need to check if the PK in that row is unique. Without an index, each check would require a scan of the table. If a table has n rows, then a scan takes O(n) time. And the time for all inserts would be O(n2). Not good. But with an index, (let's assume a hash index), the time to insert n rows is O(n), because each uniqueness check is just a hash lookup, which costs O(1). (Yes, I know the worst case of a hash lookup is O(n), but in practice, averaging over many keys, O(1) is a safe assumption.)

What about foreign keys? Does creating a foreign key result in the creation of an index? No, and I'm not sure why. It should at least be default behavior, because so many operations involving the FK are unacceptably slow without the index.

Consider this query: What F.y values associated with P.x = 1? which can be answered by the this relational algebra expression:
project( join( select(P, x = 1), F), [y])

Evaluation proceeds as follows:

  1. Find rows in P with x = 1. (This will be faster if we have an index on P.x.)

  2. From each qualifying P row, take the p_id value.

  3. Use the p_id value to find rows of F with matching p_id values.

  4. From each qualifying F row, take the y value.

Focus on step 3. If F has an index on p_id, then that step is going to run quickly, otherwise a scan is required. And remember, we're already in a loop from step 1. So it is really important for the speed of this query to have an index on F.p_id.

Finally, here is the point: Think about the index on F.p_id. It maps p_id values to F row addresses. Hey! That sounds like a back-pointer!


So do foreign keys have back-pointers? Yes, if you want them to, (and you usually do). They exist at the physical level of the database, in the form of an index on the foreign key columns. Here is the original schema, with an additional statement to create the FK index:
create table P( p_id int not null primary key, x int ); create table F( f_id int not null primary key, p_id int default null references P, y int ); create index fp_index on F(p_id); -- fp_index is the name of the index

Note that create table and create index are both part of the DDL (Data Definition Language) part of SQL. However, they are quite different conceptually. create table statements contribute to the logical layer of your database. create index statements pertain to the physical layer.