Relational Algebra Implementation

Your goals in this assignment are to:

  1. Implement the operators of the Relational Algebra, and
  2. Use your implementation to evaluate a number of queries.

You are not starting from scratch. This archive contains code that you will extend. The code includes classes implementing a rudimentary database system that runs in main memory. The only column type supported is string, (i.e. std::string). Database, Table and Row classes are provided, but you need to fill in most of the implementations.

The archive also includes two sets of tests, as described below. You can compile and run the code in the archive, but all the tests will fail. After you implement the relational algebra operators, and a set of queries using your relational algebra implementation, the tests will pass.

You are expected to modify this code significantly. A few general comments on how to proceed:

What's in the archive

Mostly empty implementations of Database, Table and Row

The Table class is partially implemented. It has data members for storing metadata: table name and a list (vector) of column names. You need to extend the representation to provide a place to store Rows. You will need to modify Table::add to actually store the Rows. Note that the current implementation of Table::add says IMPLEMENT_ME(). You will see these calls wherever you need to provide code.

You may find that additional member functions are required. For example, it is possible that you may find Table::remove to be necessary. Go ahead and add any data members and member functions that your implementation requires.

Similarly, Row is partially implemented. You will again need to provide a representation, and operate on it.

The Database class is complete. Database is responsible for tracking and cleaning up Tables, so be careful not to modify this code at all.

An empty implementation of Relational Algebra

RelationalAlgebra.h contains declarations of the functions comprising Relational Algebra:
Table *onion(Table *r, Table *s); Table *intersect(Table *r, Table *s); Table *diff(Table *r, Table *s); Table *product(Table *r, Table *s); Table *rename(Table *r, const NameMap &name_map); Table *select(Table *r, RowPredicate predicate); Table *project(Table *r, const ColumnNames &columns); Table *join(Table *r, Table *s);

(Note that union is a keyword in C++, so the union operator is represented by the onion function.)

Comments in RelationalAlgebra.h explain the behavior of each operator, along with a discussion of possible exceptions. You will need to implement these functions, in RelationalAlgebra.cpp. Your implementation will rely on the Database, Table and Row types, and as you proceed, you may find it useful to extend the interfaces of these types, (i.e., add member functions). Don't modify the provided interfaces, as the tests will then probably not compile.

Tests

There are two source files with tests, test_ra.cpp and test_queries.cpp. main.cpp has the main() function which simply invokes the tests.

test_ra.cpp

test_ra.cpp contains 42 tests that test your relational algebra implementation, both "positive" cases (e.g. compute an intersection and see that the results are as expected), and "negative" ones, (e.g. try inserting a row with three columns into a table created with two columns).

Here is a typical positive test:
static void union_compatible_both_non_empty() { Table* r = Database::new_table("r", ColumnNames({"a", "b"})); r->add({"1", "44"}); r->add({"1", "55"}); r->add({"2", "66"}); Table* s = Database::new_table("s", ColumnNames({"a", "b"})); s->add({"1", "44"}); s->add({"3", "77"}); Table* control = Database::new_table("control", ColumnNames({"a", "b"})); control->add({"1", "44"}); control->add({"1", "55"}); control->add({"2", "66"}); control->add({"3", "77"}); assert(eq(control, onion(r, s))); }

This creates two tables, r and s, both with columns a and b. Rows are added to the tables. Note that the {"1", "44"} row is added to each. Next, a control table is created. This table contains the expected result. In this table, note that {"1", "44"} appears once, because the output of any relational algebra operation is a table, and tables do not have duplicates. A buggy implementation might have two occurrences, violating uniqueness, and this would cause the test to fail.

The last statement in the code compares the union of r and s with the control, and asserts that output from the union and the control should have the same rows.

Here is a typical negative test:
static void add_row_with_too_many_columns() { Table* t = Database::new_table("t", ColumnNames({"a", "b"})); try { t->add({"1", "2", "3"}); fail(); } catch (RowException& e) { // expected } }

The test starts by creating a new table named t, with columns named a and b. Next, inside a try block, we add a row with three values. This is incorrect, because the table was created with two columns. Correct behavior is to throw a RowException. If that happens the test succeeds. If that does not happen, and we return from t->add without throwing an exception, fail() is called, which causes the test to fail.

test_queries.cpp

test_queries.cpp contains empty implementations of five queries, test_q0 through test_q4. test_q0 is an example, discussed below. You need to fill in implementations for test_q1 through test_q4.

These tests depend on the following database:
user(user_id, username, birth_date) routing(from_user_id, to_user_id, message_id) message(message_id, send_date, text)

These tables are populated by the files in the db directory: user.csv, routing.csv, message.csv. Your task is to fill in the query implementations, and the results are tested for correctness. Predicates that you are likely to need are already present in test_queries.cpp. You are free to use them or discard them and write your own.

Your code should look something like test_q0(), which is provided as an example. static void test_q0() { Table *q0 = diff( project(user, ColumnNames({{"username"}})), project( join( rename( select(routing, q0_predicate), NameMap({{"from_user_id", "user_id"}})), user), ColumnNames({{"username"}}))); // print_table("q0", q0); Table *control0 = Database::new_table("control0", ColumnNames({{"username"}})); control0->add({{"Intaglio"}}); control0->add({{"Unguiferous"}}); assert(eq(control0, q0)); }

The query implemented here is: What are the names of users who have not sent a message to themselves?

The implementation works as follows, proceeding from the most nested part outward.

Utilities

These files contain various useful but uninteresting things that you shouldn't have to touch.

Makefile

A Makefile is provided. If you add .cpp or .h source, be sure to modify the Makefile appropriately.

Building the code

Unpack the archive, cd into the ra directory, and then just run make: jao@mintyzack ~/teaching/tufts/115/hw1/ra $ make g++ -std=c++11 -Wall -Wno-unused-function -c ColumnNames.cpp -o ColumnNames.o g++ -std=c++11 -Wall -Wno-unused-function -c RelationalAlgebra.cpp -o RelationalAlgebra.o g++ -std=c++11 -Wall -Wno-unused-function -c Table.cpp -o Table.o g++ -std=c++11 -Wall -Wno-unused-function -c test_queries.cpp -o test_queries.o g++ -std=c++11 -Wall -Wno-unused-function -c Database.cpp -o Database.o g++ -std=c++11 -Wall -Wno-unused-function -c RowCompare.cpp -o RowCompare.o g++ -std=c++11 -Wall -Wno-unused-function -c test_ra.cpp -o test_ra.o g++ -std=c++11 -Wall -Wno-unused-function -c main.cpp -o main.o g++ -std=c++11 -Wall -Wno-unused-function -c Row.cpp -o Row.o g++ -std=c++11 -Wall -Wno-unused-function -c testing.cpp -o testing.o g++ -std=c++11 -Wall -Wno-unused-function -c util.cpp -o util.o g++ -std=c++11 -Wall -Wno-unused-function ColumnNames.o RelationalAlgebra.o Table.o test_queries.o Database.o RowCompare.o test_ra.o main.o Row.o testing.o util.o -o ra

Running the code

If you run the code without modification, you should see this output:
jao@mintyzack ~/teaching/tufts/115/hw1/ra $ ./ra db :-( 1/42 -- create_table_no_columns: FAILED -- Not yet implemented! :-( 2/42 -- create_table_duplicate_columns: FAILED -- Not yet implemented! ... :-( 41/42 -- join_one_empty: FAILED -- Not yet implemented! :-( 42/42 -- join_both_non_empty: FAILED -- Not yet implemented! SUMMARY: 42 tests 0 passed 42 failed Test initialization failed! :-( 1/ 5 -- test_q0: FAILED -- Not yet implemented! :-( 2/ 5 -- test_q1: FAILED -- Not yet implemented! :-( 3/ 5 -- test_q2: FAILED -- Not yet implemented! :-( 4/ 5 -- test_q3: FAILED -- Not yet implemented! :-( 5/ 5 -- test_q4: FAILED -- Not yet implemented! SUMMARY: 5 tests 0 passed 5 failed

When you successfully implement Row, Table, and RelationalAlgebra, and then the query implementations in test_queries.cpp, you should see this:

jao@mintyzack ~/teaching/tufts/115/hw1/ra $ ./ra db :-) 1/42 -- create_table_no_columns: ok :-) 2/42 -- create_table_duplicate_columns: ok ... :-) 41/42 -- join_one_empty: ok :-) 42/42 -- join_both_non_empty: ok SUMMARY: 42 tests 42 passed 0 failed :-) 1/ 5 -- test_q0: ok :-) 2/ 5 -- test_q1: ok :-) 3/ 5 -- test_q2: ok :-) 4/ 5 -- test_q3: ok :-) 5/ 5 -- test_q4: ok SUMMARY: 5 tests 5 passed 0 failed

Valgrind

Running valgrind should be part of your development workflow. Run it frequently to make sure that you don't introduce any memory problems as your develop your code. If you start with a clean run of valgrind, then you know that problems turned up by it later must be due to code changes since the previous, clean run.

E.g., here is a valgrind run from my implementation of this assignment: jao@mintyzack ~/teaching/tufts/115/hw1/ra $ valgrind ./ra db ... ==47587== ==47587== HEAP SUMMARY: ==47587== in use at exit: 72,704 bytes in 1 blocks ==47587== total heap usage: 292 allocs, 291 frees, 91,528 bytes allocated ==47587== ==47587== LEAK SUMMARY: ==47587== definitely lost: 0 bytes in 0 blocks ==47587== indirectly lost: 0 bytes in 0 blocks ==47587== possibly lost: 0 bytes in 0 blocks ==47587== still reachable: 72,704 bytes in 1 blocks ==47587== suppressed: 0 bytes in 0 blocks ==47587== Rerun with --leak-check=full to see details of leaked memory ==47587== ==47587== For counts of detected and suppressed errors, rerun with: -v ==47587== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

You want to see no corruption (of course), and 0 bytes lost in the leak summary. The still reachable part is fine; that appears to be related to memory used by valgrind itself.