Project 4: Symbolic Execution

COMP 150-AVS, Fall 2018

Test case due: Mon, Nov 5 @ 11:59pm

Main project due: Thu, Nov 15 @ 11:59pm

In this project, you will implement a symbolic executor for RubyVM bytecode. You will use your SAT sovler implementation from Project 1.

To simplify the project, we've made some changes to the RubyVM bytecode format. Specifically, we've moved several instructions to simplify the project, and we've added two new instructions to support symbolic execution. You can find the changes in instr.ml, and we'll talk about them more below. One major change is that we've eliminated the heap.

Once you compile the code skeleton, you'll be left with an executable main.byte that takes the input RubeVM file as an argument, plus two other arguments: the scheduler that decides the order of path exploration, and the maximum number of steps to take during symbolic execution. For example:

The result of symbolic execution is a series of path conditions, one for each path explored in the program. The path conditions either describe a full path, from the start of the main function until it returns, or a path that starts at main and ends in an assertion violation.

Since symbolic execution typically could run for longer than we have time for, we've added a step count, which is the maximum number of instructions to be executed summed up across all paths.

Unlike the other projects, this one doesn't break down into sub-parts very well. Here is what you'll need to do.

Boolean and BVec

This project will build on project 1, so you'll need to put your solutions to boolean.ml and bvec.ml into this project.

Symbolic Execution State

Like in Project 2, as we execute programs we'll keep track of the stack. However, the stack might now include symbolic expressions that range over integers, i.e., symbolic expressions that are bitvectors. Thus, we've extended the type value in instr.ml so that it now includes `L_Sym of Bvec.bvec. This means that the register file regs could map registers to bitvectors.

We've introduced a new instruction, int_sym r, s, that creates a fresh, new integer bitvector named after the string s and stores the bitvector in register r. For purposes of this project, all bitvectors will have length 3. The names of bitvector variables come from the name given in int_sym appended with, (1) an underscore, then (2) an integer that starts with 0 and increments for each time that name is used, then (3) another underscore, then (4) the bit position. For example, the comments show the bitvectors stored in the registers for the following sequence of instructions:

  int_sym r0, foo  # r0 contains `L_Sym [EVar "foo_0_0"; EVar "foo_0_1"; EVar "foo_0_2"]
  int_sym r1, bar  # r1 contains `L_Sym [EVar "bar_0_0"; EVar "bar_0_1"; EVar "bar_0_2"]
  int_sym r2, foo  # r2 contains `L_Sym [EVar "foo_1_0"; EVar "foo_1_1"; EVar "foo_1_2"]
Then "computing" with bitvectors will use the functions you implemented in Project 1:
  add r3, r0, r1  # r3 contains `L_Sym [something, something, something]
where the somethings are the boolean formulae representing the corresponding bits computed by Bvec.add. Note that to keep everything to 3 bits, you will need to truncate the result of addition. (This means you can also effectively perform subtraction by overflowing 3 bits during addition!)

Note that the only way of introducing symbolic variables into the program is through the int_sym instruction. Important: For "performance," regular integers and bitvector integers will co-exist in this project. For example:

  int_sym r0, foo  # r0 contains `L_Sym [EVar "foo_0_0"; EVar "foo_0_1"; EVar "foo_0_2"]
  const r1, 3     # r1 contains `L_Int 3
  const r2, "bar"  # r2 contains `L_Str "bar"

This means that some instructions will be executed just like in Project 2, i.e., using concrete values, and other instructions will be executed using Boolean and Bvec. In some cases, there will be an instruction where one operand is symmbolic and one is concrete, e.g.,

  int_sym r0, foo
  const r1, 3
  add r2, r0, r1
  i_is_int r3, r2  # stores 1 in r3
In this case, your symbolic executor will need to encode 3 as a bitvector (using Bvec.bvec_of_int and truncating to 3 bits as needed) and then use Bvec.add to perform the addition.

In some cases, your symbolic executor will encounter an operation on symbolic expressions for which we have no support in Project 1, such as

  int_sym r0, foo
  int_sym r1, bar
  sub r2, r0, r1  # error!
In this case, your symbolic executor should raise an exception (any exception) and abort. We won't test this case in grading.

You can also assume that foreign functions will only be called with concrete values (i.e., no bvec argumments).

Path Conditions

Recall from Project 2 that as we executed a RubeVM program in the interpreter, we maintained a config keeping track of the program state. In symbolic execution, we also need to keep track of the path condition, which records the branches taken so far. We'll use Boolean.bexpr to represent path conditions. Thus, the type config in this project is now stack * Boolean.bexpr.

Path conditions will be used in three places in this project:

  1. At an if_zero instruction, use Boolean.sat to see if the branch might be taken, not taken, or both.
  2. At the new instruction assert_zero r, check if register r is guaranteed to always be zero under the current path condition.
  3. At the end of execution, we will print out the path conditions from each path taken.

Similarly to project 2, we'll implement a function run_instr : prog -> config -> inst_ret that takes a single step. In the normal case, it will return `Normal config. In other words, given a program and a (stack, path condition) pair, run_instr executes the current instruction and returns a new (stack, path condition) pair. The path condition will only be changed at if_zero instructions, following the symbolic execution logic we saw in class.

If the instruction is an assert_zero, and that instruction fails (i.e., the assertion is not implied by the path condition), then run_instr returns a `Failure result containing the current path condition.

If the instruction does anything else bad, like try to add an integer and a string, run_instr should throw an exception (any exception).

Hint: One of the slightly tedious parts of run_instr will be handling the case when a regular integer and a symbolic expression are used in the same instruction.

Schedulers

As we saw in class, symbolic execution might potentially fork when reaching a branch that could go either way. Once there is more than one possible execution, the policy for which one to execute next is a key design decision in symbolic execution.

In this project, we will represent this choice by implementing two schedulers. We will maintain the set of possible configurations as a list, with the invariant that the next configuration to execute is at the front of the list.

A scheduler has type config * config option * config list -> config list. Its input is one or two configurations to add to the given configuration list. Its output is a new configuration list ordered appropriately so that visiting the first configuration on the list should happen next.

For this project, you should implement two schedulers. Imagine that the execution tree (as we saw in class) in a complete tree, e.g.:

    a
   / \
  b   c
 /\   /\
d  e f  g
Then the scheduler sch_bfs should perform breadth-first traversal, i.e., it should visit the nodes in the order a, b, c, d, e, f, g. The schedule sch_dfs. should perform depth-first traversal, i.e., it should visit the nodes in the order a, b, d, e, c, f, g.

We won't use automated tests for this part of the project. We'll grade it by hand. So don't worry about the exact order, e.g., it's okay if sch_bfs visits c before b (especially because we haven't said which is the true branch and which is the false branch).

Running a Program

Finally, you'll implement function run_prog (p:prog) (sch:scheduler) (max_steps:int):prog_ret to symbolically execute a program. The first two arguments are obvious. The last argument is the maximum number of instructions that should be executed, i.e., the maximum number of times run_instr should be called. Once this count has been reached, run_prog will return even if symbolic execution hasn't explored all paths. Of course, it's also possible for symbolic execution to finish before reaching this count.

The run_prog function returns a list of results for each explored path. Results consist of the following:

The exact ordering of these results is undefined.

Hint: Be careful with imperative data structures, particularly the register file. If the symbolic exector forks execution, be sure not to share register files across different paths, otherwise surprising and incorrect things will happen.

As with previous projects, you must develop at least one test case by the date specified at the top of the project description, and post it on Piazza. Your test case should consist of (a) RubeVM code, written in RubeVM assembly language accepted by the parser, and (b) the output of symbolic execution on that code.

Note: You don't need to have a working implementation to write the test case. In fact, we want you to do this on paper so you can use it to double-check your implementation.

Don't worry about the exact output format for your test case, but if you want to get the details exactly right, you can take a look in main.ml.