Project 3: Dataflow Analysis

COMP 150-AVS, Fall 2018

Test case (part 4) due: Wed, Oct 10 @ 11:59pm

Main project due: Due Wed, Oct 17 @ 11:59pm

Oct 9 - Clarified that you only need to implement two analyses (avaiable expressions and very busy expressions).
Oct 7 - Added key argument to last several Bvset functions.

In this project, you will implement a general data flow analysis engine for RubeVM bytecode, and use it to build two of the data flow analyses we discussed in class: available expressions and very busy expressions.

To simplify the project, we have made some changes to the RubeVM bytecode format: (1) Instructions are no longer contained in functions. Instead, a program consists simply of a single list of instructions. (2) We have removed several instructions; see instr.ml for details (the removed instructions are commented out).

Once you compile the code skeleton, you'll be left with an executable main.byte that takes the input RubeVM file as an argument, plus an optional analysis name:

./main/byte r1.rubevm - Parse and print out bytecode
./main.byte -analysis ae r1.rubevm - Run available expressions on bytecode and display results
./main.byte -analysis vbe r1.rubevm - Run very busy expressions on bytecode and display results

In the project, we will represent sets of dataflow facts using bitvectors. The bitvector representation of sets admits all the usual operations, but has one slight twist: we need to pre-identify the range of the set, i.e., the possible values that may go in the set, so that we know how large a bitvector to create.

The file bvset.mli contains a module signature for bitvector sets (note that none of these functions should have externally visible side-effects):

type 'a bvset_key (* the key indicating which bits stand for what *)

(* Make the key for a bit vector set *)
val list_to_key : 'a list -> 'a bvset_key

(* Return a list of all members of the set. *)
val to_list : 'a bvset_key -> int -> 'a list

(* Return an empty set that ranges over the elements in the list. *)
val mkempty : 'a bvset_key -> int

(* Add an element to a set, or do nothing if elt already present.
   Raises an exception if elt not in range of set. *)
val insert : 'a bvset_key -> 'a -> int -> int

(* Remove an element to a set, or do nothing if elt not present.
   Raises an exception if elt not in range of set. *)
val remove : 'a bvset_key -> 'a -> int -> int

(* Return true iff element in set. Raises an exception if elt not in
   range of set. *)
val mem : 'a bvset_key -> 'a -> int -> bool

(* Set union. *)
val union : 'a bvset_key -> int -> int -> int

(* Set intersection. *)
val inter : 'a bvset_key -> int -> int -> int

(* diff a b returns a - b. *)
val diff : 'a bvset_key -> int -> int -> int

(* negate a returns 1 - a, where 1 is the set of all elements. Thus,
   negate (mkempty l) returns top. *)
val negate : 'a bvset_key -> int -> int

(* Return true iff sets are equal. *)
val equal : 'a bvset_key -> int -> int -> bool

For this part of the problem, you must implement this interface in bvset.ml. Recall that in bitvectors, each bit of an integer represents the presence or absence of a set member. Thus, somehow we need to keep track of which bits map to what. Inside of bvset.ml, we've given you a definition type 'a bvset_key = ('a, int) Hashtbl.t * (int, 'a) Hashtbl.t, which represents this mapping (You may not change this type.) Thus, a key for a bitvector is a pair (elt-to-int, int-to-elt), where elt-to-int is a hash table mapping each possible set member to its index in the bitvector, and int-to-elt is the inverse mapping.

For purposes of the project, bit-vectors are single integers, and you can assume that 63 bits is enough for anyone.

For example, if m is a bitvector key 'a' to position 0, 'b' to 1, and 'c' to 2, then under this key, 1 represents the set {'a'} and 5 represents the set {'a', 'c'}.

Note that for the last several functions (union through equal, you may or may not need to use the key argument; but you probably need to use it for at least one of {negate, equal}.

In class, we learned about several gen/kill dataflow problems. For this part of the project, you will implement gen/kill set computations for two of these analysis: available expressions and very busy expressions.

For these analyses, we will use the following representation of an expression, which is essentially the "right-hand side" of the bytecode instructions that compute results and store them in registers:

type kind_uop = U_is_int | U_is_str
type kind_binop = B_add | B_sub | B_mul | B_div | B_eq | B_lt | B_leq
type expr =
  | Uop of kind_uop * reg
  | Binop of kind_binop * reg * reg  (* op, src, src *)

To get started, we'll need to compute the expressions that occur in a RubeVM program. Implement the following function in dfa.ml:

exprs_of_prog (instrs:instr array):expr list - given a program, return all the expressions in the program.

Once we know all the expressions in the program, we can use Bvset.list_to_key to turn the list into a key for bitvector sets.Then given that key, we can compute the gen/kill sets of every statement in a RubeVM program. We'll define a type to represent the gen/kill bitvector sets:

type gen_kill = { gen : int; kill : int }

(Of course, to interpret this type, we'll need to know what the key is; but we'll assume we always have the key hanging around somewhere.) Now, you should implement the following two functions in dfa.ml:

avail_exprs (instrs:instr array) (k:expr bvset_key):gen_kill array - Return an array a the same size as instrs such that index a.(i) contains the gen/kill set for instrs.(i) for available expressions
very_busy_exprs - same as above, but for very busy expressions.

Hint: You'll probably need a list or set of all expressions in the program. You can either get that by running exprs_of_prog again or by using Bvset.to_list on max_int.

Finally, you will implement the dataflow analysis algorithm. Recall that a dataflow analysis is defined by its direction (forward or backward); whether it uses may (union) or must (intersection) at join points; its gen/kill sets; the sets of dataflow facts assumed at the program entry (for forward analysis) or program exit (for backward analysis); and "top," the initial sets of dataflow facts at all other program points. We define the following types to represent the direction and may/must choices:

type dir_type =
  | D_Forward
  | D_Backward

type may_must_type =
  | K_May
  | K_Must

Then, for this part of the project, implement the following functions:

pred : (instrs:instr array) -> (i:int) -> int list - return the predecessors of i in the program.
succ : (instrs:instr array) -> (i:int) -> int list - return the successors of i in the programm.

and then implement:

    dfa (instrs : instr array)
	(key : 'a bvset_key)
	(dir : dir_type)
	(may_must : may_must_type)
	(gen_kill : instr array -> 'a bvset_key -> gen_kill array)
	(entry_or_exit_facts : int)
	(top : int):int array =

The input to dfa is, in order: the program; a key for bitvectors of dataflow facts for the program (which for this project will be derived from exprs_of_prog); the direction; may/must; a function for computing the gen/kill sets for every program statement (which you implemented in part 2); the facts that hold at the entry (for available expressions) or exit (for very busy expressions); and top.

The return value of dfa is an array r that is the same size as instrs such that r.(i) is the set of dataflow facts at the out of statement i for a forward analysis, and at the in of statement i for a backward analysis.

We've already given you code in main.ml that calls dfa, so once you've written the function you can run it as described at the very top of the project description.

One of the hardest parts of this project is figuring out if the answers you're getting from the dataflow analysis are correct. The challenge is that the dataflow analysis is making various approximations (e.g., at a join points), so you really have to walk through the algorithm somewhat painstakingly to check your answers. We strongly suggest that you write a series of test cases that exercise each step: Try a program with just a single statement; try one with a couple of statements; try one with a couple of statements and a join point; try one with a couple of statements and loop; etc.

To help make things easier on everyone, you must develop at least one test case by the date specified at the top of the project description, and post it on Piazza. Your test case should consist of (a) RubeVM code, written in RubeVM assembly language accepted by the parser; (b) The output of available expressions analysis for the code, and (c) The output of very busy expressions analysis for the code.

Note: You don't need to have a working implementation to write the test case. In fact, we want you to do this on paper so you can use it to double-check your implementation.

Don't worry about the exact format of expressions for your test case, but if you want to get the details exactly right, you can take a look at output_expr in dfa.ml.

Project 3: Dataflow Analysis

Updates

Introduction

Part 1: Sets as Bitvectors

Part 2: Two transfer functions for expressions

Part 3: The dataflow analysis algorithm

Part 4: Sharing Test Cases