Discussion questions for COMP 150DAO (Dataflow Analysis and Optimization)

5 April 2010: Approximation of formulas as dataflow facts

Constant propagation

Section 3.1 of the July Hoopl paper talks about dataflow facts for constant propagation.

Define a Haskell type, each value of which (except for Haskell undefined) represents a dataflow fact useful for constant propagation.
Define a mapping, using some mix of Haskell and informal notation, from your type from part 1 to a standard logical formula.
Define an approximation function that takes any logical formula and computes a "best-effort" approximation that is representible using the restricted logic represented by your solution to part 1.
Prove that if you approximate an arbitrary logical formula P, the approximation is weaker than P (that is, the approximation is implied by P).
Prove that for any given program, there are no finite ascending chains of approximations.

Homework

Using the representations described in the April Hoopl paper, define a representation of nodes sufficient to represent insertion sort and binary search.

Using your representation, create a Haskell value that represents a graph for insertion sort.

12 April 2010: Dataflow facts about paths

Questions:

In the world of model theory, what is a good way to represent a path within a control-flow graph?
Given a general control-flow graph and a label L, what model-theoretic idea would you use to characterize the set of all possible paths from the entry point to L? In general, is this set finite or infinite?
Do paths have a lattice structure? If not, can you extend paths to have a lattice structure? If paths have a native lattice structure, or if they can be extended to have a latice structure, does the lattice structure have any infinite ascending chains?
Using your knowledge of computer science, show that for any program, for any label, the set of all paths from the entry to that label can be represented by a finite term in a domain-specific language called Mystery. What is the language Mystery?
(This language is found in many areas of computer science, and it is part of the required curriculum, but it maybe in a course you haven't encountered yet.) What well-known algorithm can be used to construct the appropriate term given a control-flow graph?

[Bonus question for outside of class: does the Mystery language have a lattice structure? If so, what is it?]
Given a set of recursion equations about paths, can a finite term in the Mystery language be computed by the method of successive approximations?
How could you approximate paths in a way that guarantees both a lattice structure and ensures that
- There are no infinite ascending chains?
- The method of successive approximations works?

A label L' reaches a label L if some path from the entry to L passes through L'. There are efficient special-purpose algorithms for computing reachability, but we can also compute reachability through a forward dataflow analysis.

What is the dataflow fact?
What assertion (about paths) does it represent?
What operation corresponds to the disjunction at a join point?
How would you represent the top and bottom elements of the lattice?
What should the transfer function look like?

Dominator analysis

If every path from the entry to label L must pass through another label D, we say that D dominates L. D is called a dominator of L. There are efficient special-purpose algorithms for computing dominators, but we can also compute dominators through a forward dataflow analysis.

What is the dataflow fact?
What assertion (about paths) does it represent?
What operation corresponds to the disjunction at a join point?
How would you represent the top and bottom elements of the lattice?
What should the transfer function look like?
For a control-flow graph of N labels, approximately what is the worst-case space cost?

Loops

Can you conclude anything about loops?

Efficient dominator analysis

Dominator exercises:

Suppose that both D and D' dominate L. Prove that exactly one of the following three relations holds:
- D equals D'
- D dominates D'
- D' dominates D
Hint: proof by contradiction.
Suppose that I dominates L. I is called an immediate dominator of L if whenever D dominates L, either D = I or D dominates I.
- Prove that if L has an immediate dominator, the immediate dominator is unique.
- Prove that if L has a dominator, L has an immediate dominator.
Prove that the set of labels dominating L can be represented as a list with the following properties:
- If the set of dominators is empty, the list is []
- If the set of dominators is nonempty, the list has the form d:ds, where d is the immediate dominator of L, and ds is the list representing the set of labels that dominate d.
Show that the list representation has a lattice structure and that it is an exact "approximation" of the representation you used above in its inefficient form.
Show that the lattice inf for the list representation can be computed without allocating any cons cells.
To compute dominators with the list representation, what should the transfer function look like?
What is the total space cost to calculate dominators using the list representation?

Homework

The results of a dominator analysis can be represented as a tree:

data DominatorNode = Entry | Labelled Label
data DominatorTree = Dominates DominatorNode [DominatorTree]

This data structure is a rose tree in which each node may have arbitrarily many children. The root of the tree is the Entry constructor, and each label is a child of its immediate dominator.

Using Hoopl, write a forward dataflow analysis that computes dominators at each node in a control-flow graph, using the list representation [Label] above.
Write code to take the set of all lists and compute a dominator tree:
```
dominatorTree :: [[Label]] -> DominatorTree
```
Your code need not work on arbitrary lists of Labels, but it must work on lists that come out of your analysis in part 1. (Hint: reverse the lists.)

14 April 2010: Induction variables

Induction-variable elimination

Dataflow analysis is strictly more powerful than Hoare logic, because dataflow analysis can reason about paths. This kind of reasoning is needed to identify induction variables.

The simplest form of basic induction variable is one whose value describes an arithmetic progression during the course of the loop. As mathematical formula, we can say that i is an induction variable if its value at the loop header satisfies

i == v + c * k

where v is any value, c is the count of the loop iterations, and k is a constant. (More generally, k can be a loop-invariant expression, but for now, let's not go there.)

Dataflow analysis doesn't know from loops, but perhaps we can subsituted some all-paths reasoning. For i to be an induction variable at a particular point L, what has to be true?

For all paths from the procedure entry to point L, not including L, what must be true about i?
For all paths from L to itself, not including L, what must be true about i?
How would you express this property as a logical formula, universally quantified over program paths?
How would you express this property in a forward dataflow analysis?
What would the transfer functions look like?

Back to class home page