Project 5: Parallel Regexp Matching

COMP 121, Spring 2021

Performance test case due: Tue, Apr 27 @ 11:59pm

Main project due: Tue, May 4 @ 11:59pm

A regular expression (or regex) is a pattern, written in a domain-specific language, that describes a set of strings. Standard regex algorithms can very efficiently check whether a given string is in the set described by the regex. Many languages, including Java, support regexs.

In this project, you will implement a multi-threaded regex string matcher. The exact threading design is up to you, though we will give you some parameters.

Regexs are expressions in the following language:

    E,F ::= a           Matches a single character
          | EF          Matches E followed by F
          | E|F         Matches either E or F (note the | on the right is a regexp symbol)
          | E*          Matches zero or more occurrences of E ("Kleene star")
          | (E)         Matches E (parentheses used for grouping, they do not match input symbols)

For example, here are some regexps and strings that match and do not match them. Below, "" represents the empty string.

Regexp	Matches	Doesn't match
`a`	`a`	`b`, `c`
`abc`	`abc`	`a`, `ab`, `bc`, `cba`
`(ab)\|(cd)`	`ab`, `cd`	`a`, `abcd`, `abc`
`a*`	`""`, `a`, `aa`, `aaaaaaa`	`b`, `ab`, `ba`
`(a\|b)*c`	`c`, `ac`, `bc`, `aac`, `abc`, `bac`, `bbc`	`ca`, `cc`, `aca`

In regexs, * has higher precedence than concatenation has higher precedence than |. For example, ba* = b(a*), and ab|bc = (ab)|(bc), and a|b* = (a)|(b*). We've written a parser for you, so you don't need to worry about implementing precedence, but you do need to know the rules when testing your code. When in doubt, add parentheses.

Another approach to matching strings is to use nondeterministic finite automata (NFAs). For example, here is an NFA that recognizes the language (a|b)*abb:

An NFA, which is usually drawn as a graph, consists of a set of states (the nodes of the graph) and transitions (the edges of the graph). Here we've labeled the states S0 through S3, but the labels are just for our convenience—they don't mean anything. Some of the states are special: There is exactly one start state, in this case S0, which is drawn with an incoming edge from nowhere. There are zero or more final states, in this case just S3, which are drawn with a double circle. It is possible for the start state to also be a final state. Each edge is labeled with one of more characters. For example, the edge from S0 to S1 is labeled with a, and the self-loop from S0 to S0 is labeled with a and b.

To check whether an NFA matches a string, imagine putting a token down on the start state and then scanning the string, moving the token from state to state following an edge that's labeled with the string's next character. If it's possible to make such a sequence of moves and wind up on a final state after reaching the end of the string, then the string matches. Otherwise—if you wind up in a non-final state, or if it's impossible to make a move at some point in the middle—then the string does not match.

For example, the string babb matches because we can make the following sequence of moves: S0 -b-> S0 -a-> S1 -b-> S2 -b-> S3. On the other hand, b does not match because there is no sequence of moves terminating in a final state (e.g., we can do S0 -b-> S0, but S0 is not final).

As another example, the string abb matches because we can make the following moves: S0 -a-> S1 -b-> S2 -b-> S3. Notice that abb is matched even though there are other non-matching sequences of moves, e.g., S0 -a-> S0 -b-> S0 -b-> S0 finishes in a non-final state. In an NFA, as long as there is some sequence of moves that ends in a final state, the string is matched.

NFAs can also have special ε-transitions, which can be taken without consuming a character from the string. For example, consider the following NFA:

This NFA accepts the string ababa using the following sequence of moves: S0 -a-> S1 -b-> S2 -ε-> S0 -a-> S1 -b-> S2 -a-> S0. Notice that we can sometimes take the ε-transition, and sometimes not take it. The presence of ε-transitions or the presence of states that have multiple outgoing transitions with the same character is what makes NFAs nondeterministic: while looking for a match, we are allowed to make choices, and there is a match if any set of choices works.

Directly matching strings using regexs is possible, but it's somewhat clunky and may have poor performance. A better idea is to convert the regex into an NFA, and then match using the NFA. In fact, regexs and NFAs are equivalent in power in terms of the strings they can match (as are deterministic finite automata (DFAs), which are NFAs but without ε-transitions and with at most one transition with a given character from each state).

Here are the rules for translating from regexs to NFAs. The algorithm is recursive over the structure of the regex. At each step, we'll draw the NFA for a regex E as a box: . When we draw edges going into the box, we mean the edges connect to the start state of the regex representing E. When we draw edges going from the box, we mean the edges connect to all of the final states of the regex representing E.

Regexp	NFA	Final State Channges
Character `a`
Sequence `EF`		Final states of `E` made non-final
Disjunction `E\|F`
Star `E*`		Final states of `E` made non-final

For example, consider translating ab*|c to an NFA. Structurally, the regexp looks like this:

To translate this regex to an NFA, we walk over this structure recursively, building the NFA from the bottom up, as follows:

Build the NFA for a:
Build the NFA for b:
Apply the rule for star to the NFA for b to yield the NFA for b*:
Apply the rule for sequence to the NFA for a and the NFA for b* to yield the NFA for ab*:
Build the NFA for c:
Apply the rule for disjunction to the NFA for ab* and the NFA for c to yield the NFA for ab*|c:

Notice that this algorithm doesn't necessarily give you the most compact NFA possible, and in general there are an unbounded number of equivalent NFAs. But, for this project we will not worry about making the NFAs small (unless you choose to do this as part of improving the performance for part 2 below).

For the first part of the project, you will implement the regex-to-NFA translation above. You must implement the exact translation above. We will check your code by examining the NFA structure directly, so even if you see opportunities for optimizing the output NFA, do not take them in this part of the project.

To start out, we've given you a code skeleton with several pieces:

Interface Regex with implementations with constructors
- RChar(a) - character a
- RSeq(re1, re2) - sequence with re1 followed by re2
- ROr(re1, re2) - disjunction of re1 and re2
- RStar(re) - star of re
For example, ab*|c can be built as new ROr(new RSeq(new RChar('a'), new RStar(new RChar('b'))), new RChar('c')).
A class Parser for turning a string like ab*|c into a Regex. Note: The only allowed characters for this project's regexs are a-z in lowercase, plus operators | and *, plus parentheses. If you try anything else the regex parser will give you an error (possibly a mysterious one...sorry, error handling is hard!).
A class Match for testing your regex matching from the command line. Running java Match "regex" str will print yes if the string matches the regex, and no otherwise. For example, java Match "ab*|c" ab should print yes. (The regex has quotes around it so the shell does not get confused by any *s or |s.)
We have not given you a test infrastructure for this project, but we highly recommend you add a testing facility. You can look back at previous projects for the class and do this in whatever style you like.

Your task is to implement the following constructor/methods of class NFA (which is already called by Match#main):

NFA() - Construct an NFA with one start state and no transitions.
Object newState() - Add a new state to the NFA and return it. It's up to you how to represent states. We will compare states using equals and possibly hashCode, so make sure that two states are equals and have the same hashCode if and only if they represent the same state. If you use something like Integer or String to represent states, this will happen automatically.
void newTransition(Object start, char c, Object end) - Make a new transition from state start to state end on character c. The states must exist in the NFA already, otherwise this method should raise an exception. The character c must be in the range a-z (lowercase) or # (to represent an ε-transition) or this method should raise an exception.
void makeFinal(Object s) - Make state s final. The state must exist in the NFA already, otherwise this method should raise an exception
List<Object> states() - Return a list of the states of the NFA.
Object start_state() - Returns the start state.
List<Object> final_states() - Returns a list of the final states.
List<Map.Entry<Character, Object>> transition(Object state) - returns a list of pairs of transitions from the given state. An ε-transition should be represented using the character #. For example, in this NFA:

The edges from S0 are ('a', S0), ('b', S0), and ('a', S1).
NFA(Regex re) - Construct an NFA from a regex. This is the most interesting constructor/method in this part!

For the second part of your project, you will implement multi-threaded NFA matching. More specifically, inside the NFA class, implement a method boolean match(String s, int nthreads) that returns true if and only if s is matched by the NFA, using at most nthreads to do the matching. And, it should go without saying, but just in case: you may not use Java's regexs in your implementation! (You might find them handy for testing. Warning: Java regexs will look for a match anywhere in the string, not just at the beginning, which is what your regexs will do.)

The key place you can use multi-threading is whenever there is non-determinism. For example, consider again the first NFA from above:

If we are scanning a string, and the first character we see is a b, then there's no choice: We have to take the transition from S0 to S0. But, if the first character we see is an a, then there are two choices: we could either go from S0 to S0, or from S0 to S1. At such choice points, we can conceptually fork execution of the matcher and try one choice in one thread and the other choice in another thread.

The exact way you implement multi-threading, and the number of threads you use at any one time, is up to you. In addition to your code, include a file design.txt in the project submission that describes, in a half page to one-and-a-half pages of text, the design of multi-threading in your system, with references to classes and methods in the code. We will grade this rather loosely, but your explanation should be clear enough that a skilled programmer can look at your discussion, match it up with your code, and understand what you did.

Here are some considerations in writing match:

Your code may not use more than nthreads threads, including the main thread. On the other hand, it should also use more than one thread in appropriate circumstances. If your code always only uses one thread, you will not get many points for multi-threading!
We highly recommend looking in java.util.concurrent for existing abstractions you can use. For example, you could use ThreadPoolExecutor to manage a set of worker threads, or possibly ForkJoinPool. You could use LinkedBlockingQueue (or Deque) to share information between the main and worker threads.
One key challenge is allocating the work among threads efficiently. For example, if you create a new thread each time there's a non-deterministic choice, then the overhead of doing so might swamp the benefit of having threads. Another challenge is determining what path to explore next. Should you go breadth-first? Depth-first? Use some kind of metric to estimate the chances of success down one path? These are tricky questions!
Another key challenge is termination: match should return as soon as possible after finding a match (since any work afterwards is wasted effort). So you need to think about how to cancel threads that are in the middle of trying to match. Your code also needs to terminate if all possibilities have been explored and there is no match.
To measure performance of your implementation, you are going to need large strings and tricky NFAs. If the strings are short, then it's going to run fast almost no matter what. See below for the requirement to submit a performance test case.
This is a hard project and you will get almost all, or possibly all, of the points for an implementation that works with small NFAs and small strings. So, make sure your code works for the basic cases, and think of the time spent improving performance for large-scale cases as more "fun" than required for the project.

You must also submit one performance test for your regex on campuswire. Your test should be in the form of either (a) a single command-line call java Match "regexp" str; (b) a shell script that runs java Match; or (c) Java code that uses NFA methods to directly construct an NFA and calls match on the resulting NFA with some string. Accompanying your test should be a brief explanation (a few sentences will do) about why this is a useful performance test.

Put all your code in the code skeleton we have supplied, and upload your solution using Gradescope. Don't forget to submit your design.txt file.

Project 5: Parallel Regexp Matching

Introduction

Regexs

NFAs

Regex to NFA Conversion

Part 1: Regex to NFA Implementation

Part 2: A Multi-Threaded Regex Matcher

Performance Test Case

What to Turn In