Project 5: Parallel Regexp Matching

COMP 121, Spring 2021

Performance test case due: Tue, Apr 27 @ 11:59pm

Main project due: Tue, May 4 @ 11:59pm

A regular expression (or regex) is a pattern, written in a domain-specific language, that describes a set of strings. Standard regex algorithms can very efficiently check whether a given string is in the set described by the regex. Many languages, including Java, support regexs.

In this project, you will implement a multi-threaded regex string matcher. The exact threading design is up to you, though we will give you some parameters.

Regexs are expressions in the following language:

    E,F ::= a           Matches a single character
          | EF          Matches E followed by F
          | E|F         Matches either E or F (note the | on the right is a regexp symbol)
          | E*          Matches zero or more occurrences of E ("Kleene star")
          | (E)         Matches E (parentheses used for grouping, they do not match input symbols)

For example, here are some regexps and strings that match and do not match them. Below, "" represents the empty string.

RegexpMatchesDoesn't match
aab, c
abcabca, ab, bc, cba
(ab)|(cd)ab, cda, abcd, abc
a*"", a, aa, aaaaaaab, ab, ba
(a|b)*cc, ac, bc, aac, abc, bac, bbcca, cc, aca

In regexs, * has higher precedence than concatenation has higher precedence than |. For example, ba* = b(a*), and ab|bc = (ab)|(bc), and a|b* = (a)|(b*). We've written a parser for you, so you don't need to worry about implementing precedence, but you do need to know the rules when testing your code. When in doubt, add parentheses.

Another approach to matching strings is to use nondeterministic finite automata (NFAs). For example, here is an NFA that recognizes the language (a|b)*abb:

An NFA, which is usually drawn as a graph, consists of a set of states (the nodes of the graph) and transitions (the edges of the graph). Here we've labeled the states S0 through S3, but the labels are just for our convenience—they don't mean anything. Some of the states are special: There is exactly one start state, in this case S0, which is drawn with an incoming edge from nowhere. There are zero or more final states, in this case just S3, which are drawn with a double circle. It is possible for the start state to also be a final state. Each edge is labeled with one of more characters. For example, the edge from S0 to S1 is labeled with a, and the self-loop from S0 to S0 is labeled with a and b.

To check whether an NFA matches a string, imagine putting a token down on the start state and then scanning the string, moving the token from state to state following an edge that's labeled with the string's next character. If it's possible to make such a sequence of moves and wind up on a final state after reaching the end of the string, then the string matches. Otherwise—if you wind up in a non-final state, or if it's impossible to make a move at some point in the middle—then the string does not match.

For example, the string babb matches because we can make the following sequence of moves: S0 -b-> S0 -a-> S1 -b-> S2 -b-> S3. On the other hand, b does not match because there is no sequence of moves terminating in a final state (e.g., we can do S0 -b-> S0, but S0 is not final).

As another example, the string abb matches because we can make the following moves: S0 -a-> S1 -b-> S2 -b-> S3. Notice that abb is matched even though there are other non-matching sequences of moves, e.g., S0 -a-> S0 -b-> S0 -b-> S0 finishes in a non-final state. In an NFA, as long as there is some sequence of moves that ends in a final state, the string is matched.

NFAs can also have special ε-transitions, which can be taken without consuming a character from the string. For example, consider the following NFA:

This NFA accepts the string ababa using the following sequence of moves: S0 -a-> S1 -b-> S2 -ε-> S0 -a-> S1 -b-> S2 -a-> S0. Notice that we can sometimes take the ε-transition, and sometimes not take it. The presence of ε-transitions or the presence of states that have multiple outgoing transitions with the same character is what makes NFAs nondeterministic: while looking for a match, we are allowed to make choices, and there is a match if any set of choices works.

Directly matching strings using regexs is possible, but it's somewhat clunky and may have poor performance. A better idea is to convert the regex into an NFA, and then match using the NFA. In fact, regexs and NFAs are equivalent in power in terms of the strings they can match (as are deterministic finite automata (DFAs), which are NFAs but without ε-transitions and with at most one transition with a given character from each state).

Here are the rules for translating from regexs to NFAs. The algorithm is recursive over the structure of the regex. At each step, we'll draw the NFA for a regex E as a box: . When we draw edges going into the box, we mean the edges connect to the start state of the regex representing E. When we draw edges going from the box, we mean the edges connect to all of the final states of the regex representing E.

RegexpNFAFinal State Channges
Character a
Sequence EFFinal states of E made non-final
Disjunction E|F
Star E*Final states of E made non-final

For example, consider translating ab*|c to an NFA. Structurally, the regexp looks like this:

To translate this regex to an NFA, we walk over this structure recursively, building the NFA from the bottom up, as follows:

Notice that this algorithm doesn't necessarily give you the most compact NFA possible, and in general there are an unbounded number of equivalent NFAs. But, for this project we will not worry about making the NFAs small (unless you choose to do this as part of improving the performance for part 2 below).

For the first part of the project, you will implement the regex-to-NFA translation above. You must implement the exact translation above. We will check your code by examining the NFA structure directly, so even if you see opportunities for optimizing the output NFA, do not take them in this part of the project.

To start out, we've given you a code skeleton with several pieces:

Your task is to implement the following constructor/methods of class NFA (which is already called by Match#main):

For the second part of your project, you will implement multi-threaded NFA matching. More specifically, inside the NFA class, implement a method boolean match(String s, int nthreads) that returns true if and only if s is matched by the NFA, using at most nthreads to do the matching. And, it should go without saying, but just in case: you may not use Java's regexs in your implementation! (You might find them handy for testing. Warning: Java regexs will look for a match anywhere in the string, not just at the beginning, which is what your regexs will do.)

The key place you can use multi-threading is whenever there is non-determinism. For example, consider again the first NFA from above:

If we are scanning a string, and the first character we see is a b, then there's no choice: We have to take the transition from S0 to S0. But, if the first character we see is an a, then there are two choices: we could either go from S0 to S0, or from S0 to S1. At such choice points, we can conceptually fork execution of the matcher and try one choice in one thread and the other choice in another thread.

The exact way you implement multi-threading, and the number of threads you use at any one time, is up to you. In addition to your code, include a file design.txt in the project submission that describes, in a half page to one-and-a-half pages of text, the design of multi-threading in your system, with references to classes and methods in the code. We will grade this rather loosely, but your explanation should be clear enough that a skilled programmer can look at your discussion, match it up with your code, and understand what you did.

Here are some considerations in writing match:

You must also submit one performance test for your regex on campuswire. Your test should be in the form of either (a) a single command-line call java Match "regexp" str; (b) a shell script that runs java Match; or (c) Java code that uses NFA methods to directly construct an NFA and calls match on the resulting NFA with some string. Accompanying your test should be a brief explanation (a few sentences will do) about why this is a useful performance test.

Put all your code in the code skeleton we have supplied, and upload your solution using Gradescope. Don't forget to submit your design.txt file.