COMP 121, Spring 2021
Performance test case due: Tue, Apr 27 @ 11:59pm
Main project due: Tue, May 4 @ 11:59pm
A regular expression (or regex) is a pattern, written in a domain-specific language, that describes a set of strings. Standard regex algorithms can very efficiently check whether a given string is in the set described by the regex. Many languages, including Java, support regexs.
In this project, you will implement a multi-threaded regex string matcher. The exact threading design is up to you, though we will give you some parameters.
Regexs are expressions in the following language:
E,F ::= a Matches a single character | EF Matches E followed by F | E|F Matches either E or F (note the | on the right is a regexp symbol) | E* Matches zero or more occurrences of E ("Kleene star") | (E) Matches E (parentheses used for grouping, they do not match input symbols)
For example, here are some regexps and strings that match and do not match them. Below, ""
represents the empty string.
Regexp | Matches | Doesn't match |
---|---|---|
a | a | b , c |
abc | abc | a , ab , bc , cba |
(ab)|(cd) | ab , cd | a , abcd , abc |
a* | "" , a , aa , aaaaaaa | b , ab , ba |
(a|b)*c | c , ac , bc , aac , abc , bac , bbc | ca , cc , aca |
In regexs, *
has higher precedence than concatenation has
higher precedence than |
. For example, ba* =
b(a*)
, and ab|bc = (ab)|(bc)
, and a|b* =
(a)|(b*)
. We've written a parser for you, so you don't need to
worry about implementing precedence, but you do need to know the rules
when testing your code. When in doubt, add parentheses.
Another approach to matching strings is to use nondeterministic
finite automata (NFAs). For example, here is an NFA that
recognizes the language (a|b)*abb
:
An NFA, which is usually drawn as a graph, consists of a set of
states (the nodes of the graph) and transitions (the
edges of the graph). Here we've labeled the states S0
through S3
, but the labels are just for our
convenience—they don't mean anything. Some of the states are
special: There is exactly one start state, in this case
S0
, which is drawn with an incoming edge from
nowhere. There are zero or more final states, in this case just
S3
, which are drawn with a double circle. It is possible for the
start state to also be a final state. Each edge is labeled
with one of more characters. For example, the edge from
S0
to S1
is labeled with a
, and
the self-loop from S0
to S0
is labeled with
a
and b
.
To check whether an NFA matches a string, imagine putting a token down on the start state and then scanning the string, moving the token from state to state following an edge that's labeled with the string's next character. If it's possible to make such a sequence of moves and wind up on a final state after reaching the end of the string, then the string matches. Otherwise—if you wind up in a non-final state, or if it's impossible to make a move at some point in the middle—then the string does not match.
For example, the string babb
matches because we can make
the following sequence of moves: S0 -b-> S0 -a-> S1 -b-> S2 -b->
S3
. On the other hand, b
does not match because
there is no sequence of moves terminating in a final state (e.g., we
can do S0 -b-> S0
, but S0
is not final).
As another example, the string abb
matches
because we can make the following moves: S0 -a-> S1 -b-> S2 -b->
S3
. Notice that abb
is matched even though there
are other non-matching sequences of moves, e.g., S0 -a-> S0 -b->
S0 -b-> S0
finishes in a non-final state. In an NFA, as long as
there is some sequence of moves that ends in a final state, the
string is matched.
NFAs can also have special ε-transitions, which can be taken without consuming a character from the string. For example, consider the following NFA:
This NFA accepts the string ababa
using the following
sequence of moves: S0 -a-> S1 -b-> S2 -ε-> S0 -a-> S1
-b-> S2 -a-> S0
. Notice that we can sometimes take the
ε-transition, and sometimes not take it. The presence of
ε-transitions or the presence of states that have multiple
outgoing transitions with the same character is what makes NFAs
nondeterministic: while looking for a match, we are allowed to make
choices, and there is a match if any set of choices works.
Directly matching strings using regexs is possible, but it's somewhat clunky and may have poor performance. A better idea is to convert the regex into an NFA, and then match using the NFA. In fact, regexs and NFAs are equivalent in power in terms of the strings they can match (as are deterministic finite automata (DFAs), which are NFAs but without ε-transitions and with at most one transition with a given character from each state).
Here are the rules for translating from regexs to NFAs. The algorithm
is recursive over the structure of the regex. At each step, we'll draw
the NFA for a regex E
as a box: . When we draw edges going into the box, we mean the
edges connect to the start state of the regex representing
E
. When we draw edges going from the box, we mean the
edges connect to all of the final states of the regex
representing E
.
Regexp | NFA | Final State Channges |
---|---|---|
Character a | ||
Sequence EF | Final states of E made non-final | |
Disjunction E|F | ||
Star E* | Final states of E made non-final |
For example, consider translating ab*|c
to an
NFA. Structurally, the regexp looks like this:
To translate this regex to an NFA, we walk over this structure recursively, building the NFA from the bottom up, as follows:
a
: b
: b
to yield the NFA for b*
: a
and the NFA for b*
to yield the NFA for ab*
: c
: ab*
and the NFA for c
to yield the NFA for ab*|c
:
Notice that this algorithm doesn't necessarily give you the most compact NFA possible, and in general there are an unbounded number of equivalent NFAs. But, for this project we will not worry about making the NFAs small (unless you choose to do this as part of improving the performance for part 2 below).
For the first part of the project, you will implement the regex-to-NFA translation above. You must implement the exact translation above. We will check your code by examining the NFA structure directly, so even if you see opportunities for optimizing the output NFA, do not take them in this part of the project.
To start out, we've given you a code skeleton with several pieces:
Regex
with implementations with constructors
RChar(a)
- character a
RSeq(re1, re2)
- sequence with re1
followed by re2
ROr(re1, re2)
- disjunction of re1
and re2
RStar(re)
- star of re
ab*|c
can be built as new ROr(new
RSeq(new RChar('a'), new RStar(new RChar('b'))), new
RChar('c'))
.Parser
for turning a string like
ab*|c
into a Regex
. Note: The only
allowed characters for this project's regexs are
a
-z
in lowercase, plus operators | and *,
plus parentheses. If you try anything else the regex parser will
give you an error (possibly a mysterious one...sorry, error handling
is hard!). Match
for testing your regex matching from
the command line. Running java Match "regex" str
will
print yes
if the string matches the regex, and
no
otherwise. For example, java Match "ab*|c"
ab
should print yes
. (The regex has quotes
around it so the shell does not get confused by any *
s
or |
s.)
Your task is to implement the following constructor/methods of class
NFA
(which is already called by Match#main
):
NFA()
- Construct an NFA with one start state and
no transitions.Object newState()
- Add a new state to the NFA and
return it. It's up to you how to represent states. We will compare
states using equals
and possibly hashCode
,
so make sure that two states are equals
and have the
same hashCode
if and only if they represent the same
state. If you use something like Integer
or
String
to represent states, this will happen
automatically.void newTransition(Object start, char c, Object
end)
- Make a new transition from state start
to
state end
on character c
. The states must
exist in the NFA already, otherwise this method should raise an
exception. The character c
must be in the range
a-z
(lowercase) or #
(to represent
an ε-transition) or this method should raise an exception.void makeFinal(Object s)
- Make state s
final. The state must exist in the NFA already, otherwise this
method should raise an exceptionList<Object> states()
- Return a list of the
states of the NFA.Object start_state()
- Returns the start
state.List<Object> final_states()
- Returns a list
of the final states.List<Map.Entry<Character, Object>>
transition(Object state)
- returns a list of pairs of
transitions from the given state. An ε-transition should
be represented using the character #
. For example, in
this NFA:
The edges from S0
are ('a', S0)
,
('b', S0)
, and ('a', S1)
.
NFA(Regex re)
- Construct an NFA from a regex. This
is the most interesting constructor/method in this part! For the second part of your project, you will implement
multi-threaded NFA matching. More specifically,
inside the NFA
class, implement a method boolean
match(String s, int nthreads)
that returns true
if and only if s
is matched by the NFA, using at most
nthreads
to do the matching. And, it should go without
saying, but just in case: you may not use Java's regexs in your
implementation! (You might find them handy for testing. Warning: Java
regexs will look for a match anywhere in the string, not just
at the beginning, which is what your regexs will do.)
The key place you can use multi-threading is whenever there is non-determinism. For example, consider again the first NFA from above:
If we are scanning a string, and the first character we see is a
b
, then there's no choice: We have to take the transition
from S0
to S0
. But, if the first character
we see is an a
, then there are two choices: we could
either go from S0
to S0
, or from
S0
to S1
. At such choice points, we can
conceptually fork execution of the matcher and try one choice in
one thread and the other choice in another thread.
The exact way you implement multi-threading, and the number of
threads you use at any one time, is up to you. In addition to your
code, include a file design.txt
in the project submission
that describes, in a half page to one-and-a-half pages of text, the design of
multi-threading in your system, with references to classes and methods
in the code. We will grade this rather loosely, but your
explanation should be clear enough that a skilled programmer can look
at your discussion, match it up with your code, and understand what you did.
Here are some considerations in writing match
:
nthreads
threads, including
the main thread. On the other hand, it should also use more than one
thread in appropriate circumstances. If your code always only uses
one thread, you will not get many points for multi-threading!java.util.concurrent
for existing abstractions you can use. For example, you could use
ThreadPoolExecutor
to manage a set of worker
threads, or possibly ForkJoinPool
.
You could use LinkedBlockingQueue
(or
Deque
) to share information between the main and
worker threads.match
should
return as soon as possible after finding a match (since any work
afterwards is wasted effort). So you need to think about how to
cancel threads that are in the middle of trying to match. Your code
also needs to terminate if all possibilities have been explored and
there is no match. You must also submit one performance test for your regex on
campuswire. Your test should be in the form of either (a) a single
command-line call java Match "regexp" str
; (b) a shell
script that runs java Match
; or (c) Java code that uses
NFA
methods to directly construct an NFA and calls
match
on the resulting NFA with some string. Accompanying
your test should be a brief explanation (a few sentences will do)
about why this is a useful performance test.
Put all your code in the code skeleton we have
supplied, and upload your solution using Gradescope. Don't forget to
submit your design.txt
file.