CS152 ML Homework Extra Credit: Beyond Regular Expressions

Part I: The death of the regular expression

In class, someone asked why I had a low opinion of regular expressions. The short version: they are not extensible, and they don't produce any interesting output (just success or failure). In this extra credit, you explore more general abstractions. If you do any of this work, please submit it in file parcom.sml.

A classic regular expression is empty, a character, a concatenation, or the closure of a regular expression. Because it's no harder, we will use ``character classes'' (represented as functions of course).

<parcom.sml>= [D->]
datatype regexp 
  | CHARCLASS of char -> bool
  | CAT       of regexp * regexp
  | STAR      of regexp
  | OR        of regexp * regexp

A simple character can be represented this way:

<parcom.sml>+= [<-D]
val char : char -> regexp = fn c => CHARCLASS (fn c' => c = c')

  1. Write an ML function recognize of type regexp -> string -> bool which tells whether a string matches a regular expression. Hint: you may find success and failure continuations useful.
  2. Now write it again so that you may usefully partially apply it. That is, write a compiler that translates a regexp into an automaton.
  3. Now get rid of the regexp datatype entirely and instead define higher-order functions empty, charclass, cat, star, and or directly. Write a new version of regcognize to go with them.
So who cares? What's the big deal here. Well, you can't extend a datatype, but you can write new functions!
  1. Write a function all that is like star except that it is greedy; that is, it insists on consuming as much input as possible. So the string xxx would match concat(star (char x), char x) but not concat(all (char x), char x).

Part II: The birth of parsing combinators

OK, enough fooling around. Let's define a parser from 'a to 'b as a function that will take a sequence of values of type 'a and do one of two things:
  1. Define the function type ('a, 'b) parser to represent a parser. Hint: try the following types for success and failure continuations:
    type 'b fail = unit -> 'b
    type 'b resume = unit -> 'b
    type ('a, 'b) succ = 'a list -> 'b -> 'b resume -> 'b

  2. Write a function
    eof : ('a, unit) parser
    which succeeds when it has reached end of file and fails otherwise.

  3. Write a function
    return : 'a -> ('a, 'b) parser
    which always succeeds and never consumes any input.

  4. Write a function
    expect : ('a -> bool) -> ('a, 'a) parser
    such that expect p succeeds if the input is nonempty and the first item in the input satisfies p and fails otherwise. If expect p succeeds, it returns that first item.

  5. The grownup version of or operates on parsers:
    infix |||
    op ||| : ('a, 'b) parser * ('a, 'b) parser -> ('a, 'b) parser
    The parser p1 ||| p2 first tries p1, then p2. You will need to build a suitable failure continuation for p1.

  6. The most mathematically deep parsing combinator allows us to combine two parsers in sequence, where the output of one parser is used to determine the second parser:
    infix >>=
    op >>= : ('a, 'b) parser * ('b -> ('a, 'b) parser) -> ('a, 'b) parser
    (It would be pleasant if >>= had an even more general type, but to make it so would require a better way of managing backtracking than explicit success and failure continuations. One technique that is particularly effective is to have a an ('a, 'b) parser return a value of type ('a list * 'b) list, but this is efficient only if the outer list is lazy.)

    To implement p >>= k successfully will require constructing a suitable success continuation for p.

  7. To implement regular expressions, write the following parsers and parser combinators:
    type regexp = (char, string) parser
    empty : regexp
    charclass : (char -> bool) -> regexp
    cat : regexp * regexp -> regexp
    star : regexp -> regexp
    If you lean heavily on the combinators you have already done (especially >>=, return, and |||), this will be trivial. (You will also find the predefined function str useful.)