CS152 ML Homework Extra Credit: Beyond Regular Expressions

Part I: The death of the regular expression

In class, someone asked why I had a low opinion of regular expressions. The short version: they are not extensible, and they don't produce any interesting output (just success or failure). In this extra credit, you explore more general abstractions. If you do any of this work, please submit it in file [[parcom.sml]].

A classic regular expression is empty, a character, a concatenation, or the closure of a regular expression. Because it's no harder, we will use ``character classes'' (represented as functions of course). <>= datatype regexp = EMPTY | CHARCLASS of char -> bool | CAT of regexp * regexp | STAR of regexp | OR of regexp * regexp @ A simple character can be represented this way: <>= val char : char -> regexp = fn c => CHARCLASS (fn c' => c = c') @

  1. Write an ML function [[recognize]] of type [[regexp -> string -> bool]] which tells whether a string matches a regular expression. Hint: you may find success and failure continuations useful.
  2. Now write it again so that you may usefully partially apply it. That is, write a compiler that translates a regexp into an automaton.
  3. Now get rid of the [[regexp]] datatype entirely and instead define higher-order functions [[empty]], [[charclass]], [[cat]], [[star]], and [[or]] directly. Write a new version of [[regcognize]] to go with them.
So who cares? What's the big deal here. Well, you can't extend a [[datatype]], but you can write new functions!
  1. Write a function [[all]] that is like [[star]] except that it is greedy; that is, it insists on consuming as much input as possible. So the string [[xxx]] would match [[concat(star (char x), char x)]] but not [[concat(all (char x), char x)]].

Part II: The birth of parsing combinators

OK, enough fooling around. Let's define a parser from [['a]] to [['b]] as a function that will take a sequence of values of type [['a]] and do one of two things:
  1. Define the function type [[('a, 'b) parser]] to represent a parser. Hint: try the following types for success and failure continuations:
    type 'b fail = unit -> 'b
    type 'b resume = unit -> 'b
    type ('a, 'b) succ = 'a list -> 'b -> 'b resume -> 'b
    

  2. Write a function
    eof : ('a, unit) parser
    
    which succeeds when it has reached end of file and fails otherwise.

  3. Write a function
    return : 'a -> ('a, 'b) parser
    
    which always succeeds and never consumes any input.

  4. Write a function
    expect : ('a -> bool) -> ('a, 'a) parser
    
    such that [[expect p]] succeeds if the input is nonempty and the first item in the input satisfies [[p]] and fails otherwise. If [[expect p]] succeeds, it returns that first item.

  5. The grownup version of [[or]] operates on parsers:
    infix |||
    op ||| : ('a, 'b) parser * ('a, 'b) parser -> ('a, 'b) parser
    
    The parser [[p1 ||| p2]] first tries [[p1]], then [[p2]]. You will need to build a suitable failure continuation for [[p1]].

  6. The most mathematically deep parsing combinator allows us to combine two parsers in sequence, where the output of one parser is used to determine the second parser:
    infix >>=
    op >>= : ('a, 'b) parser * ('b -> ('a, 'b) parser) -> ('a, 'b) parser
    
    (It would be pleasant if [[>>=]] had an even more general type, but to make it so would require a better way of managing backtracking than explicit success and failure continuations. One technique that is particularly effective is to have a an [[('a, 'b) parser]] return a value of type [[('a list * 'b) list]], but this is efficient only if the outer list is lazy.)

    To implement [[p >>= k]] successfully will require constructing a suitable success continuation for [[p]].

  7. To implement regular expressions, write the following parsers and parser combinators:
    type regexp = (char, string) parser
    empty : regexp
    charclass : (char -> bool) -> regexp
    cat : regexp * regexp -> regexp
    star : regexp -> regexp
    
    If you lean heavily on the combinators you have already done (especially [[>>=]], [[return]], and [[|||]]), this will be trivial. (You will also find the predefined function [[str]] useful.)