How parsing combinators work

If you’re reading this handout, you’ve seen one of the two videos on what parsing combinators can do for us, and you’re ready to learn how they work. The handout starts with the get functions used in the SVM loader, and from there it works up to something much more flexible and powerful.

Replicating our token parsers in a functional setting

The get functions from the tokens.h file in module 2 have no conditional logic: a get function either succeeds or halts the program with an error. And when it succeeds, a get function consumes a token from a list as a side effect.

To introduce combinator parsing, I’m first going to write the same functions in ML without side effects: when a get function consumes a token, it returns the value gotten, paired with a list containing the remaining unconsumed tokens.

datatype loader_token = TNAME of string | TU32 of | TDOUBLE of real

fun get_int (TU32 n :: tokens) = (n, tokens)
  | get_int _                  = error "expected an int"

fun get_name (TNAME x :: tokens) = (x, tokens)
  | get_name _                   = error "expected a name"

Functions get_int and get_name have these types:

val get_int  : loader_token list -> * loader_token list
val get_name : loader_token list -> string    * loader_token list

Functional parsing with the error monad

I’d rather a syntax error didn’t crash my assembler. To avoid the possibility of a crash, new version get_int' and get_int' return a result in the error monad. They still return the list of unconsumed tokens, but now that list is paired with a value in the error monad.

datatype 'a error = OK of 'a | ERROR of string

val get_int'  : loader_token list -> error * loader_token list
val get_name' : loader_token list -> string    error * loader_token list

These functions return results that use the error monad, so if an error occurs, the caller can see ERROR, and the program doesn’t crash.

These types are starting to get complicated. I’m going to simplify them by defining a type synonym (also called “type abbreviation”):

type 'a proto_producer = loader_token list -> 'a error * loader_token list
val get_int'  : int    proto_producer
val get_name' : string proto_producer

A value of type 'a proto_producer is function that takes tokens and will, if successful, produce a value of type 'a (along with leftover tokens).

The implementations don’t change much:

fun get_int' (TU32 n :: tokens) = (OK n, tokens)
  | get_int' tokens             = (ERROR "expected an int", tokens)

fun get_name' (TNAME x :: tokens) = (OK x, tokens)
  | get_name' tokens              = (ERROR "expected a name", tokens)

Pattern matching on token lists doesn’t scale

When I want to start parsing more complicated things, the computation I care about is going to start to be obscured by boilerplate code. As an example, here’s an approximation of get_literal:

datatype value = NUM of real | BOOL of bool | NIL | EMPTYLIST | STRING of string

fun get_literal (TU32 n :: tokens)    = (OK (NUM (real n)), tokens)
  | get_literal (TDOUBLE x :: tokens) = (OK (NUM x),        tokens)
  | get_literal (TNAME "true" :: tokens) = (OK (BOOL true), tokens)
  | get_literal (TNAME "string" :: TU32 count :: tokens) =
      let val bytes = map something (take count tokens)
      in  (OK (STRING (implode bytes)), drop count tokens)
  | get_literal tokens = (ERROR "not a literal", tokens)

Its type is OK, written in either of these two ways:

val get_literal : value proto_producer
val get_literal : loader_token list -> value error * loader_token list

But look at the "string" case. I am pattern matching on two tokens at the head of the list, not just one. Although the code isn’t terrible, you should have a bad feeling about this:

To fix these issues, we’ll start using some of the operators in the videos:

Parsing with failure and higher-order functions

The “proto-producer” above can say only two things: “I recognize this input” and “this input is wrong.” A full-fledged “producer” can say a third thing: “I don’t recognize this input.” The type of a producer is

type 'a producer = input list -> ('a error * input list) option

where input can be a token (for a parser) or a character (for a lexer) or even something like an S-expression.1 That result type can have three forms:

A parser is a producer, and parsing begins with “atomic” parsers—one true producer for each species of token:

val int  : producer
val name : string producer

fun int (TU32 n :: tokens) = SOME (OK n, tokens)
  | int _                  = NONE

fun name (TNAME x :: tokens) = SOME (OK x, tokens)
  | name _                   = NONE

True producers easily support both sequence and choice operations. A sequence of two parsers produces whatever the second parser produces:

val >> : 'a producer * 'b producer -> 'b producer

A sequence p1 >> p2 succeeds only if both p1 and p2 succeed.

When a parser chooses between two alternatives, both alternatives have to produce the same type of value:

val <|> : 'a producer * 'a producer -> 'a producer

A choice p <|> p' can succeed in two ways: p succeeds, or p fails and p' succeeds. If p errors, then p' is never used.

Combinators >> and <|> have sane definitions; you’ll write algebraic laws for them in lab. And they justify the phrase “parsing combinators”; a combinator is a function that builds other functions.

Example parsers and parsing combinators

Parsers are built from atomic parsers using combinators. Here are a couple of base cases:

val eos : unit producer   (* recognizes end of stream *)
val one : input producer  (* takes one input token, if present *)

fun eos []       = SOME (E.OK (), [])
  | eos (_ :: _) = NONE

fun one [] = NONE
  | one (y :: ys) = SOME (E.OK y, ys)

Parsers support a form of “map”: a function is applied to the result of a successful parse.

val <$> : ('a -> 'b) * 'a producer -> 'b producer

Because <$> is usually defined infix, it’s not Curried; It’s the same <$> as you see in the error monad (with type error instead of producer), and it obeys the same laws. Because the forms of producer are more complicated than the forms of error, so are the laws:

In the more general case, we have a parser that produces a function, a parser that produces an argument, and we apply one to the other:

val <*> : ('a -> 'b) producer * 'a producer -> 'b producer

This one is also the same idea as in the error monad, but because there are nine cases, the laws (and the code) are unpleasant. But the laws for the baby error monad will give you the right idea.

Here’s a weird trivial one: a parser that reads nothing and always succeeds:

val succeed : 'a -> 'a producer

fun succeed a = (fn ts => SOME (OK a, ts))

The <$> combinator can be viewed as an abbreviation:

f <$> p = succeed f <*> p

Parser sat f p recognizes a subset of what p recognizes: only those values that satisfy f:

val sat : ('a -> bool) -> 'a producer -> 'a producer

A maybe parser is built using a function that may transform a value’s type:

val maybe : ('a -> 'b option) -> 'a producer -> 'b producer

maybe f p = valOf <$> sat isSome p

Abstract parsers and recursive parsers

The type abbreviation for 'a producer makes it easy to understand what producers do and how to write one, but if you get anything wrong when you’re building a parser, the type-error messages come from left field. To solve this problem, I’ve defined a parser interface in which the types of the input and the producer are abstract:

type input
type 'a producer

The representation is exposed via two functions:

val asFunction : 'a producer -> input list -> ('a error * input list) option
val ofFunction : (input list -> ('a error * input list) option) -> 'a producer

Although the type checker doesn’t know it, these functions are both the identity function, and if you compile with MLton, they cost nothing at run time. But you are unlikely to need them; it’s more common to use produce:

val produce : 'a producer -> input list -> 'a error
  (* consumes the entire list to produce a single 'a, or errors *)

The other difficulty is with recursion: if a parser isn’t a function, how are we to define one that’s recursive? The answer is to use a fixed-point combinator:

  val fix : ('a producer -> 'a producer) -> 'a producer
     (* fix g = g (fix g) *)

A recursive parser is defined like this

val arithmetic = fix (fn exp =>
     <|> curry op + <$> exp <~> sat (curry op = "+") name <*> exp)

The <~> says, in effect, “parse input and ignore the result”; parser p1 <~> p2 recognizes the same inputs as p1 >> p2, but instead of producing the result of p2, it produces the result of p1:

val <~> : 'a producer * 'b producer -> 'a producer
val >>  : 'a producer * 'b producer -> 'b producer

A tiny case study

As a realistic example, here is the expression parser for vScheme. Since expressions may have subexpressions, the parser is recursive and is written with the fix operator.

val exp = P.fix (fn exp =>
  let fun pair x y = (x, y)
      val binding  = oflist (pair <$> name <*> exp)
      val bindings = oflist (many binding)
  in     bracket "set"    (curry  S.SET <$> name <*> exp)
     <|> bracket "if"     (curry3 S.IFX <$> exp <*> exp <*> exp)
     <|> bracket "while"  (curry  S.WHILEX <$> exp <*> exp)
     <|> bracket "begin"  (       S.BEGIN <$> many exp)
     <|> bracket "let"    (letx S.LET    <$> bindings <*> exp)
     <|> bracket "let*"   (letstar       <$> bindings <*> exp)
     <|> bracket "letrec" (letx S.LETREC <$> bindings <*> exp)
     <|> bracket "quote"  (      (S.LITERAL o sexp) <$> one)
     <|> bracket "lambda" (curry S.LAMBDA <$> oflist (many name) <*> exp) 
     <|> oflist eos >> P.perror "empty list as Scheme expression"
     <|> S.LITERAL <$> (    kw "#t" >> P.succeed (S.BOOLV true)
                        <|> kw "#f" >> P.succeed (S.BOOLV false)
                        <|> S.BOOLV <$> bool
                        <|> S.NUM <$> int
                        <|> (S.SYM o Real.toString) <$> sxreal
     <|> S.VAR <$> name
     <|> oflist (curry S.APPLY <$> exp <*> many exp) 

The bracket and oflist combinators are functions I defined myself. The oflist parser is very strange: it unpacks a single S-expression into a sequence of S-expressions, then parses the sequence. It’s used for every Scheme form that is written as a list of S-expressions. The bracket parser looks for a keyword in brackets; it works more or less like the thing in my introductory video:

fun bracket word parser =
  oflist (kw word >> (parser <~> eos <|> perror ("Bad " ^ word ^ " form")))

Error handling

For now, don’t worry about error outcomes. Just write parsers that recognize the syntax you’ve defined. In testing, there will come a time when your parser says “input not recognized” and you don’t know why not. To improve the error message, you can use a new combinator, perror, which always causes an error. It’s useful because you can pass an informative message.

Most likely you want to define new combinators that use perror. Here’s one, commitTo, which says that I commit to finding a particular input:

fun commitTo what p = p <|> perror ("I was expecting " ^ what)

Lots more is possible; for example, here’s one that shows the next token, if present:

fun expected what =
  let fun bad t = Error.ERROR ("expected " ^ what ^
                               ", but instead found this token: " ^ showToken t)
  in check (  bad <$> one
          <|> perror ("expected " ^ what ^ ", but there was no more input")

or all the tokens:

fun expected what =
  let fun bads ts = Error.ERROR ("looking for " ^ what ^
                               ", got this input: " ^ showTokens ts)
  in  check ( bads <$> many one )

A definition of showTokens appears in file asmparse.sml.

  1. I say “producer” because “parser” and “lexer” (or “scanner”) are traditionally used to refer to different tools built using different technology. But using parsing combinators, both are “producers”.