Hints for writing and debugging combinator parsers

Draft

Recursive parsers using fixed-point combinators

When a parser is just a function, it’s easy to define recursively, provided you’re careful to avoid an infinite recursion. But when a parser is an abstraction, we have to use a fixed-point combinator, just as we do in the lambda calculus.

Suppose that an expression exp can contain subexpressions with the same syntax as exp. Then we are likely to define it using a recursion equation. Here, for example, is Wirth’s classic grammar for four-function arithmetic:

exp  ::= term { ("+" | "-") term }
term :: = factor { ("*" | "/") factor }
factor ::= int | "(" exp ")"

We’re going to want to turn this into a recursion equation

exp = ... exp ...

At which point we can define the parser by

fix (fn exp => ... exp ...)

If we want to make it efficient, we’ll further transform the code as

fix' (fn expref => let val exp = !!expref in ... exp ... end)

Here’s a sketch, not even typechecked:

fix' (fn expref =>
  let val exp = !! expref
      val factor = int <|> the "(" >> exp <~> the ")"
      val term   = build <$> factor <*> 
                   many (pair <$> (the "*" <|> the "/") <*> factor)
  in               build <$> term <*> 
                   many (pair <$> (the "+" <|> the "-") <*> term)
  end)

where

pair x y (x, y)
build v [] = v
build v ((symbol, v1) :: pairs) = build (operator symbol (v, v1)) pairs
operator "+" = op +
operator "-" = op -
operator "*" = op *
operator "/" = op div

Error detection

Error detection is hell. It’s not just combinator parsers; Yacc is a mess, and there are many other issues. If you’re curious to see some evidence, start with Clint Jeffery’s paper on error messages in LR parsers. For combinator parsers, there’s not a lot in the professional literature on error detection or reporting. Evan Czaplicki is doing lovely work on Elm, but he’s got a language to grow and he doesn’t publish.

The best practice I know of is to identify when the parser should absolutely commit to a syntactic form, and at that point write something in roughly like this:

p1 >> (* commit point is here *) (p1 <|> perror "syntax error:)

The <|> and perror are almost certainly something you want to be done for you by a combinator.

The next level up is instead of just perror, you write a combinator that can gobble up some bad tokens and maybe get parsing back on track. That’s what I’m doing in my instruction parser in asmparse.sml: if things go horribly wrong, my parser eats every token up to the next newline, then continues.

My opinion is that for your very first combinator parser, you shouldn’t have to muck around with error detection. Focus on parsing one line correctly, and if something goes wrong, my instruction parser should spit out a message that shows the offending line. That should be good enough to go on with.

Debugging

The only way I know of to debug a combinator parser is so-called “printf debugging.” To get you started, I’ve provided two functions in asmparse.sml:

I hope these two are enough. If you need more, the next step is to wrap a parser p in a new parser that prints a message on entry, runs p, scrutinizes p’s result and reports something to standard error, then returns the result:

val veryVerbose : string -> 'a parser -> 'a parser
    = (fn what => fn p =>
         let fun shout s = app eprint ["looking for ", what, s, "\n"]
         in  P.ofFunction (fn ts =>
                              let val _ = shout "..."
                                  val answer = P.asFunction p ts
                                  val _ =
                                      case answer
                                        of NONE => shout ": failed"
                                         | SOME (Error.ERROR _, _) => shout ": errored"
                                         | SOME (Error.OK _, _) => shout ": succeeded"
                              in  answer
                              end)
         end)

Let us hope you don’t need to go there.

Draft