If you’re reading this handout, you’ve seen one of the two videos on what parsing combinators can do for us, and you’re ready to learn how they work. The handout starts with the get
functions used in the SVM loader, and from there it works up to something much more flexible and powerful.
Replicating our token parsers in a functional setting
The get
functions from the tokens.h
file in module 2 have no conditional logic: a get
function either succeeds or halts the program with an error. And when it succeeds, a get
function consumes a token from a list as a side effect.
To introduce combinator parsing, I’m first going to write the same functions in ML without side effects: when a get
function consumes a token, it returns the value gotten, paired with a list containing the remaining unconsumed tokens.
datatype loader_token = TNAME of string | TU32 of Int32.int | TDOUBLE of real
fun get_int (TU32 n :: tokens) = (n, tokens)
| get_int _ = error "expected an int"
fun get_name (TNAME x :: tokens) = (x, tokens)
| get_name _ = error "expected a name"
Functions get_int
and get_name
have these types:
val get_int : loader_token list -> Int32.int * loader_token list
val get_name : loader_token list -> string * loader_token list
Functional parsing with the error monad
I’d rather a syntax error didn’t crash my assembler. To avoid the possibility of a crash, new version get_int'
and get_int'
return a result in the error monad. They still return the list of unconsumed tokens, but now that list is paired with a value in the error monad.
datatype 'a error = OK of 'a | ERROR of string
val get_int' : loader_token list -> Int32.int error * loader_token list
val get_name' : loader_token list -> string error * loader_token list
These functions return results that use the error
monad, so if an error occurs, the caller can see ERROR
, and the program doesn’t crash.
These types are starting to get complicated. I’m going to simplify them by defining a type synonym (also called “type abbreviation”):
type 'a proto_producer = loader_token list -> 'a error * loader_token list
val get_int' : int proto_producer
val get_name' : string proto_producer
A value of type 'a proto_producer
is function that takes tokens and will, if successful, produce a value of type 'a
(along with leftover tokens).
The implementations don’t change much:
fun get_int' (TU32 n :: tokens) = (OK n, tokens)
| get_int' tokens = (ERROR "expected an int", tokens)
fun get_name' (TNAME x :: tokens) = (OK x, tokens)
| get_name' tokens = (ERROR "expected a name", tokens)
Pattern matching on token lists doesn’t scale
When I want to start parsing more complicated things, the computation I care about is going to start to be obscured by boilerplate code. As an example, here’s an approximation of get_literal
:
datatype value = NUM of real | BOOL of bool | NIL | EMPTYLIST | STRING of string
fun get_literal (TU32 n :: tokens) = (OK (NUM (real n)), tokens)
| get_literal (TDOUBLE x :: tokens) = (OK (NUM x), tokens)
| get_literal (TNAME "true" :: tokens) = (OK (BOOL true), tokens)
| get_literal (TNAME "string" :: TU32 count :: tokens) =
let val bytes = map something (take count tokens)
in (OK (STRING (implode bytes)), drop count tokens)
end
| get_literal tokens = (ERROR "not a literal", tokens)
Its type is OK, written in either of these two ways:
val get_literal : value proto_producer
val get_literal : loader_token list -> value error * loader_token list
But look at the "string"
case. I am pattern matching on two tokens at the head of the list, not just one. Although the code isn’t terrible, you should have a bad feeling about this:
The code looks repetitive. Fragments like
(⋯ :: tokens)
and(OK ⋯, tokens)
are everywhere.As things get complicated, the “remaining tokens” list is annoying to manipulate and easy to get wrong.
Pattern matching deep into a list of tokens gets hard to read, and it doesn’t allow for any abstraction. For example, it’s not possible to write a pattern that says “match any literal.”
It looks like the only way to make a choice is by choosing between clauses of a single function definition. This idiom also disallows abstraction, and it does not allow nested choices.
To fix these issues, we’ll start using some of the operators in the videos:
To parse things in sequence, we’ll write
p1 >> p2
, which means, “parsep1
and then parsep2
.”To make choices, instead of pattern matching on tokens, we’ll write
p <|> p'
, which means, “parsep
, and if that doesn’t work, tryp'
instead.”
Parsing with failure and higher-order functions
The “proto-producer” above can say only two things: “I recognize this input” and “this input is wrong.” A full-fledged “producer” can say a third thing: “I don’t recognize this input.” The type of a producer is
type 'a producer = input list -> ('a error * input list) option
where input
can be a token (for a parser) or a character (for a lexer) or even something like an S-expression.1 That result type can have three forms:
To say I recognize this input, a producer consumes the part of the input that it recognizes, and it returns a value of the form
SOME (OK a, remaining_inputs)
This result is usually called success.
To say this input is wrong, a producer consumes the wrong input, and it returns a value of the form
SOME (ERROR message, remaining_inputs)
This result is usually called error.
To say I don’t recognize this input, a producer returns
NONE
This result is usually called failure.
A parser is a producer, and parsing begins with “atomic” parsers—one true producer for each species of token:
val int : Int32.int producer
val name : string producer
fun int (TU32 n :: tokens) = SOME (OK n, tokens)
| int _ = NONE
fun name (TNAME x :: tokens) = SOME (OK x, tokens)
| name _ = NONE
True producers easily support both sequence and choice operations. A sequence of two parsers produces whatever the second parser produces:
val >> : 'a producer * 'b producer -> 'b producer
A sequence p1 >> p2
succeeds only if both p1
and p2
succeed.
When a parser chooses between two alternatives, both alternatives have to produce the same type of value:
val <|> : 'a producer * 'a producer -> 'a producer
A choice p <|> p'
can succeed in two ways: p
succeeds, or p
fails and p'
succeeds. If p
errors, then p'
is never used.
Combinators >>
and <|>
have sane definitions; you’ll write algebraic laws for them in lab. And they justify the phrase “parsing combinators”; a combinator is a function that builds other functions.
Example parsers and parsing combinators
Parsers are built from atomic parsers using combinators. Here are a couple of base cases:
val eos : unit producer (* recognizes end of stream *)
val one : input producer (* takes one input token, if present *)
fun eos [] = SOME (E.OK (), [])
| eos (_ :: _) = NONE
fun one [] = NONE
| one (y :: ys) = SOME (E.OK y, ys)
Parsers support a form of “map”: a function is applied to the result of a successful parse.
val <$> : ('a -> 'b) * 'a producer -> 'b producer
Because <$>
is usually defined infix
, it’s not Curried; It’s the same <$>
as you see in the error monad (with type error
instead of producer), and it obeys the same laws. Because the forms of producer
are more complicated than the forms of error
, so are the laws:
(f <$> p) ts = NONE
, whenp ts = NONE
(f <$> p) ts = SOME (ERROR msg, ts')
, whenp input = SOME (ERROR msg, ts')
(f <$> p) ts = SOME (OK (f a), ts')
, whenp input = SOME (OK a, ts')
In the more general case, we have a parser that produces a function, a parser that produces an argument, and we apply one to the other:
val <*> : ('a -> 'b) producer * 'a producer -> 'b producer
This one is also the same idea as in the error monad, but because there are nine cases, the laws (and the code) are unpleasant. But the laws for the baby error monad will give you the right idea.
Here’s a weird trivial one: a parser that reads nothing and always succeeds:
val succeed : 'a -> 'a producer
fun succeed a = (fn ts => SOME (OK a, ts))
The <$>
combinator can be viewed as an abbreviation:
f <$> p = succeed f <*> p
Parser sat f p
recognizes a subset of what p
recognizes: only those values that satisfy f
:
val sat : ('a -> bool) -> 'a producer -> 'a producer
A maybe
parser is built using a function that may transform a value’s type:
val maybe : ('a -> 'b option) -> 'a producer -> 'b producer
maybe f p = valOf <$> sat isSome p
Abstract parsers and recursive parsers
The type abbreviation for 'a producer
makes it easy to understand what producers do and how to write one, but if you get anything wrong when you’re building a parser, the type-error messages come from left field. To solve this problem, I’ve defined a parser interface in which the types of the input and the producer are abstract:
type input
type 'a producer
The representation is exposed via two functions:
val asFunction : 'a producer -> input list -> ('a error * input list) option
val ofFunction : (input list -> ('a error * input list) option) -> 'a producer
Although the type checker doesn’t know it, these functions are both the identity function, and if you compile with MLton, they cost nothing at run time. But you are unlikely to need them; it’s more common to use produce
:
val produce : 'a producer -> input list -> 'a error
(* consumes the entire list to produce a single 'a, or errors *)
The other difficulty is with recursion: if a parser isn’t a function, how are we to define one that’s recursive? The answer is to use a fixed-point combinator:
val fix : ('a producer -> 'a producer) -> 'a producer
(* fix g = g (fix g) *)
A recursive parser is defined like this
val arithmetic = fix (fn exp =>
int
<|> curry op + <$> exp <~> sat (curry op = "+") name <*> exp)
The <~>
says, in effect, “parse input and ignore the result”; parser p1 <~> p2
recognizes the same inputs as p1 >> p2
, but instead of producing the result of p2
, it produces the result of p1
:
val <~> : 'a producer * 'b producer -> 'a producer
val >> : 'a producer * 'b producer -> 'b producer
A tiny case study
As a realistic example, here is the expression parser for vScheme. Since expressions may have subexpressions, the parser is recursive and is written with the fix
operator.
val exp = P.fix (fn exp =>
let fun pair x y = (x, y)
val binding = oflist (pair <$> name <*> exp)
val bindings = oflist (many binding)
in bracket "set" (curry S.SET <$> name <*> exp)
<|> bracket "if" (curry3 S.IFX <$> exp <*> exp <*> exp)
<|> bracket "while" (curry S.WHILEX <$> exp <*> exp)
<|> bracket "begin" ( S.BEGIN <$> many exp)
<|> bracket "let" (letx S.LET <$> bindings <*> exp)
<|> bracket "let*" (letstar <$> bindings <*> exp)
<|> bracket "letrec" (letx S.LETREC <$> bindings <*> exp)
<|> bracket "quote" ( (S.LITERAL o sexp) <$> one)
<|> bracket "lambda" (curry S.LAMBDA <$> oflist (many name) <*> exp)
<|> oflist eos >> P.perror "empty list as Scheme expression"
<|> S.LITERAL <$> ( kw "#t" >> P.succeed (S.BOOLV true)
<|> kw "#f" >> P.succeed (S.BOOLV false)
<|> S.BOOLV <$> bool
<|> S.NUM <$> int
<|> (S.SYM o Real.toString) <$> sxreal
)
<|> S.VAR <$> name
<|> oflist (curry S.APPLY <$> exp <*> many exp)
end)
The bracket
and oflist
combinators are functions I defined myself. The oflist
parser is very strange: it unpacks a single S-expression into a sequence of S-expressions, then parses the sequence. It’s used for every Scheme form that is written as a list of S-expressions. The bracket
parser looks for a keyword in brackets; it works more or less like the thing in my introductory video:
fun bracket word parser =
oflist (kw word >> (parser <~> eos <|> perror ("Bad " ^ word ^ " form")))
Error handling
For now, don’t worry about error outcomes. Just write parsers that recognize the syntax you’ve defined. In testing, there will come a time when your parser says “input not recognized” and you don’t know why not. To improve the error message, you can use a new combinator, perror
, which always causes an error. It’s useful because you can pass an informative message.
Most likely you want to define new combinators that use perror
. Here’s one, commitTo
, which says that I commit to finding a particular input:
fun commitTo what p = p <|> perror ("I was expecting " ^ what)
Lots more is possible; for example, here’s one that shows the next token, if present:
fun expected what =
let fun bad t = Error.ERROR ("expected " ^ what ^
", but instead found this token: " ^ showToken t)
in check ( bad <$> one
<|> perror ("expected " ^ what ^ ", but there was no more input")
)
end
or all the tokens:
fun expected what =
let fun bads ts = Error.ERROR ("looking for " ^ what ^
", got this input: " ^ showTokens ts)
in check ( bads <$> many one )
end
A definition of showTokens
appears in file asmparse.sml
.