-
Notifications
You must be signed in to change notification settings - Fork 18
Quick Parser Overview
The parser is defined in the src/parse/ subdirectory. The entrypoint is a function Parser.parse, which takes a sequence of tokens as input and produces an Ast. The parser largely mirrors the Ast, so we'll first consider the Ast.
The Ast datatype is defined in src/ast/AstType.sml and then opened in src/ast/Ast.sml. It follows the formal SML grammar very closely. (Actually, it might be best to think of this as a concrete syntax tree, because all source tokens---except comments and whitespace---are accounted for in the data type.)
For example, here is the Ast node definition for a tuple of expressions. For an N-ary tuple, there are N elems and N-1 delims. We use this pattern (elems and delims) throughout the Ast.
datatype exp =
...
| Tuple of
{ left: Token.t (* open paren *)
, elems: exp Seq.t (* elements *)
, delims: Token.t Seq.t (* commas between elements *)
, right: Token.t (* close paren *)
}The Ast structure is partitioned into a few substructures to help organize the SML grammar. For example, Ast.Exp.exp is the type of an expression node in the Ast, and Ast.Fun.fundec is the type of a node that declares a functor. All of these substructures are:
- Ast.Ty: types
- Ast.Pat: patterns
- Ast.Exp: expressions and declarations (mutually recursive)
- Ast.Sig: signature expressions, specifications, and declarations (mutually recursive)
- Ast.Str: structure expressions and declarations (mutually recursive)
- Ast.Fun: functor declarations
The parser (defined in the src/parse/ subdirectory) is organized to mirror the Ast type, and therefore is broken up into a number of smaller structures. For example, the structure ParsePat produces Ast nodes of type Ast.Pat.pat, and the structure ParseTy produces nodes of type Ast.Ty.ty.
The parser is mostly just a standard recursive-descent parser. We define a parser type for parser combinators:
type ('state, 'result) parser = 'state -> ('state * 'result)And then each individual parsing function typically has type Token.t Seq.t -> (int, X) parser where X is the result type. For example, the function ParseTy.ty has the type Token.t Seq.t -> (int, Ast.Ty.ty) parser.
The integer in the parser type encodes the state of the parser, which is just its position in the token sequence. This enables a particular mode of use which is maybe best exemplified here, where we parse an if-then-else expression:
val (i, exp1) = consume_exp infdict Restriction.None i
val (i, thenn) = parse_reserved Token.Then i
val (i, exp2) = consume_exp infdict Restriction.None i
val (i, elsee) = parse_reserved Token.Else i
val (i, exp3) = consume_exp infdict Restriction.None i
val result = Ast.Exp.IfThenElse
{ iff = iff
, exp1 = exp1
, thenn = thenn
, exp2 = exp2
, elsee = elsee
, exp3 = exp3
}The term consume_exp infdict Restriction.None has type (int, Ast.Exp.exp) parser and the variable i is the current position of the parser. (The other arguments are just some additional details needed, e.g., an infix precedence lookup table.) We advance the parser by passing i as argument, and receive a new position and an Ast.Exp.exp node in return.
Almost all of the parser has this flavor, of passing around the current position i and advancing by recursive descent. The parser is purely functional, and does not have any implicit state.
The nice error messages come from functions like ParserUtils.error, which raises an exception describing the error. Here’s an example which informs the user about the record punning SuccessorML syntax:
ParserUtils.error
{ pos = Token.getSource lab
, what = "Incomplete row in record expression."
, explain = SOME
"In Standard ML, each element of a record expression must \
\look like: `<label> = <expression>`. Note that SuccessorML \
\allows for \"record punning\", where (for example) the syntax \
\`{x, y, z = foo}` is shorthand for `{x = x, y = y, z = foo}`. \
\To enable this feature, use the command-line argument \
\'-allow-record-pun-exps true'."
}These exceptions are caught at the top-level and handled by outputting a nice message: