Overview
Compilers and interpreters take as input programs in string form. Lexing and
parsing are the first two steps towards converting this string input into a
program in the language that can then be executed or interpreted.
Lexing is the process that takes the entire string input and breaks it into
smaller chunks, or tokens. Tokens are often separated by spaces, newlines,
operators, and other characters like ;, (, and }.
Once a string has been tokenized into a sequence of tokens, the parsing
process takes the sequence and groups tokens together according to the rules
defined by a grammar.
Implementations of lexers and parsers for different languages have a lot
similarities. To avoid duplicating effort, tools called lexer-generators
and parser-generators allow the compiler writer to focus on the interesting
details of his/her language -- namely, the syntax and semantics --
while taking care of the boilerplate code (think fold).
Used for writing compilers and interpreters in OCaml, ocamllex is
a lexer-generator and ocamlyacc a parser-generator.
The compiler writer implements two specification files in particular formats
that ocamllex and ocamlyacc process, and they generate pure
OCaml code that can be executed to lex and parse strings.
We will look at an example lexer and parser for a simple language of
binary expressions and assignments. Run ./build.sh to build
everything.
beParser.mly
In this file we define the different kinds of tokens for our language, the
grammar rules that define legal statements in our language, and how to process
the stream of tokens when groups of them match a grammar rule.
ocamlyacc will read this file and then generate an OCaml file called
beParser.ml, which will be the complete parser for our language.
The code between %{ and %} gets copied verbatim into
beParser.ml, so it is pure OCaml code. In this case, we just open
up the BeAst module that defines the abstract syntax for our language
of binary expressions and assignments.
Each %token line defines a kind of token. Note that multiple tokens
can be defined on the same line, and you can define a token to carry along with
it data of a particular type using <...>. Take a look at
beParser.ml to see the OCaml datatype that corresponds to these
token definitions.
The %start and %type lines define which of the rules (defined
later in the file) should be used as the top-most rule when trying to
parse an input string. The remainder of the file after the %% defines
the grammar rules. For each rule in the grammar, there is a corresponding
{ and }. What goes inside there is the OCaml "expression" that
gets returned when the particular rule is matched. It is normal OCaml code, with
the exception of things like $1 and $2, which correspond to
the values returned by the first and second subparts of the rule. For example,
when some sequence of tokens matches the fourth atom rule, the
expression matched by expr between two parentheses is referred to
as $2.
For more complete documentation about ocamlyacc, see
http://plus.kaist.ac.kr/~shoh/ocaml/ocamllex-ocamlyacc/ocamlyacc-tutorial/.
beLexer.mll
In this file we define how characters in the input string should be broken
up into the tokens defined in beParser.mly. The code within
the { and } at the top is pure OCaml code that gets copied
verbatim into a file called beLexer.ml. Notice that we open
the BeParser module (which will be generated by ocamlyacc
so that the token datatype is in scope).
The rest of the file defines how to convert ASCII characters into tokens.
We choose the name token for our definition, but we very well
could have named in anything. Whatever name we choose, however, we will
be the entrypoint to the lexer; look at the file beLexer.ml and look
for a function called token.
The syntax in this section of the specification should be readable because it
is very close to pure OCaml. The input string is matched against a series of
string literals and regular expressions. As soon as one of the patterns matches
the input string, the OCaml code within the corresponding { and }
is added to the list of tokens, and the token function is recursively
invoked on what remains of the input string. Note that in the first rule,
lexbuf is the implicit name of the input string, so token
lexbuf is a recursive call to the lexer that does not add any tokens
to the list whenever any whitespace is matched in the input string.
For more complete documentation about ocamllex, see
http://plus.kaist.ac.kr/~shoh/ocaml/ocamllex-ocamlyacc/ocamllex-tutorial/.
Putting it all together
After running ocamllex beLexer.ml and ocamlyacc beParser.mly,
the modules BeLexer and BeParser can now be used to lex
and parse strings and interpret them as statements in our little language.
The parse_string function in beMain.ml shows how to
use the generated lexer and parser to process a string.
Run ./be.top, which is an OCaml top-level that already has loaded
all the modules we need (look at build.sh to see how be.top
is created). We can now process strings in the language:
% ./be.top
Objective Caml version 3.10.0
# BeMain.token_list_of_string "true || false";;
- : BeParser.token list = [BeParser.TRUE; BeParser.OR; BeParser.FALSE]
# BeMain.token_list_of_string "(look ma no hands)";;
- : BeParser.token list =
[BeParser.LPAREN; BeParser.Id "look"; BeParser.Id "ma"; BeParser.Id "no"; BeParser.Id "hands"; BeParser.RPAREN]
# BeMain.parse_string "(true || false) && x";;
- : BeAst.bexpr = BeAst.And (BeAst.Or (BeAst.Const true, BeAst.Const false), BeAst.Var "x")
# BeMain.parse_string " true || (false && x)";;
- : BeAst.bexpr = BeAst.Or (BeAst.Const true, BeAst.And (BeAst.Const false, BeAst.Var "x"))
# BeMain.parse_string "look ma no hands";;
Exception: Parsing.Parse_error.