Sunday, December 28, 2008

Reading Camlp4, part 1: the OCaml AST

Camlp4 is one of the best things about OCaml, but because it isn't well-documented (particularly so the revised version released with OCaml 3.10 and later), it isn't used as widely as it might be. I use it in my orpc and jslib (part of ocamljs) projects, and to do so I've had to learn it the hard way, by reading the source code and experimenting. (Of course I have also found the Camlp4 wiki and documentation and tutorials covering the old version useful.) This is the first of a projected series of posts containing what I learned, so others can pick it up faster and so I don't forget it myself.

A good place to start is the datatypes representing OCaml abstract syntax trees. Values of these datatypes are produced by parsing, manipulated by syntax extensions, and converted to concrete syntax by pretty-printing. The easiest way to understand ASTs is to get Camlp4 to show you the AST for a piece of OCaml code. Put some code in a file test.ml:

let q = <:str_item< let f x = x >>
then expand it with camlp4of test.ml -printer o. The << >> syntax introduces a Camlp4 quotation, which takes OCaml concrete syntax and replaces it with the corresponding AST values. The :str_item part says which of the mutually-recursive AST types we want to produce; a str_item is something that can appear at the top level of a module body. The invocation camlp4of runs Camlp4 with the quotation module loaded, among others, and the option -printer o means pretty-print the result using standard (or "original") OCaml syntax. The result is:
let q =
  Ast.StSem (_loc,
    Ast.StVal (_loc, Ast.BFalse,
      Ast.BiEq (_loc,
        Ast.PaId (_loc, Ast.IdLid (_loc, "f")),
        Ast.ExFun (_loc,
          Ast.McArr (_loc,
            Ast.PaId (_loc,
              Ast.IdLid (_loc, "x")),
            Ast.ExNil _loc,
            Ast.ExId (_loc,
              Ast.IdLid (_loc, "x")))))),
    Ast.StNil _loc)
Some things to notice here: The expanded value is given by constructors from an Ast module. If you open Camlp4.PreCast at the top of the file (and give the flags -I +camlp4 to ocamlc, or -package camlp4 to ocamlfind ocamlc) you'll get an appropriate Ast module. The full datatype is defined in camlp4/Camlp4/Camlp4Ast.partial.ml in the OCaml source (this file, as with most of Camlp4, is written in the revised syntax).

All the constructors take a _loc argument, which is a value of type Loc.t (when using Camlp4.PreCast this is Camlp4.Struct.Loc). When a source file is parsed in the normal way (not via a quotation) then this argument is the location in the source file where a piece of AST came from. When you're generating code with quotations you have to put something in here, by binding something to _loc before the quotation. For now you can use Loc.ghost, which is just a dummy.

The StSem and StNil constructors are used to collect parts of a module. When you parse ordinary code you'll always get a list: a chain of StSem's with some actual structure item in the first position (after the location) and the tail of the list in the second, terminated by a StNil. If you generate code with nested lists, they'll be flattened out when the AST is pretty-printed. This is convenient when you're building up a complicated quotation, so you don't have to worry about flattening it yourself.

Let's try another one:

let q = <:ctyp< ('a, 'b) foo >>
let r = <:ctyp< 'a 'b foo >>
(here a ctyp is a type expression) which results in
let q =
  Ast.TyApp (_loc,
    Ast.TyApp (_loc,
      Ast.TyId (_loc, Ast.IdLid (_loc, "foo")),
      Ast.TyQuo (_loc, "a")),
    Ast.TyQuo (_loc, "b"))

let r =
  Ast.TyApp (_loc,
    Ast.TyId (_loc, Ast.IdLid (_loc, "foo")),
    Ast.TyApp (_loc, Ast.TyQuo (_loc, "b"),
      Ast.TyQuo (_loc, "a")))
This is a little strange: we have two constructs in the concrete syntax that both turn into TyApp's, one associated to the left and one to the right.

I put in this example (and the StSem/StNil business) to point out that the AST type definition doesn't tell the whole story about how ASTs are put together. Although most of the time you can use quotations, sometimes you need to work directly with the AST: when debugging, when building a very dynamic AST, or when you can't figure out the right quotation mumbo-jumbo. No worries though; it's easy to get Camlp4 to tell you the right AST for a piece of concrete syntax.

Next post I'll cover quotations for OCaml syntax in more detail, and build a simple code generator.