Thursday, January 22, 2009

Reading Camlp4, part 3: quotations in depth

(I set myself the goal of posting every week, but the latest Skydeck release has kept me busy, and it turned out I didn't understand the following as well as I thought.)

After seeing the examples of Camlp4 quotations in my last post, you may wonder:

  • what are all the quotations (str_item, ctyp, etc.)?
  • what are all the antiquotations (uid, `str, etc.)?
  • which antiquotations are allowed where?
To answer these questions, we're going to look at how quotations are implemented in Camlp4. We'll need to learn a little about Camlp4's extensible parsers, and look at the OCaml parser in Camlp4.

Parsing OCaml

A small complication is that there is more than one concrete syntax for OCaml in Camlp4: the original (i.e. normal OCaml syntax) and revised syntaxes. The original syntax parser is given as an extension of the revised syntax one. So we'll begin in camlp4/Camlp4Parsers/Camlp4OCamlRevisedParser.ml (line 588 in the 3.10.2 source):

    expr:
      [ "top" RIGHTA
        [ (* ... *)
        | "if"; e1 = SELF; "then"; e2 = SELF; "else"; e3 = SELF ->
            <:expr< if $e1$ then $e2$ else $e3$ >>
You can read the parser more or less as a BNF grammar. This code defines a nonterminal expr by giving a bunch of cases. The cases are grouped together into levels, which can be labeled and given an associativity (that's what "top" and NONASSOC are). Levels are used to indicate the precedence of operators, and also to provide hooks into the parser for extending it; for our purpose here you can skip over them.

You can read a case like a pattern match: match the stuff to the left of the arrow, return the stuff to the right. (What's being matched is a stream of tokens from the lexer.) A parser pattern can contain literal strings like "if", backquoted data constructors like `INT (which can carry additional data), nonterminals, and some special keywords like SELF. You can bind variables using ordinary pattern-matching syntax within token literals, and use x = y syntax to bind the result of a call to a nonterminal.

The right side is a piece of AST representing what was parsed, and in most cases it is given as a quotation. This is pretty confusing, because often the left and right sides of a case look very similar, and you can't tell what AST node is produced. However, it gives us lots of examples of tricky quotations, and since we have already seen how to expand quotations we can deal with it. (If you're curious how Camlp4 is written using itself see camlp4/boot.)

Focusing on the if case: the keywords if, then, and else are parsed with an expression after each (at least we know that's the syntax of normal OCaml, and that gives a clue to what SELF means: parse the current nonterminal); the expressions are bound to a variables; then the pieces are put together into an ExIfe AST node.

(Some other special keywords you'll see are OPT, which makes the next item optional, and LIST0/LIST1, which parse a list of items separated by the token after SEP. LIST1 means there must be at least one item.)

OCaml allows you to leave off the else part; where is the code for that? Turns out this is not allowed in revised syntax, and the original syntax overrides this part of the parser. Take a look at camlp4/Camlp4Parsers/Camlp4OCamlParser.ml (line 292):

    expr: LEVEL "top"
      [ [ (* ... *)
        | "if"; e1 = SELF; "then"; e2 = expr LEVEL "top";
          "else"; e3 = expr LEVEL "top" ->
            <:expr< if $e1$ then $e2$ else $e3$ >>
        | "if"; e1 = SELF; "then"; e2 = expr LEVEL "top" ->
            <:expr< if $e1$ then $e2$ else () >>
(Notice how the expr definition is qualified with the level in the revised grammar where it should slot in.)

Quotations and antiquotations

Hopefully that is enough about parsing to muddle through; let's move on to quotations. Here's another piece of the revised parser (line 670)--these are still cases of expr:

  [ `QUOTATION x -> Quotation.expand _loc x Quotation.DynAst.expr_tag
The `QUOTATION token contains a record including the body of the quotation and the tag. The record is passed off to the Quotation module to be expanded. The actual expansion happens in camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml. Looking to the bottom of that file, there are several lines like:
  add_quotation "sig_item" sig_item_quot ME.meta_sig_item MP.meta_sig_item;
This installs a quotation expander for the sig_item tag. The expander parses the quotation starting at the sig_item_quot nonterminal in the parser, then runs the result through the antiquotation expander (see below). (The last two arguments to add_quotation have to do with the context where a quotation appears: inside a pattern you get PaFoo nodes while inside an expression you get ExBar nodes.) So we can answer one of the questions posed at the beginning: what are all the quotation tags? We can see here that there is a quotation for each type in camlp4/Camlp4/Camlp4Ast.partial.ml.

Now let's look at antiquotations, which are more complicated (line 671):

        | `ANTIQUOT ("exp"|""|"anti" as n) s ->
            <:expr< $anti:mk_anti ~c:"expr" n s$ >>
The `ANTIQUOT token contains the tag and the body again (and the parser can choose a case based on the tag). The anti antiquotation creates a special AST node to hold the body of the antiquotation; each type in the AST has a constructor (ExAnt, TyAnt, etc.) for this purpose. The mk_anti function adds another tag, which is not always the same as the one we parsed; the ~c argument adds a suffix giving the context where the antiquotation appeared.

There are two places where antiquotations are interpreted. First, in camlp4/Camlp4Parsers/Camlp4QuotationCommon.ml (line 89):

            [ "`int" -> <:expr< string_of_int $e$ >>
This is one of a bunch of cases in a map over the syntax tree. It handles antiquotations like <:expr< $`int:5$ >>, which turns into an ExInt. You can also see cases here for the anti antiquotations, and some things to do with list antiquotations we haven't seen yet (more on this below).

Things that don't match these cases are handled when the AST is pretty-printed. Let's look at camlp4/Camlp4/Printers/OCaml.ml (line 510):

    | <:expr< $int:s$ >> -> o#numeric f s ""
This case handles antiquotations like <:expr< $int:"5"$ >>. Again, this produces an ExInt, but you give it a string instead of an int.

What we have learned

Teaching a person to fish is fine, unless that person starves while trying to finish their PhD in theoretical pescatology. But I hope that you can see how we might go about answering the remaining questions--what are all the antiquotations, and where are they allowed--by examining all the `ANTIQUOT cases in the parser and puzzling out where they get expanded.

Let's look at a particular example, by way of addressing the comment Nicolas Pouillard (aka Ertai) made on the last post. He points out that the final McOr in of_string can go outside the antiquotation. How could we learn this from the Camlp4 code? Let's find where the antiquotation is expanded, starting at the point where the function keyword is parsed (Camlp4OCamlParser.ml line 299):

  | "function"; a = match_case ->
      <:expr< fun [ $a$ ] >>
(the right side is revised syntax) which uses match_case (line 350):
    match_case:
      [ [ OPT "|"; l = LIST1 match_case0 SEP "|" -> Ast.mcOr_of_list l ] ]
You might think that match_case0 parses a single case, but let's check (Camlp4OCamlRevisedParser.ml line 778):
    match_case0:
      [ [ `ANTIQUOT ("match_case"|"list" as n) s ->
            <:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
        | `ANTIQUOT (""|"anti" as n) s ->
            <:match_case< $anti:mk_anti ~c:"match_case" n s$ >>
We're interested in the second case for the moment: here's the antiquotation with no tag used in of_string. So the list of cases is returned by match_case0 (as an McAnt with match_case as its tag) and more cases can be parsed following it.

(Now we can see a justification for a puzzling design decision in the AST: instead of collecting match cases in a list, it collects them with McOr nodes. Many arrangements of McOr nodes correspond to the same list of cases. As the above possibility shows, this is useful: an antiquotation can return zero, one, or several match cases, and we don't have to worry about splicing them into the list. On the other hand, it makes consuming the AST a little more complicated.)

We can go one step further: if we use the list antiquotation, the first case in match_case0 returns an antiquotation with tag listmatch_case, and we get the following expansion (Camlp4QuotationCommon.ml line 117):

            | "listmatch_case" -> <:expr< Ast.mcOr_of_list $e$ >>
So our final of_string becomes:
let of_string = function
    $list:
      List.map
        (fun c -> <:match_case< $`str:c$ -> $uid:c$ >>)
        cons$
  | _ -> invalid_arg "bad string"
Can we do something similar with the generation of the variant type? No, as it turns out. In the revised syntax, the arms of a variant are given inside square brackets, so we can say:
type t = [ $list:List.map (fun c -> <:ctyp< $uid:c$ >>) cons$ ]
But in the original syntax, without at least one constructor to make clear that we're defining a variant, there's no context to interpret a list, and this is reflected in the parser, which doesn't allow a list antiquotation there. This kind of problem is apparently why the revised syntax was introduced.

So far I've talked only about generating OCaml code; next time I'll cover how to use Camlp4 to consume OCaml, and build a simple code analysis tool.

No comments:

Post a Comment