Friday, August 13, 2010

Reading Camlp4, part 10: custom lexers

As a final modification to our running JSON quotation example, I want to repair a problem noted in the first post—that the default lexer does not match the JSON spec—and in doing so demonstrate the use of custom lexers with Camlp4 grammars. We’ll parse UTF8-encoded Javascript using the ulex library.

To use a custom lexer, we need to pass a module matching the Lexer signature (in camlp4/Camlp4/Sig.ml) to Camlp4.PreCast.MakeGram. (Recall that we get back an empty grammar which we then extend with parser entries. ) Let’s look at the signature and its subsignatures, and our implementation of each:

Error
  module type Error = sig 
    type t 
    exception E of t 
    val to_string : t -> string 
    val print : Format.formatter -> t -> unit 
  end 

First we have a module for packaging up an exception so it can be handled generically (in particular it may be registered with Camlp4.ErrorHandler for common printing and handling). We have simple exception needs so we give a simple implementation:

  module Error = 
  struct 
    type t = string 
    exception E of string 
    let print = Format.pp_print_string 
    let to_string x = x 
  end 
  let _ = let module M = Camlp4.ErrorHandler.Register(Error) in () 
Token

Next we have a module defining the tokens our lexer supports:

  module type Token = sig 
    module Loc : Loc 
  
    type t 
  
    val to_string : t -> string 
    val print : Format.formatter -> t -> unit 
    val match_keyword : string -> t -> bool 
    val extract_string : t -> string 
  
    module Filter : ... (* see below *) 
    module Error : Error 
  end 

The type t represents a token. This can be anything we like (in particular it does not need to be a variant with arms KEYWORD, EOI, etc. although that is the conventional representation), so long as we provide the specified functions to convert it to a string, print it to a formatter, determine if it matches a string keyword (recall that we can use literal strings in grammars; this function is called to see if the next token matches a literal string), and extract a string representation of it (called when you bind a variable to a token in a grammar—e.g. n = NUMBER). Here’s our implementation:

  type token = 
    | KEYWORD  of string 
    | NUMBER   of string 
    | STRING   of string 
    | ANTIQUOT of string * string 
    | EOI 
 
  module Token = 
  struct 
    type t = token 
  
    let to_string t = 
      let sf = Printf.sprintf in 
      match t with 
        | KEYWORD s       -> sf "KEYWORD %S" s 
        | NUMBER s        -> sf "NUMBER %s" s 
        | STRING s        -> sf "STRING \"%s\"" s 
        | ANTIQUOT (n, s) -> sf "ANTIQUOT %s: %S" n s 
        | EOI             -> sf "EOI" 
  
    let print ppf x = Format.pp_print_string ppf (to_string x) 
  
    let match_keyword kwd = 
      function 
        | KEYWORD kwd' when kwd = kwd' -> true 
        | _ -> false 
  
    let extract_string = 
      function 
        | KEYWORD s | NUMBER s | STRING s -> s 
        | tok -> 
            invalid_arg 
              ("Cannot extract a string from this token: " ^ 
                 to_string tok) 
 
    module Loc = Camlp4.PreCast.Loc 
    module Error = Error 
    module Filter = ... (* see below *) 
  end 

Not much to it. KEYWORD covers true, false, null, and punctuation; NUMBER and STRING are JSON numbers and strings; as we saw last time antiquotations are returned in ANTIQUOT; finally we signal the end of the input with EOI.

Filter
  module Filter : sig 
    type token_filter = 
      (t * Loc.t) Stream.t -> (t * Loc.t) Stream.t 
 
    type t 
 
    val mk : (string -> bool) -> t 
    val define_filter : t -> (token_filter -> token_filter) -> unit 
    val filter : t -> token_filter 
    val keyword_added : t -> string -> bool -> unit 
    val keyword_removed : t -> string -> unit 
  end; 

The Filter module provides filters over token streams. We don’t have a need for it in the JSON example, but it’s interesting to see how it is implemented in the default lexer and used in the OCaml parser. The argument to mk is a function indicating whether a string should be treated as a keyword (i.e. the literal string is used in the grammar), and the default lexer uses it to filter the token stream to convert identifiers into keywords. If we wanted the JSON parser to be extensible, we would need to take this into account; instead we’ll just stub out the functions:

  module Filter = 
  struct 
    type token_filter = 
      (t * Loc.t) Stream.t -> (t * Loc.t) Stream.t 
 
    type t = unit 
 
    let mk _ = () 
    let filter _ strm = strm 
    let define_filter _ _ = () 
    let keyword_added _ _ _ = () 
    let keyword_removed _ _ = () 
  end 
Lexer

Finally we have Lexer, which packages up the other modules and provides the actual lexing function. The lexing function takes an initial location and a character stream, and returns a stream of token and location pairs:

module type Lexer = sig 
  module Loc : Loc 
  module Token : Token with module Loc = Loc 
  module Error : Error 
 
  val mk : 
    unit -> 
    (Loc.t -> char Stream.t -> (Token.t * Loc.t) Stream.t) 
end 

I don’t want to go through the whole lexing function; it is not very interesting. But here is the main loop:

let rec token c = lexer 
  | eof -> EOI 
 
  | newline -> next_line c; token c c.lexbuf 
  | blank+ -> token c c.lexbuf 
 
  | '-'? ['0'-'9']+ ('.' ['0'-'9']* )? 
      (('e'|'E')('+'|'-')?(['0'-'9']+))? -> 
        NUMBER (L.utf8_lexeme c.lexbuf) 
 
  | [ "{}[]:," ] | "null" | "true" | "false" -> 
      KEYWORD (L.utf8_lexeme c.lexbuf) 
 
  | '"' -> 
      set_start_loc c; 
      string c c.lexbuf; 
      STRING (get_stored_string c) 
 
  | "$" -> 
      set_start_loc c; 
      c.enc := Ulexing.Latin1; 
      let aq = antiquot c lexbuf in 
      c.enc := Ulexing.Utf8; 
      aq 
 
  | _ -> illegal c 

The lexer syntax is an extension provided by ulex; the effect is similar to ocamllex. The lexer needs to keep track of the current location and return it along with the token (next_line advances the current location; set_start_loc is for when a token spans multiple ulex lexemes). The lexer also needs to parse antiquotations, taking into account nested quotations within them.

(I think it is not actually necessary to lex JSON as UTF8. The only place that non-ASCII characters can appear is in a string. To lex a string we just accumulate characters until we see a double-quote, which cannot appear as part of a multibyte character. So it would work just as well to accumulate bytes. I am no Unicode expert though. This example was extracted from the Javascript parser in jslib, where I think UTF8 must be taken into account.)

Hooking up the lexer

There are a handful of changes we need to make to call the custom lexer:

In Jq_parser we make the grammar with the custom lexer module, and open it so the token constructors are available; we also replace the INT and FLOAT cases with just NUMBER; for the other cases we used the same token constructor names as the default lexer so we don’t need to change anything.

  open Jq_lexer 
 
  module Gram = Camlp4.PreCast.MakeGram(Jq_lexer) 
 
  ... 
      | n = NUMBER -> Jq_number (float_of_string n) 

In Jq_quotations we have Camlp4.PreCast open (so references to Ast in the <:expr< >> quotations resolve), so EOI is Camlp4.PreCast.EOI; we want Jq_lexer.EOI, so we need to write it explicitly:

  json_eoi: [[ x = Jq_parser.json; `Jq_lexer.EOI -> x ]]; 

(Recall that the backtick lets us match a constructor directly; for some reason we can’t module-qualify EOI without it.)

That’s it.

I want to finish off this series next time by covering grammar extension, with an example OCaml syntax extension.

(You can find the complete code for this example here.)

No comments:

Post a Comment