Resumable Parser Combinators with error detection

Table of Contents

1. What is this?
2. Requirements
3. The type of the parser
4. Base parsers (to be combined)
5. Modifying parsers result
6. Combinators
7. Repeating parsers
8. Miscellaneous
9. Parsing usual things
10. Configurations

1. What is this?

This document describes the implementation of a library performing text parsing according to a technique that is called parser combinators because the parser for a given grammar is build by combining simpler parsers. This technique is fashionable although old and slow for the reason that it is simple and permissive.

This is also an example of literate programming; in case the notations used in this document are unclear you can read about them in this document.

2. Requirements

2.1. A N+1^th parser to rule them all

Along the years I’ve often written small dedicated parsing libraries, either using lex+yacc approach for well behaved syntaxes or custom made combinatoric parsers because they are fun to play with and more flexible. My needs ranged from parsing configuration files, parsing programming languages, parsing natural languages, parsing networking protocols…

This parser is meant to be flexible enough to accommodate all those needs, when speed is not an issue.

2.2. Parsing binary

Although most of the times one wants to parse text, there are cases when the need arises to parse binary messages (such as when parsing some networking protocols). Therefore parsed input tokens must not be bound to characters.

2.3. Parsing recursively

I want to parse as much as possible from perl and other such footgun languages (without executing them). Perl has eval on strings. Therefore parsing must not stop as soon as we reach a string literal but instead recurse and try to parse that string. It must therefore be possible to write such a recursive set of rules.

2.4. Parsing with gaps

For the same reason I want to be able to parse as much as possible of an input even if some parts of it is (still) missing. This is useful both to parse string fragments into code but also to parse a networking stream despite missing packets.

2.5. Best effort

It will not always be possible to know for sure the type of all symbols. In that case we want to present all possible alternatives as the result, rather than stopping the parse or selecting the most probable one.

2.6. Resumable parsers

The parser must be usable for parsing large messages either received from the network or stored on a file, without having to buffer everything. Input tokens must therefore be consumed as they arrive, and the parser must "return" when it has processed the given input as much as possible, and be resumable later on when more data is available.

For this, call_cc would help tremendously to freeze the state of affairs and return it to the original caller. Without that facility, though, alternatives do exist:

A conventional method to achieve something similar is to turn our API (aka our types) inside out: instead of taking a stream of input tokens and returning a result, parsers could take the next item and return, along with each result, a next-step parser to apply to the next token. This is rather inconvenient, though.

Alternatively, threading is another way to freeze ones stack: a parser waiting for input could just be waiting for a conditional variable, or the input stream could be a mailbox variable kind of queue, etc. All this amount to relying on the operating to freeze the parser and resume it later.

We might want to use those parsers in an environment where using POSIX threads is not an option, though (such as: in a microkernel). In that case lightweight threads (akin to gnupth) could work, but that’s still a huge constraint on the environment. Instead, closure-based threads, a variety of cooperative threads where a thread context is saved in a function closure (and that are also sometime referred to as "lightweight threads") are cheap and does not depend on OS support ; for this reason they are available in many shapes and forms. The downside is that the API needs to accommodate for those (basically, each parser must return a future value rather than a value). But once this abstraction is in place then one is free to use it for any variant of threads or make it a bypass. That’s thus what this library will be using, at the cost of a bit of functorization.

2.7. Error detection

Error detection must be an option. The simplest method to reach this goal would be to have an 'error budget', and try different outcomes at every step to look the combination of alternatives that parse the most of the input stream.

It is hard to make this technique less awfully expensive than it sounds.

Setting this error budget to zero would effectively turn off the feature.

Instead of a global error budget, one may prefer to have a maximum error rate.

Many of those ideas come from Fast, Error Correcting Parser Combinators: A Short Tutorial, a XX^th century paper.

2.8. Error reporting

Although this is related to error detection, error reporting is a requirement on its own. Indeed, a parser could do a good job at correcting typos but still be unable to report a meaningful error message when it fails to process the input tokens.

2.9. Parser Combinators

Natural languages (and many runtime typed programming languages too) do not bow to any rigid formal grammar. Parser combinators are appealing because they make it possible to add new valid constructs without rethinking the whole grammar; thus permitting to build a good enough parser iteratively.

3. The type of the parser

To be clearer, let’s write down the usual type of a parser used with combinators, written in ML:

Typical parser type

type ('a, 'b) parser = 'a list -> ('b * 'a list) list

Which reads like this: Calling α ('a) the type of the input tokens and β ('b) the type of the result, a parser is a function that takes a list of α and returns a list of pairs composed of a β and a list of α, being the list of all possible solutions, composed of the result of parsing and the list of tokens that remain to be parsed.

Ideally, a successful top level parser will thus return a list composed of a single pair (non ambiguity of the outcome) made of the final result and an empty list (the whole input has been consumed).

For resumable parsing in a possibly threading context we need to introduce the α ct type (for a future value of type α), and make the input stream of tokens a possibly blocking function returning the next item, also turning our Parsers module into a functor depending on what mechanism we plan to use for threading between parsers and token intake.

Having now a functor and an abstract stream brings the question whether to keep the type for tokens (α) universal or rather make it existential (an abstract but single type). Keeping it universal makes it easier to combine parsers (especially: we can devise such combinators that feed a parser with something else than tokens, such as the result of another parser). But on the down side it would force the user to hand us a stream type that can handle any type, which is a strong constraint to bear with ; for one, it prevents the stream implementation to look at the actual values. This would prove too inconvenient given we will enrich the stream with positions. We thus change the type of the stream to return an existential rather than universal type:

Parsers.ml: functor to parametrize over the threading mechanism

open Batteries

(* ...Parsers helpers... *)

module type CONFIG = sig
  type token
  (* ...Parser configuration... *)
end

module type S = sig
  include CONFIG

  (* ...Parser signature... *)
end

module Make (Conf : CONFIG) :
  S with type token = Conf.token
  (* ...parser public type constraints... *) =
struct
  include Conf

  (* ...Parser library... *)
end

Due to functorization we have to explicitly provide a signature for the result of Make so that we can use the resulting parser as input of further functors.

The configuration must thus provide not only the actual type for frozen computations (aka future values) but also a way to wrap a value into such a future value and a way to pipe one thread into another, both operations typically called return and bind but here prefixed with ct_ because we reserve those names for parser combinators:

threading types

type 'a ct
val ct_return : 'a -> 'a ct
val ct_bind : 'a ct -> ('a -> 'b ct) -> 'b ct

Thus the possibly blocking input mechanism:

Parser configuration with possible threading

(* ...threading types... *)
type stream
val take : stream -> (token ParsersMisc.stream_item * stream) ct

With the stream_item type being like an option type with more specific constructors, defined in a separate module for fear of circular dependencies:

ParsersMisc.ml: type of stream value

type 'a stream_item = Item of 'a | EndOfStream

Notice that this stream container must be free of side effects to the extend that any token read from it in one place must still be available for reading from previously stored streams. That is why take returns both the next token and the next (shorter) stream. In other words it must be a persistent data structure.

For convenience better not keep it secret where our Parsers take their types from:

parser public type constraints

and type 'a ct = 'a Conf.ct

Thus a parser now has this shape:

Resumable parser type

type 'b parser = stream -> ('b * stream) list ct

With this API, when the parser fails to find any way to parse the input it returns a minimally informative empty list. Introducing the error budget changes this somewhat: we will try to artificially force the failing parsers to succeed in order to sneak into that alternate reality and try to locate where a change would lead to a drastically better outcome. This means that each individual result must be accompanied with a description of the (few) changes required to reach that point:

Parser type with error detection

type 'b parser =
  ParsersCorrections.t -> stream -> ('b * ParsersCorrections.t * stream) list ct

Before taking a closer look at this new ParsersCorrections.t type that would encode the corrections we must question the usage of a list of results as the return type. Firstly, a list is over-specified since the order of the possible results is not important; what we really want here is a set and we use a list only because it makes our code more terse. More importantly an empty list to signal failure seems not enough to explore the artificial failure of parsers (because we need to store that correction somewhere).

Consider for instance this excerpt from a fictitious programming language:

class form;
for x = new form(...);

Obviously the intent was to write form x = .... Imagine the rule to parse the second line is is_keyword XOR is_name. Once the parser have accepted the keyword for as valid without questioning it then it is likely that the error message pointing at what follows will be hard to comprehend. On another hand, if the parser also tried to force the failure of the keyword parser in this location then it will notice that everything would parse properly henceforth, suggesting a better error message. So it seems beneficial to return that failure as a correction and move on to next token.

The price to pay for testing the failure of successful parsers is obviously high, though, and not only because of the additional time spent. Returning error descriptions alongside failures forces us to give up the elegant list of result as the main return type.

But it seems that this problem arises only when we make use of the exclusive alternative. Should we decide not to implement such a combinator, then the above example ``either a keyword or a variable name that is not a keyword'' could still be written with inclusive alternative at the price of redundant checks: (is_keyword AND (check (NOT is_name))) OR (is_name AND (check (NOT is_keyword))). In this case we could explore the failure of the checks and notice that if for were a valid variable name then the input would be valid, which will make a much better error message.

So we will not implement exclusive alternative and will instead explore forced success of the check parser. Hence, we turned exploring failure into exploring success and saved our list as the return type.

Now, what’s this ParsersCorrections.t type?

To be able to build a useful error message we must point at the position in the original stream of tokens where some change had to be made in order to parse the input stream of tokens (if not in full at least more than without that change). What constitutes a position depends on the nature of the tokens. The obvious offset since the beginning might not always be appropriate and it’s probably better to leave it open to the user. Let’s therefore assume that both tokens and positions are read from the input stream.

In addition to the location a mere description of the parser that we forced to succeed (as a string) completes the ParsersCorrection.t:

ParsersCorrections.ml: type

open ParsersMisc
type ('pos, 'tok) t = ('pos * 'tok stream_item * string) ParsersBoundedSet.t

where ParsersBoundedSet is an unordered container with a maximum capacity (the maximum amount of changes allowed) and which API will become clearer as we encounter the few required functions.

Trivially, to add an error at a given position to the correction list, with message m:

ParsersCorrections.ml: recording a change

let change_at c pos tok m =
  ParsersBoundedSet.add c (pos, tok, List.hd m)

Now that we know what corrections look like and that we have to read the positions alongside the tokens from the input stream, we can write a better type for the parser:

Parser type with error correction

type 'b possible_result =
  'b * ((position, token) ParsersCorrections.t) * stream
type 'b t =
  (position, token) ParsersCorrections.t -> stream ->
  'b possible_result list ct

The type position have to be supplied by the functor configuration and now we have the final type for stream/take:

Parser configuration: now also supplying position

(* ...threading types... *)
type position
type stream
val take : stream -> (position * token ParsersMisc.stream_item * stream) ct

parser public type constraints

and type position = Conf.position
and type stream = Conf.stream

We are not done yet. Above we saved the list as a container for the possible solutions, but this still left us with an empty list when no solution could be found, from which it’s not possible to devise an informative error message!

In addition to the cumulative list of all solutions a parser should also return an aggregated value containing the "best" error found.

What we call the "best" error is the error that caused the parser to give up (return []) the later in the input stream, measured by the position of the token. We thus need a comparison function between two positions. We could have a minimally informative greater_than : position -> position -> bool function, but it will be later convenient to have an idea of the actual (oriented) distance between any two positions, to let’s rather define:

Parser configuration: comparator for positions

val distance : position -> position -> int

An error is composed of the location in the stream where the parsing stopped and the stack of things the parser was trying to build:

error type:

type error =
  { where : position ; what : string list }

We can update the current best error from another stream item and stack of messages with this simple function:

keeping track of the best error

let new_error where what = function
  | None ->
    Some { where ; what }
  | Some err as e ->
    if distance err.where where >= 0 then
      Some { where ; what }
    else
      e

With the idea that a stack of messages describing the context is maintained from one parser to the next as the context gets deeper. Therefore, every parser must accept this current stack as an additional input and enrich it with whatever makes sense (which oftentimes requires some hint from the caller).

So that the new (and final) type for parsers expands to:

final parser type

(* ...error type... *)

type 'b possible_result =
  'b * ((position, token) ParsersCorrections.t) * stream

type 'b t =
  string list -> error option ->
  (position, token) ParsersCorrections.t -> stream ->
  (error option * 'b possible_result list) ct

That we want both in the implementation and the signature:

Parser library

(* ...final parser type... *)
(* ...keeping track of the best error... *)

Parser signature

(* ...final parser type... *)

That’s a lot of inputs. For simplicity and conciseness those parameters will always be given the same one letter names:

m for the stack of messages describing the context (of type string list);
e for the optional error;
c for the ParsersCorrections.t;
s for the stream;
x for a token;
p for a parser (of type t);
r for a list or possible results (of type 'b possible_result list).

The output itself, (error option * 'b possible_result list) ct suits combining parsers but may be a bit of a mouthful for the end user to chew. This function turns it into a more edible result type:

Parser library: building a final result from possible results

let to_result x =
  ct_bind x (fun (e, r) ->
    (* If we have a single solution then that's the one! *)
    ct_return (match r with
    | [ b, c, s ] ->
      if ParsersBoundedSet.is_empty c then Ok (b, s)
      else Error (Approximation (b, c, s))
    | [] ->
      Error (NoSolution e)
    | lst ->
      Error (Ambiguous lst)))

of type:

Parser signature

val to_result : (error option * 'b possible_result list) ct ->
                ('b * stream, 'b failure) result ct

with:

final parser type: many ways to fail

type 'b failure =
  | Approximation of ('b * (position, token) ParsersCorrections.t * stream)
  | Ambiguous of ('b * (position, token) ParsersCorrections.t * stream) list
  | NoSolution of error option

Notice that we pay no attention to whether the stream has been emptied or not. If one want to reach the end of the input stream then that must be part of the parser (see eof).

It will come handy to have a configuration that’s as simple as possible, with no threading going on, for those cases where we do not need resumable parsers; for instance when testing. Since there is no threading all the input has to be already available and can thus be passed using a mere list.

The SimpleConfig serves this purpose:

Parsers.ml: simple configuration for non-resumable parsers

open ParsersMisc
module SimpleConfig
  (Token : sig
    type t
    val print : 'o BatInnerIO.output -> t -> unit
  end) =
struct
  type 'a ct = 'a
  type token = Token.t
  let print_token = Token.print
  type position = int
  let distance p1 p2 = p2 - p1
  let print_position fmt p =
    Printf.fprintf fmt "offset %d" p
  type stream = position * token list
  let print_stream fmt (_, s) =
    List.print print_token fmt s
  let take = function
    | (pos, []) as s -> pos, EndOfStream, s
    | pos, (x::rest) -> pos, Item x, (pos+1, rest)
  let ct_return x = x
  let ct_bind x f = f x
  (* ...other SimpleConfig definitions... *)
end

We will devise a more elaborate configuration for LWT later on.

4. Base parsers (to be combined)

4.1. Fail, Return and check

The simplest parser that does nothing is return. It does not consume anything from the input but merely return a single result:

Parser library: return

let return x _m e c s = ct_return (e, [x, c, s])

A similarly simple one is the parser that always fail:

Parser library: fail

let fail _m e _c _s = ct_return (e, [])

with signatures:

Parser signature

val return : 'b -> 'b t
val fail : 'b t

Those two first parsers perform no error correction at all. But many other parsers will have to either terminate parsing abruptly (with fail) or add a change to the correction list and proceed, if the error budget is not exhausted already. We will abstract this in a fail_or_maybe_not function:

Parser library: fail with success exploration

open ParsersCorrections

let fail_or_maybe_not x (* <1> *) m e c s =
  if ParsersBoundedSet.is_full c then (
    (* no more errors permitted so fail for real *)
    ct_bind (take s) (fun (pos, _, _) ->
      fail m (new_error pos m e) c s)
  ) else (
    (* Here we insert x in the stream ;
       Shall we also try to replace pos_tok with x?
       In any case beware than s maybe end_of_stream.
       TODO: two types of correction: replace and insert *)
    (* Note that corrections still count as errors because there is no
       guarantee that we will keep track of them: *)
    ct_bind (take s) (fun (pos, tok, _) ->
      return x m (new_error pos m e) (change_at c pos tok m) s))

Here we need an example value x of type β in order to change the outcome of a failure. Which value exactly is not really a concern since only its type matters (although the error message could print it as an example, as OCaml compiler does when complaining about an incomplete pattern matching).

Another parser that does not consume any input is the check parser that we have mentioned earlier. It is actually a combinator since it takes another parser as parameter. It checks that the given parser succeed but then return the input stream unchanged (with a unit result). The only thing interesting is that it explores forcing a success in case the check fails.

Parser library: check

let check ?(what="check") p m e c s =
  let m = what::m in
  ct_bind (p m e c s) (function
    | e', [] -> fail_or_maybe_not () m e' c s
    | e', _ -> return () m e' c s)

Parser signature

val check : ?what:string -> 'b t -> unit t

Another parser that will prove useful (despite contributing no value to the result) especially in coordination with check is the negation:

Parser library: negation

let nay p m e c s =
  let m = "nay"::m in
  ct_bind (p m e c s) (function
    | e', [] -> return () m e' c s
    | e', _ -> fail_or_maybe_not () m e' c s)

So that we could write check (nay p).

Notice that when nay succeeds, ie when p fails, the stream of input tokens s is not advanced (it does not consume any input). Indeed, we do not know of how many tokens to advance it (since p could operate on several).

Parser signature

val nay : 'b t -> unit t

4.2. Tests

It is important to have a test infrastructure in place before it’s needed. Given literate programing allows us to mix code and tests at ease we do not need to get this feature from such a tool as qtest and will use oUnit directly.

It would be convenient to have a function to convert a string into a stream, and not only during tests but also for real use when all the input is available at once. Let’s thus add this into the CONFIG signature:

Parser configuration: convert from string

val stream_of_string : string -> stream

other SimpleConfig definitions:

let stream_of_string s =
  0, String.to_list s

Supposing for now that we have also all the required printers we can set up a satisfying environment for tests:

test.ml: the stage.

open Batteries
open OUnit2
open ParsersMisc
open ParsersCorrections

module P = Parsers.Make (Parsers.SimpleConfig (Char))
(* ...other tested modules... *)
open P

let max_changes = 3
let corr = ParsersBoundedSet.make max_changes
(* TODO: two types of correction: replace and insert *)
let correction_at pos tok m =
  let c = ParsersBoundedSet.make max_changes in
  change_at c pos tok [ m ]

let no_corr = Parsers.no_error_correction
let rest = stream_of_string "glop glop pas glop"
let no_input = 0, []
let drop ?(sep_len=0) n (o, lst) =
  let n' = 1 + (n-1)*(1+sep_len) in
  o + n', List.drop n' lst

(* ...other global functions or types for testing... *)

let uniq = function
  | [x, _, _] -> Some x
  | _ -> None

(* version of assert_equal specialized for parser results *)
let assert_same_results ?msg print_output exp got_ =
  ct_bind got_ (fun got ->
    let printer =
      IO.to_string (print_possible_results print_output) in
    let cmp (exp_err, exp) (act_err, got) =
      let same_err =
        match exp_err, act_err with
        | None, _ ->
          (* Many times we do not want to guess the errors (for instance
             when the parser actually succeeds): *)
          true
        | Some exp_err, Some act_err ->
          (* Unless we give a deep exp_err we want to compare only the heads: *)
          exp_err.where = act_err.where &&
          (match exp_err.what, act_err.what with
          | [exp_head], act_head::_ ->
            0 = compare exp_head act_head
          | exp_what, act_what ->
            0 = compare exp_what act_what)
        | exp, act ->
          0 = compare exp act in
      same_err &&
      0 = compare (List.sort compare exp)
                  (List.sort compare got) in
    ct_return (
      (* OUnit really should have an assert_same_elements *)
      assert_equal ~printer ~cmp ?msg exp got)) |>
  ignore

let () =
  run_test_tt_main (
    "test helpers" >:::
      [ "stream_of_empty" >:: (
          fun _ctx ->
            assert_equal ~printer:string_of_int
              0 (List.length (stream_of_string "" |> snd))) ;
        "stream_of_string basic" >:: (
          fun _ctx ->
            assert_equal ~printer:string_of_int
              2 (List.length (stream_of_string "ab" |> snd))) ;
        "(no) corrections allowed" >:: (
          fun _ctl ->
            assert_bool "no_corr is full"
              (ParsersBoundedSet.is_full no_corr) ;
            assert_bool "corr is not full"
              (not (ParsersBoundedSet.is_full corr))) ;
        ]) ;
  run_test_tt_main (
    "tests" >:::
      [ (* ...tests... *) ])

Notice that we have to force the type of assert_same_results to be unit (with ignore) otherwise it would be unit P.ct P.ct, which should be demonstrably equivalent to unit given SimpleConfig but still makes the compiler to grumble.

Let’s warm this up with simple tests for return and fail (which really does not cause too much worries):

tests

"return succeed" >:: (
  fun _ctx ->
    assert_same_results Int.print
      (None, [42, no_corr, rest])
      (return 42 [] None no_corr rest)
) ;
"return succeed even at EOF" >:: (
  fun _ctx ->
    assert_same_results Int.print
      (None, [42, no_corr, (0,[])])
      (return 42 [] None no_corr no_input)
) ;
"fail fails" >:: (
  fun _ctx ->
    assert_same_results Int.print
      (None, [])
      (fail [] None no_corr rest)
) ;
"fail fails even at EOF" >:: (
  fun _ctx ->
    assert_same_results Int.print
      (None, [])
      (fail [] None no_corr no_input)
) ;

4.3. Checking for end of stream

Another very useful and basic parser is the one that succeeds on EOF and fails everywhere else. It is useful to check that the input stream have been consumed entirely by the preceding parsers.

We do not engage in error detection in this parser: mimicking success implies pretending the stream stops there, but most input streams could be trivially declared valid if the stream is cut short (empty string is often valid for instance). In case of spurious input tokens at the end shouldn’t the error message be trivial enough already? Also, when error detection is allowed then we (should) also try to skip tokens, which is enough to correct a few spurious chars at the end.

The real reason of course is none of the above. It is that forcing eof to succeed would require an 'empty_stream' from the configuration, which was too inconvenient.

Parser library: checking for EOF

open ParsersMisc
let eof m e c s =
  let m = "eof"::m in
  ct_bind (take s) (function
    | _, EndOfStream, s' ->
      return () m e c s'
    | pos, _, _ ->
      fail m (new_error pos m e) c s)

Parser signature

val eof : unit t

And the accompanying tests:

tests

"eof succeed" >:: (
  fun _ctx ->
    assert_same_results Unit.print
      (None, [(), no_corr, (0,[])])
      (eof [] None no_corr no_input)
) ;
"eof fails" >:: (
  fun _ctx ->
    assert_same_results Unit.print
      (Some { where = 0 ; what = ["eof"] }, [])
      (eof [] None no_corr rest)
) ;
(*
"eof suggests truncation" >:: (
  fun _ctx ->
    assert_same_results Unit.print
      (None, [(), correction_at 0 (Item 'g') "eof", (18,[]) ])
      (eof [] None corr rest)
) ;
*)

4.4. First non trivial parser

The more general of parsers that do consume some input is the cond parser, which tries to recognize a condition on the next token (for instance that it is equal to a given value). So cond is a function that takes a predicate on token and returns a parser that, when given this token, returns it (and consumes it), or otherwise fails (with a message describing what it was looking for).

Now that we know the type, writing the code is rather easy:

A cond parser

let cond expl f x m e c s =
  let m = expl::m in
  ct_bind (take s) (function
    | _pos, Item tok, s' when f tok ->
      return tok m e c s'
    | _ ->
      fail_or_maybe_not x m e c s)

Parser signature

val cond : string -> (token -> bool) -> token -> token t

cond_map is a cond that returns an optional value instead of a mere boolean:

Parser library: cond_map

let cond_map expl f x m e c s =
  let m = expl::m in
  ct_bind (take s) (function
    | _pos, EndOfStream, _ ->
      fail_or_maybe_not x m e c s
    | _pos, Item tok, s' ->
      (match f tok with
       | Some v -> return v m e c s'
       | None   -> fail_or_maybe_not x m e c s))

Parser signature

val cond_map : string -> (token -> 'b option) -> 'b -> 'b t

from which a simpler cond parser can be written:

Parser library: the cond parser, revisited

let cond expl f =
  cond_map expl (fun c -> if f c then Some c else None)

It is possible to build many simpler and more convenient parsers on top of cond, such as item which expects a specific token in the input, and range which expect anything in the given token range (assuming token behavior in face of an inequality operator makes sense) :

Parser library: the item parser

let item ?what x =
  let expl =
    Option.default_delayed (fun () ->
      Printf.sprintf2 "%a" print_token x) what in
  cond expl ((=) x) x

let range a b expl =
  cond expl (fun c -> c >= a && c <= b) a

Parser signature

val item : ?what:string -> token -> token t
val range : token -> token -> string -> token t

tests

"item canonical success" >:: (
  fun _ctx ->
    assert_same_results Char.print
      (None, ['g', no_corr, (1, List.tl (snd rest))])
      (item 'g' [] None no_corr rest)
) ;
"item canonical failure" >:: (
  fun _ctx ->
    assert_same_results Char.print
      (Some { what = ["X"] ; where = 0 }, [])
      (item 'X' [] None no_corr rest)
) ;
"item fails at EOF" >:: (
  fun _ctx ->
    assert_same_results Char.print
      (Some { what = ["g"] ; where = 0 }, [])
      (item 'g' [] None no_corr no_input)
) ;
"item error exploration" >:: (
  fun _ctx ->
    (* Here we 'find' the item X just because we add it. *)
    assert_same_results Char.print
      (Some { what = ["X"] ; where = 0 },
       [ 'X', correction_at 0 (Item 'g') "X",
         rest (* since we add X in front of rest *) ])
      (item 'X' [] None corr rest)
) ;

5. Modifying parsers result

Before going too far we need to introduce functions that alter a parser result (equivalent of map, fold, filter…) and come up with a convenient syntax for those since they are going to be used prevalently.

Applying a function to all results of a parser, take 1

let map p f m e c s =
  ct_bind (p m e c s) (fun (e', results) ->
    let results' = List.map (fun (x, corr, rest) ->
      f x, corr, rest) results in
    ct_return (e', results'))

The order of parameters is important so that map p f is itself a parser. In practice though, we often want to filter the results in addition to mapping them. For instance, p should be a parser for numbers and we want to convert the sequence of character it outputs into a number, but want to accept only numbers below some limit. We will therefore allow f to throw a special exception to reject a solution:

Parser library: applying a function to all results of a parser, take 2

exception Reject of string

let map p f m e c s =
  ct_bind (p m e c s) (fun (e', results) ->
    let rejection = ref None in
    let results' =
      List.filter_map (fun (x, corr, rest) ->
        try Some (f x, corr, rest)
        with Reject msg ->
          if !rejection = None then
            rejection := Some msg ;
          None) results in
    match !rejection with
    | None -> ct_return (e', results')
    | Some msg ->
      ct_bind (take s) (fun (pos, _, _) ->
        ct_return (new_error pos (msg::m) e', results')))

An infix operator makes it even more convenient:

Parser library: infix operator for map

let (>>:) = map

Parser signature

exception Reject of string
val map   : 'b t -> ('b -> 'd) -> 'd t
val (>>:) : 'b t -> ('b -> 'd) -> 'd t

6. Combinators

The first combinators to consider are the succession of two given parsers and the alternative of two parsers.

Notice that since we are now merely combining parsers we do not have to care about error correction any more: only the base parsers need to pretend succeeding when they fail.

The more general way to build a combinator for the succession of two parsers is to take the first parser p1 and a function f which, given the output of p1, will return a parser p2 to apply to the remaining of the input stream. Let’s call this combinator bind (by analogy with the type of the bind operation in the monad ``design pattern''). The values of bind p1 f are the values of p2, p1 intermediary values being only meaningful to build p2.

Parser library: bind

let bind p1 f m e c s =
  ct_bind (p1 m e c s) (fun (e', r) ->
    (* For each possible result of p1, try to continue parsing with p2.
       Aggregate all encountered errors. *)
    List.fold_left (fun prev (x1, c', s') ->
        ct_bind prev (fun (e'', r') ->
          let p2 = f x1 in
          ct_bind (p2 m e'' c' s') (fun (e''', r'') ->
            ct_return (e''', List.rev_append r'' r')))
      ) (ct_return (e', [])) r)

With the conventional infix operator:

Parser library: infix operator for bind

let (>>=) = bind

Parser signature

val bind  : 'b t -> ('b -> 'd t) -> 'd t
val (>>=) : 'b t -> ('b -> 'd t) -> 'd t

Given this bind combinator, the concatenation of two given parsers p1 and p2 can be easily written as:

Parser library: succession of two parsers

let cons p1 p2 =
  p1 >>= (fun x1 -> p2 >>: fun x2 -> x1,x2)

Here, we want the final result set to be the product of each result of p1 with all following results of p2.

We’d better have a shorter infix alternative for cons which is used very often:

Parser library: infix operator for cons

let (++) p1 p2 = cons p1 p2

Parser signature

val cons : 'b t -> 'd t -> ('b * 'd) t
val (++) : 'b t -> 'd t -> ('b * 'd) t

Also, we will often discard the result of one parser. For instance when parsing delimiters the only information is that the parser succeeds (there is a delimiter) but there is no value to attach to that success. Also when using the check parser, which purpose is really not its return value. So here are three variants of cons: one that ignores the result of p1, one that ignores the result of p2 and one that ignore both (returning ()):

Parser library: other convenient infix operators

let (+-) p1 p2 = p1 ++ p2 >>: fst
let (-+) p1 p2 = p1 ++ p2 >>: snd
let (--) p1 p2 = p1 ++ p2 >>: ignore

Parser signature

val (+-) : 'b t -> 'd t -> 'b t
val (-+) : 'b t -> 'd t -> 'd t
val (--) : 'b t -> 'd t -> unit t

Conveniently all those are left associative.

Now let’s test that we can indeed sequence parsers:

tests

"Can parse a sequence" >:: (
  fun _ctx ->
    let ab = stream_of_string "ab" in
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('a', 'b'), no_corr, (2,[])])
      ((item 'a' ++ item 'b') [] None no_corr ab) ;
    assert_same_results Char.print
      (None, ['a', no_corr, (2,[])])
      ((item 'a' +- item 'b') [] None no_corr ab)
) ;

The next most useful combinator is the alternative:

The alternative, take 1

let oneof p1 p2 m e c s =
  ct_bind (p1 m e c s) (fun (e', r) ->
    ct_bind (p2 m e' c s) (fun (e'', r') ->
      ct_return (e'', List.rev_append r r')))

…which simply try each parser in turn, accumulate the best error and combine the possible results.

There is a small inconvenience here: if both terms of the alternative p1 and p2 failed right from the start, then the resulting error will be about p2. It is probably more insightful in this case to report that the oneof itself failed.

To obtain this behavior an additional error for the alternative itself can be registered for the starting position and it will overwrite the best error e'' if it’s also located there.

If either p1 or p2 managed to parse even a tiny bit of s then it will still be favored in the error report, which might or not be good depending on how likely this parser fails late on random data.

Parser library: alternative

let oneof p1 p2 m e c s =
  ct_bind (p1 m e c s) (fun (e', r) ->
    ct_bind (p2 m e' c s) (fun (e'', r') ->
      ct_bind (take s) (fun (pos, _, _) ->
        let r'' = List.rev_append r r' in
        let e''' = if r'' = [] then new_error pos m e''
                               else e'' in
        ct_return (e''', r''))))

let (|||) = oneof

Parser signature

val oneof : 'b t -> 'b t -> 'b t
val (|||) : 'b t -> 'b t -> 'b t

Notice that results are really sets not list, so the order in which the alternatives are listed does not matter. Notice also that this is not an exclusive alternative: if both p1 and p2 can parse then both will contribute a result to the result set. As discussed in the beginning we do not enforce that if p1 succeeds then p2 must fail nor the other way around. If this is wanted though then it is easy enough to write:

Parser library: exclusive alternative

let either p1 p2 =
  (nay p2 -+ p1) ||| (nay p1 -+ p2)

let (|/|) = either

Parser signature

val either : 'b t -> 'b t -> 'b t
val (|/|)  : 'b t -> 'b t -> 'b t

Finally, we can introduce a left_leaning version of oneof that would try p2 only if p1 fails. This can considerably speed up parsing at places where the grammar is non ambiguous:

Parser library: left-leaning oneof

let oneof_greedy p1 p2 m e c s =
  ct_bind (p1 m e c s) (fun (e', r) ->
    if r <> [] then ct_return (e', r) else
    ct_bind (p2 m e' c s) (fun (e'', r) ->
      ct_bind (take s) (fun (pos, _, _) ->
        let e''' = if r = [] then new_error pos m e''
                             else e'' in
        ct_return (e''', r))))

let (|<|) = oneof_greedy

Parser signature

val oneof_greedy : 'b t -> 'b t -> 'b t
val (|<|) : 'b t -> 'b t -> 'b t

With sequences and alternatives we can start writing some interesting tests:

tests

"any: 'a' or 'b' but not 'z'" >:: (
  fun _ctx ->
    let a_or_b m =
      let m = "a or b" :: m in
      (item 'a' ||| item 'b') m in
    let z = stream_of_string "z" in
    assert_same_results Char.print
      (None, ['a', no_corr, (1,[])])
      (a_or_b [] None no_corr (stream_of_string "a")) ;
    assert_same_results Char.print
      (None, ['b', no_corr, (1,[])])
      (a_or_b [] None no_corr (stream_of_string "b")) ;
    assert_same_results Char.print
      (Some { what = ["a or b"] ; where = 0 }, [])
      (a_or_b [] None no_corr z) ;
    (* Same as above, greedy variant: *)
    let a_or_b' m =
      let m = "a or b" :: m in
      (item 'a' |<| item 'b') m in
    assert_same_results Char.print
      (None, ['a', no_corr, (1,[])])
      (a_or_b' [] None no_corr (stream_of_string "a")) ;
    assert_same_results Char.print
      (None, ['b', no_corr, (1,[])])
      (a_or_b' [] None no_corr (stream_of_string "b")) ;
    assert_same_results Char.print
      (Some { what = ["a or b"] ; where = 0 }, [])
      (a_or_b' [] None no_corr z) ;
    (* Here we try with corrections: we make up the requested item
       in front of the actual one (check the 2 possibilities): *)
    assert_same_results Char.print
      (Some { what = ["b"] ; where = 0 },
       ['a', correction_at 0 (Item 'z') "a", z ;
        'b', correction_at 0 (Item 'z') "b", z])
      (a_or_b [] None corr z)
) ;

7. Repeating parsers

Binding several parsers already gives us a way to harvest several values from the input stream but many times what is needed is to repeat the same parser an unspecified number of times.

Before that, a special case of repetition will prove very useful: having zero or one occurrence of p:

Parser library: zero or one

let optional ~def p = p ||| return def
let optional_greedy ~def p m e c s =
  ct_bind (p m e c s) (function
    | e', [] -> ct_return (e', [def, c, s])
    | x -> ct_return x)

The optional_greedy above is to avoid considering not consuming a matching token as a possible solution.

Parser signature

val optional : def:'b -> 'b t -> 'b t
val optional_greedy : def:'b -> 'b t -> 'b t

Which behavior we’d better test:

tests

"optional behavior" >:: (
  fun _ctx ->
    let opt_char c = optional ~def:'x' (item c) in
    let ab = stream_of_string "ab" in
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('a', 'b'), no_corr, (2,[])])
      ((opt_char 'a' ++ item 'b') [] None no_corr ab) ;
    let b = stream_of_string "b" in
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('x', 'b'), no_corr, (1,[])])
      ((opt_char 'a' ++ item 'b') [] None no_corr b) ;
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('x', 'b'), no_corr, (1,[]) ;
              ('b', 'x'), no_corr, (1,[]) ;
              ('x', 'x'), no_corr, (0,['b'])])
      ((opt_char 'b' ++ opt_char 'b') [] None no_corr b)
) ;
"optional_greedy really is" >:: (
  fun _ctx ->
    let opt_char c = optional_greedy ~def:'x' (item c) in
    let ab = stream_of_string "ab" in
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('a', 'b'), no_corr, (2,[])])
      ((opt_char 'a' ++ item 'b') [] None no_corr ab) ;
    let b = stream_of_string "b" in
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('x', 'b'), no_corr, (1,[])])
      ((opt_char 'a' ++ item 'b') [] None no_corr b) ;
    assert_same_results (Tuple2.print Char.print Char.print)
      (None, [('b', 'x'), no_corr, (1,[])])
      ((opt_char 'b' ++ opt_char 'b') [] None no_corr b)
) ;

Sometime it’s more convenient that the optional parser return an optional value instead of a default one. Then one can use None as the default and combine the parser with this some parser, to make optional returns an option type:

Parser library: some

let some p = p >>: fun x -> Some x

Parser signature

val some : 'a t -> 'a option t

The repeat combinator is a swiss-army knife for all variants of repetitions, requiring a parser p to succeed from min to max times consecutively, with an optional additional parser sep for a separator in between p occurrences. It returns a list of all values returned by the successive p.

By allowing min to be 0 (and making it the default value) we expect to cut down on the many optional (repeat p) that we would have otherwise.

Parser library: repetition of a parser

let repeat ~sep ?(min=0) ?max ?what p m =
  let m = match what with None -> m | Some w -> w::m in
  let rec loop ~min ?max m e c s =
    if max = Some 0 then (
      if min = 0 then return [] m e c s
      (* note: if fail was taking this token itself then we could
         get away with repeating the full m e c s parameters everywhere *)
      else
        ct_bind (take s) (fun (pos, _, _) ->
          fail m (new_error pos m e) c s)
    ) else (
      let pred_ma = match max with None -> None
                                 | Some m -> Some (m-1) in
      match min with
      | 0 ->
        (* we may stop here or continue *)
        (optional ~def:[] (loop ~min:1 ?max)) m e c s
      | 1 ->
        (* If max is greater or unset, we want at least one more and
           keep trying. But if max is also 1 then we just want one more
           and there is no point looking further; which may avoid to wait
           for another block of tokens in some settings: *)
        (
          if max = Some 1 then p >>: fun x -> [x]
          else (p ++ optional ~def:[]
                       (sep -+ (loop ~min:1 ?max:pred_ma))) >>:
                fun (x, xs) -> x::xs
        ) m e c s
      | _ ->
        (* above that, repetition is mandatory *)
        ((p +- sep ++ loop ~min:(min-1) ?max:pred_ma) >>:
          fun (x, xs) -> x::xs) m e c s
    ) in
  loop ~min ?max m

Notice there are two conditions that terminate the recursion: max reaching 0 (no more occurrences permitted) or, when min > 0, a failure of p.

Notice also that repeat builds a whole list before sending it to the next stage. We’d like to get away with this list which most often than not will be mapped into something else. A variant of lazy list would likely be preferable here (as in other places).

Parser signature

val repeat :
  sep:'d t -> ?min:int -> ?max:int -> ?what:string -> 'b t -> ('b list) t

We’d like to get away with the mandatory sep parameter using a default value of return () but that would prevent OCaml compiler to infer that since sep result is consistently discarded any result type would be as good. Simpler example of this using the REPL:

# let f ?sep x = x ;;
val f : ?sep:'a -> 'b -> 'b = <fun>
# let f ?(sep=42) x = x;;
val f : ?sep:int -> 'a -> 'a = <fun>

Therefore we merely provide this short do-nothing constant parser to be used when there is no separator:

Parser library: none

let none m = return () m

You may be surprised by this notation, either because you were expecting let none corr rest = return () corr rest or the shorter let none = return (). Refer to the section about type generalization if that is the case.

Parser signature

val none : unit t

We can easily define the greedy version of repeat (that is, a version that swallows as many p occurrences as present in the input stream) using nay:

(erroneous) greedy repetition

let repeat_greedy ~sep ?min ?max ?what p =
  repeat ~sep ?min ?max ?what p +- nay (sep -+ p)

…which unfortunately fails for min=0 because of the separator. We have to be more cautious not to allow an input stream starting with p before returning []:

Parser library: greedy repetition

let rec repeat_greedy ~sep ?min ?max ?what p =
  match min with
  | None | Some 0 ->
    repeat_greedy ~sep ~min:1 ?max ?what p |||
    (nay p >>: fun () -> [])
  | min ->
    repeat ~sep ?min ?max ?what p +- nay (sep -- p)

Parser signature

val repeat_greedy :
  sep:'d t -> ?min:int -> ?max:int -> ?what:string -> 'b t -> ('b list) t

tests

"repetition: canonical successes" >:: (
  fun _ctx ->
    let assert_ok ?(greedy=false) ~sep ?min ?max rest exp =
      assert_same_results (List.print Char.print) exp
        ((if greedy then repeat_greedy else repeat)
           ~sep ?min ?max (item 'a') [] None no_corr rest) in
    let test_with_sep sep sep_len rest =
      assert_ok ~sep rest
        (None,
         [['a';'a';'a'], no_corr, drop ~sep_len 3 rest ;
          ['a';'a'],     no_corr, drop ~sep_len 2 rest ;
          ['a'],         no_corr, drop ~sep_len 1 rest ;
          [],            no_corr, rest]) ;
      (* Same with min=2 *)
      assert_ok ~sep ~min:2 rest
        (None,
         [['a';'a';'a'], no_corr, drop ~sep_len 3 rest ;
          ['a';'a'],     no_corr, drop ~sep_len 2 rest]) ;
      (* Testing max=2 *)
      assert_ok ~sep ~max:2 rest
        (None,
         [['a';'a'],     no_corr, drop ~sep_len 2 rest;
          ['a'],         no_corr, drop ~sep_len 1 rest;
          [],            no_corr, rest]) ;
      (* Now with min and max *)
      assert_ok ~sep ~min:1 ~max:2 rest
        (None,
         [['a';'a'],     no_corr, drop ~sep_len 2 rest ;
          ['a'],         no_corr, drop ~sep_len 1 rest]) ;
      (* min = max *)
      assert_ok ~sep ~min:2 ~max:2 rest
        (None, [['a';'a'],     no_corr, drop ~sep_len 2 rest]) in
    let aaab = stream_of_string "aaab"
    and a_a_a_b = stream_of_string "a_a_a_b"
    and _a_a_a_b = stream_of_string "_a_a_a_b" in
    test_with_sep none       0 aaab ;
    test_with_sep underscore 1 a_a_a_b ;
    assert_ok ~greedy:true ~sep:none aaab
      (None,
       [['a';'a';'a'], no_corr, drop 3 aaab]) ;
    assert_ok ~greedy:true ~sep:underscore a_a_a_b
      (None,
       [['a';'a';'a'], no_corr, drop 5 a_a_a_b]) ;
    (* Do not allow a separator at start *)
    assert_ok ~greedy:true ~sep:underscore _a_a_a_b
      (None, [[], no_corr, _a_a_a_b])
) ;
"repetition: simplest failure" >:: (
  fun _ctx ->
    assert_same_results (List.print Char.print)
      (Some { what = ["a"] ; where = 0 }, [])
      (repeat ~sep:none ~min:1 (item 'a') [] None no_corr (stream_of_string "zaab")) ;
    assert_same_results (List.print Char.print)
      (Some { what = ["a"] ; where = 0 }, [])
      (repeat_greedy ~sep:none ~min:1 (item 'a') [] None no_corr (stream_of_string "zaab"))
) ;
"repetition: missing separator" >:: (
  fun _ctx ->
    assert_same_results (List.print Char.print)
      (Some { what = ["-"] ; where = 3 }, [])
      (repeat ~sep:(item '-') ~min:3 (item 'a') [] None no_corr (stream_of_string "a-aab")) ;
    assert_same_results (List.print Char.print)
      (Some { what = ["-"] ; where = 3 }, [])
      (repeat_greedy ~sep:(item '-') ~min:3 (item 'a') [] None no_corr (stream_of_string "a-aab"))
) ;
"repetition: below min" >:: (
  fun _ctx ->
    assert_same_results (List.print Char.print)
      (Some { what = ["a"] ; where = 2 }, [])
      (repeat ~sep:none ~min:3 (item 'a') [] None no_corr (stream_of_string "aab")) ;
    assert_same_results (List.print Char.print)
      (Some { what = ["a"] ; where = 2 }, [])
      (repeat_greedy ~sep:none ~min:3 (item 'a') [] None no_corr (stream_of_string "aab"))
) ;

Some variants of repeat can now be defined:

Parser library: repeat variants

let several ~sep = repeat ~sep ~min:1
let several_greedy ~sep = repeat_greedy ~sep ~min:1
let times n = repeat ~min:n ~max:n

Parser signature

val several : sep:'d t -> ?max:int -> ?what:string -> 'b t -> ('b list) t
val several_greedy : sep:'z t -> ?max:int -> ?what:string -> 'b t -> ('b list) t
val times : int -> sep:'z t -> ?what:string -> 'b t -> ('b list) t

With all these new combinators, more interesting tests can be devised:

tests

"several combinators bound together" >:: (
  fun _ctx ->
    let p = decimal_digit >>= (fun c ->
      let i = Char.code c - Char.code '0' in
      assert_bool "not a digit" (i >= 0 && i <= 9) ;
      (* match a sequence of i zeros *)
      times ~sep:none i (item '0')) in
    let rest1 = stream_of_string "105"
    and rest2 = stream_of_string "100"
    and rest3 = stream_of_string "30005"
    and rest4 = stream_of_string "3005" in
    assert_same_results (List.print Char.print)
      (None, [['0'], no_corr, drop 2 rest1])
      (p [] None no_corr rest1) ;
    assert_same_results (List.print Char.print)
      (None, [['0'], no_corr, drop 2 rest2])
      (p [] None no_corr rest2) ;
    assert_same_results (List.print Char.print)
      (None, [['0';'0';'0'], no_corr, drop 4 rest3])
      (p [] None no_corr rest3) ;
    assert_same_results (List.print Char.print)
      (Some { what = ["0"] ; where = 3 }, [])
      (p [] None no_corr rest4)
) ;

Parser library: trivial parsers and utilities

let replace x _ = x

let anything ?(what="anything") m e c s =
  let m = what::m in
  ct_bind (take s) (function
    | pos, EndOfStream, _s' ->
      let m = "unexpected end of stream"::m in
      let e' = new_error pos m e in
      fail m e' c s
    | _, Item tok, s' ->
      return tok m e c s')

Notice that anything can only fail at end of input.

Parser signature

val replace : token -> 'b -> token
val anything : ?what:string -> token t

8. Miscellaneous

8.1. Improving error reporting

We have touched a bit on this topic earlier, when devising the alternate function oneof: we make it so that when both alternative fails at_the_very_start then the error reported is about the alternative not the last failing term. We also acknowledged that in some case this restriction about failing right form the first token should be relaxed.

We are now going to show a way to do this.

Intuitively, some parser results are not significant unless they manage to parse at least a sizeable amount of tokens. A first parser combinator that comes to mind would thus, given a parser p, record the starting position and current best error before trying p on the input. In case of p failing it could unwind the error unless it meet some criteria, for instance that the position must have reached either a minimum distance from the starting position or EndOfStream.

Parser library: conditional dismissing of errors

let dismiss_error_if cond p m e c s =
  let start_err = e in
  ct_bind (p m e c s) (fun (e', r as res) ->
    ct_bind (take s) (fun (start_pos, _, _) ->
      if e' <> start_err && cond start_pos (Option.get e').where then
        ct_return (start_err, r)
      else
        ct_return res))

…where cond is the condition dictating the dismissal of the new error, and for which a simple implementation can be proposed that just considers how many inputs have been successfully consumed before the failure:

Parser library: convenience functions to judge error significance

let parsed_fewer_than len pos1 pos2 =
  distance pos1 pos2 < len

Corresponding types have to be declared in the signature as well:

Parser signature

val dismiss_error_if : (position -> position -> bool) -> 'a t -> 'a t
val parsed_fewer_than : int -> position -> position -> bool

8.2. Type generalization

Let’s get back to why we haven’t defined none simply as let none = return (), letting automatic currying to lighten the syntax:

# let none = return ();;
             ^^^^^^^^^
Error: The type of this expression, ('_a, unit, '_b) t,
       contains type variables that cannot be generalized

This is actually a limitation of OCaml compiler. Here is what’s happening: normally, in an expression like let name = expr, expr will be typed first, leading in this case where expr is actually return () to the type (`_a, unit, '_b) t (where '_a and '_b are ``weak types'' (refer to the definition of return: it’s merely a function of 3 parameters returning a list of the triplet of these 3 parameters). Once expr is typed, OCaml follow this rule: if expr is a function (as in function ... ->), a constant or an identifier then generalize the weak types into universal types (the more familiar 'a, 'b etc). If erpx is anything fancier, though, such as a partial application as is the case here, then do not generalize.

If instead we had let name params... = expr then, given it’s syntactic sugar for let name = function ... -> expr then the ``weak types'' would have been generalized.

So we have to make this looks more like a function, by making explicit at least one parameter (a process famously known under the tickling name ``eta-expansion'').

This feels arbitrary because it is ; apparently this is one of the minor disadvantage of a typing rules that has plenty of other advantages such as simplifying something that’s already quite complex. See the OCaml FAQ for more details.

This is unfortunately going to hit us a lot when defining parser combinators because we’d like to get away with the many meaningless and repetitive parameters which presence just obfuscate the intent of the code. C’est la vie.

8.3. Bounded set

We still have to provide an implementation for our set of fixed maximum size. The simplest implementation is that of a list with a current size:

ParsersBoundedSet.ml: type

type 'a t =
  { size : int ;
    max_size : int ;
    items : 'a list }

With the trivial constructor:

ParsersBoundedSet.ml: constructor

let make max_size =
  { size = 0 ; max_size ; items = [] }

And the only three operations we’ve met so far:

ParsersBoundedSet.ml: operations

let is_full t = t.size >= t.max_size
let is_empty t = t.size = 0

let add t x =
  { t with size = t.size + 1 ;
           items = x::t.items }

It would also be convenient to provide a simple shortcut in Parsers for cases where no error detection is required:

Parsers helpers: no error detection

let no_error_correction = ParsersBoundedSet.make 0

8.4. Printers

If there is something annoying about OCaml it’s the lack of default printers for types. Batteries provides dump but it is oblivious to constructors so the result is not pretty. So let’s write our own printers.

It would be best to provide formatters instead of mere printers to benefit from automatic typesetting but unfortunately Batteries support for those is minimal so it’s better to forget about formatters to cut down on typing.

First, parser configuration must supply printers of tokens, positions and streams:

Parser configuration:

val print_token : 'o BatInnerIO.output -> token -> unit
val print_position : 'o BatInnerIO.output -> position -> unit
val print_stream : 'o BatInnerIO.output -> stream -> unit

With a printer for ParsersBoundedSet.t we could also print corrections:

ParsersBoundedSet.ml: printer

open Batteries

let print print_value fmt t =
  List.print print_value fmt t.items

ParsersMisc.ml: printer

open Batteries

let print_stream_item print_token fmt = function
  | EndOfStream -> String.print fmt "end of input"
  | Item c -> print_token fmt c

ParsersCorrections.ml: printers

open Batteries

let print_correction print_position print_token fmt (pos, tok, msg) =
  Printf.fprintf fmt "%s at %a (near '%a')"
    msg
    print_position pos
    (print_stream_item print_token) tok

let print_corrections print_position print_token fmt corr =
  ParsersBoundedSet.print (print_correction print_position print_token) fmt corr

With all this we can print errors and results:

Parser library: printers

let print_error_context fmt =
  let cannot_find fmt = function
    | "eof" -> String.print fmt "Was expecting end-of-file"
    | s -> Printf.fprintf fmt "Cannot find %s" s in
  function
  | [] ->
    String.print fmt "No context known. This is bad. Good luck!"
  | [x] ->
    cannot_find fmt x
  | [x; rest] ->
    Printf.fprintf fmt "%a while looking for %s"
      cannot_find x rest
  | x :: next :: rest ->
    Printf.fprintf fmt "%a while looking for %s %a"
      cannot_find x next
      (List.print ~first:"(in " ~last:")" ~sep:" in " String.print) rest

let print_error fmt = function
  | None -> Printf.fprintf fmt "Ok"
  | Some e ->
    Printf.fprintf fmt "Error at %a: %a"
      print_position e.where
      print_error_context e.what

let print_possible_result print_output fmt (x, corr, rest) =
  Printf.fprintf fmt "(output=%a,corrections=%a,rest=%a)"
    print_output x
    (print_corrections print_position print_token) corr
    print_stream rest

let print_possible_results print_output fmt (e, r) =
  Printf.fprintf fmt "%a, %a"
    print_error e
    (List.print (print_possible_result print_output)) r

let print_bad_result print_output fmt = function
  | Approximation (b, c, _s) ->
    Printf.fprintf fmt "Approximately: %a (corrections: %a)"
      print_output b
      (print_corrections print_position print_token) c
  | Ambiguous lst ->
    Printf.fprintf fmt "Ambiguous: %a"
      (List.print (fun fmt (b, c, _s) ->
        Printf.fprintf fmt "%a (corrections: %a)"
          print_output b
          (print_corrections print_position print_token) c)) lst
  | NoSolution e ->
    print_error fmt e

let print_result print_output fmt = function
  | Ok (b, _s) -> print_output fmt b
  | Error e -> print_bad_result print_output fmt e

Parser signature

val print_error : 'o BatInnerIO.output -> error option -> unit

val print_error_context : 'o BatInnerIO.output -> string list -> unit

val print_possible_result :
  ('o BatInnerIO.output -> 'b -> unit) ->
  'o BatInnerIO.output ->
  'b possible_result -> unit

val print_possible_results :
  ('o BatInnerIO.output -> 'b -> unit) ->
  'o BatInnerIO.output ->
  (error option * 'b possible_result list) -> unit

val print_bad_result :
  ('o BatInnerIO.output -> 'b -> unit) ->
  'o BatInnerIO.output ->
  'b failure -> unit

val print_result :
  ('o BatInnerIO.output -> 'b -> unit) ->
  'o BatInnerIO.output ->
  ('b * stream, 'b failure) result -> unit

8.5. Simple string scanners

Thanks to the stream_of_string function we can also devise a very simple to use of_string function for any parser, that would come handy for testing or other simple circumstances.

Parser library: printers

let of_string p str =
  let s = stream_of_string str in
  ct_bind
    ((p +- eof) [] None no_error_correction s |> to_result) (function
    | Error e -> ct_return (Error e)
    | Ok (res, _rest) -> ct_return (Ok res))

Notice that since we force the parser to reach the end of string we can discard from the result the unparsed rest of the stream, which will always be empty.

Simplifying further, this exception throwing variant will just return the result without further ado. Notice that, for the sake of simplicity we don’t force a print_output argument so the error report will not be as detailed as the one given by print_result:

Parser library: printers

exception ParseError of string

let of_string_exn p str =
  ct_bind (of_string p str) (function
  | Error (Approximation _) -> raise (ParseError "Approximate result")
  | Error (Ambiguous _)     -> raise (ParseError "Ambiguous result")
  | Error (NoSolution e)    -> raise (ParseError (IO.to_string print_error e))
  | Ok x -> ct_return x)

Parser signature

exception ParseError of string
val of_string : 'b t -> string -> ('b, 'b failure) result ct
val of_string_exn : 'b t -> string -> 'b ct

9. Parsing usual things

It might come handy to have some ready made parsers for common things such as words, numbers, etc… We will regroup those in a ParsersUsual module parametrized by a Parsers module for characters:

ParsersUsual.ml: Parsers for usual things

open Batteries
module Make (P : Parsers.S with type token = char) =
struct
  open P
  (* ...usual parsers... *)
end

…that we will test along with the Parsers module:

other tested modules

module SimpleUsual = ParsersUsual.Make (P)
open SimpleUsual

9.1. Character types

It is common to check for various classes of character: blanks, numerics, alphanumerics, newlines…

usual parsers: character classes

let blank m =
  cond "blank" (fun c -> c = ' ' || c = '\t') ' ' m

let carriage_return m = item ~what:"carriage return" '\r' m
let new_line m = item ~what:"new line" '\n' m

let newline m =
  (optional ~def:'\r' carriage_return -+ new_line) m

let whitespace m =
  repeat_greedy ~min:1 ~sep:none ~what:"whitespaces" (blank ||| newline) m

let opt_whitespace m =
  optional_greedy ~def:[] whitespace m

Notice we read greedily the whitespaces because we want to avoid a whitespace -- whitespace ambiguity. optional_greedy is there for the same reason.

usual parsers: more character classes

let lowercase m = range 'a' 'z' "lowercase" m
let uppercase m = range 'A' 'Z' "uppercase" m
let letter m = (lowercase ||| uppercase) m
let decimal_digit m = range '0' '9' "digit" m
let alphanum m = (letter ||| decimal_digit) m

9.2. Strings

Starting from the char parser that’s an item specialized in characters:

usual parsers: chars

let quoted s = Printf.sprintf "%S" s

let char ?what ?(case_sensitive=true) c =
  let other_case c =
    if Char.is_lowercase c then Char.uppercase c else
    if Char.is_uppercase c then Char.lowercase c else c in
  let c' = if case_sensitive then c else other_case c in
  let expl =
    Option.default_delayed (fun () -> quoted (String.of_char c)) what in
  cond expl (fun x -> x = c || x = c') c

…the parser matching a given string can be written:

usual parsers: strings

open ParsersMisc

let string ?case_sensitive s =
  let rec loop i =
    if i >= String.length s then return ()
    else (
      (char ~what:(quoted s) ?case_sensitive s.[i]) -- (loop (i+1))
    ) in
  loop 0

One often wants to parse string literals (represented as C-like quoted strings):

usual parsers: quoted strings

let quoted_char ?(base_num=8) ?(what="character") =
  let digit base_num = cond_map "digit" (fun c ->
    let check n = if n >= 0 && n < base_num then Some n else None in
    if c >= '0' && c <= '9' then check (Char.code c - Char.code '0') else
    if c >= 'a' && c <= 'z' then check (10 + Char.code c - Char.code 'a') else
    if c >= 'A' && c <= 'Z' then check (10 + Char.code c - Char.code 'A') else
    None) 0 in
  let hexdigit = digit 16 and digit = digit base_num
  in
  char ~what:"escape sequence" '\\' -+
    ((cond_map "escaped character" (function
     | 'n' -> Some '\n'  | 'r' -> Some '\r'  | 'b' -> Some '\b'
     | 't' -> Some '\t'  | '\\' -> Some '\\' | '"' -> Some '"'
     | '\'' -> Some '\''
     | c when (c < '0' || c > '9') && c <> 'x' -> Some c
     | _ -> None) 'n') |||
    ((* numeric escape sequence of 3 digits in base base_num: *)
     repeat ~sep:none ~what:"numeric escape sequence" ~min:3 ~max:3 digit >>:
       fun lst ->
         let n =
           List.fold_left (fun s c -> s * base_num + c) 0 lst in
         Char.chr n) |||
    ((* hexanumeric escape sequence: *)
     char ~what:"x" 'x' -+
     repeat ~sep:none ~what:"hexanumeric escape sequence" ~min:2 ~max:2 hexdigit >>:
       fun lst ->
        let n =
          List.fold_left (fun s c -> s * 16 + c) 0 lst in
        Char.chr n)) |||
  cond what (fun c -> c <> '\\' && c <> '"') 'x'

let quoted_string ?base_num =
  char ~what:"opening quote" '"' -+
  repeat_greedy ~sep:none ~what:"a quoted string" (quoted_char ?base_num) +-
  char ~what:"closing quote" '"' >>: String.of_list

Notice that if the base for the numeric escape sequence is greater than 10 we cannot be sure any longer how to parse "\b" for instance.

With some tests for good measure:

tests

"quoted strings" >:: (
  fun _ctx ->
    [ "\"abc\"", "abc" ;
      "\"\\x61\\x62\\x63\"", "abc" ] |>
    List.iter (fun (input, output) ->
      assert_same_results String.print
        (None, [output, no_corr, (String.length input, [])])
        ((quoted_string +- eof) [] None no_corr (stream_of_string input)))
) ;

Another usual offender is parsing C like identifiers (aka any words made of characters, numbers or underscore but not starting by a number):

usual parsers: identifier

let underscore m = item ~what:"underscore" '_' m

let identifier ?(what="identifier") =
  let first_char = letter ||| underscore in
  let any_char = first_char ||| decimal_digit in
  first_char ++ repeat_greedy ~sep:none ~what any_char >>: fun (c, s) ->
    String.of_list (c :: s)

9.3. Numbers

We will try to follow the most common conventions for parsing numbers. Notice that a simple base 10 integer number must start with a non 0 (otherwise it’s octal). We make no exception for the single digit '0' which will be parsed as octal.

We use the Num module from the standard library to represent arbitrary integers as we use it only to store the representation of numbers (if we cared about efficiency of arithmetic operations we would use zarith).

usual parsers: integers

type integer = Num.num

let non_zero_decimal_digit m =
  range '1' '9' "non-zero digit" m

let num_of_char c =
  let cc = Char.code c in
  if cc >= Char.code '0' && cc <= Char.code '9' then
    Num.num_of_int (cc - Char.code '0')
  else if cc >= Char.code 'a' && cc <= Char.code 'f' then
    Num.num_of_int (cc - Char.code 'a' + 10)
  else if cc >= Char.code 'A' && cc <= Char.code 'F' then
    Num.num_of_int (cc - Char.code 'A' + 10)
  else invalid_arg "c"

let unsigned_decimal_number ?what ?(inc_zero=true) m =
  let ten = Num.num_of_int 10 in
  let m = may_add_context m what in
  let digits m = several_greedy ~sep:none decimal_digit m in
  ((if inc_zero then decimal_digit else non_zero_decimal_digit) +-
   optional ~def:' ' underscore ++
   repeat_greedy ~sep:underscore digits >>:
   fun (first, next) ->
   List.fold_left (fun c digits ->
     List.fold_left (fun c digit ->
       Num.add (Num.mul c ten) (num_of_char digit)) c digits) Num.zero ([first]::next)) m

let signed neg p =
  p                                |||
  item ~what:"sign" '+' -+ p       |||
  (item ~what:"sign" '-' -+ p >>: neg)

let decimal_number ?(inc_zero=true) m =
  signed Num.minus_num (unsigned_decimal_number ~inc_zero) m

with the help of:

ParsersMisc.ml: helper for enriching context

let may_add_context m = function
  | None -> m
  | Some w -> w::m

We have made num_of_char accept hexadecimal digits in foresight.

Octals, hexadecimal and binary numbers are then build similarly: a mandatory prefix, and some digits interleaved with underscores. Notice that only the prefix is mandatory and '0x' for instance is a valid immediate (representing zero of course), as in the Perl language.

usual parsers: non decimal integers

let non_decimal_integer base prefix digit =
  let base = Num.num_of_int base in
  let digits m = several_greedy ~sep:none digit m in
  prefix -+ repeat ~sep:underscore ~what:"digits" digits >>:
     List.fold_left (fun c digits ->
       List.fold_left (fun c digit ->
         Num.add (Num.mul c base) (num_of_char digit)) c digits) Num.zero

let octal_digit m =
  range '0' '7' "octal digit" m

let octal_number m =
  (non_decimal_integer 8 (item '0') octal_digit |>
   signed Num.minus_num) m

let hexadecimal_digit m =
  cond "hexadecimal digit" (fun c ->
    (c >= '0' && c <= '9') ||
    (c >= 'a' && c <= 'f') ||
    (c >= 'A' && c <= 'F')) '1' m

let non_decimal_integer_prefix x =
  item '0' --
  cond "integer prefix" (fun c -> Char.lowercase c = x) x

let hexadecimal_number m =
  let prefix = non_decimal_integer_prefix 'x' in
  (non_decimal_integer 16 prefix hexadecimal_digit |>
   signed Num.minus_num) m

let binary_digit m =
  range '0' '1' "bit" m

let binary_number m =
  let prefix = non_decimal_integer_prefix 'b' in
  (non_decimal_integer 2 prefix binary_digit |>
   signed Num.minus_num) m

Finally, this parser can parse all kinds of integers seen so far:

usual parsers: any integer

let integer m =
  (decimal_number ~inc_zero:false |||
   octal_number                   |||
   hexadecimal_number             |||
   binary_number) m

tests

"integer immediate" >:: (
  fun _ctx ->
    [ "4", Num.num_of_int 4 ;
      "42", Num.num_of_int 42 ;
      "12345", Num.num_of_int 12345 ;
      "4_294_967_296", Num.num_of_int 4_294_967_296 ;
      "042", Num.num_of_int 0o42 ;
      "0x42", Num.num_of_int 0x42 ;
      "0X42", Num.num_of_int 0x42 ;
      "0xff", Num.num_of_int 0xff ;
      "0b10", Num.num_of_int 0b10 ;
      "0x", Num.num_of_int 0 ;
      "0x4_2", Num.num_of_int 0x4_2 ;
      "-4", Num.num_of_int ~-4 ;
      "+4", Num.num_of_int 4 ;
      "-042", Num.num_of_int ~-0o42 ;
      "+042", Num.num_of_int 0o42 ;
      "-0x42", Num.num_of_int ~-0x42 ;
      "-0b10", Num.num_of_int ~-0b10 ] |>
    List.iter (fun (input, output) ->
      assert_same_results Num.print
        (None, [output, no_corr, (String.length input, [])])
        ((integer +- eof) [] None no_corr (stream_of_string input)))
) ;
"not decimal number immediate" >:: (
  fun _ctx ->
    [ "0_" ; "0X_" ; "_123" ; "123_" ; "12__34" ; "_" ; "_0x123" ;
      "-0_" ; "-_42" ; "+" ; "-" ; "" ] |>
    List.iter (fun input ->
      assert_same_results Num.print
        (None, [])
        ((integer +- eof) [] None no_corr (stream_of_string input)))
) ;

The syntax for floating point numbers is more perly. Indeed, in additional to the usual decimal and scientific notations, Perl allows hexadecimal floating point, with a power of two as the exponent (and a "p" instead of an "e" to introduce the exponent, for obvious reason).

Also, notice that you can omit either the integer or the fractional part but not both.

Finally, we also accept well known strings for NaN and infinity special floating point values.

usual parsers: floating point

let fractional_part inv_base digit =
  let digits m = several ~sep:none digit m in
  several ~sep:underscore digits >>: fun digits ->
    List.fold_left (fun c_scale digits ->
        List.fold_left (fun (c, scale) digit ->
            let n = num_of_char digit |> Num.to_float in
            c +. n *. scale, scale *. inv_base
          ) c_scale digits
      ) (0., inv_base) digits |>
    fst

let unsigned_decimal_fractional m =
  let dot m = item ~what:"fractional dot" '.' m in
  ((unsigned_decimal_number +- dot ++ fractional_part 0.1 decimal_digit) |||
   (return Num.zero +- dot ++ fractional_part 0.1 decimal_digit)         |||
   (unsigned_decimal_number +- dot ++ return 0.) >>:
       fun (n, p) -> Num.to_float n +. p
  ) m

let decimal_fractional m =
  signed Float.neg unsigned_decimal_fractional m

let decimal_scientific m =
  ((decimal_fractional |||
    (decimal_number >>: Num.to_float)) +-
   cond "exponent delimiter" (fun c -> c = 'e' || c = 'E') 'e' ++
   decimal_number >>: fun (m, e) ->
     m *. Float.pow 10. (Num.to_float e) (* FIXME *)
   ) m

let hexadecimal_scientific m =
  let dot m = item ~what:"fractional dot" '.' m in
  let sep = return () in
  let sign =
    optional ~def:'+' (item ~what:"sign" '+' ||| item ~what:"sign" '-') in
  (sign +-
   non_decimal_integer_prefix 'x' ++
   repeat_greedy ~sep hexadecimal_digit ++
   optional ~def:[] (dot -+ several_greedy ~sep hexadecimal_digit) +-
   cond "exponent delimiter" (fun c -> c = 'p' || c = 'P') 'p' ++
   sign ++
   several_greedy ~sep decimal_digit >>:
   fun ((((s1, m), n), s2), e) ->
     let print_lst oc lst =
       List.print ~first:"" ~last:"" ~sep:"" Char.print oc lst in
     Printf.sprintf2 "%c0x%a.%ap%c%a"
       s1
       print_lst m
       print_lst n
       s2
       print_lst e |>
     Float.of_string
  ) m

let special_float m =
  ((string ~case_sensitive:false "nan" >>:
      fun () -> Float.nan)      |||
   ((string ~case_sensitive:false "inf" |||
     string ~case_sensitive:false "+inf") >>:
      fun () -> Float.infinity) |||
   (string ~case_sensitive:false "-inf" >>:
      fun () -> Float.neg_infinity)
  ) m

let floating_point m =
  (decimal_fractional     |||
   decimal_scientific     |||
   hexadecimal_scientific |||
   special_float) m

tests

"floating point notation" >:: (
  fun _ctx ->
    [ "3.14", 3.14 ;
      "-3.14", -3.14 ;
      "314e2", 31400. ;
      "314e-2", 3.14 ;
      ".1", 0.1 ;
      "1.", 1.0 ;
      "0x1.1p1", 2.125 ;
      "0x1.fp3", 15.5 ;
      "-0XA.BCp-2", -2.68359375 ;
      "inf", Float.infinity ;
      "-inf", Float.neg_infinity ] |>
    List.iter (fun (input, output) ->
      assert_same_results Float.print
        (None, [output, no_corr, (String.length input, [])])
        ((floating_point +- eof) [] None no_corr (stream_of_string input)))
) ;

And finally the parser of any immediate number:

usual parsers: any number

type number = Int of integer
            | Float of float
let number m =
  ((integer        >>: fun x -> Int x) |||
   (floating_point >>: fun x -> Float x)) m

9.4. Operations

Operators are another frequent occurrence. Of course how to parse an "operation" is likely to depend on the problem at hand, but it’s still useful to discuss them here if only to demonstrate how to deal with recursive rules.

Indeed, the straightforward way to define a parser for operations would rely on left recursion, which a combinatoric parser can not perform. Instead, we will have to force progress by defining a chain of terms and subterms in order of precedence.

The principle of such a chain is to replace a left recursing definition such as:

let term m = (term +- any_binary_op ++ term) m

with:

let term1 m = ((term2 +- low_precedence_op ++ term2) ||| term2) m
let term2 m = ((term3 +- higher_precedence_op ++ term3) ||| term3) m
(* etc... *)

Allowing recursion only after some input have been consumed:

let rec this_is_ok m =
  (item '{' -+ this_is_ok ++ item '}') m

let rec this_is_infinite_recursion m =
  (this_is_infinite_recursion ++ anything_else) m

let rec this_is_still_infinite_recursion m =
  (check some_check ++
   this_is_still_infinite_recursion) m

Now this chain will always parse left side first. If 1 + 2 * 3 will properly be parsed as 1 + (2 * 3) (because the parse would fail if term1 consumed only 1 + 2), the simple 3 - 2 - 1 would be erroneously parsed as 3 - (2 - 1) instead of (3 - 2) - 1. To help with left associative operators, we need to group operators of same precedence and associativity and use a repeat parser, which associativity we are free to choose.

Here is a binary_ops_reducer parser that takes a parser for binary operators of same associativity and precedence (here called op), and a parser for terms (called term), and returns either the left or right associativity parser. It is expected that the term parser has higher precedence than op. It bears some resemblance with repeat but does not discard the output of the separator (here: the operation) and build as a last stage the final result out of the list of partial results, with the expected associativity. This situation occurs often enough in practice that it’s worth having a generic solution in the parser combinator library. It is made generic enough by the use of another parameter, the reduce function, that combines two terms and an operator results into a value of the same type as returned by term. Notice that this may force the user of this binary_ops_reducer function to lift the sub-term parser in order to return a singleton term instead (if the sub-terms and terms do not share a common type).

Parser library: binary operations with selected associativity

let binary_ops_reducer ?(right_associative=false) ~op ~term ~sep ~reduce =
  term ++ optional_greedy ~def:[] (sep -+ repeat ~min:1 ~sep (op +- sep ++ term)) >>:
  fun (fst, lst) -> (* lst is a list of (op result * term result) *)
    let rec loop_lst last_term = function
      | [] -> last_term
      | (op, next_term)::rest ->
        if right_associative then
          reduce last_term op (loop_lst next_term rest)
        else
          loop_lst (reduce last_term op next_term) rest
        in
    loop_lst fst lst

Parser signature

val binary_ops_reducer :
  ?right_associative:bool ->
  op:'o t ->
  term:'b t ->
  sep:unit t ->
  reduce:('b -> 'o -> 'b -> 'b) ->
  'b t

Let’s see it in action:

tests

"binary_ops_reducer" >:: (
  fun _ctx ->
    let term m = (decimal_digit >>: fun c -> Term c) m in
    let op m = item '+' m in
    let reduce t1 _op t2 = Op (t1, t2) in
    [ "1+2",
        Op (Term '1', Term '2'),
        Op (Term '1', Term '2') ;
      "1+2+3",
        Op (Op (Term '1', Term '2'), Term '3'),
        Op (Term '1', Op (Term '2', Term '3')) ;
      "1+2+3+4",
        Op (Op (Op (Term '1', Term '2'), Term '3'), Term '4'),
        Op (Term '1', Op (Term '2', Op (Term '3', Term '4'))) ] |>
    List.iter (fun (input_str, exp1, exp2) ->
      (* exp1 is the expected result for left associative parsing and
         exp2 for right associative parsing. *)
      let input = stream_of_string input_str in
      assert_same_results ~msg:"left assoc." binary_ops_reducer_test_result_print
        (None, [exp1, no_corr, (String.length input_str, [])])
        ((binary_ops_reducer ~op ~term ~sep:none ~reduce ~right_associative:false
          +- eof) [] None no_corr input) ;
      assert_same_results ~msg:"right assoc." binary_ops_reducer_test_result_print
        (None, [exp2, no_corr, (String.length input_str, [])])
        ((binary_ops_reducer ~op ~term ~sep:none ~reduce ~right_associative:true
          +- eof) [] None no_corr input))
) ;

with type binary_ops_reducer_test_result defined globally, as required by OCaml:

other global functions or types for testing

type binary_ops_reducer_test_result =
    Term of Char.t
  | Op of (binary_ops_reducer_test_result *
           binary_ops_reducer_test_result)

let rec binary_ops_reducer_test_result_print fmt = function
  | Term c ->
     Printf.fprintf fmt "%c" c
  | Op (r1, r2) ->
     Printf.fprintf fmt "(%a+%a)"
       binary_ops_reducer_test_result_print r1
       binary_ops_reducer_test_result_print r2

Let’s also test the handling of precedence with a small calculator:

tests

"precedence and associativity" >:: (
  fun _ctx ->
    let value m = (decimal_digit >>: num_of_char) m in
    let reduce t1 op t2 =
      let op = match op with
        | '+' -> Num.add | '-' -> Num.sub
        | '*' -> Num.mul | '/' -> Num.div
        | '^' -> Num.pow | _ -> assert false in
      op t1 t2 in
    let rec left_assoc_low_prec m =
      binary_ops_reducer ~op:(item '+' ||| item '-')
                         ~term:left_assoc_high_prec
                         ~sep:none ~reduce m
    and left_assoc_high_prec m =
      binary_ops_reducer ~op:(item '*' ||| item '/')
                         ~term:right_assoc_higher_prec
                         ~sep:none ~reduce m
    and right_assoc_higher_prec m =
      binary_ops_reducer ~op:(item '^')
                         ~right_associative:true
                         ~term:left_assoc_highest_prec
                         ~sep:none ~reduce m
    and left_assoc_highest_prec m =
      (value |||
       item '(' -+ left_assoc_low_prec +- item ')') m in
    [ "0",       Num.num_of_int 0 ;
      "1+2",     Num.num_of_int 3 ;
      "1+2+3",   Num.num_of_int 6 ;
      "1+2+3+4", Num.num_of_int 10 ;
      "5-1",     Num.num_of_int 4 ;
      "5-4-1",   Num.num_of_int 0 ;
      "(5-4)-1", Num.num_of_int 0 ;
      "5-(4-1)", Num.num_of_int 2 ;
      "4^3^2",   Num.num_of_int 262144 ;
      "4^(3^2)", Num.num_of_int 262144 ;
      "(4^3)^2", Num.num_of_int 4096 ;
      "3*2+1",   Num.num_of_int 7 ;
      "1+3*2",   Num.num_of_int 7 ;
      "(1+3)*2", Num.num_of_int 8 ;
      "8/2/2",   Num.num_of_int 2 ] |>
    List.iter (fun (input_str, exp) ->
      let input = stream_of_string input_str in
      assert_same_results Num.print
        (None, [exp, no_corr, (String.length input_str, [])])
        ((left_assoc_low_prec +- eof) [] None no_corr input))
) ;

Hopefully this example shed some confidence on parsing operators with any precedence and associativity despite using parser combinators.

10. Configurations

Here are provided some Parsers.CONFIG implementing more complex parsers than the SimpleConfig (that was tailored for unit tests) and a few simple position types ɣ.

10.1. Turning destructive reads into persistent streams

Let’s define a proper persistent stream type adapted to networking.

The stream should take its tokens from blocs chained together. Reception of a new block should append it at the end of the list (the pointer to the next block must be mutable), in such a way that when all streams are done with the first block of the chain no more pointers point to it and it can be reclaimed by the garbage collector.

Most of the times we will want bytes in there, but for generality we must provide streams of token. We will therefore ask for another configuration module providing the actual token container (maybe a Buffer.t, a bytes, an array or any indexed container) with a nth function returning the token at some designated index. This configuration module must also provide the function returning the (promise of a) next block (and the range we ought to parse). To limit functorization in end user code we will assume that this function also takes some additional parameter representing the channel that is being read (such as a file descriptor).

ParsersConfig.ml: turn a destructive stream into a persistent one

open Batteries
open ParsersMisc

module type CONFIG = sig
  (* ...threading types... *)
  type token
  val print_token : 'o BatInnerIO.output -> token -> unit
  type buf
  val nth : buf -> int -> token
  type channel
  val read_buf : channel -> (buf * int * int) option ct
end

module BlockList (Conf : CONFIG) =
struct
  include Conf

  type block =
    { buffer : buf ;
      first : int ; (* <1> *)
      last : int ;
      mutable next_block : block option }

  type stream = Unstarted of { channel : channel ; mutable first_block : block option }
              | Started of { channel : channel ; block : block ; next : int (* <2> *) }
              | Finished

  let make_stream channel =
    Unstarted { channel ; first_block = None }

  let print_stream fmt = function
    | Unstarted _ -> Printf.fprintf fmt "Unstarted"
    | Started _  -> Printf.fprintf fmt "TODO"
    | Finished -> Printf.fprintf fmt "EOF"

These first and last indexes are fixed. They tell us where in the buffer we should start and stop reading ; this is not the stream pointer!
This is the stream pointer.

Notice that the chain of blocks is not persistent (because of the mutable next pointer) but the stream itself is: a stream can be copied and reused later regardless of what other copies are doing.

So the stream reader can either read the next available token (wrapped into a ct) or wait until the next buffer to be ready:

ParsersConfig.ml: The take function

  let rec take = function
    | Finished as s ->
      ct_return ((), EndOfStream, s)
    | Started s as stream ->
      if s.next < s.block.last then
        ct_return (
          (),
          Item (nth s.block.buffer s.next),
          Started { s with next = s.next+1 })
      else (
        match s.block.next_block with
        | Some block ->
          take (Started { s with block = block ; next = block.first })
        | None ->

This is the problematic case. We have to request the next block, and add it to the chain of blocks so that other streams lagging behind us could also parse it in the future. Notice that given read_buf is a blocking operation other threads could (and hopefully will) run while we wait. But none of those are competing parsers : despite we use threads to freeze the parsing at no point are we actually having more than one parsing thread ; nowhere in the library have we spawned a new thread, and all other existing streams are in this thread own stack. This is not to say that one can not run two parsers simultaneously but then, of course, they will have to parse different streams. In other words, nothing else in the running program must be calling our read_buf or we will miss some blocks. Hopefully that’s obvious enough that nobody will try to do that.

So let’s call read_buf and wait, confident that nothing bad will happen:

ParsersConfig.ml: take next token from next block

          ct_bind (read_buf s.channel) (function
          | None ->
            ct_return ((), EndOfStream, Finished)
          | Some (buffer, first, last) ->
            let next_block = { buffer ; first ; last ;
                               next_block = None } in

Of course it costs nothing to check the above assumption:

ParsersConfig.ml: enqueue this block for laggers and retry

            assert (s.block.next_block = None) ;
            s.block.next_block <- Some next_block ;
            take stream
        ))

The case where the stream is not started yet is interesting: this stream has to be shareable as well, but we must have only one instance of the pointer to the next block (so that when one instance triggers the actual read all other instances can find the data) and we do not want to keep a pointer to the head of the list forever (we want the list to be garbage collected). So we merely have the mutable address of the optional first block.

ParsersConfig.ml: take from the head of the stream

    | Unstarted u as stream ->
      (match u.first_block with
      | None ->
        ct_bind (read_buf u.channel) (function
        | None ->
          ct_return ((), EndOfStream, Finished)
        | Some (buffer, first, last) ->
          let next_block = { buffer ; first ; last ;
                             next_block = None } in
          assert (u.first_block = None) ;
          u.first_block <- Some next_block ;
          take stream)
      | Some block ->
        take (Started { channel = u.channel ; block ; next = block.first }))

We can now add a few more definitions in there so that this BlockList look more like a Parsers.CONFIG:

ParsersConfig.ml: turning it into an actual Parsers CONFIG

  type position = unit
  let distance () () = 0
  let print_position fmt () =
    String.print fmt "some location"
  let stream_of_string _str =
    failwith "Not implemented"
end

10.2. LWT

LWT already provides a Lwt_stream.t type for streams, that you would expect to be persistent given the threading context. But actually those are destructive, therefore useless for us. Let’s use instead the list of blocks constructed above. Here is an example implementation that reads bytes from a file descriptor:

ParsersLwtConfig.ml: file parsers for LWT

open Batteries

module FileReader :
  ParsersConfig.CONFIG with type token = char
                        and type 'a ct = 'a Lwt.t
                        and type channel = Lwt_unix.file_descr =
struct
  (* Lwt bindings: *)
  type 'a ct = 'a Lwt.t
  let ct_bind = Lwt.bind
  let ct_return = Lwt.return

  (* Reading characters: *)
  type token = char
  let print_token fmt c =
    Printf.fprintf fmt "%C" c

  (* From a file: *)
  type buf = bytes
  let nth = Bytes.get
  type channel = Lwt_unix.file_descr
  let read_buf channel =
    let max_len = 1024 in
    let buf = Bytes.create max_len in
    let%lwt r = Lwt_unix.read channel buf 0 max_len in
    Lwt.return (
      if r = 0 then None
      else Some (buf, 0, r))
end

If you prefer Lwt_io channels instead:

ParsersLwtConfig.ml: file parsers for LWT channels

module ChannelReader :
  ParsersConfig.CONFIG with type token = char
                        and type 'a ct = 'a Lwt.t
                        and type channel = Lwt_io.input_channel =
struct
  (* Same Lwt bindings as above: *)
  type 'a ct = 'a Lwt.t
  let ct_bind = Lwt.bind
  let ct_return = Lwt.return

  (* And same tokens *)
  type token = char
  let print_token fmt c =
    Printf.fprintf fmt "%C" c

  (* Same byte buffers: *)
  type buf = bytes
  let nth = Bytes.get

  (* But now we read from a Lwt_io.input_channel: *)
  type channel = Lwt_io.input_channel
  let read_buf channel =
    let max_len = 1024 in
    let buf = Bytes.create max_len in
    let%lwt r = Lwt_io.read_into channel buf 0 max_len in
    Lwt.return (
      if r = 0 then None
      else Some (buf, 0, r))
end

10.3. Blocking Unix CONFIG

This configuration assumes that it is OK to block when reading tokens (assuming the parser runs in its own POSIX thread for instance). It’s given as a configuration a mere in_channel from which bytes (tokens) will be read.

ParsersConfig.ml: Reading a file using normal blocking Unix read

open Batteries

module FileReader :
  CONFIG with type token = char
          and type 'a ct = 'a
          and type channel = Unix.file_descr =
struct
  (* No continuation passing trickery needed: *)
  type 'a ct = 'a
  let ct_bind x f = f x
  let ct_return x = x

  (* Reading characters: *)
  type token = char
  let print_token fmt c =
    Printf.fprintf fmt "%C" c

  (* From a file: *)
  type buf = bytes
  let nth = Bytes.get
  type channel = Unix.file_descr
  let read_buf channel =
    let max_len = 1024 in
    let buf = Bytes.create max_len in
    let r =
      Unix.(
        try
          restart_on_EINTR (read channel buf 0) max_len
        with Unix_error(ECONNRESET, _, _) -> 0) in
    if r = 0 then None
    else Some (buf, 0, r)
end

10.4. Positioning in a stream

The take function from Parsers.CONFIG must return the position along with the token. Most use cases will require the position as the location in the source text presented as the conventional line + column numbers. Others will prefer a simpler offset from the beginning of the stream. Others may even have more specialized requirements.

The stream implementations proposed in the library cannot implement each of those and therefore they return no position (or rather: a unit value). Those implementations can then be wrapped into another function that will compute and overwrite the position.

Again, we need those wrapper to be pure from side effects as we may read the stream several times ; thus we cannot hide the required state in a closure. Instead we add the positioner state to the stream type.

The simplest such wrapper function, that works for every type of token, just adds the offset from the beginning of the stream:

ParsersPositions.ml: pairing an offset with every token

open Batteries
open ParsersMisc
module Offset (Conf : Parsers.CONFIG )
  : Parsers.CONFIG with type token = Conf.token
                    and type position = int
                    and type stream = Conf.stream * int
                    and type 'a ct = 'a Conf.ct =
struct
  (* Include Conf but change the stream type *)
  type token = Conf.token
  let print_token = Conf.print_token
  type 'a ct = 'a Conf.ct
  let ct_bind = Conf.ct_bind
  let ct_return = Conf.ct_return
  type position = int
  type stream = Conf.stream * position

  let distance pos1 pos2 = pos2 - pos1
  let print_position fmt ofs =
    Printf.fprintf fmt "offset %d" ofs
  let print_stream fmt (str, _) = Conf.print_stream fmt str
  let take (stream, pos) =
    ct_bind (Conf.take stream) (function
      | _, Item tok, stream' ->
        ct_return (pos, Item tok, (stream', pos+1))
      | _, EndOfStream, stream' ->
        ct_return (pos, EndOfStream, (stream', pos)))
  let stream_of_string str = Conf.stream_of_string str, 0
end

Another simple wrapper keeping track of the position as number of lines and columns must sneak at the obtained tokens (looking for newlines) and therefore works only with streams of characters:

TODO: 1. Parsers.CONFIG should have a function to turn a string into a stream. 2. We should have a user-manual or a few simple easy functor for the end users.

ParsersPositions.ml: pairing a line number and column number with every characters

type file_position = { offs : int ; line : int ; column : int }

module LineCol (Conf : Parsers.CONFIG with type token = char)
  : Parsers.CONFIG with type token = Conf.token
                    and type position = file_position
                    and type stream = Conf.stream * file_position
                    and type 'a ct = 'a Conf.ct =
struct
  (* Include Conf but change the stream type *)
  type token = Conf.token
  let print_token = Conf.print_token
  type 'a ct = 'a Conf.ct
  let ct_bind = Conf.ct_bind
  let ct_return = Conf.ct_return
  type position = file_position
  type stream = Conf.stream * position

  let distance pos1 pos2 = pos2.offs - pos1.offs
  let print_position fmt pos =
    Printf.fprintf fmt "line %d, col %d" pos.line pos.column
  let print_stream fmt (str, _) = Conf.print_stream fmt str
  let take (stream, pos) =
    ct_bind (Conf.take stream) (function
      | _, Item tok, stream' ->
        let line, column =
          if tok = '\n' then (pos.line+1, 1)
          else (pos.line, pos.column+1) in
        let pos' = { offs = pos.offs + 1 ; line ; column } in
        ct_return (pos', Item tok, (stream', pos'))
      | _, EndOfStream, stream' ->
        ct_return (pos, EndOfStream, (stream', pos)))
  let stream_of_string str =
    let start_pos = { offs = 0 ; line = 1 ; column = 1 } in
    Conf.stream_of_string str, start_pos
end

Name		Name	Last commit message	Last commit date
Latest commit History 123 Commits
m4		m4
.gitignore		.gitignore
META		META
Makefile.in		Makefile.in
README.adoc		README.adoc
configure		configure
configure.ac		configure.ac
default.nix		default.nix
make.common.in		make.common.in
opam.in		opam.in

rixed/ocaml-parsercombinator

Folders and files

Latest commit

History

Repository files navigation

Resumable Parser Combinators with error detection

1. What is this?

2. Requirements

2.1. A N+1th parser to rule them all

2.2. Parsing binary

2.3. Parsing recursively

2.4. Parsing with gaps

2.5. Best effort

2.6. Resumable parsers

2.7. Error detection

2.8. Error reporting

2.9. Parser Combinators

3. The type of the parser

4. Base parsers (to be combined)

4.1. Fail, Return and check

4.2. Tests

4.3. Checking for end of stream

4.4. First non trivial parser

5. Modifying parsers result

6. Combinators

7. Repeating parsers

8. Miscellaneous

8.1. Improving error reporting

8.2. Type generalization

8.3. Bounded set

8.4. Printers

8.5. Simple string scanners

9. Parsing usual things

9.1. Character types

9.2. Strings

9.3. Numbers

9.4. Operations

10. Configurations

10.1. Turning destructive reads into persistent streams

10.2. LWT

10.3. Blocking Unix CONFIG

10.4. Positioning in a stream

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

2.1. A N+1^th parser to rule them all

Packages