Parsing with Binary String Pattern Matching

If you've ever had to parse anything, (anything really), I want to show you a glimpse of what it could look like when you parse it binary string pattern matching in OCaml like you'd do on Erlang, Elixir, or Gleam.

Parsing with Binary String Pattern Matching

If you've ever had to parse anything, (anything really), I want to show you a glimpse of what it could look like when you parse it binary string pattern matching, like you'd do on Erlang, Elixir, or Gleam.

In OCaml, we can do this with the bitstring package.

To do this we'll use a small example of something I've had to parse recently: WebSocket Frames.

A WebSocket frame looks like this:


      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-------+-+-------------+-------------------------------+
     |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
     |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
     |N|V|V|V|       |S|             |   (if payload len==126/127)   |
     | |1|2|3|       |K|             |                               |
     +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
     |     Extended payload length continued, if payload len == 127  |
     + - - - - - - - - - - - - - - - +-------------------------------+
     |                               |Masking-key, if MASK set to 1  |
     +-------------------------------+-------------------------------+
     | Masking-key (continued)       |          Payload Data         |
     +-------------------------------- - - - - - - - - - - - - - - - +
     :                     Payload Data continued ...                :
     + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
     |                     Payload Data continued ...                |
     +---------------------------------------------------------------+

pretty ascii art for a websocket frame from rfc6455

This roughly means:

  • 1st byte includes the fin, rsv1,2,3 and opcode headers
  • 2nd byte includes a mask and a payload_length
  • 3rd-7th byte includes an extended payload (can be shorter too)
  • the byte after that is going to be a masking key
  • and then the rest is the actual data you packed in the WebSocket frame

The way you'd parse this with parse combinators looks a lot like the BNF grammar used to describe this packet. Let's look at some code, shall we?

We'll assume we have a function Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload that will create a record of type Frame.t. That part is the least interesting. The interesting thing is how do we get those values.

Here's how the websocketaf parses this frame using :

(* this example code taken almost verbatim from the `websocketaf`
   opam package.

   I've only amended the last `parse` function to make it more obvious how 
   the parsing is becoming a frame. In practice they do something 
   different after parsing.
*)

(* a Bigstringaf.t (i know, i giggle too) is an efficient type
   for a large string that doesn't have to be copied over and over.

   If you need something similar in your code, consider using Cstruct.t
*)
type t = Bigstringaf.t

(* this function takes a bigstring and checks if the first bit, which
   corresponds to the FIN bit is set *)
let is_fin t =
  let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
  bits land (1 lsl 8) = 1 lsl 8
;;

(* this function extracts the reserverd bits from the bigstring *)
let rsv t =
  let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
  (bits lsr 4) land 0b0111
;;

(* this function creates an int representation for an opcode *)
let opcode t =
  let bits = Bigstringaf.unsafe_get t 0 |> Char.code in
  bits land 0b1111 |> Opcode.unsafe_of_code
;;

(* this function gets the length of the payload.
   and yes that comment was already there.
*)
let payload_length_of_offset t off =
  let bits = Bigstringaf.unsafe_get t (off + 1) |> Char.code in
  let length = bits land 0b01111111 in
  if length = 126 then Bigstringaf.unsafe_get_int16_be t (off + 2)                 else
  (* This is technically unsafe, but if somebody's asking us to read 2^63
   * bytes, then we're already screwd. *)
  if length = 127 then Bigstringaf.unsafe_get_int64_be t (off + 2) |> Int64.to_int else
  length
;;

let payload_length t =
  payload_length_of_offset t 0
;;

(* this function checks if the bigstring has a specific mask set *)
let has_mask t =
  let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
  bits land (1 lsl 7) = 1 lsl 7

(* and this one checks if the mask exists and if so, gets the value *)
let mask t =
  if not (has_mask t) 
  then None
  else
    Some (
      let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
      if bits  = 254 then Bigstringaf.unsafe_get_int32_be t 4  else
      if bits  = 255 then Bigstringaf.unsafe_get_int32_be t 10 else
      Bigstringaf.unsafe_get_int32_be t 2)
;;

(* this computes the offset of the payload in bits *)
let payload_offset_of_bits bits =
  let initial_offset = 2 in
  let mask_offset    = (bits land (1 lsl 7)) lsr (7 - 2) in
  let length_offset  =
    let length = bits land 0b01111111 in
    if length < 126
    then 0
    else 2 lsl ((length land 0b1) lsl 2)
  in
  initial_offset + mask_offset + length_offset
;;

let payload_offset t =
  let bits = Bigstringaf.unsafe_get t 1 |> Char.code in
  payload_offset_of_bits bits
;;


let length_of_offset t off =
  let bits           = Bigstringaf.unsafe_get t (off + 1) |> Char.code in
  let payload_offset = payload_offset_of_bits bits in
  let payload_length = payload_length_of_offset t off in
  payload_offset + payload_length
;;

let length t =
      length_of_offset t 0
  ;;

let apply_mask mask bs ~off ~len =
  for i = off to off + len - 1 do
    let j = (i - off) mod 4 in
    let c = Bigstringaf.unsafe_get bs i |> Char.code in
    let c = c lxor Int32.(logand (shift_right mask (8 * (3 - j))) 0xffl |> to_int) in
    Bigstringaf.unsafe_set bs i (Char.unsafe_chr c)
  done
;;

let unmask_inplace t =
  if has_mask t then begin
    let mask = mask_exn t in
    let len = payload_length t in
    let off = payload_offset t in
    apply_mask mask t ~off ~len
  end
;;

(* this was confusing to me too *)
let mask_inplace = unmask_inplace

let parse =
  let open Angstrom in
  Unsafe.peek 2 (fun bs ~off ~len:_ -> length_of_offset bs off)
  >>= fun len -> Unsafe.take len Bigstringaf.sub
  >>| fun frame ->
      let fin = Websocket.Frame.is_fin frame in
      let opcode = Websocket.Frame.opcode frame in
      unmask_inplace frame;
      let len = payload_length t in
      let off = payload_offset t in
      Frame.make ~fin ~opcode ~payload:(len, off, frame)
;;

This process begins by defining from the bottom up all the smallest parts we need to use in our final combinator. Only at the very end do we stitch everything together. It also requires us to keep track of several small numbers in different positions to understand what is being read from where.

If you wanted to run this parser incrementally, you now need an external loop that would normally read something like this:

let parse ?(parser = Frame.parse) data =
  match parser with
  (* check if the parser is currently in a partial state *)
  | Angstrom.Buffered.Partial continue -> (
      (* continue by with the next piece of data *)
      match continue (`Bigstring data) with
      (* if we're done, return our frame *)
      | Angstrom.Buffered.Done (_unconsumed, frame) -> Ok (`frame frame)
      (* handle any parsing errors *)
      | Angstrom.Buffered.Fail (_, _, _) -> Error `bad_frame
      (* otherwise return that we need more data to continue *)
      | parser -> Ok (`more parser))
  | Angstrom.Buffered.Done (_unconsumed, frame) -> Ok (`frame frame)
  | Angstrom.Buffered.Fail (_, _, _) -> Error `bad_frame

(* helper function to read data and continue parsing until we are done
   or until the parser errors *)
let rec read_all_and_parse ?(parser=Frame.parse) () =
  let data = read_data () in
  match parse ~parser data with
  | Ok (`more parser) -> read_all_and_parse ~parser ()
  | Ok (`frame frame) -> Ok frame
  | Error reason -> Error reason

Not bad! I kinda like being able to control the parsing incrementally like this. It can make for really efficient code.

And here's how it can look when pattern matching on binary strings:

let deserialize ?(max_frame_size = 0) data =
  match%bitstring Bitstring.bitstring_of_string data with
  | {| fin : 1;
       compressed : 1;
       rsv : 2;
       opcode : 4;
       pad1 : 1 : check( pad1 = true );
       pad2 : 7 : check( pad2 = 127 );
       length : 64;
       mask : 32;
       payload : Int64.(mul length 8L |> to_int) : string;
       rest : -1 : string |}
    when max_frame_size = 0 || Int64.(length <= of_int max_frame_size) ->
      Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
      
  | {| fin : 1;
       compressed : 1;
       rsv : 2;
       opcode : 4;
       pad1 : 1 : check( pad1 = true );
       pad2 : 7 : check( pad2 = 126 );
       length : 16 : int;
       mask : 32 : int;
       payload : (length * 8) : string;
       rest : -1 : string |}
    when max_frame_size = 0 || length <= max_frame_size ->
      Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
      
  | {| fin : 1;
       compressed : 1;
       rsv : 2;
       opcode : 4;
       x : 1 : check( x = true );
       length : 7 : int;
       mask : 32 : int;
       payload : (length * 8) : string;
       rest : -1 : string |}
    when length <= 125 && (max_frame_size == 0 || length <= max_frame_size) ->
      Some (Frame.make ~fin ~compressed ~rsv ~opcode ~mask ~payload, rest)
      
  | {| data : -1 : string  |} -> Some (`more data, "")

This function also takes in a string as data, but it does a much clearer job of outlining what are the valid cases for frames on each matching branch.

Let's break this down.

A binary pattern match starts with match%bitstring data with and includes several cases. Every case is a regular OCaml string with the delimiters {| ... |} . The bitstring package will translate this string into a series of operations over the string data. This is why this is more declarative.

Okay but what goes inside each pattern?

Roughly every binary pattern follows this format: {| variable : size : type |} and they are all separated by semicolons.

For example, let's grab the first branch of the match:

{| fin : 1;
   compressed : 1;
   rsv : 2;
   opcode : 4;
   pad1 : 1 : check( pad1 = true );
   pad2 : 7 : check( pad2 = 127 );
   length : 64;
   mask : 32;
   payload : Int64.(mul length 8L |> to_int) : string;
   rest : -1 : string |}

In here we are saying that:

  • the first bit will be captured by the variable fin,
  • the second one by the variable compressed,
  • the next 2 bits by the variable rsv
  • the next 4 bits by the variable opcode
  • the next 1 bit by the variable pad1 , and we will check that the variable pad1 is true – since it is a single bit, the library maps this to a bool
  • the next 7 bits by the variable pad2, and we will check that it is the number 127
  • the next 64 bits by the variable length
  • the next 32 bits by the variable mask
  • and next length * 8 bits by the variable payload – and we specifically want this to be captured as a string
  • the rest of the bits (-1) as a string

Now compare that to our little diagram and tell me if this, while packed with info, is not easier to double-check that you are in fact parsing the right thing.


      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-------+-+-------------+-------------------------------+
     |F|R|R|R| opcode|M| Payload len |    Extended payload length    |
     |I|S|S|S|  (4)  |A|     (7)     |             (16/64)           |
     |N|V|V|V|       |S|             |   (if payload len==126/127)   |
     | |1|2|3|       |K|             |                               |
     +-+-+-+-+-------+-+-------------+ - - - - - - - - - - - - - - - +
     |     Extended payload length continued, if payload len == 127  |
     + - - - - - - - - - - - - - - - +-------------------------------+
     |                               |Masking-key, if MASK set to 1  |
     +-------------------------------+-------------------------------+
     | Masking-key (continued)       |          Payload Data         |
     +-------------------------------- - - - - - - - - - - - - - - - +
     :                     Payload Data continued ...                :
     + - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - +
     |                     Payload Data continued ...                |
     +---------------------------------------------------------------+

pretty ascii art for a websocket frame from rfc6455

With that explained, we can go back to calling our function.

Since we control the returning values in this function, it is straightforward to fit it into other abstractions that we may be using. For example, we can use it with Seq.unfold to generate a sequence of frames:

Seq.unfold deserialize data
(* `ok frame
   `ok frame
   `more rest *)

And with a little combinator for sequences, we can make it quite pleasant to incrementally parse over streams of data:

(* reduce a value until we explicitly halt or the sequence is finished *)
let rec reduce_while init fn t =
  match t () with
  | Seq.Nil -> init
  | Seq.Cons (v, t') -> (
      match fn v init with
      | `continue acc -> reduce_while acc fn t'
      | `halt acc -> acc)

(* do an incremental read/parse/handle_frame loop *)
let rec parse data =
  let next =
    Seq.unfold Frame.deserialize data
    |> reduce_while `ok (fun frame state ->
           match (state, frame) with
           | `ok, `more buf -> `halt (`more buf)
           | `ok, `ok frame -> `continue (handle_frame frame)
           | `ok, `error reason -> `halt (`error reason)
           | `error reason, _ -> `halt (`error reason)
           | `more buf, _ -> `halt (`more buf))
  in
  match next with
  | `ok -> Ok ()
  | `error reason -> Error reason
  | `more buffer ->
      let more_data = read_data () in
      parse (buffer ^ more_data)

Just to clarify, I'm sure you can fit Angstrom into a Seq.t as well. So this last reduce_while trick isn't a win specifically for binary pattern matching, but having full control over how the parser executes sure helps us build parsers that fit exactly where we need them to.

Conclusion

I like binary pattern matching better, it has way less cognitive overhead, it helps me parse instantly what a sequence of bits/bytes would look like, and its recursive descent approach helps me write parsers that fit exactly into the code around them.

They're not for everything, and some things can be rather painful to parse where you end up using tiny combinators anyway (like "parse all until new line"), but overall I think they're a major improvement in readability and declarativeness over parser combinators.

If you find this interesting, have a look at bitstring. I'll be writing more about this package soon.


That's all I've got for today folks, it's the last day of the year and I figured we should end it with one last 2023 issue.

If you liked what I've written you can support me on Github Sponsors

Here's to more Practical OCaml in 2024 🥂

Thanks to @dillon_mulroy for reviewing a draft of this post.