F#: Building a Regular Expression Pattern Parser

It has been a while since I did any real functional programming (about 10 years!) and so I thought I’d set myself a challenge with Microsoft’s F# language: build a regular expression engine.  Of course there are many of these off-the-shelf, but that would defeat the point of (re-)learning how to do real functional programming!

Here’s part one: parsing a pattern…

Caveats

I have not really got to grips with monads yet and so I’m passing state using parameters.  This makes some of the code pretty ugly but it is probably easier to read.  If anyone wants to help me improve the code, I’d love comments…

Features

One of the habits of a good functional programmer is to use tail-recursion.  This allows the compiler to convert recursion into iterative methods that can handle any length of input.  I have attempted to use this.  There are localised exceptions to this rule, such as the processing of escape codes, but these are finite processing that will not themselves recurse significantly.  During the development of this code I did need to remove mutual tail recursion and other undesirable processing behaviours.

I have also structured the code into modules to demonstrate some of F#’s features in this regard.

Example output

The following is an example output but the code will also handle more complicated regular expressions including most features except back tracking (partly because this has bad performance characteristics and I am not planning on using them!):

pattern: "^the (very )*lazy dog$"

Parse tree:

[StartMarker; Literal 't'; Literal 'h'; Literal 'e';
 Literal ' ';
 AnyNumber
   (SubPattern [Literal 'v'; Literal 'e'; Literal 'r';
      Literal 'y'; Literal ' ']);
 Literal 'l'; Literal 'a'; Literal 'z'; Literal 'y';
 Literal ' '; Literal 'd';
 Literal 'o'; Literal 'g'; EndMarker]

Structure

The code is split into seven files.  Three of these are signature files for Hex, Text and RegExParsing.  Another three are the implementation files for these modules.  Hex and Text provide helper methods used by the processing of escape codes.  RegExParsing contains the majority of the real code.  The final file is Program and this actually uses the RegExParsing module to parse a pattern and display the results.

It is important to realise that the Solution Explorer shows files in a compile order for F# projects.  The order of the files is important and should be as follows for this project:

  • Hex.fsi
  • Hex.fs
  • Text.fsi
  • Text.fs
  • RegExParsing.fsi
  • RegExParsing.fs
  • Program.fs

Program.fs

Starting from the end is sometimes easiest!

#light
open RegExParsing

let pattern = "^the (very )*lazy dog$"
pattern |> printf "pattern: %A\n\n"
pattern
  |> RegExParsing.parseRegExp
  |> printf "Parse tree:\n\n%A\n\n";;

This module is very simple. The first two lines tell the compiler how to interpret the rest of the code. Firstly, it uses lightweight syntax (as virtually all F# programs do). Secondly, there is a dependency on the RegExParsing module.

The final three lines can be read as:

  1. In what follows, treat the symbol “pattern” as the string “^the (very )*lazy dog$”
  2. Send pattern as an input into the partial function obtained by applying the string “pattern: %A\n\n” to the function “printf”
  3. Send pattern as an input to the function “parseRegExp” in the module “RegExParsing”, and then send the output of that as an input into the partial function obtained by applying the string “Parse tree:\n\n%A\n\n” to the function “printf”

I will not be explaining the code in depth for the remainder of this article but hopefully that will give you a starting point!  There are a list of introductory links to F# programming at the end of this post.

Hex.fsi

This is the signature file for the Hex module.  This module is used to convert character strings in hexadecimal notation into integer numbers.  It is pretty straightforward but I could not see a direct equivalent that was easily accessible.  Still, it serves as an example:

#light
module Hex =
  val hex4 : char -> int
  val hex8 : char -> char -> int
  val hex16 : char -> char -> char -> char -> int

Hex.fs

This is the implementation file for the Hex module:

#light
module Hex =
  let hex4 A =
    match A with
    | '0' -> 0
    | '1' -> 1
    | '2' -> 2
    | '3' -> 3
    | '4' -> 4
    | '5' -> 5
    | '6' -> 6
    | '7' -> 7
    | '8' -> 8
    | '9' -> 9
    | 'A' -> 10
    | 'B' -> 11
    | 'C' -> 12
    | 'D' -> 13
    | 'E' -> 14
    | 'F' -> 15
    | 'a' -> 10
    | 'b' -> 11
    | 'c' -> 12
    | 'd' -> 13
    | 'e' -> 14
    | 'f' -> 15
    | _ -> failwith "invalid hexadecimal digit"

  let hex8 A B =
    (16 * (hex4 A)) + (hex4 B)
  let hex16 A B C D =
    (256 * (hex8 A B)) + (hex8 C D)

Text.fsi

This is the signature file for the Text module.  It is used to convert numbers to Unicode characters.  It just serves as a wrapper for existing .NET framework functions:

#light
module Text =
  val asciiChar : int -> char
  val unicodeChar : int -> char

Text.fs

The implementation is equally simple:

#light
module Text =
  let asciiChar i =
    System.Convert.ToChar(byte i)
  let unicodeChar i =
    System.Convert.ToChar(int16 i)

RegExParsing.fsi

Now we get to the guts of the algorithm. Here is the code for the signature file:

#light
namespace RegExParsing

  (* RegExpCharacterRange represents either a single character
     or a contiguous range of characters within a regular
     expression chacter range *)
  type RegExpCharacterRange =
    | Character of char
    | Range of char*char

  (* RegExpCharacterRangeUnit represents the content of a
     regular expression chacter range *)
  type RegExpCharacterRangeUnit =
    | Inversion of RegExpCharacterRange list
    | Matching of RegExpCharacterRange list

  (* RegExpParseToken represents an element of a (partially)
     parsed regular expression.

     Note that a RegExpParseToken list may be valid or invalid
     depending on its content *)
  type RegExpParseToken =
    | StartSubPattern
    | SubPattern of RegExpParseToken list
    | Alternation of RegExpParseToken list list
    | Option of RegExpParseToken
    | AnyNumber of RegExpParseToken
    | AtLeastOne of RegExpParseToken
    | CharacterRange of RegExpCharacterRangeUnit
    | AnyCharacter
    | StartMarker
    | EndMarker
    | Literal of char

  module RegExParsing =
    val parseRegExp : string -> RegExpParseToken list

This module defines a namespace “RegExParsing”, some types and a single function “parseRegExp” that takes a string and returns a list of “RegExpParseToken”. Notably, all of these (types and functions) must also be defined in the implementation file. This differs from C++ where the types defined in header files cannot (or at least should not usually!) be redefined in the implementation file.

RegExParsing.fs

The implementation follows. There are some comments in the code. Try compiling the project and see what you think.

#light

namespace RegExParsing
  open Hex
  open Text

  (* RegExpCharacterRange represents either a single character
     or a contiguous range of characters within a regular
     expression chacter range *)
  type RegExpCharacterRange =
    | Character of char
    | Range of char*char

  (* RegExpCharacterRangeUnit represents the content of a
     regular expression chacter range *)
  type RegExpCharacterRangeUnit =
    | Inversion of RegExpCharacterRange list
    | Matching of RegExpCharacterRange list

  (* RegExpParseToken represents an element of a (partially)
     parsed regular expression.

     Note that a RegExpParseToken list may be valid or invalid
     depending on its content *)
  type RegExpParseToken =
    | StartSubPattern (* used during the build of a RegExpParseToken list *)
    | SubPattern of RegExpParseToken list
    | Alternation of RegExpParseToken list list
    | Option of RegExpParseToken
    | AnyNumber of RegExpParseToken
    | AtLeastOne of RegExpParseToken
    | CharacterRange of RegExpCharacterRangeUnit
    | AnyCharacter
    | StartMarker
    | EndMarker
    | Literal of char

  module RegExParsing =
    (* processes an escaped character in the body of the regular expression *)
    let parseRegExpEscape ps : RegExpParseToken * char list =
      match ps with
      | '\\'::ps' -> (Literal '\\', ps')
      | '['::ps' -> (Literal '[', ps')
      | ']'::ps' -> (Literal ']', ps')
      | '^'::ps' -> (Literal '^', ps')
      | '$'::ps' -> (Literal '$', ps')
      | '*'::ps' -> (Literal '*', ps')
      | '+'::ps' -> (Literal '+', ps')
      | '?'::ps' -> (Literal '?', ps')
      | '|'::ps' -> (Literal '|', ps')
      | '('::ps' -> (Literal '(', ps')
      | ')'::ps' -> (Literal ')', ps')
      | '.'::ps' -> (Literal '.', ps')
      | 't'::ps' -> (Literal '\t', ps')
      | 'n'::ps' -> (Literal '\n', ps')
      | 'r'::ps' -> (Literal '\r', ps')
      | 'x'::A::B::ps' -> (Literal (Text.asciiChar (Hex.hex8 A B)), ps')
      | 'u'::A::B::C::D::ps' -> (Literal (Text.unicodeChar (Hex.hex16 A B C D)), ps')
      | 'X'::A::B::ps' -> (Literal (Text.asciiChar (Hex.hex8 A B)), ps')
      | 'U'::A::B::C::D::ps' -> (Literal (Text.unicodeChar (Hex.hex16 A B C D)), ps')
      | _ -> failwith "invalid escape sequence"

    (* processes an escaped character within a character range block *)
    let processRangeEscape ps =
      match ps with
      | '\\'::ps' -> ('\\', ps')
      | ']'::ps' -> (']', ps')
      | '^'::ps' -> ('^', ps')
      | 't'::ps' -> ('\t', ps')
      | 'n'::ps' -> ('\n', ps')
      | 'r'::ps' -> ('\r', ps')
      | '-'::ps' -> ('-', ps')
      | 'x'::A::B::ps' -> (Text.asciiChar (Hex.hex8 A B), ps')
      | 'u'::A::B::C::D::ps' -> (Text.unicodeChar (Hex.hex16 A B C D), ps')
      | 'X'::A::B::ps' -> (Text.asciiChar (Hex.hex8 A B), ps')
      | 'U'::A::B::C::D::ps' -> (Text.unicodeChar (Hex.hex16 A B C D), ps')
      | _ -> failwith "invalid escape sequence"

    (*
      main processing for a character range with tail recursion

      finisher -> turns a list of terms into a RegExpParseToken
      current -> a character that has already been subject to
                 escape processing and other special handling
      ps -> the remainder of the pattern

      cases of 'ps':

      1)                    the range hasn't been closed
      2) ]...               end of the range
      3) -]...              end of the range with a literal '-'
      4) -\...              range of current to an escaped character
      5) -X]...             range of current to X and end of the range
      6) -X\...             range of current to X followed by an escaped character
      7) -XY...             range of current to X followed by normal processing of Y
      8) \...               literal of current followed by an escaped character
      9) X...               literal of current followed by normal processing of X

    *)
    let rec parseRegExpRangeBody1 finisher terms current ps =
      let unclosed () = failwith "unclosed character range expression"
      match ps with
      | [] -> unclosed ()
      | ']'::rs -> (finisher (Character current::terms), rs)
      | '-'::']'::rs -> (finisher (Character current::Character '-'::terms), rs)
      | '-'::'\\'::rs ->
        // Range of current to escaped character c and remainder cs
        let (c, cs) = processRangeEscape rs
        match cs with
        | [] -> unclosed ()
        | ']'::cs' -> (finisher (Range (current, c)::terms), cs')
        | '\\'::cs' -> parseRegExpRangeBody1 finisher (Character current::terms) c cs'
        | c'::cs' -> parseRegExpRangeBody1 finisher (Character current::terms) c' cs'
      | '-'::r::']'::rs ->
        // Range of current to normal character r followed by closure
        (finisher (Range (current, r)::terms), rs)
      | '-'::r::'\\'::rs ->
        // Range of current to normal character r followed by escaped character
        let (c, cs) = processRangeEscape rs
        parseRegExpRangeBody1 finisher (Range (current, r)::terms) c cs
      | '-'::r::r'::rs ->
        // Range of current to normal character r, next character r'
        parseRegExpRangeBody1 finisher (Range (current, r)::terms) r' rs
      | '\\'::rs ->
        let (c, cs) = processRangeEscape rs
        parseRegExpRangeBody1 finisher (Character current::terms) c cs
      | r::rs ->
        parseRegExpRangeBody1 finisher (Character current::terms) r rs

    (* starts the processing of a character range block *)
    let parseRegExpRangeBody finisher ps =
      match ps with
      | [] -> failwith "unclosed character range expression"
      | ']'::']'::rs -> (finisher [Character ']'], rs)
      | ']'::rs ->
        parseRegExpRangeBody1 finisher [] ']' rs
      | '-'::rs ->
        parseRegExpRangeBody1 finisher [] '-' rs
      | '\\'::rs ->
        let (c, cs) = processRangeEscape rs
        parseRegExpRangeBody1 finisher [] c cs
      | r::rs ->
        parseRegExpRangeBody1 finisher [] r rs

    (* a finisher function for an inverted character range *)
    let buildInvertedCharacterRange terms = CharacterRange (Inversion terms)
    (* a finisher function for an matching character range *)
    let buildMatchingCharacterRange terms = CharacterRange (Matching terms)

    (* processes a character range expression in the body of
       a regular expression pattern *)
    let parseRegExpRange ps =
      match ps with
      | '^'::rs ->
        parseRegExpRangeBody buildInvertedCharacterRange rs
      | _ ->
        parseRegExpRangeBody buildMatchingCharacterRange ps

    (* handles common functionality for multiplicity constructs *)
    let parseRegExpMultiplicity (context : RegExpParseToken list) =
      let error () = failwith "invalid position for multiplicity character (?, +, *)"
      let testMultiplicityValid (t:RegExpParseToken) =
        match t with
        | SubPattern _ -> ()
        | Literal _ -> ()
        | CharacterRange _ -> ()
        | AnyCharacter -> ()
        | _ -> error ()
      match context with
      | [] -> error ()
      | c::cs ->
        testMultiplicityValid c
        (c, cs)

    (* progressively builds a parsed pattern with tail recursion
       context contains the parsed content so far as a stack *)
    let rec parseRegExpFragment
      (pattern : char list) (context : RegExpParseToken list list) =
      match pattern with
        | [] ->
          match context with
          | c :: (Alternation alts::ccs)::ds ->
            let newAlt = Alternation (List.rev (List.rev c::alts))
            [newAlt] :: ds
          | _ -> context
        | p :: ps ->
          match context with
          | [] -> failwith "invalid input"
          | c :: cs ->
            match p with
            | '[' ->
                let (parseToken:RegExpParseToken, pattern2: char list) =
                    parseRegExpRange ps
                parseRegExpFragment pattern2 ((parseToken::c)::cs)
            | '(' ->
                parseRegExpFragment ps ([]::((StartSubPattern::c)::cs))
            | ')' ->
                match cs with
                | [] -> failwith "not in a sub expression"
                | (StartSubPattern::ccs)::ds ->
                  let previous = (SubPattern (List.rev c))::ccs
                  let newContext = previous::ds
                  parseRegExpFragment ps newContext
                | [Alternation alts]::(StartSubPattern::ccs)::ds ->
                  let newAlt = Alternation (List.rev ((List.rev c)::alts))
                  let previous = (SubPattern [newAlt])::ccs
                  let newContext = previous::ds
                  parseRegExpFragment ps newContext
                | _ -> failwith "algorithm error"
            | '\\' ->
                let (parseToken:RegExpParseToken, pattern2: char list) =
                    parseRegExpEscape ps
                parseRegExpFragment pattern2 ((parseToken::c)::cs)
            | '|' ->
                match cs with
                | [] ->
                  let newContext = []::([Alternation [List.rev c]])::cs
                  parseRegExpFragment ps newContext
                | (Alternation alts::ccs)::ds ->
                  let newAlt = Alternation ((List.rev c)::alts)
                  let newContext = []::(newAlt::ccs)::ds
                  parseRegExpFragment ps newContext
                | ccs::ds ->
                  let previous = Alternation [List.rev c]
                  let newContext = []::[previous]::(ccs::ds)
                  parseRegExpFragment ps newContext
            | '?' ->
                let (d, ds) = parseRegExpMultiplicity c
                parseRegExpFragment ps ((Option d::ds)::cs)
            | '*' ->
                let (d, ds) = parseRegExpMultiplicity c
                parseRegExpFragment ps ((AnyNumber d::ds)::cs)
            | '+' ->
                let (d, ds) = parseRegExpMultiplicity c
                parseRegExpFragment ps ((AtLeastOne d::ds)::cs)
            | '.' ->
                parseRegExpFragment ps ((AnyCharacter::c)::cs)
            | '^' ->
                parseRegExpFragment ps ((StartMarker::c)::cs)
            | '$' ->
                parseRegExpFragment ps ((EndMarker::c)::cs)
            | _ ->
                parseRegExpFragment ps ((Literal p::c)::cs)

    (* main function for converting a pattern into a
       parsed regular expression syntax *)
    let parseRegExp (pattern : string) =
      let ps = Seq.to_list pattern
      (parseRegExpFragment ps [ [] ])
        |> List.hd
        |> List.rev

Resources

Advertisements

3 thoughts on “F#: Building a Regular Expression Pattern Parser

  1. Pingback: F#: A Data Structure For Modelling Directional Graphs « Steve Horsfield

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s