F#: Compiling a Regular Expression Syntax

Following on from my previous posts, I now show how to build a regular expression compiler.  Later this week I will post the conclusion of this mini-series with a processor that uses this compiled representation.  The previous posts are:

Introduction

You can download the complete source code for this post here: Regular Expressions in FSharp draft 1.

The algorithm has been adapted from the following article by Russ Cox: “Regular Expression Matching Can Be Simple And Fast”, http://swtch.com/~rsc/regexp/regexp1.html.

Notes

The code for the Graph module introduced in a previous post has been adapted in two ways:

  • The generation of identifiers for vertices and edges has been split so that states are numbered sequentially
  • A new function has been added to allow mapping of data within a graph.  This is not required for this post but is used in the processing step described in my next post

The algorithm works without using GLEE but the main program module (program.fs) does use GLEE to visualise the result of compilation.  If you do not want to install GLEE then it is a simple change to omit that step.

Where I left off…

At the end of the previous post on regular expression parsing, I produced a RegExpParseSyntax.  This contains all of the building blocks for producing a regular expression representation as a non-deterministic finite state machine,  (or finite automata: NDFA) the next step.  The algorithm is described very clearly in Russ Cox’s article, so I recommend that you start there.  I will not be repeating that discussion.

The compilation step includes support for all of the features introduced in parsing, including submatch identification.

RegExCompiling.fsi

This is the signature file for the compilation module.  It exposes two methods for compiling the NDFA, from a string and from a previously parsed regular expression syntax.  It also exposes the types that describe an NDFA, especially NdfaGraph which is a type alias for the generic Graph type with the type parameters of NdfaNode and NdfaEdge.

#light

namespace RegExCompiling
  open RegExParsing
  open Graph

  (* RegExMatchMode determines matching or searching semantics *)
  type RegExMatchMode = FullMatch | FirstMatch

  (* Vertices represent states in the state machine.
     Start indicates the starting state.
     Normal is a non-finishing states
     Closure is a state that can complete the state machine *)
  type NdfaNodeType = Normal | Start | Closure

  (* This is used for additional semantic information to
     attach to states *)
  type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern

  (* This is the type used for vertex data *)
  type NdfaNode =
    NdfaNodeType * NdfaNodeSubPatternData option

  (* This is the type used for edge data *)
  type NdfaEdge =
    | Auto (* used to represent free transitions *)
    | Simple of char
    | AnyChar
    | StartAssertion
    | EndAssertion
    | CharacterTest of RegExpCharacterRangeUnit
    | NonGreedyAnyCharacter (* used for partial matching *)

  (* shorthand for the state machine graph type *)
  type NdfaGraph = Graph<NdfaNode, NdfaEdge>

  module RegExCompiling =
    val compile : RegExMatchMode -> RegExpParseSyntax -> NdfaGraph
    val parseAndCompile : RegExMatchMode -> string -> NdfaGraph

RegExCompiling.fs

#light

namespace RegExCompiling
  open RegExParsing
  open Graph

  (* RegExMatchMode determines matching or searching semantics *)
  type RegExMatchMode = FullMatch | FirstMatch

  (* Vertices represent states in the state machine.
     Start indicates the starting state.
     Normal is a non-finishing states
     Closure is a state that can complete the state machine *)
  type NdfaNodeType = Normal | Start | Closure

  (* This is used for additional semantic information to
     attach to states *)
  type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern

  (* This is the type used for vertex data *)
  type NdfaNode =
    NdfaNodeType * NdfaNodeSubPatternData option

  (* This is the type used for edge data *)
  type NdfaEdge =
    | Auto (* used to represent free transitions *)
    | Simple of char
    | AnyChar
    | StartAssertion
    | EndAssertion
    | CharacterTest of RegExpCharacterRangeUnit
    | NonGreedyAnyCharacter (* used for partial matching *)

  (* shorthand for the state machine graph type *)
  type NdfaGraph = Graph<NdfaNode, NdfaEdge>

  type CompileState =
    NdfaGraph * (* graph *)
    int list * (* trailing nodes *)
    int * (* next edge priority *)
    bool (* should force a node *)

  module RegExCompiling =

    (* curried function to build a CompileState tuple *)
    let toCompileState g outs initialPriority forceNode
        : CompileState =
      (g, outs, initialPriority, forceNode)

    (* connects trailing edges to the node v
       p is the edge priority
       e is the edge data *)
    let patch (g:NdfaGraph) (outs:int list) v p e =
      outs
        |> List.fold
            (fun g' out ->
              g'
                |> Graph.addEdge p out v e
                (* returns the new edge id
                   and the new graph, but
                   just want the graph *)
                |> snd)
            g

    (* simplifies the graph by joining multiple
       trailing edges to a single marker state,
       but does not change the graph for a single
       trailing edge unless forceNode is
       specified which is useful in some
       cases to prevent incorrect graphs
       resulting.*)
    let join
        ((g, outs, initialPriority, forceNode):CompileState) =
      match outs with
      | [] -> failwith "invalid graph"
      | [out] when forceNode = false ->
        (out, g)
      | _ ->
        let (v, g') = g |> Graph.addVertex (Normal, None)
        let g'' = patch g' outs v initialPriority Auto
        (v, g'')

    (* adds a single node to the end of the graph *)
    let add1
        ((g,outs,initialPriority,forceNode):CompileState)
        (finalPriority:int)
        (vertex:NdfaNode)
        (edge:NdfaEdge) =
      let (v, g') =
        g |> Graph.addVertex vertex
      let (out, g'') =
        join (toCompileState g' outs initialPriority forceNode)
      let (_, g''') =
        g'' |> Graph.addEdge initialPriority out v edge
      toCompileState g''' [v] finalPriority false

    (* progressively builds the NDFA graph
       the first parameter is the state going into the
       processing of a step.  It comprises:

       g: the graph so far
       outs: the tailing nodes that should be connected
       initialPriority: used to decide among multiple alternations
       forceNode: used to prevent errors involving
                  alternations and multiplicities

       It returns this tuple as the result and as the accumulator
       going into List.fold
    *)
    let rec stepNdfa
        (currentState:CompileState)
        (next:RegExpParseToken)
            : CompileState =
      let (g, outs, initialPriority, forceNode) = currentState

      (* Note: the Option, AnyNumber and AtLeastOne
         match types have this code in common.
         The function adds the subpattern to the existing
         graph and returns the new graph with the
         identifiers for the nodes representing the
         entry and exit from the subpattern.  Each of
         the match types handle these nodes slightly
         differently *)
      let multiplicity subpat =
        (* simplify the graph with join *)
        let (v, g') = join currentState
        (* connect the subpattern *)
        let state' =
          subpat
            |> stepNdfa
                (toCompileState g' [v] initialPriority false)
        (* simplify the output of the subpattern *)
        let (out', g'') = join state'
        (g'', v, out')

      match next with
      | Literal l ->
        add1 currentState 0 (Normal, None) (Simple l)
      | AnyCharacter ->
        add1 currentState 0 (Normal, None) AnyChar
      | CharacterRange range ->
        add1 currentState 0 (Normal, None) (CharacterTest range)
      | StartMarker ->
        add1 currentState 0 (Normal, None) StartAssertion
      | EndMarker ->
        add1 currentState 0 (Normal, None) EndAssertion
      | Option subpat ->
        let (g', vin, vout) = multiplicity subpat
        (* connect the input to the output directly since this
           is an optional step *)
        let g'' = patch g' [vin] vout 0 Auto
        toCompileState g'' [vout] 0 true
      | AnyNumber subpat ->
        let (g', vin, vout) = multiplicity subpat
        (* connect the output to the input to allow for more *)
        let g'' = patch g' [vout] vin 0 Auto
        (* the tailing vertex is the input in this case *)
        toCompileState g'' [vin] 0 true
      | AtLeastOne subpat->
        let (g', vin, vout) = multiplicity subpat
        (* connect the output to the input to allow for more *)
        let g'' = patch g' [vout] vin 0 Auto
        (* the tailing vertex is the output of the subpattern
           in this case *)
        toCompileState g'' [vout] 0 true
      | SubPattern subpat ->
          (* v and v' mark the subpattern region *)
          let (v, g') =
            g |> Graph.addVertex (Normal, Some StartSubPattern)
          let (v', g'') =
            g' |> Graph.addVertex (Normal, Some EndSubPattern)
          (* connect v to the tail of the graph *)
          let g''' =
            patch g'' outs v initialPriority Auto
          (* build the subpattern *)
          let (g'''', outs', _, _) =
            subpat
              |> Seq.fold
                  stepNdfa
                  (toCompileState g''' [v] 0 true)
          (* attach the subpattern closure *)
          let g''''' =
            patch g'''' outs' v' 0 Auto
          (* return the new process state *)
          toCompileState g''''' [v'] 0 true
      | Alternation alts ->
          (* priority is used by alternation to choose a
             match when multiple are possible *)
          (* first join the inputs *)
          let (v, g') = join currentState
          (* this creates the graph for an alternation *)
          let applyAlt g'alt p'alt subpat =
            subpat
              |> List.fold
                  stepNdfa
                  (toCompileState g'alt [v] p'alt true)
          (* but for folding, we need to build the
             combined list of tailing edges,
             which is what this function does *)
          let foldAlt
              (ins'alt, g'alt, p'alt, outs'alt:int list)
              subpat =
            let (g'alt', outs'alt':int list, _, _) =
              applyAlt g'alt p'alt subpat
            (ins'alt,
             g'alt',
             p'alt + 1,
             List.append outs'alt' outs'alt)
          (* apply foldAlt across the alternations *)
          let (_,g'',_,outs') =
              alts |>
                List.fold foldAlt
                          ([v], g', 0, [])
          toCompileState g'' outs' 0 false
      | _ -> failwith "invalid syntax tree"

    (* this function turns the trailing vertices to
       be closures *)
    let completeNdfa
          ((g, outs, initialPriority, forceNode):CompileState) =
      let mutate (vId:int) ((nodeType,subMatchData):NdfaNode) =
        if List.exists (fun x -> x = vId) outs then
          (Closure, subMatchData)
        else
          (nodeType, subMatchData)
      g |> Graph.mapVertices mutate

    (* this is the main module entry point for building the NDFA *)
    let toNDFA (mode:RegExMatchMode) (syntax:RegExpParseSyntax) =
      let (v, g) =
        Graph.empty
          |> Graph.addVertex (Start, None)
      let startState =
        match mode with
        | FullMatch ->
            toCompileState g [v] 0 false
        | FirstMatch ->
            add1
              (toCompileState g [v] 0 true)
              0
              (Normal, None)
              NonGreedyAnyCharacter
      syntax
        |> Seq.fold stepNdfa startState
        |> completeNdfa

    (* This produces the NDFA from a parsed syntax *)
    let compile (mode:RegExMatchMode) (syntax:RegExpParseSyntax) =
      syntax |> toNDFA mode

    (* This produces the NDFA from a pattern string *)
    let parseAndCompile (mode:RegExMatchMode) (pattern:string) =
      pattern |> RegExParsing.parseRegExp |> compile mode

I hope the code is fairly self-explanatory. Let me know if you would like further explanation.  The code has been through several iterations to address problems found in the processing stage, and it was for this reason that the “forceNode” boolean was added to several operations.

Program.fs

#light

open Graph
open RegExParsing
open RegExCompiling
open GleeGraph
open Microsoft.Glee.Drawing

let nodeAdapter
    (v:Vertex<NdfaNode, NdfaEdge>)
    (n:Microsoft.Glee.Drawing.Node) =
  let (nt:NdfaNodeType, sp:NdfaNodeSubPatternData option) = vertexData v
  n.Attr.Shape <-
    match nt with
    | Start -> Shape.Point
    | Closure -> Shape.DoubleCircle
    | Normal -> Shape.Circle
  n.Attr.Color <-
    match sp with
    | None -> Color.Black
    | Some StartSubPattern -> Color.Blue
    | Some EndSubPattern -> Color.Red

let edgeAdapter
    ((id,priority,target,data):EdgeData<NdfaEdge>)
    (e:Microsoft.Glee.Drawing.Edge) =
  match data with
  | Auto ->
    e.Attr.Color <- Color.Blue
  | Simple c ->
    e.Attr.Label <- c.ToString()
    e.Attr.Fontcolor <- Color.Red
    e.Attr.Color <- Color.Black
  | CharacterTest t ->
    e.Attr.Label <- (sprintf "$test %20A" t)
    e.Attr.Fontcolor <- Color.Blue
    e.Attr.Color <- Color.Black
  | AnyChar ->
    e.Attr.Label <- "$any"
    e.Attr.Fontcolor <- Color.Blue
    e.Attr.Color <- Color.Black
  | StartAssertion ->
    e.Attr.Label <- "$start"
    e.Attr.Fontcolor <- Color.Red
    e.Attr.Color <- Color.Red
  | EndAssertion ->
    e.Attr.Label <- "$end"
    e.Attr.Fontcolor <- Color.Red
    e.Attr.Color <- Color.Red
  | NonGreedyAnyCharacter ->
    e.Attr.Label <- "$any*[non-greedy]"
    e.Attr.Fontcolor <- Color.Blue
    e.Attr.Color <- Color.Blue

 
let run () =
  let frm = new System.Windows.Forms.Form()
  frm.Text <- "RegEx Compiler"
  let panel = new System.Windows.Forms.FlowLayoutPanel()
  panel.Size <- new System.Drawing.Size(100, 40)
  panel.Dock <- System.Windows.Forms.DockStyle.Top
  let txtBoxRegex = new System.Windows.Forms.TextBox()
  txtBoxRegex.Text <- ""
  let chkboxFullMatch = new System.Windows.Forms.CheckBox()
  chkboxFullMatch.Text <- "Initial matches"
  chkboxFullMatch.Checked <- true
  let btnGo = new System.Windows.Forms.Button()
  btnGo.Text <- "Generate"
  let label l =
    let c = new System.Windows.Forms.Label()
    c.Text <- l
    c :> System.Windows.Forms.Control
  panel.Controls.AddRange
    [| (label "regular expression") ;
       (txtBoxRegex :> System.Windows.Forms.Control) ;
       (chkboxFullMatch:>System.Windows.Forms.Control) ;
       (btnGo :>System.Windows.Forms.Control) |]
  let gc = new Microsoft.Glee.GraphViewerGdi.GViewer()
  gc.Graph <- new Microsoft.Glee.Drawing.Graph("empty")
  gc.Dock <- System.Windows.Forms.DockStyle.Fill;
  frm.Controls.Add gc
  frm.Controls.Add panel
  frm.Size <- new System.Drawing.Size(600, 600)
  let getSyntax () =
    let syntax = txtBoxRegex.Text |> RegExParsing.parseRegExp
    let mode =
      if (chkboxFullMatch.Checked) then FullMatch else FirstMatch
    let g = syntax |> RegExCompiling.compile mode
    printf "Pattern: %A\n\n" txtBoxRegex.Text
    printf "Syntax: %A\n\n" syntax
    printf "Graph: %A\n\n" g
    g
  let applyGleeGraph () =
    let g = getSyntax()
    gc.Graph <- (Graph.toGlee txtBoxRegex.Text nodeAdapter edgeAdapter g)
    g
  let applyGraph () =
    let g = applyGleeGraph ()
    printf "-------------------------------------------------------------\n\n"
  btnGo.Click.Add (fun _ -> applyGraph () )
  applyGraph ()
  System.Windows.Forms.Application.Run frm

run ()

Example
Here is some sample output for a regular expression pattern:

Pattern: "a|(b*)c|^H[e|E][^A-Z]?5+c$"

Syntax: [Alternation
   [[Literal 'a']; [SubPattern [AnyNumber (Literal 'b')]; Literal 'c'];
    [StartMarker; Literal 'H';
     CharacterRange (Matching [Character 'E'; Character '|'; Character 'e']);
     Option (CharacterRange (Inversion [Range ('A','Z')]));
     AtLeastOne (Literal '5'); Literal 'c'; EndMarker]]]

Graph: ((21, 23),
 [((20, (Closure, null)), []); ((19, (Normal, null)), [(21, 0, 18, Simple 'c')]);
  ((18, (Normal, null)), [(22, 0, 20, EndAssertion)]);
  ((17, (Normal, null)), [(20, 0, 19, Auto); (19, 0, 16, Auto)]);
  ((16, (Normal, null)), [(18, 0, 17, Simple '5')]);
  ((15, (Normal, null)), [(17, 0, 16, Auto)]);
  ((14, (Normal, null)),
   [(16, 0, 15, Auto); (15, 0, 15, CharacterTest (Inversion [Range ('A','Z')]))]);
  ((13, (Normal, null)),
   [(14, 0, 14,
     CharacterTest (Matching [Character 'E'; Character '|'; Character 'e']))]);
  ((12, (Normal, null)), [(12, 2, 11, StartAssertion)]);
  ((11, (Normal, null)), [(13, 0, 13, Simple 'H')]);
  ((10, (Normal, null)), [(10, 0, 9, Simple 'c')]); ((9, (Closure, null)), []);
  ((8, (Normal, null)), [(7, 0, 7, Auto)]);
  ((7, (Normal, null)), [(8, 0, 6, Auto); (6, 0, 8, Simple 'b')]);
  ((6, (Normal, Some EndSubPattern)), [(9, 0, 10, Auto)]);
  ((5, (Normal, Some StartSubPattern)), [(5, 0, 7, Auto)]);
  ((4, (Normal, null)), [(3, 0, 3, Simple 'a')]); ((3, (Closure, null)), []);
  ((2, (Normal, null)), [(1, 0, 1, NonGreedyAnyCharacter)]);
  ((1, (Normal, null)), [(11, 2, 12, Auto); (4, 1, 5, Auto); (2, 0, 4, Auto)]);
  ((0, (Start, null)), [(0, 0, 2, Auto)])])

Sample regular expression

Sample regular expression

1 thought on “F#: Compiling a Regular Expression Syntax

  1. Pingback: F#: A Complete Regular Expression Processor « Steve Horsfield

Leave a comment