Following on from my previous posts, I now show how to build a regular expression compiler. Later this week I will post the conclusion of this mini-series with a processor that uses this compiled representation. The previous posts are:
- F#: Graphing with GLEE (only needed for visualisation)
- F#: A Data Structure For Modelling Directional Graphs (used by the code in this post)
- F#: Building a Regular Expression Pattern Parser (where I previously showed how to build an abstract syntax tree for a regular expression pattern)
Introduction
You can download the complete source code for this post here: Regular Expressions in FSharp draft 1.
The algorithm has been adapted from the following article by Russ Cox: “Regular Expression Matching Can Be Simple And Fast”, http://swtch.com/~rsc/regexp/regexp1.html.
Notes
The code for the Graph module introduced in a previous post has been adapted in two ways:
- The generation of identifiers for vertices and edges has been split so that states are numbered sequentially
- A new function has been added to allow mapping of data within a graph. This is not required for this post but is used in the processing step described in my next post
The algorithm works without using GLEE but the main program module (program.fs) does use GLEE to visualise the result of compilation. If you do not want to install GLEE then it is a simple change to omit that step.
Where I left off…
At the end of the previous post on regular expression parsing, I produced a RegExpParseSyntax. This contains all of the building blocks for producing a regular expression representation as a non-deterministic finite state machine, (or finite automata: NDFA) the next step. The algorithm is described very clearly in Russ Cox’s article, so I recommend that you start there. I will not be repeating that discussion.
The compilation step includes support for all of the features introduced in parsing, including submatch identification.
RegExCompiling.fsi
This is the signature file for the compilation module. It exposes two methods for compiling the NDFA, from a string and from a previously parsed regular expression syntax. It also exposes the types that describe an NDFA, especially NdfaGraph which is a type alias for the generic Graph type with the type parameters of NdfaNode and NdfaEdge.
#light namespace RegExCompiling open RegExParsing open Graph (* RegExMatchMode determines matching or searching semantics *) type RegExMatchMode = FullMatch | FirstMatch (* Vertices represent states in the state machine. Start indicates the starting state. Normal is a non-finishing states Closure is a state that can complete the state machine *) type NdfaNodeType = Normal | Start | Closure (* This is used for additional semantic information to attach to states *) type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern (* This is the type used for vertex data *) type NdfaNode = NdfaNodeType * NdfaNodeSubPatternData option (* This is the type used for edge data *) type NdfaEdge = | Auto (* used to represent free transitions *) | Simple of char | AnyChar | StartAssertion | EndAssertion | CharacterTest of RegExpCharacterRangeUnit | NonGreedyAnyCharacter (* used for partial matching *) (* shorthand for the state machine graph type *) type NdfaGraph = Graph<NdfaNode, NdfaEdge> module RegExCompiling = val compile : RegExMatchMode -> RegExpParseSyntax -> NdfaGraph val parseAndCompile : RegExMatchMode -> string -> NdfaGraph
RegExCompiling.fs
#light namespace RegExCompiling open RegExParsing open Graph (* RegExMatchMode determines matching or searching semantics *) type RegExMatchMode = FullMatch | FirstMatch (* Vertices represent states in the state machine. Start indicates the starting state. Normal is a non-finishing states Closure is a state that can complete the state machine *) type NdfaNodeType = Normal | Start | Closure (* This is used for additional semantic information to attach to states *) type NdfaNodeSubPatternData = StartSubPattern | EndSubPattern (* This is the type used for vertex data *) type NdfaNode = NdfaNodeType * NdfaNodeSubPatternData option (* This is the type used for edge data *) type NdfaEdge = | Auto (* used to represent free transitions *) | Simple of char | AnyChar | StartAssertion | EndAssertion | CharacterTest of RegExpCharacterRangeUnit | NonGreedyAnyCharacter (* used for partial matching *) (* shorthand for the state machine graph type *) type NdfaGraph = Graph<NdfaNode, NdfaEdge> type CompileState = NdfaGraph * (* graph *) int list * (* trailing nodes *) int * (* next edge priority *) bool (* should force a node *) module RegExCompiling = (* curried function to build a CompileState tuple *) let toCompileState g outs initialPriority forceNode : CompileState = (g, outs, initialPriority, forceNode) (* connects trailing edges to the node v p is the edge priority e is the edge data *) let patch (g:NdfaGraph) (outs:int list) v p e = outs |> List.fold (fun g' out -> g' |> Graph.addEdge p out v e (* returns the new edge id and the new graph, but just want the graph *) |> snd) g (* simplifies the graph by joining multiple trailing edges to a single marker state, but does not change the graph for a single trailing edge unless forceNode is specified which is useful in some cases to prevent incorrect graphs resulting.*) let join ((g, outs, initialPriority, forceNode):CompileState) = match outs with | [] -> failwith "invalid graph" | [out] when forceNode = false -> (out, g) | _ -> let (v, g') = g |> Graph.addVertex (Normal, None) let g'' = patch g' outs v initialPriority Auto (v, g'') (* adds a single node to the end of the graph *) let add1 ((g,outs,initialPriority,forceNode):CompileState) (finalPriority:int) (vertex:NdfaNode) (edge:NdfaEdge) = let (v, g') = g |> Graph.addVertex vertex let (out, g'') = join (toCompileState g' outs initialPriority forceNode) let (_, g''') = g'' |> Graph.addEdge initialPriority out v edge toCompileState g''' [v] finalPriority false (* progressively builds the NDFA graph the first parameter is the state going into the processing of a step. It comprises: g: the graph so far outs: the tailing nodes that should be connected initialPriority: used to decide among multiple alternations forceNode: used to prevent errors involving alternations and multiplicities It returns this tuple as the result and as the accumulator going into List.fold *) let rec stepNdfa (currentState:CompileState) (next:RegExpParseToken) : CompileState = let (g, outs, initialPriority, forceNode) = currentState (* Note: the Option, AnyNumber and AtLeastOne match types have this code in common. The function adds the subpattern to the existing graph and returns the new graph with the identifiers for the nodes representing the entry and exit from the subpattern. Each of the match types handle these nodes slightly differently *) let multiplicity subpat = (* simplify the graph with join *) let (v, g') = join currentState (* connect the subpattern *) let state' = subpat |> stepNdfa (toCompileState g' [v] initialPriority false) (* simplify the output of the subpattern *) let (out', g'') = join state' (g'', v, out') match next with | Literal l -> add1 currentState 0 (Normal, None) (Simple l) | AnyCharacter -> add1 currentState 0 (Normal, None) AnyChar | CharacterRange range -> add1 currentState 0 (Normal, None) (CharacterTest range) | StartMarker -> add1 currentState 0 (Normal, None) StartAssertion | EndMarker -> add1 currentState 0 (Normal, None) EndAssertion | Option subpat -> let (g', vin, vout) = multiplicity subpat (* connect the input to the output directly since this is an optional step *) let g'' = patch g' [vin] vout 0 Auto toCompileState g'' [vout] 0 true | AnyNumber subpat -> let (g', vin, vout) = multiplicity subpat (* connect the output to the input to allow for more *) let g'' = patch g' [vout] vin 0 Auto (* the tailing vertex is the input in this case *) toCompileState g'' [vin] 0 true | AtLeastOne subpat-> let (g', vin, vout) = multiplicity subpat (* connect the output to the input to allow for more *) let g'' = patch g' [vout] vin 0 Auto (* the tailing vertex is the output of the subpattern in this case *) toCompileState g'' [vout] 0 true | SubPattern subpat -> (* v and v' mark the subpattern region *) let (v, g') = g |> Graph.addVertex (Normal, Some StartSubPattern) let (v', g'') = g' |> Graph.addVertex (Normal, Some EndSubPattern) (* connect v to the tail of the graph *) let g''' = patch g'' outs v initialPriority Auto (* build the subpattern *) let (g'''', outs', _, _) = subpat |> Seq.fold stepNdfa (toCompileState g''' [v] 0 true) (* attach the subpattern closure *) let g''''' = patch g'''' outs' v' 0 Auto (* return the new process state *) toCompileState g''''' [v'] 0 true | Alternation alts -> (* priority is used by alternation to choose a match when multiple are possible *) (* first join the inputs *) let (v, g') = join currentState (* this creates the graph for an alternation *) let applyAlt g'alt p'alt subpat = subpat |> List.fold stepNdfa (toCompileState g'alt [v] p'alt true) (* but for folding, we need to build the combined list of tailing edges, which is what this function does *) let foldAlt (ins'alt, g'alt, p'alt, outs'alt:int list) subpat = let (g'alt', outs'alt':int list, _, _) = applyAlt g'alt p'alt subpat (ins'alt, g'alt', p'alt + 1, List.append outs'alt' outs'alt) (* apply foldAlt across the alternations *) let (_,g'',_,outs') = alts |> List.fold foldAlt ([v], g', 0, []) toCompileState g'' outs' 0 false | _ -> failwith "invalid syntax tree" (* this function turns the trailing vertices to be closures *) let completeNdfa ((g, outs, initialPriority, forceNode):CompileState) = let mutate (vId:int) ((nodeType,subMatchData):NdfaNode) = if List.exists (fun x -> x = vId) outs then (Closure, subMatchData) else (nodeType, subMatchData) g |> Graph.mapVertices mutate (* this is the main module entry point for building the NDFA *) let toNDFA (mode:RegExMatchMode) (syntax:RegExpParseSyntax) = let (v, g) = Graph.empty |> Graph.addVertex (Start, None) let startState = match mode with | FullMatch -> toCompileState g [v] 0 false | FirstMatch -> add1 (toCompileState g [v] 0 true) 0 (Normal, None) NonGreedyAnyCharacter syntax |> Seq.fold stepNdfa startState |> completeNdfa (* This produces the NDFA from a parsed syntax *) let compile (mode:RegExMatchMode) (syntax:RegExpParseSyntax) = syntax |> toNDFA mode (* This produces the NDFA from a pattern string *) let parseAndCompile (mode:RegExMatchMode) (pattern:string) = pattern |> RegExParsing.parseRegExp |> compile mode
I hope the code is fairly self-explanatory. Let me know if you would like further explanation. The code has been through several iterations to address problems found in the processing stage, and it was for this reason that the “forceNode” boolean was added to several operations.
Program.fs
#light open Graph open RegExParsing open RegExCompiling open GleeGraph open Microsoft.Glee.Drawing let nodeAdapter (v:Vertex<NdfaNode, NdfaEdge>) (n:Microsoft.Glee.Drawing.Node) = let (nt:NdfaNodeType, sp:NdfaNodeSubPatternData option) = vertexData v n.Attr.Shape <- match nt with | Start -> Shape.Point | Closure -> Shape.DoubleCircle | Normal -> Shape.Circle n.Attr.Color <- match sp with | None -> Color.Black | Some StartSubPattern -> Color.Blue | Some EndSubPattern -> Color.Red let edgeAdapter ((id,priority,target,data):EdgeData<NdfaEdge>) (e:Microsoft.Glee.Drawing.Edge) = match data with | Auto -> e.Attr.Color <- Color.Blue | Simple c -> e.Attr.Label <- c.ToString() e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Black | CharacterTest t -> e.Attr.Label <- (sprintf "$test %20A" t) e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Black | AnyChar -> e.Attr.Label <- "$any" e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Black | StartAssertion -> e.Attr.Label <- "$start" e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Red | EndAssertion -> e.Attr.Label <- "$end" e.Attr.Fontcolor <- Color.Red e.Attr.Color <- Color.Red | NonGreedyAnyCharacter -> e.Attr.Label <- "$any*[non-greedy]" e.Attr.Fontcolor <- Color.Blue e.Attr.Color <- Color.Blue let run () = let frm = new System.Windows.Forms.Form() frm.Text <- "RegEx Compiler" let panel = new System.Windows.Forms.FlowLayoutPanel() panel.Size <- new System.Drawing.Size(100, 40) panel.Dock <- System.Windows.Forms.DockStyle.Top let txtBoxRegex = new System.Windows.Forms.TextBox() txtBoxRegex.Text <- "" let chkboxFullMatch = new System.Windows.Forms.CheckBox() chkboxFullMatch.Text <- "Initial matches" chkboxFullMatch.Checked <- true let btnGo = new System.Windows.Forms.Button() btnGo.Text <- "Generate" let label l = let c = new System.Windows.Forms.Label() c.Text <- l c :> System.Windows.Forms.Control panel.Controls.AddRange [| (label "regular expression") ; (txtBoxRegex :> System.Windows.Forms.Control) ; (chkboxFullMatch:>System.Windows.Forms.Control) ; (btnGo :>System.Windows.Forms.Control) |] let gc = new Microsoft.Glee.GraphViewerGdi.GViewer() gc.Graph <- new Microsoft.Glee.Drawing.Graph("empty") gc.Dock <- System.Windows.Forms.DockStyle.Fill; frm.Controls.Add gc frm.Controls.Add panel frm.Size <- new System.Drawing.Size(600, 600) let getSyntax () = let syntax = txtBoxRegex.Text |> RegExParsing.parseRegExp let mode = if (chkboxFullMatch.Checked) then FullMatch else FirstMatch let g = syntax |> RegExCompiling.compile mode printf "Pattern: %A\n\n" txtBoxRegex.Text printf "Syntax: %A\n\n" syntax printf "Graph: %A\n\n" g g let applyGleeGraph () = let g = getSyntax() gc.Graph <- (Graph.toGlee txtBoxRegex.Text nodeAdapter edgeAdapter g) g let applyGraph () = let g = applyGleeGraph () printf "-------------------------------------------------------------\n\n" btnGo.Click.Add (fun _ -> applyGraph () ) applyGraph () System.Windows.Forms.Application.Run frm run ()
Example
Here is some sample output for a regular expression pattern:
Pattern: "a|(b*)c|^H[e|E][^A-Z]?5+c$" Syntax: [Alternation [[Literal 'a']; [SubPattern [AnyNumber (Literal 'b')]; Literal 'c']; [StartMarker; Literal 'H'; CharacterRange (Matching [Character 'E'; Character '|'; Character 'e']); Option (CharacterRange (Inversion [Range ('A','Z')])); AtLeastOne (Literal '5'); Literal 'c'; EndMarker]]] Graph: ((21, 23), [((20, (Closure, null)), []); ((19, (Normal, null)), [(21, 0, 18, Simple 'c')]); ((18, (Normal, null)), [(22, 0, 20, EndAssertion)]); ((17, (Normal, null)), [(20, 0, 19, Auto); (19, 0, 16, Auto)]); ((16, (Normal, null)), [(18, 0, 17, Simple '5')]); ((15, (Normal, null)), [(17, 0, 16, Auto)]); ((14, (Normal, null)), [(16, 0, 15, Auto); (15, 0, 15, CharacterTest (Inversion [Range ('A','Z')]))]); ((13, (Normal, null)), [(14, 0, 14, CharacterTest (Matching [Character 'E'; Character '|'; Character 'e']))]); ((12, (Normal, null)), [(12, 2, 11, StartAssertion)]); ((11, (Normal, null)), [(13, 0, 13, Simple 'H')]); ((10, (Normal, null)), [(10, 0, 9, Simple 'c')]); ((9, (Closure, null)), []); ((8, (Normal, null)), [(7, 0, 7, Auto)]); ((7, (Normal, null)), [(8, 0, 6, Auto); (6, 0, 8, Simple 'b')]); ((6, (Normal, Some EndSubPattern)), [(9, 0, 10, Auto)]); ((5, (Normal, Some StartSubPattern)), [(5, 0, 7, Auto)]); ((4, (Normal, null)), [(3, 0, 3, Simple 'a')]); ((3, (Closure, null)), []); ((2, (Normal, null)), [(1, 0, 1, NonGreedyAnyCharacter)]); ((1, (Normal, null)), [(11, 2, 12, Auto); (4, 1, 5, Auto); (2, 0, 4, Auto)]); ((0, (Start, null)), [(0, 0, 2, Auto)])])
Pingback: F#: A Complete Regular Expression Processor « Steve Horsfield