F#: Sequence Comprehensions and Iteration

This is a quick post about sequence comprehensions in F#, and also a quick introduction to a few of the standard collection types.  I hit some issues while developing the regular expression engine (see my previous and upcoming posts!) and this post describes the issues and the solution.

Context

I wanted to produce a significant test for my regular expression engine, one that I knew performed badly on certain other regular expression engines that are known to use backtracking algorithms.  So, to do this I generated a sequence of a million ‘a’ characters followed by a single ‘b’ character.  The regular expression pattern I was to test was ‘a+b’.  This should match the whole input and should take time and space directly proportional to the number of characters (O(n) time and O(n) space).  This is because every character is analysed once and the whole input string should match the pattern.

Initially, I had problems with performance that meant that more than 200 characters of input took a long time to process.  Then a recursion that had not been optimised by the compiler (it was a simple refactoring exercise to make the algorithm tail-recursive) caused a stack overflow at around 4000 characters.  Once I had fixed that, I found that I still had very poor performance and I found that the input iteration was at fault.  Finally, I found that the collection I was using to store the match capture was not effective for large numbers of additions and removals.

Once I had addressed these issues, the performance was still not as high as I was expecting but it could still process a million characters in around twenty-two seconds.  If I was really doing this for performance then I would produce repeatable and comparable results, but this is more about learning F# than that!  Still, that is not bad for a first attempt on a pre-release compiler.

Basic collections in F#

There are three basic collections in F#:

  • Arrays, specified using [| and |]
  • Lists, specified using [ and ]
  • Sequences, specified using { and }

Of these, lists are the default choice for the functional programmer, arrays are common to many programming platforms and are mutable, and sequences (seq) are a type alias for System.Collections.Generic.IEnumerable<T>.

You can find out more about the characteristics of these types here: http://en.wikibooks.org/wiki/F_Sharp_Programming/Mutable_Collections.

The three most important things to know are:

  • Arrays are mutable, fixed-length containers (okay, that’s two things already!)
  • Removal of items from lists is expensive
  • Sequences can be lazily generated (and they share many properties with streams)

Now, if I want to simulate input from a stream then I need to use sequences.  And if I want a very long sequence then lazy generation is appropriate, so seq seems a good choice.

Sequence comprehensions

Sequence comprehensions are a way to describe a sequence using a generator function.  The generator function may be simple or complex and generators can be used for any kind of container, as the mechanism is the same as for computation expressions.  However, you will need to write your own expression builder class if you want to support a custom container.  The builders are used as follows:

  • let myArray = [| cexpr |]
  • let myList = [ cexpr ]
  • let mySeq = seq { cexpr }

cexpr is a computation expression and reflects the generator function.  The function can be expressed in a number of ways, including the simplistic “1..20” which generates a collection of the integer values between 1 and 20, or a multi-step generation function.  For more information, check the links in the references section at the end of the post.

Understanding sequences (seq) and the Seq module functions

Depending on the object that the seq represents, you may get different behaviour.  However, in my case I was generating a sequence of characters.  The value of my result was a char seq.  (Actually, I further developed this to provide a combination of start and end markers and position identification as well.  That is not important for this discussion.)  Now this sequence had a million characters in it, but how do I access them?  My initial attempt was to use:

  • let c = mySeq |> Seq.hd
  • let newSeq = mySeq |> Seq.skip 1

This achieves the outcome I desired.  I had a new sequence that I could use to refer to the items one step along the sequence and I could access the character at the index.  What was not apparent was that I was actually enumerating the sequence from scratch at each execution of these steps.  Why?  Because I was modifying the behaviour of an IEnumerable.  An IEnumerable is an object that can provide an enumerator.  Each time I was accessing the sequence, I was actually creating a new enumerator with a slightly modified generator function.  This was something like:

  • Initial state: (IEnumerable : generatorFunc = f ())
      -> getEnumerator()
  • 1 transition: (IEnumerable : generatorFunc = (f () |> skip 1)
      -> getEnumerator()
  • 2 transition: (IEnumerable : generatorFunc = (f () |> skip 1 |> skip 1) 
      -> getEnumerator()
  • nth transition: (IEnumerable : generatorFunc = (f () |> skip 1 |> … |>
      skip 1)  -> getEnumerator()

This is clearly not an efficient way to enumerate the characters represented by the enumerable.  Instead, I needed to manually iterate the collection.  This was the most significant problem behind my earliest performance limitation.

Manual iteration

 To manually enumerate the sequence, it is necessary to use the .NET class library types directly.  The following function converts a sequence into a counted iterator:

type FlowState<'T> =
    { iter: System.Collections.Generic.IEnumerator<'T> ;
      isStart: bool ; isEnd: bool; count: int }
let seqToFlow (s:'T seq) =
    let enumerator = s.GetEnumerator()
    { iter = enumerator; isStart = true; isEnd = false; count = 0 }

Next, a function is needed to move the iterator on one:

let flowChar (s:FlowState<char>) =
    if (s.isStart) then
      (StartOfInput,
      {iter = s.iter; isStart = false;
       isEnd = false; count = 0 })
    elif (s.isEnd) then
      (EndOfInput s.count,
      {iter = null; isStart = false;
       isEnd = false; count = s.count })
    else
      let hasNext = s.iter.MoveNext()
      if hasNext <> true then
        (EndOfInput s.count,
         {iter = s.iter; isStart = false;
          isEnd = false; count = s.count })
      else
        (Char (s.iter.Current, s.count),
        {iter = s.iter; isStart = false;
         isEnd = false; count = s.count + 1 })

Finally, a third function to identify when the sequence has completed is helpful.

let flowEmpty (s:FlowState<'T>) =
    s.isEnd

Put all together, these can be used to model a manual iteration. Of course, this is flexible as you can choose how to design the type and the methods.

Other considerations

As with any algorithm, the choice of data structure is imperative.  Once I realised that the performance of the F# List was not appropriate for removals, I decided to use the .NET doubly-linked LinkedList collection.  However, I am also planning to compare the performance of StringBuilder for this specific purpose.  Changing the data structure can be very straightforward if carefully planned for.  Of course, this is just a hobby project so the planning was slightly limited…

I hope you have found this post helpful.  Sequence comprehensions are very useful, but this is a significant potential problem with using Seq.skip to iterate a generated sequence exactly once.  If there is a standard approach to this then do let me know!

Caveat

I have not been using the most recent version of F# and I am using pre-release code.  One area that has recently been improved is sequence comprehensions and so I am not sure to what extent these problems still occur, however I suspect that they do as they reflect a functional view of a sequence rather than an iterative one.

References

Leave a comment