WS-BPEL, Windows Workflow Foundation and Race Conditions

In this post I am raising the question about how to deal with race conditions in asynchronous service invocations. I have seen possible solutions in BizTalk and WF scenarios but they are both pretty messy and have their own race condition issues. What do you do to address these issues in your solutions?

Basics

A typical operation in a composite service or business process is to invoke an operation on another service and then receive a response.  There are several reasons why an asynchronous approach is considered for doing this:

  • A queued messaging approach is used to ensure that reliability concerns can be met when one or both of the client and service may not be concurrently available
  • Scalability and performance criteria
  • The interaction is long-running
  • The service responsible for responding may not be the invoked service

So, in order to model an asynchronous interaction the naive approach is as follows:

  1. Create a sequence flow or use an outer sequence to group the following three activities:
    1. Create a correlation identifier
    2. Create a send task responsible for invoking the remote service operation
    3. Create a correlated receive task responsible for receiving the reply

From a logic perspective this is fine, but it introduces a possible race condition:

  • The service response may be received before the correlated receive task has begun, and this may cause the message to be rejected.  In this case, no further message is received and the process is hung

Possible race condition solutions

One way of solving this problem is as follows:

  1. Create a sequential task comprising:
    1. Create a correllation identifier that is specific to this exchange
    2. Create a parallel task comprising:
      1. Create a correlated receive activity in the first branch
      2. Create the send activity in the second branch

Unfortunately, this solution has only moved the responsibility for solving the race condition to the orchestration/workflow system.  In WF this may not be an issue as the parallel branch has specific non-concurrent activity execution that enforces a precise ordering on branch execution.  However, in general this can be addressed by the following:

  1. Create a sequential task comprising:
    1. Create a correllation identifier that is specific to this exchange
    2. Create a parallel task comprising:
      1. Create a sequential task in each of two branches
      2. In the first sequential add the following two activities:
        1. An event wait (that will be triggered by the second branch)
        2. A send activity
      3. In the second sequential add the following two activities:
        1. A correlated receive activity
        2. An event trigger to allow the first branch to continue

This solves the issue of ordering the two activities.  Unfortunately, this may not solve all the race conditions, especially if the composing process service is redundantly implemented with automatic network load balancing (although this is just one case where there may be a problem).  If the workflow is executing the send activity at the time at which the corresponding receive message is being received (on another thread) then the workflow may be in a locked state where the receive message is unable to be processed.  In this case, the message must be queued for delivery but that also limits the possibility of a synchronous response to the responding service.  In a recent Windows WF 3.5 solution I created a fault-tolerant framework that was able to correctly handle this kind of interaction but it was nevertheless a complex solution to an intrinsic problem and was implementation-specific.

As you can see, we end up introducing substantial logic to mitigate race conditions, and if technically savvy people see this as necessary then it also limits the possibilities for non-technically savvy people to author process definitions, losing part of the benefit of the workflow or business process itself.

It is worth pointing out that WS-BPEL 2.0 (http://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html) acknowledges these race conditions but does not require an implementation to provide a specific resolution to these issues (or indeed any resolution at all).

So go on, tell me your experiences, or how a particular WS-BPEL or workflow engine addresses these message-exchange-pattern issues.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s