Wednesday, July 25, 2012

Golf Tournament - 5th SS Bajwa

Couple of years back (Aug, 2010) played my first Golf Tournament and playing with a partner we finished 3rd. It was 5th SS Bajwa Golf Tournament organized by Kang Brothers in memory of their late uncle. The tournament format was two man Scramble. Playing back tee and Flight A (handicap 12 or under), we shot 3 under 69. Tournament home page -

Monday, July 23, 2012

Good Magazine

Few years back I got introduced to Good Magazine during one of Consumer Marketing class I took. Ever since I'm hooked on to this great magazine. In this blog entry I'll review this magazine and outline what I like about it.
Good Magazine, henceforth referred to as GOOD (as they like to call them as), sets up their mission in their ‘about’ section on their website. It says and I quote “GOOD is a collaboration of individuals, businesses, and nonprofits pushing the world forward. Since 2006 we've been making a magazine, videos, and events for people who give a damn."
Further their website defines their magazine as following in a very powerful and no-nonsense manner:
In a world where things too often don’t work, GOOD seeks a path that does. Left, right. In, out. Greed, altruism. Us, them. These are the defaults and they are broken. We are the alternative model. We are the reasonable people who give a damn. No dogma. No party lines. No borders. We care about what works--what is sustainable, prosperous, productive, creative, and just--for all of us and each of us. This isn’t easy, but we are not afraid to fail. We’ll figure it out as we go.
Wikipedia states GOOD "is a media platform that promotes, connects, and reports on individuals, businesses, and non-profits "pushing the world forward.... produces online videos, and events highlighting examples of what is sustainable, prosperous, productive, creative, equitable. The content covers a variety of topics, including the environment, education, urban planning, design, food, politics, culture, lifestyle, technology, and health."
From start they followed unconventional business model of donating all their subscription money to charities. Upon further exploration of the website, I find that they deliver what they clearly state in their mission. Each of their main sections contain sub-sections that talk about not just news and stories of the day, but also thought provoking and stimulating articles. With excellent array of articles on Lifestyle, Culture and Business are bound to invoke keen interest from their target readers. 
While over the years I have seen magazine evolve to more broader areas, the general appeal and their still remains very much the same - providing platform for the ideas, people, and businesses that are driving change in the world. 
GOOD provides platform to share ideas around common good that their readers value. GOOD readers predominately care for a cause and the first reason to pick this magazine is perhaps this realization that in the process they are helping a cause, hence the good feeling. It is expected that readers of GOOD are more educated, employed and perhaps financially more stable. 
The popularity and growth of GOOD’s in large part is dependent on two factors, their network of charity they deal with and the quality of the content in their magazine. Since readers would like to see their money go to the charity close to their heart, lack of association of the charity with GOOD can be a put off for some. The revenue for the magazine is mainly advertising, hence the readership/viewer-ship is key for success of such venture. In essence, the challenge for GOOD is producing work (content) that is challenging and inspiring.
GOOD, in its unique do good – feel good – read good model is unique and being the leader in this concept along with very appropriate title should be a very formidable competitor to any other similar non-profit. The key to understand here is that although, they contribute 100% of the subscription to charity of readers choice, they themselves are not `non-profit. Thus, they offer best of both worlds, they look like non-profit, hence offer great appeal, but they can function like for-profit business.
In closing, if you haven't checked out good yet, you must give it a try and judge for yourself. I promise you won't be disappointed.

BPEL SE (Java CAPS) Monitoring Console

The last major task I did during Sun days was building of new Monitoring console for BPEL Service engine. It was complete rewrite from scratch of console to provide insight about running BPEL SE. Note that this product was not released as open-source as part of project open-esb. Anyway, was wanting to write a blog about it for a very long time, just did not sit down to do it. Working with documentation group we prepared a user guide, but was not able to found any online link to share (old Sun docs links have moved). Anyway, this blog would capture some of that.
The new monitor console is independent web application which was developed using open JSF framework Icefaces ( which provide rich ajax enabled components. It runs on Glassfish server in a Glassfish ESB environment. Here are some excepts from user guide:

The BPELMonitoring Console monitors your BPEL Service Engine's applications and business
processes, allowing you to quickly discern the health of your system. The console is designed to provide a comprehensive view of your current applications. It provides a real time representation of your business processes throughout the life cycle of each instance. The console also enables you to drill down to see what is happening with any specific application or process. It allows you to track down a business process based on customer information and to suspend and resume an instance for system maintenance.
The BPEL Monitoring Console runs as an independent web application and can run from a
remote computer. The console runs on the Glassfish server in a GlassFish ESB environment.
The console is designed to provide maximum responsiveness and uses Ajax to ensure that you see each point in process as it happens.

The Dashboard
The BPELMonitoring Console starts with the Dashboard, a top-level window that provides a holistic picture of all your deployed applications.
From the Dashboard you can see:
  1. Which processes are running
  2. The number of process instances that have completed, faulted, or have been suspended, or terminated
  3. The time at which the most recent business process instance occurred
  4. The Instance Processing Rate or number of instances for a specified period
The Business Process Home Page
The Business Process Home displays statistical information for the selected business process, similar to the Dashboard. In addition, the Business ProcessHome provides a graphical model of the business process, as well as a textual display of the business process code for an even finer level of information. From this window, you can drill down into instance information for all instances, oldest instances, or most current instances.

The Instance Home Page
On the InstanceHome you can view a group of business process instances or select a specific instance. From the Instance Home you can:
  1. View all completed, suspended, or terminated business process instances
  2. Choose specific instances to view, from the oldest to the most recent
  3. View the Service Engine, instance ID, start, end, updated time, and status for an instance
  4. Look at the life cycle of an instance in real time
  5. View the variables for the instance
  6. Click on a process instance to display a Process Scalable Vector Graphic (SVG) model of the instance in the real-time current state of execution
  7. Suspend, resume, or terminate one or all instances for maintenance or customer service

Sunday, July 22, 2012

BPEL SE (Open-ESB) Scalability Support

Scalability Support for BPEL Service Engine

The complete solution for scalability of system is co-coordinated work of all the constituent parts; service engines, binding components and NMR. When the system starts maxing out heap space, the full GC kicks in and ‘pauses’ the JVM. This has debilitating effect on the system performance. A properly designed system is expected to know expected usage and provide enough resources for capacity utilization.
The specific problem in the context of business process orchestration involving interactions with external systems is the in-determinism in terms of responses for the outbound requests. When that happen the business process instances continue to hog memory and new requests for such processes would only aggravate the memory situation.
Since BPEL SE is the only component currently in the open-esb that can be configured to be used with database, we can leverage this to provide scalability for long running processes. Also the BPEL SE scalability solution needs to be thought in conjunction with its recovery and monitor capabilities. In my opinion, the best solution is to have very large heap space and nothing else would be required. But, if that cannot happen, at least with 32-bit boxes, and also for other reason s scalability solution as outlined here for BPELSE can help.
The scalability solution can be called in phases based on early detection and also following some patterns of business process behavior. The complete solution may be complex, but taking a trivial approach may not solve most (if not all) the use cases. Since the primary mechanism for scalability solution of BPEL Service Engine is passivation, there will be some overhead associated with that happens. Also, in order to minimize the overhead and not be overly aggressive about it, attempt will be made to do the passivation in phases with each phase providing incremental relief.

Figure 1: Phase 1 (Variable) and Phase 2 (Instance) Passivation for Phase 2 Scalability Solution

Configurations and Factory Defaults

Whenever the persistence is on, the Scalability Solution (Phase-1 and Phase-2) is enabled by default. Using system properties the solution can be turned off. Also, the phase-1 and Phase -2 solution can be selectively turned off. In addition there are other system properties that allows for configuring the memory levels and the age criterion used by the scalability for memory release. Please note that default values for all these properties are already supplied. Here are the System Properties to use and the explanation thereof.

Scalability Levels (Default=phase2)(Valid options - none, phase1, phase2) For example to turn off Scalability: com.sun.jbi.bpelse.scalabilityLevel=none
Upper Memory Threshold (% of Total Heap)(Default 75%). For Example to set it to 85% of heap size: com.sun.jbi.bpelse.scalability.upperMemoryThreshold=0.85
Lower Memory Threashold (% of Total Heap)(Default 50%). For Example to set it to 60% of heap size: com.sun.jbi.bpelse.scalability.lowerMemoryThreshold=0.60
To Configure memory release age criterion to start with (Default 2 Seconds). For example to set it to 1 hour (enter value in seconds): com.sun.jbi.bpelse.scalability.idleThreshold=3600
Recovery Batch Size (Scalability need to be turned on)(Default=50). For recovering large number of instances in batches. For example to set it to 25 : com.sun.jbi.bpelse.recoveryBatchSize=25
Experiments to Analyze BPEL SE Scalability Issues also post Phase-1 and Phase-2 Improvement Measurements
Details can be found here.

Design Overview

Phase 1: Variable Passivation

When the free memory falls to certain levels of max heap size, identify all the instances that are waiting for any of the massaging activity (response, status) and wait activity for some defined period of time (hard coded value of 10 minutes, can be made configurable). Scan through these qualifying instances and if the variables are dirty, store them in the database (same variable table with special mark, if not dirty, they are already persisted due to last persistable activity) and de-reference all the variables. Such instances (in-memory) will be marked scalability passivated to avoid the scalability thread from visiting them again. When the messaging event or timer goes off for waits, the instance during its execution will load the variable as and when needed, if the variable is marked as passivated. When that happens, the instance would be marked back as non-scalability passivated.
If the available free memory does not meet this threshold limit, then halve the time duration for identifying the waiting/held instances and do the 1 above again. Repeat this till the wait time is zero.

Phase 2: Instance Passivation

If Phase 1 does not yield the required free memory (i.e. all the variables of paused instances are knocked off), Phase 2 scalability solution will be called.

Start removing the instances that are not not moving forward (due to defined wait, waiting on status/response or correlating receive) completely from memory. Note that these instances are already persisted in the database. Before, removing these instances, appropriate information (please refer design details below) is stored in database. These instances can be resurrected back, when needed.
Again, as for Phase 1, the passivation will happen on time inverse scale with the instances waiting for most of the time will be taken out first before removing the newer such instance. The only exception to this rule is for waiting instances (waiting for the defined wait) where there is predictability of wait, hence the instances that are going to wait the most (based on the alarm expiration) will be taken out first.
In addition to providing support for scalability for BPEL engine, we need to take care of the scalability of recover/Monitor and Fail-over in the case of cluster.

Recovery Scalability

Without the scalability solution, during recovery the bpel service engine loads all the instances in the memory, and try to recover them all. If there are very large number of instances to be recovered, engine may run into out of memory issues.

When there are very large number of instances to be recovered, the instances will be recovered in batches with check on available memory before scheduling next batch. If the available memory falls below as defined by phase/level 1 solution (refer related integration notice), the level 1 scalability solution will be called to free up memory. If enough memory is freed, the recovery will continue, otherwise the recovery will paused and will be periodically visited to finish up the recovery work.

Fail-over Scalability (Cluster Case)
With this solution in place, BPELSE in cluster performs fail-over by acquiring all the instances of the failed engine(s). When the engine is running with low available memory (heap space), this could be result in destabilizing engine/system. The scalability solution enables the engine in cluster to fail-over the instances without causing memory issues and also better distribute the fail-over work amongst the live engines.

Monitor Scalability

BPELSE Monitoring API provides BPEL process instance query that may return thousands of instances and runs a risk of memory exhaustion. The scalability design for monitoring takes it into account and imposes maximum instances restriction (1000) in conjunction with with sorting and ordering to help user identify the relevant instances.User can specify a lower maximum instance count (<1000) and sort on startTime, endTime, and updatedTime in either ascending or descending orders.
Limitations of current solution
If the defined process is synchronous process (receive-reply), the instances cannot be passivated while in the middle of execution of receive-reply. Hence, if the BPEL engine runs into memory issues do to inordinate wait of request-reply, this solution cannot help. The reason why we cannot passivate the instances in the middle request-reply is due to fact that by passivating the requester reference (Message Exchange) will be lost (currently, the ME cannot be passivated).
This solution assumes that the synchronous processes do not take long time to complete. One way to ensure scalability for Synchronous outbound invocations to external web service (that can take long time to complete), is by configuring throttling property on the connection. Throttling can limit the number of simultaneous outstanding requests at a given time, hence limiting the memory usage.

Design Details:

During BPEL Service Engine initialization an instance of BPELMemoryMonitorImpl class is set on the engine. This class provide utility methods to do the memory levels checks and also defines the Phase1 Upper and Lower Free Memory ratio limits.
When the free memory falls below the threshold as defined by Lower Free Memory Ratio, the phase 1 solution will be called first. If phase 1 solution does not succeed in releasing enough memory to meet this threshold, phase 2 will be employed.

Phase 1 Scalability Solution

When scalability is enabled, the main entry point for Scalability Solution is method doMemoryMangement() on EngineImpl. This gets called during processing of any request by BPEL SE.
The current memory levels are checked; if enough free memory (as defined by Phase1 Free Memory Ratio) is available, this method returns.
If the memory level check fails, recursing through all the instances of deployed processes, a call is made to passivate variables of the instances not moving forward (due to wait/corelateable receive/pending status/pending response).
This variable passivation is also done in phases, where in during the initial call only instances that are waiting for 10 minutes or more is picked for variable passivation.
Even during variable passivation for a given time criterion for a business process, the memory levels are checked while iterating though the identified instances. This ensures that just enough variable passivation happen and we don't aggressively do this passivation.
During execution of process instance, it pass through various persistence points. At these persistence points the state of the process instance along with all the dirty variables gets persisted. In order to take advantage of this fact, the variables used by the process instance will maintain the information regarding the fact that the variable is dirty (whenever it gets changed) or not (no change since last persistence).
The persistence schema is changed to add a new column SCALABILITYPASSIVATED for the variables table.
During phase 1 scalability passivation of instance variables, if the variable is found dirty, it will be saved in the database, with SCALABILITYPASSIVATED marked as 'Y', otherwise the variable data (payload) will be de-referenced (and hence removed from memory) and variable marked as Scalability Passivated.
When all the business processes are recursed and variable passivation for time value of 10 minutes returns, another memory check is made. If still enough memory is not available, the time interval is halved (5 minutes) and step 2 is called again. This process repeats till the time period is zero.
If even after de-referencing all the in-memory variables of the held process instances, still Phase 1 free memory criterion is not met, phase 2 solution will be called.
As a result of Phase 1 Scalability Solution call interleaving the regular persistence calls, process variables can have various states. Each variable maintain three state flags - namely:

Inserted - When the variable is created this is FALSE. If the variable is inserted into database due to regular persistence point during process execution (not Scalability Passivation) this changes to TRUE.
Persisted - When the variable value is changed (dirty), this is FALSE. When the variable is persisted, it is marked TRUE.
Passivated - By default this is FALSE. When the variable is scalability passivated, this changes to TRUE.
Also, the persistence and scalability call on the variable would result in altering the above state of the variable. Following figure explains the various states and shows all the possible combinations of the before and after state as a result of scalability/persistence and in-memory read/update calls.

Figure 2: Variable States for Phase 1 Scalability Solution

Phase 2 Scalability Solution

Since the phase 1 solution (above) could not make enough free memory available, more aggressive action will be attempted to free up memory. In this phase all those instances that are not moving forward (same as identified by phase 1) will be progressively passivated, starting with most aged ones.

Passivation of instances held due to wait activity

  1. Recurse through all the waiting instances and passivate them.
  2. The passivation will be gradual starting with time criterion of instances that need to wait for 10 minutes or more. If this does not produce enough free memory, this time will be reduced by half and the processes will be repeated again repeatedly with memory check between each iteration.
  3. The idea is not to be aggressive with passivation and do only enough passivation.

Passivation of Instances held due to pending response for outstanding two way invokes


  1. When the scalability call is made to passivate instances pending on response, persist the msg ex id to inst id in a new table, OUTSTANDINGMSGEX which will keep this info.
  2. When the response/status happens, the instance will be identified by this table and scheduled for recovery.
  3. When the execution reaches invoke, it will construct CRMPID, at this time this CRMPID will be checked to see if there is any response available for this. For this case there will be one, hence no new invoke will be made and the invoke unit with further process the response message.

Passivation of Instances held due to absence of message for correlating IMA (receive/pick)

This relates to passivation of instances that are held in mRequestPendingInstances queue for waiting for the correlatable message for further execution.

  1. The instances waiting on mRequestPendingInstances that qualify for phase 2 passivation would be removed from memory.
  2. Currently, the correlation values for the IMA are not persisted along-with persistence of the IMA. The persistence point for correlation values being next persistence point. Since the instance is scalability passivated, these correlation values will be persisted before getting rid of the instance.
  3. Also, if the engine is running in cluster mode, the cluster logic will passivate the instance when the execution reaches correlatable IMA and if the message for such IMA is missing. Since the could be chance that the scalability logic to be kicking same time, the scalability passivation (for engines in cluster) will use the same passivate instance to do scalability passivation. This ensures that the scalability logic will not interfere with cluster logic of passivation and that the passivated instance can be executed by any engine. The cluster passivation api also persists the correlation values. For single engine case, only the correlation values will be persisted. Also note that for single engine case the waiting IMA table need not be populated (as done for cluster)
  4. When the correlating message arrives on the engine, the instance will be activated (recovered from the database). A new api on state manager will return the instance id for the given correlation values. Note for cluster activation uses different api that in addition to finding instance for recovery, does the ownership transfer of the instance as well.

Not to scalability passivate instance if there exists active (outstanding) request for in-out message

If the instance execution is in the middle of active request-reply, the instance should not be passivated (even if the instance qualifies for scalability passivation, due to wait/outstanding response/status) as in doing so the requester context (message exchange) will be lost. In order to scalability passivate such processes (that are defined with receive-reply) user need to configure the throttling property on the service connection (to be available on CASA editor, work under progress)


  1. Currently the outstanding requests are maintained as part of the context object (IMAScope). One such IMAScope is constructed when the process instance is created and also for each scope thereafter.
  2. For scalability support, such IMAScope's will be added to a list maintained at process instance level.
  3. Two new apis on process instance would allow the IMAScopeBride to be added to process instance.
  4. Whenever an instance need to be scalability passivated, a check using the new api on ProcessInstance would return existence of such outstanding request at any given point during execution, and for positive result the instance would not be scalability passivated.

BPEL SE Monitoring in conjunction with Scalability (instance suspend and resume functionality)

Splitting of ready to run queue introduced issue of scheduling of instance even if marked suspended. After the split the instances on waiting queue are not checked for suspended status causing this issue. Also, to support the instance suspend and resume in conjunction with scalability passivation (phase 2) following design will be used:

  1. When a waiting instance (due to defined wait) is identified for scalability passivation, the suspended flag is saved in the bpit (along-with the instance id) and the callframe is dereferenced (for releasing instance and associated memory).
  2. During normal execution the locally cached waiting bpit is first checked for timer expiration. But, before this check is made, the instance is verified that is not marked for suspension. If it is marked for suspension, the next expired and non-suspended waiting bpit is picked from the waiting bpit collection maintained at ReadyToRunQueue class.
  3. When an instance (which is suspended earlier) is called resume from monitor, also if the instance is scalability passivated, the same is activated (recovered) from database and scheduled again for execution.

Splitting ready to run queue - providing a special collection for waiting business process instances

At various points during the execution of a process instance a special instance wrapper Business Process Instance Thread (BPIT) is created and put on a queue (LinkedList in ReadyToRunQueue class). This same queue is also used for adding BPITs waiting for wait activity in the process definition. This current design has following pitfalls:

  1. All the engine need threads need to do a full scan of this queue to get the least wait time before call accept on delivery channel with wait time (if any such instance(s) exist on the queue). This will happen even though there may not be any waiting BPIT on the queue. This means additional processing and would have performance implications for highly loaded engine with very large number of concurrent instances under execution.
  2. During execution for wait type BPIT the first expired BPIT is picked for execution, even though there might be another wait BPIT which happen to have expired even earlier.

When instances that are waiting on ready to run queue (due to wait activity) are scalability passivated, keeping them on the regular queue will have performance implications. Although this is strictly required for regular execution, but found benefit of have two queues, hence added for regular execution also.


A separate data structure (a HashSet) is used to store the waiting BPITs.
The class ReadyToRunqueue will have local member variable which will store the BPIT with most recently expiring time.
Whenever a new wait type BPIT is to be added, a comparison will be made with this local object and if the new BPIT has more recent timeout, the new BPIT will be locally cached and the previously cached BPIT will be added to the set, otherwise the new BPIT will be added to the set.
Whenever a ready instance is requested for execution, this locally cached BPIT will be checked for expiry, if expired the same will be returned. There is no need to scan the Waiting BPITs Set.

Failover Scalability Solution (BPELSE in Cluster)

NOTE: This support is only added for Oracle as the equivalent of rownum function (that allows limiting the resultset for a update query) is not available for Derby at this moment. Also note that this feature for Derby is being worked on and planned for release early next year. Once this is available support will be added for Derby. Clustering will still work with Derby though, only scalability of failover support will not be available.


Currently, the engine used GUID for engine id and every time the engine restarts it acquires a new id value. It will be changed to use static engine id. The cluster’s instance id will be used as engine id. The advantage of using static engine id is that prevent the engine entries from growing in the engine table and also it will also allow engine to load its instances in the event of crash followed by restart (Note: It is not necessarily required that the engine upon restart load its instance, but wont hurt if it does). Since, currently we don’t support multiple bpelse in same cluster instances, using cluster’s instance id as engine id should not cause any issue.
During engine start, engine will update the heartbeat using the static id, success of the call indicate that id already exists, otherwise, this engine will be
registered in the database against the static id (instance id of cluster).
Currently in cluster mode, each engine after heartbeat update queries for the failed engine(s), and if found, acquires the running instances of such engine(s).
The change would be to pass the BATCH_RECOVERY_SIZE (default value 50, Can be changed using System property ‘com.sun.jbi.bpelse.recoveryBatchSize’; also used for batch recovery for single engine case) and the engine during this check would only acquire batch number of such instances of the failed engine(s).
The update return count for the above call to update the dangling instances will provide hint that the failover of instances completed or not (a return count equal to BATCH_RECOVERY_SIZE indicate there could be more).
After acquiring the instances for failover, engine would call recover methods to load the instance data (from database) into memory. If the return count was equal
to BATCH_RECOVERY_SIZE, the update dangling instances call will be made recursively till the return count is less than BATCH_RECOVERY_SIZE.
If the failed engine were to come back up, it will be able to acquire its remaining running instances prior to fail over (assuming not all the instances of this failed engine are already taken by other live engines in the interim).

BPEL SE (Open-ESB) Clustering Design

Architecture and Design Document of BPEL Service Engine Support for clustering & Failover

To achieve Scalability and High Availability for BPEL Service Engine running in JBI environment. A cluster is defined as a set of engines running in multiple JBI Environments running in Application Server Cluster. When a business process needs to be scaled to meet heavier processing needs, you can deploy Service Assemblies to BPEL Service Engine running on application server cluster to increase throughput.

To achieve transparent failover of running instances of the failed engine in cluster over to any live engine for continued processing.

How to Enable: When the app server is running in cluster mode, it sets some System properties indicating this fact. BPEL Service engine during initialization queries this property to discover engine running in cluster mode and starts behaving accordingly. There is no explicit flag on engine that user need to configure to enable BPEL SE Clusterig.
Engine Heartbeat Update Time: Each engine participating in cluster periodically updates database table to indicate its livliness. The perodicity of this update can be cofigured through property EngineExpiryInterval on BPELSE (refer this page  for engine configuratios).

Design Goals:
Enable installation of BPEL Service engines on Application Server Cluster.
Enable BPEL Service engine component installed on Application Server Cluster to be able to process incoming requests
The overall throughput of cluster should increase linearly (or almost close to) in relation to the number of engine instances in the cluster.
In the event of failure of any one the application server or engine, the other engines should continue processing new request. The in process (in-flight) requests of the failed engine should continue to be processed by any of the live engine.
Support business process with instance correlation.
Single database will be used for implementing this feature. Database will be common for all the BPEL service engines in cluster and all the service engines need to be configured to use same database
Database is considered to be highly available. The database failure will constitute single point of failure.
BPEL Service Engine uses Application Server System property com.sun.jbi.isClustered to determine if the engine is participating in cluster environment. This property is extended by Glassfish Application Server. For other application server user need to set this system property for clustered environment.
It is assumed that BPEL Service engines are running in multiple JBI runtime environments that in turn are running inside distinct application server instances which can be on the same machine or different servers. Clustering support is not available for Multiple BPEL Service engines within same JBI environment on one Application server.
Load balancing will be specific to protocol/binding component in use.
The load from the failed engine will not be distributed equally to the live engines. Instead, in the event of failure of one of the engine, any one of the live engines will take over all the instances of the failed engine. Since the current design does not distribute the load across all the live engines, there is possibility of engine overload in the event when huge number of instances that needs to be failed over.
The servers in a cluster as assumed to be on the same time-zone.
Current Limitations:
Cluster support is not available for business process that contains event handlers.
For reliability and recovery, BPEL Service engine persist the state of the instance as soon as the non-replay able activity is executed. In the event of crash, upon restart of the BPEL Service engine, the persisted state can be loaded back in the memory and the execution continues from the last persisted point. Clustering support of BPEL service engine is leveraging engine's persistence and recovery feature.

An important aspect of clustering support is high availability, which in essence means that in the event of failure of any of the application server instance/BPEL Service Engine of cluster, continue processing new and already in process requests by the live engines, and also transparent failover of in-process instances of the failed engine (by any of live engine) for further processing. Also, In order to support correlation based projects where in multiple requests need to participate in stateful interaction with same instance, mechanism needs to be created that the correlated message(s) are able to join the already created instance.

Multiple design approaches were considered primarily to solve the correlation feature of business process definition (see appendix for more details on other options considered) including Intelligent Message Router, Message Routing and Engine leasing with instance activation/passivation mechanism.

Current implementation is based Engine leasing and instance passivation/activation mechanism and was chosen for simplicity of design and implementation and also to leverage already robust persistence and recovery support for BPEL Service Engine. See Appendix A for details on other design considerations.

Design Details:

Clustering support for BPEL Service engine entails design for the following two main features and are discussed in the subsequent sections:

Failover Support
Correlation Support
Correlation Support is further discussed in following sub-sections for various types of Inbound Messaging Activities (IMA) as defined by the BPEL Specification and use case scenarios.

  • Support for Multiple Receives
  • Support for Out of Order correlated messages
  • Support for Pick Activity with no on-Alarm defined
  • Support for Pick Activity with Alarm defined
  • Support for Multiple Messaging Activities (IMAs and Invokes)in Same Flow Activity
  • Support for two way invoke to Sub Business Process containing Correlated Messaging Activities (Receive/On-Message)

The design of engine support for clustering is based on central common database used by all the active engines participating in a cluster. This database is assumed to be scalable and highly available. BPEL Service engine will use Application Server data source JNDI resource (therefore underlying connection pool) for its persistence and recovery and cluster related operations.

Before installation of BPEL Service engine on Application Server cluster, user needs to create a JNDI resource and pass this to engine configuration during installation. Please refer this page  for the details on how to configure data source for currently supported databases. During installation of BPEL service engine on Application server cluster engine will check if the persistence schema is created, if not, it will create the persistence schema and register itself in the ENGINES table. (For details on persistence schema refer this page ).

Failover Support
Each engine actively updates its lease using a heartbeat signal to ENGINE table of this common database. After updating the lease, active engines would query this table (ENGINES) to find out if any engine has not updated its lease in the specified (configurable) interval. If any such engine(s) is (are) found whose last lease is expired, that engine(s) is deemed failed and any active (running, in process) instance of that failed engine will be acquired by the first querying (live) engine. These instances will then be loaded in the memory for further execution.

BPEL Service engine does not create any separate (daemon) threads to updating its lease and recovery of the failed engines instances, but uses the same thread pool (configurable parameter MaxThreadCount) used for processing incoming requests on the DeliveryChannel. On no loaded engine (no in-coming requests) these thread block on the delivery channel for a period of next lease update time, which is currently set to 60% (non-configurable) of engine expiry interval (configurable).

As part of optimization, the process of identifying the failed engine and loading of instances in memory (if any, of failed engine, if found) is not done in one single SQL, but rather in following independent steps.

Overall the following steps are involved in lease update and assumed failover.

If the time elapsed since last update of lease is within 60% of engine expiry interval

Update engine lease.
Update the dangling instances (running instances of failed engine) of failed engine with this (querying) engine's id and return the update count.
If count is greater than 0 (which means, there were running instances of failed engine, updated with this engine's id), do the recovery.
In the recovery call, query all the running instances of this engine from database. From this list removing the instances in-memory would result in the remaining instances. Load the instance data and schedule them for recovery (further execution).

Correlation Support

Figure 1 : Instance Passivation and Activation. Correlated Messages (M2) arriving on different Engine (E2) at later time (T2) than the engine that created the instance (Engine E1 for Message M1 at Time T1)

As mentioned in the overview section, the current implementation of clustering support of BPEL Service engine does not use any intelligent routing of the messages. This in essence means that for a project defined with correlation, the correlated message(s) for the same instance may in fact end on different engine than the one that created the instance. Hence we have two options, either to route the correlated message(s) arriving on different engine to the engine which created the instance, or route the instance to the engine that received the correlated message. We chose the later, again for simplicity of design and implementation.

The design and implementation further explained using the following sub-sections

1. Support for Multiple Receives

a) The messages (first message and also the correlated message) get distributed randomly across the engines.

b) The IMA with defined attribute createinstance as true would create the instance and continue the execution till the execution hits the point where it reaches the correlated IMA, at this point one of two conditions can happen

i) The correlated message is already available at this engine, it will be consumed by the instance and further execution will continue.

ii) It is not. The instance will be *PASSIVATED* (see Figure 1 above), i.e. instance will be marked in the database, the correlated waiting IMA registered and the instance removed from memory. The correlated message arriving on any of the live engine will fist try to find the instance to correlate in memory based on the correlation value calculated (as per process definition). If instance is not available in memory, engine will query the database and if instance is available in the database for the correlation key values and also if message's IMA matches the waiting IMA registered for the instance during passivation, the instance will be loaded in the memory. This ensures that only the correct incoming message (of multiple correlated IMA that might be defined) is handled for processing and not one of other messages (IMA's defined down-steam in process definition) where the process execution has not yet reached. Once found, the instance is loaded in memory and the execution continues. We call this process of finding and loading the instance as ACTIVATION.

2. Support for Out of Order correlated messages

a) Since the messages arriving on the BPEL service engine might be out or order, i.e. the correlated message for an instance may in fact arrive before the message that actually created the instance on the same or different engine (another case when message arrives on other engine even if instance is created, but not persisted/passivated yet); these cases need to be handled as well.

b) Such (out-of order) messages are stored in a special data structure and this data structure is checked periodically to see if contains correlated message events, if it does, query the database to get matching instances (based on correlation id and matching IMA type on passivated instances). If any such instance(s) are found, engine would acquire ownership (see section Failover Support above for details), ACTIVATE the instance and process recovery for further execution.

c) At some point the message that create the instance arrives on some engine and the instance would be created and this instance upon reaching the correlated IMA will first check this data structure to find out-of-order correlated message, it still not found, would passivate the instance.

d) Such passivated instances, will be found by the above defined periodic poll (step 2.b) by the engine that got the out-of-order correlated message.

e) The polling for out-of-order message is not done by any special thread, but tied to the engine lease update thread (see failover support above).

3. Support for Pick Activity with no on-Alarm defined

The pick activity, as defined by BPEL 2.0 Spec, waits for occurrence of exactly one event from a set of event then executes the activity associated with that event. Pick activity is comprised of set of branches, each containing an event-activity pair. The event can be of on-Message type (similar to receive activity) or an on-Alarm event. An on Alarm is timer based event. Pick must have at least on on-Message activity defined. The on-Alarm activity on Pick is optional.

For a business process that does not have on-Alarm defined, the behavior of pick is same as that of multiple receives. The above design and discussion (Section 2) also applies for pick with no-on Alarm defined.

4. Support for Pick Activity with Alarm defined

A pick activity defined with on-Alarm poses special challenges to design and implementation. This is due to the fact that no only the correlated message can arrive on any of the engine, the time for on Alarm starts as soon as the pick activity enters execution. In the event that the time expires before any of the on-Message events were to happen, the execution will chose the on-Alarm branch. Hence, we need to keep track of the on-Alarm(s) timer(s) in conjunction with the messaging events. The design also takes care of the case where while the on-Alarm is active, the engine might crash.

a) During execution when a pick activity with correlated IMA (on-Message) is encountered a special timer type object3 is created for each on-Alarm defined and scheduled with the defined alarm duration. The instance is passivated (see discussion for section 1 for details)

b) If on-Message, if defined, were to arrive on any of the engine before the expiration of on-Alarm, the instance is activated and the processing continues.

c) The case where in the instance is passivated and correlated on message event has not arrived and also on-alarm not expired and the engine (that has on-Alarm timer) crashes is also handled. During of recovery of such passivated instance, all on-Alarm special timer type ojject3 are reconstructed for the remaining duration (non-expired) and scheduled in the memory such that the pre-crash state of the instance is constructed in memory.

5. Support for Multiple Messaging Activities (IMAs and Invokes)in Same Flow Activity

a) In the case of the business process containing multiple messaging activities (receives/onMessage or invokes) on different branches of same flow activity, the instance will not be passivated (during IMA execution, for the absence of messaging event) if any of the messaging activity (on another branch of same flow) is under active execution. Only when the messaging activity completes will the instance be allowed to passivate. Refer the following for the receive (Figure 2) and invoke (Figure 3) cases in flow.

b) The Figure 4 pertains to special Business Process Instance Thread created and put on ready to run queue for instance passivation when the instance cannot be passivated because of another flow branch in active execution of messaging activity. The IMA unit under execution of the flow branch is put on waiting state. This BPIT an 2000 d is is periodically checked by the engine execution threads and when the active IMA count is zero, once more a check is done to see if the message for the event(s) for this flow branch arrived on the engine. If message exists, the IMA unit (Receive or Pick) is played again. If not, then the instance is passivated and cleared from memory.

Figure 2 : Flow Diagram for Execution of ReceiveUnit and PickUnit inside a Flow Activity in Cluster (case 5)

Figure 3 : Flow Diagram for Execution of InvokeUnit inside a Flow Activity in Cluster (case 5)

Figure 4 : Special Business Process Instance Thread for Instance Passivation in Cluster (case 5)

6. Support for two way invoke to Sub Business Process containing Correlated Messaging Activities (Receive/On-Message)

In case the sub business process instance is executed in another engine (as a result of passivation and activation mechanism) when the reply is encountered, using the CRMP mechanism the response object is persisted in the database. The invoking business process (for clustered cases only) would create a special Business Process Instance Thread and would periodically (currently tied with the heartbeat updated thread) check for the response object for the two way invoke. When the response is available in the database, the response would be constructed directly from the database and further execution of the parent process will continue. The Sub BP can continue execution after persisting the response object.


A. Alternate Design Choices
Intelligent router design


Best design for clustering. The way it should be. Obviously overcomes all technical idiosyncrasies of any alternative solution.

Router sits outside of the BPEL SE and it is a dependency on some other teams and projects.
Load Balancer is specific to protocols, Messaging protocol, http, file protocol and so on. This could be an issue during deployments.
Deployment complexity involved. Can't foresee how much, but compared to the current design this will definitely be more.
Primary disadvantage is that we rely on other teams for this to work successfully which is the main deterrent here. If we can overcome this challenge and get all the teams to work towards this that is ideal (again).
Message routing

We have talked about this a few times but never ventured deep enough to see its feasibility (or its ugliness). We always averted this solution because of the lack of simple transactional support.

This design makes use of the existing DB as the message pass through. Jerry suggested similar design using a queue. Both of them are very similar but with different message pass through mechanisms.


In terms of execution of an instance it is similar to the intelligent router design, which is the way it should be in cluster.
Agnostic of an external router, from a project feasibility perspective this is good.

Transactional support for correlating messages is no longer simple if the correlating message needs to be routed. Also remember that we only support 1 ways and not ways in clustering. If we want to support 2ways we go for CRMP and in that case the transactional issue we talk about is no longer an issue.
Timeouts of transactions could be an issue here, depending on the implementation choices we make to enable the engine communication through DB.

Assumes another table say CLUSTER_MESG_INFO (clob, source, destination, status) Status can have values

- send = message sent from one engine to another but not yet consumed by the destination engine
- done = message consumed by destination and is successful in consumption
- error = message consumed by destination and is resulted in a failure
- closedSuccess = source engine closed the entire transaction as part of its transaction
- closedFailure = transaction commit failed and rolledback

Flow Chart

Different (Engine) Failure Scenarios

Sunny day case:
- A sunny scenario is possible to do without any issues. As part of E1's receive activity's (consuming M2) persistence we mark the completion of the communication by logical deletion of the row in CLUSTER_MESG_INFO.
Case 1: If Engine 2(E2) fails (system crash) during the cycle of M2 consumption

If E2 fails before the status in CLUSTER_MESG_INFO is "send", then there is no issue
If E2 fails when the status is "send", then the failover logic for engine will set the status to "closedFailure" after E1 updates the status as "done" or "error".
If E2 fails when the status is "error", then the failover logic kicks in to take over the instances run in E2. As part of the fail over logic, which ever engine takes over, it should update the status to "closedFailure".
If E2 fails when the status is "done", need to consider 3 scenarios based on when E2 crashed
If E2 didn't read the "done" or E2 did read but didn't update the record with "closedSuccess"
If E2 did read, updated the record with "closedSuccess". E2 crashed before TM called prepare on the transaction.
If E2 did read, updated the record with "closedSuccess". E2 crashed after TM prepared the transaction but before commit.
One way to address this would be implementing time out. - on time out the fail-over logic will update the status to "closedFailure".
If E2 fails after the status is updated to "closedSuccess"/"closedFailure", then there is no issue.
Case 2: If Engine 1 (E1) fails during the cycle of M2 consumption

If E1 fails after E2 decides to route it to E1 and before it can insert a row in CLUSTER_MESG_INFO with status "send"???
Maybe have one more column in CLUSTER_MESG_INFO which points to the instance ID. Whichever engine owns this instance at after whatever time, will then look at this column to know that this message is to be consumed by the new engine for that instance
If E1 fails after E2 inserts a row in CLUSTER_MESG_INFO with the status as "send". Then the failover logic kicks in and the new engine that owns the instances of E1 will also reflect that the message in the inserted row is to be routed to the new Engine.
If E1 crashes after the status is "done". Failover kicks in and replays M2 consumption (but doesn't update CLUSTER_MESG_INFO with the status "done") and waits for the status to be changed to "closedSuccess"/"closedFailure".
If E1 crashes after the status is "error", it is not an issue.
If E1 crashes after the status is "closedSuccess"/"closedFailure" then there is no issue. In the case of "closedSuccess" Failover will consume the message M2 and proceed further in processing the instance.
Case 3: If E1 and E2 crash

If E1 & E2 fail before status is "send", there is no issue
If E1 & E2 fail when the status is "send" or "error", then it falls into Case1's similar situation where E2 crashes in similar state.
If E1 & E2 crash after the status is "done". We should in this case use the trick like "rollback" or "rollforward". Most likely "roll forward" taking an optimistic approach.
If E1 & E2 crash after the status is "closedSuccess"/"closedFailure", then there is no issue. This falls into Case2's similar situation when E1 crashes in similar state.

BPEL Service Engine (Open-ESB)

Post oracle acquisition of Sun Microsystems, open source project Open-ESB has been end-of-lifed. There is no active development happening on Open-ESB core or components (the service engines or binding components). The project pages are also being moved around. To facilitate continued access to these pages, in next series of posts I will reproduce some pages or some of the features that I actively worked on, including but not limited to Clustering and Failover, Scalability and new monitoring console for BPEL Service Engine.

Wednesday, July 18, 2012

Two-Phase Memory Management/Scalability Solution for Stateful Business Processes

Reproducing article I first published on dzone on Nov 11, 2010


The popularity of web services and large scale adoption of SOA by industry has led to increasing use of business processes for integration/orchestration. Such processes in the context of Work-flows, B2B, B2C or EAI involve complex, stateful interactions that can span organizational and functional boundaries. A slow or non responsive interaction can cause the instances to pile up and such runaway process can potentially max out the available heap space. This can have adverse impact on performance and can have very debilitating effect on system and may lead to out of memory and system crashes.

Scalability Challenges

Is your business process running out of memory/not scaling well? Memory requirements depend upon concurrent sessions, process variables, complexity of flow, timers, and so on. This coupled with fact that for loosely coupled stateful long running interaction with business partners, processes cannot make any assumptions regarding responses time from interacting partners. So, how do you determine memory resources for your production loads in advance? How do you scale when you are limited by your memory resources? How do you design a scalability solution for processes that does not require many configurations and is self-managing? If you are confronted with these questions, you are not alone. While working on BPEL Service engine part of open source project Open-ESB (, managed by Sun Microsystems (now Oracle), we set out to provide built-in support as best-effort approach to help our customers to overcome some of these challenges.

Scaling Business Processes

Various strategies can be employed for scaling business processes. Traditionally, these are broadly categorized into horizontal and vertical scalability. Horizontal scalability is achieved by adding more hardware nodes to the system, whereas vertical scalability essentially means adding resources, such as memory or processing power to a single system. Each of these solutions poses specific overhead and implementation complexities.

The design options for horizontal scaling also referred to as clustering or high availability can range from traditional approach of using of intelligent message router, to that of instance routing between nodes to handle stateful interactions spanning multiple message exchanges (no routing of messages).

While vertical scalability is often employed for scaling synchronous processes; for large systems using asynchronous communications, this solution is far from optimal.

This article will present the self managing optimal memory management solution used by BPEL Service Engine. It is designed with specific domain knowledge by employing multi-phase passivation techniques that does not require any reconfigurations from user. It incorporates our experiences while working on solution for improving scalability of BPEL Engine (based on JBI Specification) for project Open-ESB. Finally, I will also present results of our test simulations for realized benefits of above solutions. The article will also touch upon Horizontal Scalability solution used by BPEL Service engine.


Terms performance and scalability is often used interchangeably but is distinct. I am quoting the definition of performance, scalability and capacity from book: Pro Java EE 5: Performance Management and Optimization book by Steven Haines.

Performance: speed at which a single request can be executed. Scalability: ability of the request to maintain performance under increasing loads the performance of the system under increasing load.

The book gives some example of service engine to explain the above. If the performance of the search engine is to produce valid response for the request in say 1 sec, the scalability would be defined as the requests ability to maintain that 1 sec response as the load increases. The performance (TPS and response times) are defined in SLAs (service level agreements). The capacity assessment of the system is performed at the production staging environment once the application can sustain expected usage though a performance load test and establishes the following among other things

the response time at expected usage
the usage when the first request exceeds its SLA
usage when the application reaches its saturation point
the degradation model from the first missed SLA to the applications saturation point
While the above definitions can be better understood in terms of concurrent users accessing a web application, in BPEL Processing Engine terms, what we are trying to achieve is improvement to all of the above (response time/TPS/capacity) for one of the very common use case - long running business processes. For such cases the instances takes long time to complete (either because it is waiting for correlated receive or response from the partner services etc) the available memory resources can be effectively increased if these in-memory object related to the long running instances can be persisted and released. These objects can then be recreated whenever the process instance is ready to execute further. Thus for such business processes we should be able to derive following benefits:

Increased performance - At full capacity usage; less memory would mean less full garbage collections and better GC throughput
Increased Capacity - Ability handle more instances before reaching the saturation limits
Better response time/latency

JBI/Open ESB Quick Overview

JBI (JSR 208) is an attempt to solve the integration challenges using declarative model whereby the systems are exposed as web-services (defined by wsdls), communicating with each other using standard message format. This prevents vendor lock in and promotes loose-coupling, flexibility and reuse. JBI define standards for plugging in service engines and binding components.

Service engine typically is a place where business logic is implemented and it could be BPEL or Workflow Engine, or Transformation such as XSLT Engine. Binding components deal with the protocol specific communications and act as entry and exit points. Components use standardized message format (called Normalized Message) for inter component communication over service bus. Each component listens to a dedicated queue (called Delivery Channel) and it is responsibility of the service bus to route the message to correct destination queue. JBI Specification does not mandate any threading Model. Each component is free to implement its threading model. All messages are exchanged asynchronously. Asynchronous communication model even for synchronous requests allows for loose-coupling and building robust systems with simpler design also allowing for building higher order features such as persistence and recovery.

BPEL Service Engine Horizontal Scalability Architecture
The clustering implementation in Open ESB BPEL Service engine does not use intelligent router (also called Sticky router) to keep track of correlated messages. The message router routes the messages randomly to any engine in the cluster and if the message end up on wrong engine, the instance (upon reaching the execution point where it need the correlated message) is then removed from memory and recreated on other engine. In essence, the effect is routing the instance across the engines. This implementation of de-hydrating and recreating the instance leverages the persistence/recovery framework, hence does add minimal complexity.

Passivation strategy for Scalability & Availability

Open ESB BPEL Engine’s scalability implementation also leverages persistence/recovery framework. As performance optimization, BPEL Engine persists only certain points during execution (after execution of non-replayable activities, such as partner invoke), and the instance state including variables are persisted in the database. For performance reasons, persistence can be turned off also.

With persistence turned on, engine checks the available heap during request processing. Should the available memory fall below certain levels (configurable) due to inactive instances (either due to response from a partner or correlating/callback message or due to some timer activity), the running instances are scanned and instances waiting for longer (configurable) than certain period are picked and the in-memory variables swapped out (de-referenced for garbage collection). Note that since variable might be dirty (not persisted yet), additional steps need to be performed to save the state of variable in database. Such out-of-band persisted variables are specially marked so that in the event of crash, the recovery starts from last good state (including state variables). If the first pass does not increase the available heap beyond the defined level, the swap-out time interval is reduced by half and process repeated again recursively.

For cases where the process is dealing with smaller size messages there might not be appreciable gain in memory due to release of memory held by the variables. In nutshell, The solution kicks in when available heap falls to certain levels. It works in phases based on severity of memory levels.

Phase 1 – Variable Passivation (Figure 2)
Phase 2 – Instance Passivation (Figure 3)
Uses database for persistence of in-memory objects
Works for Asynchronous Business Processes only

Tests Performed

Following are some of results of actual tests performed on some representative process for 1 Mb and 10 Mb load size. The JConsole snapshots (below) capture the memory growth and benefit of the two-phase scalability solution. Also note Variable passivation benefits most when the message size is large, where as Instance passivation helps when message is small but numbers of instances are large.
Test 1: Large Messages (1 Mb)
Without Two-Phase Scalability Solution

  • Heap Max at 60 Msgs
  • Total Heap: 512 MB
  • Message Size: 1 MB

  • Figure 4 Results with Variable Passivation 
    Total Message Processed: 890+

    Figure 5 
    Test 2: Small Message (10 Kb) 
    Without Two-Phase Scalability Solution 

  • Heap Max-out : 3000 messages
  • Total Heap : 512 MB

  • Figure 6 
    Results with Variable Passivation Only (Phase 1)

  • Total Message Processed: 7600
  • TPS Dropped/Processing Stopped

  • Properly Sized application

    This means providing enough hardware for the expected concurrent execution of instances for the representative message size for all business processes. This calculation should also include the number of concurrent variables in the business process definition(s). For example, for single business process case if the expected concurrent execution is 1000 instances and each instance carry a message of 100 KB size* and the process defines 10 distinct variables, than simple calculation would result in 1 GB of heap space. Also, to account for short peaks some buffer need to be provided, as appropriate. * Note that we need multiply this with a factor for in-memory DOM size for xml message.


    For stateful services/components such as BPEL Engine incorporating two phase passivation strategy can help scale the system.
    It is worth reiterating that the solution outlined above is qualified solution for processes involved in asynchronous long running interactions and would not have any benefit for short running synchronous processes. For short running processes you would achieve the scalability by either hardware addition (vertical) or by clustering (horizontal).
    In addition designing scalable business process consideration need to be given to process definition, message size and expected peaks for proper pre-production hardware sizing