Workflow Engine
Overview
The article provides a general overview of the Workflow Engine architecture and explains the common principles of how the designed Workflows are executed.
Engine
Once a Workflow is designed in the Workflow Studio, it is stored in the Workflow Repository, in the Production database, and is ready for execution. The Workflow Engine is a special module of the Workspace Management application which is responsible for managing the execution of workflows and handling such tasks as starting, resuming, terminating Workflows, monitoring and persisting the Workflow instance state.
Monitoring
The Workflow Engine catches the Workflow Instance events and persists them in the Monitoring database (by default, the name of the monitoring database is "M42Monitoring").
For analysis of how the Workflow instance is being executed (or has been executed), the System provides the Visual Tracking action available in the Workflow Studio. This module uses data from the Monitoring database and allows understanding and analyzing how each workflow activity has been executed, which input arguments it has received, and what the Activity result is, as a value of output arguments.
The System keeps all Workflow Instance events in the database until the workflow instance is completed, and for some time afterward.
For more details see Configure Workflow Instance Monitoring
Persistence
A workflow is a long-running process which in many cases requires human interaction, which means the time between the start and finish of the Workflow Instance could be years. Obviously, in such a timeframe the Application is restarted many times. For that reason, the Workflow Engine handles the saving of the Workflow Instance state to storage any time the change occurs.
The System uses the Instance Store database for storing the serialized Workflow Instance (by default, the database name is "M42InstanceStore").
The instances are stored in the database only until the moment they are completed.
Reliability
The chapter is relevant only for the Workflow Worker Engine.
Workflow Instances are long-running processes that could take years to be finished, which also means the Workflow Instance can be started on one Product Version, resumed on a second one, and finished on a third one. To mitigate risks of Workflow Instance failures in such a hostile environment the Workflow Engine has a range of machines that guarantees correct handling of the Workflow commands (start, resume, terminate, etc), or, in case of failure, provides an approach for recovery.
Queuing
Any request to the Workflow Engine is handled over the Queue (table [dbo].[QueueTask]), and stay there until the first available Matrix42 Worker polls it and executes it. This approach each request, later or sooner, will be processed
Reanimate
There are many reasons a Workflow Instance can go to Fail. In some cases, when the Workflow Instance is business-critical, the reason for the failure can be determined and solved, afterward, the Workflow Instance can continue the execution. For more details see Reanimate Workflow Instances
Self-Recovery
- Stop service: when during the execution of the Workflow the Application Server becomes unavailable (e.g. due to iisrestart, entering Maintenance Mode, ..) or the Matrix42 Worker service has been stopped (not killed), the Workflow Engine proceeds with "graceful shutdown" procedure, which means the automatic suspending of all running Workflows, persisting them in Instance Store, and then automatically resuming as soon the overall infrastructure (AppSever and Worker) is up again.
Graceful shutdown procedure is running until all Workflows are unloaded. If the Workflow executes a long-running Activity (e.g. "Invoke Powershell" with a heavy script) the stopping of the Worker could take a while, until the Activity execution is finished.
- Kill process: If the Workflow Worker Windows Service is just killed (not stopped), the "graceful shutdown" procedure is not executed, and all Workflow Instances that were running on the Worker go, after some delay, to the Failed state. Afterward, they can be manually reanimated from the last Persistence point
Implementation
The Workspace Management application supports two alternative implementations of the Workflow Engine, the basic one which is based on Microsoft AppFabric, and a new one, based on Matrix42 Workers.
AppFabric
The AppFabric technology and all related components are completely discontinued and no longer present in the system since ESMP v.12.1.0. If your system still has Workflows or Workflow instances running on AppFabric migrate the system to the Worker as described here.
AppFabric used to be the first version of the Workflow Engine. It is fully based on the Microsoft AppFabric module which out-of-the-box provides an implementation of all the basic tasks of the Workflow Engine, like hosting Workflows, monitoring, and persistence. The specifics of the AppFabric Engine is a way how the prepared Workflows are deployed and hosted. Using the "Publish" or "Publish Repository" action available either in Matrix42 Software Asset and Service Management or in the Workflow Studio, the prepared Workflows are deployed to the Application Server, to folder "/svc/WF/", and the System dynamically creates Web Services endpoints for each deployed Workflow version. In the end, each version of the Workflow represents the Workflow Service.
Matrix42 Workers
Due to a couple of downsides of the previous Workflow Engine implementation based on AppFabric, such as a problem with performance, using enormous system resources, and a problem with the horizontal scaling of the Workflow execution, the new concept of Workflow execution called Matrix42 Workers has been introduced.
Matrix42 Workers engine is designed to fully replace the AppFabric engine in the upcoming releases. The Worker roll-out strategy includes the gradual replacement of AppFabric components from version to version, with always available option to fallback to AppFabric when something goes wrong, but starting from ESMP v.12.1.0 all AppFabric-related components are discontinued.
For more information on Matrix42 Worker architecture, worker management, default worker, installation process and update see Martix42 Worker Engine page.
Configuration
Workflow Engine Definition
The System settings define which Workflow Engine is used for the execution of a Workflow Instance. The settings can be found in the Administration application → Settings → edit Global System Setting dialog → Workflows:
Starting from ESMP v.12.1.0 all Workflows are processed by the Matrix42 Worker Workflow Engine.
Configure Workflow Instance Monitoring
The Workflow Engine based on Matrix42 Workers uses an individual implementation of the Workflow Instance Monitoring, which not only replicates the functionality of the earlier used AppFabric Monitoring module for the Workflow Instances running on Matrix42 Worker but also adds additional features that allow tailoring the monitoring for each specific environment.
The System provides two levels of monitoring capabilities for Workflow Instances running on Matrix42 Worker Engine. All the Workflow Instances regardless of durability can be configured to utilize Workflow event collection capabilities, allowing data at varying verbosity to be collected for monitoring and troubleshooting purposes.
The engine differentiates two different levels for collecting events (see the image above):
- Error Only
- Troubleshooting
Workflow Engine Definition property of the Global System Settings also requires appropriate configuration of the Monitoring Level in the Workflow settings. For more detail, see Manage Workflows: General Dialog Page settings.
Error Only
The Workflow Engine records to the Monitoring database a minimal set of Workflow Instance events, which includes events on starting, finishing workflow, suspend, and resume points, and also, in case of error, all the events accumulated from the last resume point to failed Workflow activity.
Troubleshooting
The Workflow Engine records ALL the events thrown by the Instance.
The environment in which running a massive amount of Workflows generates huge amounts of events, can cause serious performance and lack of resource problems on processing and recording them. Usually, most of these events are recorded and then in a few days automatically cleaned up and never reviewed in the Visual Tracking. To optimize the System resources usage the Error Only level is introduced, which provides enough level of information to figure out the issue for most of the cases. For special cases, when the Error Only level does not provide enough data the Troubleshooting level can be used.
In case the Error Only option is set for the overall system, it is possible to elevate the level for each particular Workflow. It can be used for troubleshooting the workflow(-s), without putting extra pressure on the overall System. For that case, the Monitoring Level can be set in Workflow dialog general page (see Manage Workflows for more details).
The Workflows running on AppFabric Workflow Engine uses AppFabric Monitoring module. See "Monitoring Applications using Windows Server AppFabric" for more details.
Workflow Databases
Workflow Worker Engine stores the persisted Workflow Instances and Workflow Monitoring events in Database. By default, the persisted instances are stored in the Production database in the table [dbo].[WFPInstancesTable]. The monitoring events of the Workflow Worker Engine are recorded in table [dbo].[WorkerEvents], which is by default hosted in the AppFabric Monitoring database (e.g. M42Monitoring).
In environments with a high volume of executed Workflows, the size of the Instances and Monitoring databases could be essential, which significantly affects the size of the Production database. To improve the System performance and maintainability for such cases, it is recommended to relocate the Workflows tables to a dedicated database.
The Setup API provides the Powershell Cmdlets for relocating Workflows tables to a specified database.
New-WMWorkerDatabase -WorkerDBName "NewDBName"
WorkerDBName - the name of the Database on the same SQL Server where to move the Workflows data. The Cmdlet creates a database if it does not exist.
To prevent data losses on relocating the Workflows data to a new database put the System to the Maintenance mode first. Use the "Move Worker Workflow Data" as a template.
Workflow processing optimization
Cleaning obsolete Workflow Instances
The System automatically runs the background engines which periodically cleans up Workflow Instance from the Production database, and all related data from the Persistence and Monitoring databases. To change the timeframe the mentioned Workflow Instance stays in the System:
- In the Administration application, open Engine Activations management area;
- Find and edit Clean Up Obsolete Objects engine activation;
- Open dialog view Active Engines and edit related engine Clean Up Obsolete Objects;
- Set the number of days for:
- Completed successfully workflow instances
- Failed workflow instances
Workflow Infinite Loops Protection
If the Workflow is badly designed it can lead to infinite loops on Workflow Instance execution and overall blocking of the Workflow Engine, as some instances are always running and there is no capacity to execute new Workflow commands. To disable such negative impacts of the Infinite loops the System uses the protection mechanism which automatically terminates the Workflow Instances in case the infinite loop is detected, and the amount of iterations exceeds the configured number in the Production database SPSGlobalConfigurationClassWorkflowEngine
tableActivityLoopLimit
attribute.
By default, the System supports 10000 iterations in Workflow Instance before it will be classified as an infinite loop.
Regulate the maximum size of the Monitoring Database
Infinite Loops & Database Purge
For cases when the System runs monitoring in Troubleshooting mode and some Workflow Instances enter infinite loops, it could easily lead to the drastic growth of the Monitoring Database size and missing free hard disk space on the Database Server. To prevent this scenario the Workflow Engine supports the automatic monitoring database purge mechanism, which automatically removes the oldest Monitoring records when the database exceeds the maximum allowed size.
By default, the allowed size is 1Gb, but you can change it in the Production databaseSPSGlobalConfigurationClassWorkflowEngine
table in MonitoringMaxTableSize
attribute, which defines the database size in Megabytes.
The System uses engine activation Workflow Monitoring Autopurge to automatically start the purge function, which is configured out-of-the-box to start once a day at night.
Workflow Activities & Variables Max Length
For cases when the System runs monitoring in Troubleshooting mode and some Workflow activities output a lot of variables, it could easily lead to the drastic growth of the Monitoring Database size and missing free hard disk space on the Database Server. To prevent this scenario you can change in the Production database SPSGlobalConfigurationClassWorkflowEngine
table in MonitoringMaxVariableSize
attribute, by default the variable data max length is 2000 symbols.