HITL-AB-BPM: Business Process Improvement with
AB Testing and Human-in-the-Loop
Aaron F. Kurz1,3 , Bennet Santelmann1 , Timo Großmann1 , Timotheus Kampik3 ,
Luise Pufahl2,∗ and Ingo Weber2
1
  Technische Universitaet Berlin, Berlin, Germany
2
  Software & Business Engineering, Technische Universitaet Berlin, Berlin, Germany
3
  SAP Signavio, Berlin, Germany


                                         Abstract
                                         Organizations improve their business processes regularly to react to clients’ requests, changing business
                                         environments, or regulations. However, applying a changed process design directly in an organization
                                         is risky: frequently, process changes do not result in actual improvements. In this demo, we present a
                                         tool for AB testing of process versions with reinforcement learning techniques and a new approach to
                                         human control of the procedure. The goal of the tool is a fair and fast comparison in the production
                                         environment. As the system might not have all information required to make informed decisions, we
                                         propose how human experts can be put in control.

                                         Keywords
                                         Process redesign, AB testing, Multi-armed bandit, Human-in-the-loop


1. Introduction
A key challenge of Business Process Management (BPM) is the implementation of continuous
improvement practices that ensure business processes do not stagnate. Two core issues when
introducing a new version of a process concern risk and fairness. As discussed in earlier
research [1, 2], new versions of a process often do not amount to actual improvements, entailing
the risk of switching to an inferior version. Fairness of comparison between two process versions
is an issue in the traditional BPM lifecycle [3], where the old version is replaced with a new
version. Any comparison in such a setting suffers from confounding factors and uncontrolled
variables. To counter these issues, in earlier work [1] the AB-BPM methodology was proposed,
where two versions of a business process (A and B) are run in production side-by-side, allowing
a fair comparison. AB testing is a standard method from DevOps and is nowadays used widely
in Business-to-Consumer (B2C) applications. During AB tests, two versions of a product (e.g., a
website) are deployed in parallel, and the decision on which version is preferential depends on

Proceedings of the Demonstration & Resources Track, Best BPM Dissertation Award, and Doctoral Consortium at BPM
2021 co-located with the 20th International Conference on Business Process Management, BPM 2022, Münster, Germany,
September 11-16, 2022
∗
    Corresponding author.
Envelope-Open aaron.f.kurz@campus.tu-berlin.de (A. F. Kurz); bennet.santelmann@campus.tu-berlin.de (B. Santelmann);
timo.grossmann@campus.tu-berlin.de (T. Großmann); timotheus.kampik@sap.com (T. Kampik);
luise.pufahl@tu-berlin.de (L. Pufahl); ingo.weber@tu-berlin.de (I. Weber)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                        122
Aaron F. Kurz et al. CEUR Workshop Proceedings


the data obtained during these experiments. The application of AB testing to business processes
brought about new challenges, which have been addressed in the earlier work. Specifically, in
AB-BPM, process instantiation requests are routed dynamically to either one of the versions,
using approaches from Reinforcement Learning (RL) and observations of the performance
of the instances. This dynamic routing is done to minimize the risk of exposing customers
to sub-optimal process versions longer than necessary. The AB-BPM methodology is geared
towards iterative process changes, but the authors note that it may also be used for more radical
improvements. However, so far, AB-BPM tools have not been released, and the earlier work did
not study how to involve humans in the AB testing procedures.
   This demo of our Human-In-The-Loop AB-BPM (HITL-AB-BPM) tool closes the above two
gaps: we present an open-source tool1 that implements the AB-BPM approach and amends it
with human control. In the remainder, we present the architecture and the functionality of the
human-in-the-loop AB-BPM tool in Sect. 2 and demonstrate it in Sect. 3 along with a discussion
of the tool’s maturity.


2. Tool Description
Architecture To         realize   the    HITL-AB-BPM        tool,    we     have    developed
an instance router that decides,                using a Reinforcement Learning (RL)
algorithm and under consideration
of feedback from a human expert,
which case is executed through
which process version. It interacts
with the Camunda2 execution en-
gine as shown in Fig. 1. The HITL-
AB-BPM tool consists of three main
components: (1) the front end for
the interaction with the process ex-
pert/analyst, (2) the back end, and
(3) the database (PostgreSQL3 ). The
front end is the single point of in-
teraction with the user (human in
the loop) and is built using Stream-
lit4 . It communicates with the back
end via RESTful API endpoints. The          Figure 1: Architecture and used technologies.
back end is built using the micro-
framework Flask5 . It handles the logic of i) communicating with the process engine through
Camunda’s API to deploy process models, start instances, and retrieve relevant execution
information; ii) storing meta-information about the processes and their performance in the
1
  https://github.com/aaronkurz/hitl-ab-bpm, accessed 2022-06-22; this paper refers to release v0.1.0
2
  https://camunda.com/platform-7/workflow-engine/
3
  https://postgresql.org, last accessed 2022-03-19
4
  https://streamlit.io/, last accessed 2022-03-19
5
  https://flask.palletsprojects.com/en/2.0.x/, last accessed 2022-03-19


                                                       123
Aaron F. Kurz et al. CEUR Workshop Proceedings


Figure 2: Operational flow of the application. User tasks denote activities that have to be executed by
the process expert, whereas service tasks are executed by the application itself.
database; iii) training the RL agent and extracting relevant information from it; iv) hosting the
routing algorithm used to map requests from process instances to process versions v) handling
the requests from the front end. We use an RL-based recommender system algorithm (more
specifically, a so-called contextual multi-armed bandit algorithm) that is part of Vowpal Wabbit,
an open-source online-learning machine learning library6 with Python bindings. The Camunda
engine –with simulator plugin for demonstration purposes– executes process instances and
provides necessary data to train the RL agent.

Functionalities The HITL-AB-BPM tool incorporates a set of functionalities that aim to
strike a balance between exploration and exploitation when evaluating process versions while
supporting human supervision and interference. The operational flow of the application can be
seen in a process-orientated view in Figure 2. The user of the tool is a process expert. Therefore,
the terms user, process expert, and human expert are used interchangeably.
   First, the process expert uploads BPMN files for each process version and historical process
execution data about the default (“old”) version. The execution data is used to evaluate new
instances that are part of the experiment, i.e., the reward function judges the duration of
experimental instances in relation to the production duration data from the old version. This
reward strategy is in line with prior work [1]. Note that although the framework supports
different performance indicators in theory, the current version of the tool uses only duration
for the evaluation. Duration is relevant in most scenarios and available for any process. The
process versions are then deployed to and executed in Camunda. The routing of incoming
process instantiation requests to either version is based on so-called batch policies. A batch
contains a certain amount of process instances, that are created in response to incoming process
instantiation requests, and then monitored and evaluated. In a batch policy, the user can specify
how many of the following instances will be part of the next batch and with which probability
each version will be instantiated. To allow for more control, the user may differentiate these
probabilities by contextual factors. For this prototype, the customer category is used as the
relevant contextual factor. In theory, different contexts could be used here.
   Batch policy proposals are provided by the RL agent and suggest how instances should be
routed so as to most efficiently explore and then exploit the knowledge about which process
version is preferential. The first batch policy proposal is naïve, suggesting a 50/50 split for each
customer category, since no experimental data has been obtained to inform a more specific
6
    https://vowpalwabbit.org/, last accessed 2022-03-19


                                                          124
Aaron F. Kurz et al. CEUR Workshop Proceedings


decision. It is presented to the user immediately after the upload of the d ata. Then, the user has
to submit the first batch p olicy: setting the batch size and adjusting the routing probabilities.
The following process instances are then routed according to this policy until the batch size is
reached. Instantiation requests between batch policies (when the user has to check the new
batch policy proposal) or before the first batch policy are not part of the experiment; they are
routed to the default version and not evaluated by the RL agent. After all instances of one batch
policy have been started, the process engine is polled for information about the already finished
instances, which is then used to train the RL agent. Subsequently, the agent creates a new batch
policy proposal for the user to review. Along with this proposal, the process expert can also
view additional data about the past experimental instances. After the process expert analyzed
the results, they then have the following choices: i) continuing with another experimental
batch, i.e., to either accept or modify the proposal and to then set the batch policy; ii) ending
the experiment and starting the cool-off p hase. When they decide to continue the experiment,
they can either accept the proposal presented by the RL agent or modify it before setting the
new batch policy. This means that another batch of incoming process instantiation requests is
routed according to this batch policy and considered for the RL agent and the analysis. This
loop of reviewing the batch policy proposal, setting the batch policy, and running and learning
from the new batch can be repeated as often as the user c hooses. If they decide to end the
experiment, the cool-off period starts, which lasts until all experimental instances terminated.
   Afterward, the user is asked to set a final po licy. This decision entails setting a winning
version for each customer category. Subsequently, all incoming instantiation requests will be
routed per this decision. The user can consult two primary resources while deciding: the data
insights (overview of collected data) and the final proposal by the RL a gent. The final proposal
is the routing decision the RL agent would make if it would have to route an additional instance.
It is important to note here that this proposal by the agent does not necessarily only contain
the way the agent would exploit the collected knowledge, but might also be based on the need
for more exploration. Additionally, we added the functionality to conclude the experiment
at any time and make a manual decision, i.e., static routing to either version of the process.
The manual decision terminates the experiment. This functionality corresponds to a manual
override, which might become necessary upon discovering severe flaws in one version, vocal
customer complaints, or regulatory reasons. HITL-AB-BPM can be seen as an Augmented BPM
approach [4], where the user provides a clear frame for the RL agent in terms of batches and
manual override, thus enabling joint human/agent decision-making.


3. Demonstration and Future Work
The exemplary helicopter licensing process from Saytal et al.’s evaluation of AB-BPM [1] is used
to demonstrate the prototype’s functionality. For simulation purposes, two process versions
are created based on data from the relevant business domain. The initial version – Version A –
works through the required steps (tests and exams) of attaining a license sequentially. If one
of the steps fails, the process ends in rejection. These steps are parallelized in the supposedly
improved business process version (Version B). A person attempting to obtain a helicopter
license can, for example, already take the medical exam while waiting for the eligibility test


                                                 125
Aaron F. Kurz et al. CEUR Workshop Proceedings


results. This parallelization should, in theory, reduce the execution time. To provide historic
data of the default version, 100 instances of Version A are simulated beforehand, and the results
are uploaded to the HITL-AB-BPM tool at the beginning of the experiment. For demonstration
purposes, process execution is simulated in the process engine. One day is scaled down to one
second in the simulation. Customer categories are introduced as contextual attributes that may
affect process performance. A process instantiation request can either come from a private
sector customer (priv) or a government entity (gov). Since our demonstration does not serve
real client requests, the front end offers access to a client simulator (development mode). For the
evaluation, the client simulator sends bundles of requests at regular intervals.
    To demonstrate HITL-AB-BPM, we ran the following concrete experiment. Initially, the
routing to the new version (Version B) was set to be lower for government customers, to reduce
the potential risk of losing this important customer. For priv 80% and for gov 20% of instantiation
requests were routed to Version B. After the first batch finished, the RL agent presents a new
batch policy proposal. The proposal indicated that Version B was faster and preferential to
Version A. For priv, the RL agent proposed to route 91% of instances to version B, and 79% for
gov. The performance data supports the proposal. The average duration of Version A was 72
seconds, while the average duration of B was only 19 seconds. Due to the small batch size, the
RL agent might opt to require more exploration before fully committing to B. For the second
batch, the human expert accepts the proposed batch policy. The final proposal is in line with
the expectations: the RL agent’s probability to choose B is 96% for the priv category and 95%
for the gov category. The performance data matches the final proposal.
    The presented software system is a proof-of-concept (research) prototype; automated tests
have been developed and run in a continuous integration pipeline, including process-level tests
with the above helicopter licensing process models. The prototype has generic capabilities – i.e.,
it integrates with a BPMN 2.x standard-compliant process execution engine and it can be used
to optimize any process that can run on the engine. Still, the following conceptual and technical
challenges need to be addressed to further facilitate its adoption. i) HITL-AB-BPM would benefit
from more lose coupling regarding technical execution aspects, i.e., not only from the specifics
of a particular engine, but from BPMN-based model execution in general. In alignment with
this, future work can detach the optimization core component of HITL-AB-BPM for example, by
making it available as a Python library (PyPI package). ii) The current implementation focuses
on AB testing: it always experiments with exactly two process versions. An extension could
expand testing capabilities to larger numbers of processes, i.e., multi-variate testing.
  Acknowledgments. We thank Diana Baumann, Omar Sharif, Jiefu Zhu, and Konstantinos
Poulios for their contributions to the implementation.

References
[1] S. Satyal, I. Weber, H. Paik, C. Di Ciccio, J. Mendling, Business process improvement with the AB-BPM
    methodology, Information Systems 84 (2019) 283–298.
[2] S. Satyal, I. Weber, H. Paik, C. Di Ciccio, J. Mendling, AB testing for process versions with contextual multi-armed
    bandit algorithms, in: Advanced Information Systems Engineering, 2018, pp. 19–34.
[3] M. Dumas, M. La Rosa, J. Mendling, H. A. Reijers, Fundamentals of Business Process Management, 2nd ed.,
    Springer, 2018.
[4] M. Dumas, F. Fournier, L. Limonad, A. Marrella, M. Montali, J.-R. Rehse, R. Accorsi, D. Calvanese, G. D. Giacomo,
    et al., Augmented business process management systems: A research manifesto, 2022. a r X i v : 2 2 0 1 . 1 2 8 5 5 .


                                                         126