VDD: A Visual Drift Detection System for
                            Process Mining
              Anton Yeshchenko, Jan Mendling                         Claudio Di Ciccio                       Artem Polyvyanyy
         Vienna University of Economics and Business            Sapienza University of Rome             The University of Melbourne
                        Vienna, Austria                                  Rome, Italy                        Melbourne, Australia
                 firstname.lastname@wu.ac.at                    claudio.diciccio@uniroma1.it         artem.polyvyanyy@unimelb.edu.au


        Abstract—Research on concept drift detection has inspired re-
     cent advancements of process mining and expanding the growing
     arsenal of process analysis tools. What has so far been missing
     in this new research stream are techniques that support com-
     prehensive process drift analysis in terms of localizing, drilling-
     down, quantifying, and visualizing process drifts. In our research,
     we built on ideas from concept drift, process mining, and
     visualization research and present a novel web-based software
     tool to analyze process drifts, called Visual Drift Detection (VDD).
     Addressing the comprehensive analysis requirements, our tool                         Figure 1: Drift types, cf. [7, Fig. 2]
     is of benefit to researchers and practitioners in the business
     intelligence and process analytics area. It constitutes a valuable
     aid to those who are involved in business process redesign
     projects.                                                           from models being too complex inducing too high cognitive
                                                                         load to be comprehended in an accurate way [3].
                            I. I NTRODUCTION                                Research on data mining has discussed changes over time
        Process mining is a research field that is concerned with        and  distinguishes different types of so-called drift. Drift analysis
     leveraging real-world event data for providing transparency         has  been considered in prior research on process mining in
     of how business processes operate. Process discovery is a           the  following   way. Recent works include such contributions
     branch of process mining that takes as input event logs, i.e.,      as  Maaradji    et   al. [4] that use statistical tests in order to
     collections of event sequences (traces) wherein every event         find  sudden   and   gradual drifts, Zheng et al. [5] transform the
     corresponds to an activity execution, and returns the model that    event   logs  into  relationship    matrices and find sudden drifts
     best describes the process generating the event log. However,       with   change  point   detection   algorithms,  and Ostovar et al. [6]
     process discovery analyzes event logs without distinguishing        describes   the  sudden    drift  detection  algorithm   that relies on
     executions that are recent and that are far in the past. Therefore, discovering    a  number     of   process  trees from   the  event log
     it does not explicitly show the behavioral changes that occur       and   the  calculation    of  the  number   of  change   operations  to
     in the time lapse during which those data is gathered.              transform    one   tree  into  another.  These   papers   focus on  the
        These behavioral changes are a commonplace in the real- identification of some specific drift types, limiting to sudden
     world scenarios and introduce additional challenges for the drifts and gradual drifts. These papers also do not provide an
     existing process mining techniques that are usually assume interpretable solution for visualizing the content of the drifts.
     stable patterns of behaviour. If a drift is present in the data,       In this paper, we present a technique for process drift
     it affects all stages of process mining namely discovery, detection, called Visual Drift Detection (VDD). VDD extends
     conformance and enhancement [1] As a consequence, the existing techniques with the following features. First, our
     discovered models are much more complex since they integrate technique not only finds sudden drifts but also helps the
     behaviour that is present in different points in time. Using data user to interpret the four different types of drifts shown in
     affected by behavioral changes for process conformance also Fig. 1. Second, it facilitates assessment of drifts through visual
     hinders the results by detecting non-compliant behaviour of the interpretation [8] by the help of an interactive visualization
     aggregated data from a process that might have been a norm system. The Visual Drift Detection (VDD) system is built to
     for a particular time-span. The process enhancement using the explain input data on different levels of granularity and supports
     event log containing changes would produce process models brushing and linking of the visualization views. Its back-end
     annotated with information that is not significant at all time builds on the formal rigor of temporal logic of D ECLARE
     stages. All these issues with process mining techniques could constraints [9], [10] and time series analysis [11]. Key strengths
     be alleviated or turned into the strengths by first analysing of VDD are the clustering of declarative behavioral constraints
     behaviour changes during process mining projects [2]. This is that exhibit similar trends of change over time, the automatic
     to the benefit of the process analyst who might quickly suffer detection of drift points, and the automated characterization of


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
the drift types. We leverage this information about the trends    is a chaining constraint, which imposes that Leucocytes can
in the data and represent the changes on the process behavior     occur only if Release C is the activity that occur immediately
entailed by the drifts by means of enhanced Directly-Follows      before it (i.e., no other activities can occur in between).
graphs [12], to provide further analysis features. These features N OT S UCCESSIONpER Registration, IV Liquidq is a negative con-
allow us to detect and explain drifts that would otherwise go     straint as it imposes that ER Registration cannot be followed
undetected by other techniques. We illustrate the usage of the    by IV Liquid. For all constraints, we measure their support,
VDD system on a real-world data set publicly available on the     confidence and interest factor. Based on established metrics of
4TU Data Centre.1 The event log contains events from sepsis       association rule mining [18], they indicate the extent to which
patients’ pathways in the hospital [13]. We will henceforth       the constraints are satisfied in the log traces. The detailed
refer to that data set as the Sepsis log.                         explanation of how those measures are computed is out of
   This is a tool demonstration paper illustrating the new        scope for this paper. For further information on that matter,
software implementation of the VDD system. The theoretical        the interested reader can refer to [10].
design and evaluations of the presented system have been             Specifically, the VDD system runs a background process
partially described in [14], [15]. We remark that our earlier     to calculate the measures of D ECLARE constraints and group
work did not include the advanced features we present here        the resulting time series into behavior clusters. First, traces in
for drift type characterization and for the visualization of the  the log are sorted by the timestamp of their respective first
entailed change on the process behavior.                          events. Thereupon, we extract a sub-log of the given Win size
                                                                  from the first traces. We let the window slide over the log at
                    II. T HE VDD A PPROACH                        the given Slide size. From each sub-log we mine the set of
   Our technique takes an event log (henceforth, log for short) D ECLARE constraints and compute their measures. In our case
as an input and conducts a step-by-step visual analysis on study, with the window size set to 50 and the sliding step to
process drifts. It consists of five steps, which we shall explain 25 we mine D ECLARE constraints out of 41 sub-logs. For each
through the application of our tool on the case study of the sub-log, we compute the confidence of 3424 constraints. This
Sepsis log. Figure 2 depicts the visualization system with step proceeds with the extraction of multi-variate time series
connected views, showing the results of these steps.              that represent the trends of the constraints’ confidence.
1) Input and setting of parameters                                   As a result of this step, we obtain numerous time series (one
                                                                  per constraint and measure) which we cluster into groups that
   In the first step the user provides an XES [16] and sets
                                                                  exhibit similar confidence trends. Henceforth, we will refer to
the parameters of the technique that will influence what can
                                                                  those groups as behavior clusters. In particular, we resort on
be observed. In particular, the Win size parameter determines
                                                                  hierarchical clustering [19] to find groups of constraints that
the granularity of the drift analysis, and more specifically the
                                                                  exhibit similar confidence trends (henceforth, behavior clusters).
number of traces that will be included in each time window.
                                                                  Figure 2(a) shows the values of the time series (i.e., the
Slide size describes the number of traces that should be skipped
                                                                  confidence measures) through the plasma color-blind friendly
to calculate the next window. The system offers hover-on
                                                                  color map [8], from blue (low peak) to yellow (high peak).
explanations about each parameter. The in-depth analysis of
                                                                  The y-axis lists the constraints, the starting timestamp of the
the parameters is described in [14]. After that, the technique
                                                                  sub-logs lie on the x-axis. Constraints are sorted vertically by
calculates the event log statistics and automatically proposes
                                                                  the similarity of their measures’ trends. White dotted horizontal
default parameters as shown in Fig. 2(h). Sepsis log has 1050
                                                                  lines visually separate the behavior clusters. On the Sepsis data
cases and 15 214 events with 16 event variants. We chose the
                                                                  set, the Drift Map shows 18 behavior clusters.
Win size of 50, Slide size of 25, and Cut threshold of 420 for
our analysis.                                                     3) Visualization of drifts
2) Window-based constraints mining and time series clustering              In this step, we detect change points in the set of time
   This is a preprocessing step for the visual analysis. We split       series, both for the whole log and each cluster separately.
the log into sub-logs. From each resulting part of the log, we          Those change points are what we identify as drift points. In
measure the degree to which a set of behavioral relations in            the following, we will interchangeably name them as change
the form of declarative process constraints hold true in each           or drift points depending on the context. We plot drift points
window. In particular, we resort on the well-known declarative          in Drift Maps (Figure 2(a)) and Drift Charts (Figure 2(b)) to
language D ECLARE, whose full repertoire of constraints is              effectively communicate the drifts to the user.
described in [17]. The D ECLARE constraints represent the                  The Drift Map shown in Fig. 2(a) illustrates the detected
behavior of a process by bind the occurrence of activities              drift points over the time in the event log, which we shall
to the verification of certain conditions over other events             collectively name as drift situation. We add vertical lines to
in the trace. For example, P RECEDENCEpRelease C, IV Liquidq            mark such drift points. Drift Charts (e.g., those in Fig. 2(b))
states that IV Liquid can occur in the trace only if Release C          have time on the x-axis and the average confidence of the
occurred earlier. C HAIN P RECEDENCEpRelease C, Leucocytesq             constraints in a behavior cluster on the y-axis. We add vertical
                                                                        lines to denote drift points as in Drift Maps. In Fig. 2(b) we
  1 https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460   focus on behavior cluster 18 of the Sepsis log. We can observe
Figure 2: The user interface of the VDD system, running on the Sepsis event log [13]. (a) Drift Map. (b) Drift Chart. (c)
Autocorrelation plot. (d) Erratic measure. (e) Spread of constraints view. (f) Incremental drifts test. (g) Extended Directly-Follows
Graph. (i) Behavior cluster selection menu.


two drift points.                                                   Drift Map together with Drift Charts, autocorrelation plots,
   We also compute the values of measures called spread             and stationarity tests. In the chosen cluster 18, the system
of constraints and erratic measure to quantify the extent of        automatically identifies two sudden drifts as shown in the
the drifting behavior [14]. The spread of constraints (shown        Drift Chart (Fig. 2(b)). To check for incremental drifts, we
in Fig. 2(e)) intuitively indicates how variable and subject to     inspect the results of the stationarity test (shown in Fig. 2(f)).
change the event log is. The measure ranges from 0 to 1: the        For the chosen behavior cluster, the VDD system reports no
more the behavior changes over time, the higher the value           incremental drift. Figure 2(c) depicts an autocorrelation plot
gets. In the Sepsis log, the measured spread of constraints is      that shows how the time series correlates with itself with a
0.247, which indicates a relatively small rate of change in the     step defined in the y-axis. The blue area on this plot shows
behavior. The erratic measure (shown in Fig. 2(d)) shows how        the significant region of the analysis. Cluster 18 reveals an
a chosen cluster (Fig. 2(i)) compares to the cluster with the       autocorrelation on step 2, meaning that the drift shows signs
maximum degree of change in the same log.                           of seasonality – thus being classifiable as a reoccurring drift.
4) Drift type detection                                             5) Understanding the drift behavior
   In this step, we use a range of methods to analyze drift types      To get an understanding of the effect of drifts on the process
(as those shown in Fig. 1) and visualize them in the connected      behavior, we visually represent the general behavior found in
views. We use multi-variate time series change point detection      the log extended with specific behavior shown in a chosen
algorithms to detect sudden drifts. In particular, we resort on     behavior cluster. In particular, we use the gathered information
the Pruned Exact Linear Time (PELT) algorithm [20] to detect        on the measured D ECLARE constraints in a behavior cluster
change points in the whole multi-variate time series as well        and draw it on top of Directly-Follows graphs [12] such
as within the behavior clusters. Thereupon, we make use of          as the one in Fig. 2(g). A Directly-Follows graph connects
the stationarity analysis in ensemble with the visual inspection    via arcs the activities (nodes) with those other activities
of Drift Charts to highlight gradual and incremental drifts.        that followed at least once in a trace. Arcs are weighted
With the aid of autocorrelation plots, we seek for the behavior     by the number of such sequences. Nodes are weighted by
clusters exposing reoccurring drifts.                               the frequency with which the related activities occur in the
   To show the results of this step, we resort on a mix of          log. The Directly-Follows graph depicts the behavior that
graphical and numerical representations: the aforementioned         is common to the entire event log. We add arcs highlighted
with different colors that represent additional D ECLARE,             Council Discovery Project DP180102839. Claudio Di Ciccio
cluster-specific constraints. Negative D ECLARE constraints are       is partly supported by the MIUR under grant “Dipartimenti
colored in red. Chaining constraints are in green. All other          di eccellenza 2018-2022” of the Department of Computer
relationships are in blue. For cluster 18 we see from Fig. 2(g)       Science of Sapienza University of Rome. Anton Yeshchenko
that activities Release C and Leucocytes occur in sequence,           thanks Maryna Zadoianchuk and Oleksii Tkachenko for their
bound by the C HAIN P RECEDENCEpRelease C, Leucocytesq                assistance during the development of the web application.
constraint. Furthermore, P RECEDENCEpRelease C, IV Liquidq
                                                                                                    R EFERENCES
and P RECEDENCEpRelease C, IV Antibioticsq suggest that
IV Liquid and IV Antibiotics require Release C to occur before,        [1] W. M. P. van der Aalst, Process Mining - Data Science in Action.
                                                                           Springer, 2016.
unlike in the general behavior.                                        [2] M. L. van Eck, X. Lu, S. J. J. Leemans, and W. M. P. van der Aalst,
                                                                           “PM ˆ2 : A process mining project methodology,” in CAiSE. Springer,
   III. M ATURITY, D OCUMENTATION AND S CREENCAST                          2015, pp. 297–313.
                                                                       [3] R. Moreno and R. E. Mayer, “Visual presentations in multimedia learning:
   We implemented the VDD system as a Python-based stand-                  Conditions that overload visual working memory,” in VISUAL, D. P.
alone program for command line execution, and as a web                     Huijsmans and A. W. M. Smeulders, Eds. Springer, 1999, pp. 793–800.
application with back-end and front-end parts. The algorithms          [4] A. Maaradji, M. Dumas, M. La Rosa, and A. Ostovar, “Detecting sudden
                                                                           and gradual drifts in business processes from execution traces,” IEEE
are implemented using Python 3, resorting on the scipy library             TKDE, vol. 29, no. 10, pp. 2140–2154, 2017.
for time-series clustering and on the ruptures library for             [5] C. Zheng, L. Wen, and J. Wang, “Detecting process concept drifts from
change point identification. We use PM4Py2 [21] for the                    event logs,” in OTM. Springer, 2017, pp. 524–542.
                                                                       [6] A. Ostovar, S. J. J. Leemans, and M. L. Rosa, “Robust drift
Directly-Follows Graph visualization. We use the MINERful3                 characterization from event streams of business processes,” ACM Trans.
Java package for the discovery and measuring of D ECLARE                   Knowl. Discov. Data, vol. 14, no. 3, pp. 30:1–30:57, 2020. [Online].
constraints [10]. The front-end of the tool is implemented with            Available: https://doi.org/10.1145/3375398
                                                                       [7] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A
the React JavaScript library. The back-end is implemented with             survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4,
flask python library. We run our experiments using a laptop                pp. 44:1–44:37, 2014.
equipped with an Intel Core i5 at 2.40GHz ˆ 2 with 8GB                 [8] C. Ware, Information visualization: perception for design. Elsevier,
                                                                           2012.
of RAM. With this modest hardware, the tool was able to                [9] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly
process data and produce the analysis outcome in about 17                  declarative service flow language,” in WS-FM, ser. Lecture Notes in
seconds using a real-size event log with 15 214 events from 16             Computer Science, vol. 4184. Springer, 2006, pp. 1–23.
                                                                      [10] C. Di Ciccio and M. Mecella, “On the discovery of declarative control
activities over 1050 traces. This indicates that the VDD system            flows for artful processes,” ACM TMIS, vol. 5, no. 4, pp. 24:1–24:37,
has reached a fairly large degree of maturity as it performs               2015.
well in terms of scalability.                                         [11] G. C. Reinsel, Elements of multivariate time series analysis. Springer,
                                                                           1993.
   We have created a project website for the VDD                      [12] S. J. Leemans, D. Fahland, and W. M. van der Aalst, “Discovering block-
system, from which it can be downloaded together                           structured process models from event logs - A constructive approach,”
with      its    sources     at     https://github.com/yesanton/           in PETRI NETS. Springer, 2013, pp. 311–329.
                                                                      [13] F. Mannhardt and D. Blinde, “Analyzing the trajectories of patients with
Process-Drift-Visualization-With-Declare. It is free for                   sepsis using process mining,” in BPMDS/EMMSAD. CEUR-WS.org,
academic and non-commercial use under the MIT license.                     2017, pp. 72–80.
On the project website, we provide documentation on its               [14] A. Yeshchenko, C. Di Ciccio, J. Mendling, and A. Polyvyanyy, “Compre-
                                                                           hensive process drift detection with visual analytics,” in ER. Springer,
installation and first run. The web tool with a graphical                  2019, in print.
interface is also available at https://yesanton.github.io/driftvis,   [15] A. Yeshchenko, C. D. Ciccio, J. Mendling, and A. Polyvyanyy, “Com-
to be used for testing without the need to install the software            prehensive process drift analysis with the visual drift detection tool,” in
                                                                           ER Demos. CEUR-WS.org, 2019, pp. 108–112.
on a local machine. A screencast documenting its usage is             [16] “IEEE standard for extensible event stream (xes) for achieving
available at https://youtu.be/mHOgVBZ4Imc. The GitHub                      interoperability in event logs and event streams,” pp. 1–50, Nov 2016.
project page contains the step by step tutorial of how to                  [Online]. Available: http://dx.doi.org/10.1109/IEEESTD.2016.7740858
                                                                      [17] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly
use the web-based tool. It is available at https://github.                 declarative service flow language,” in WS-FM. Springer, 2006, pp. 1–23.
com/yesanton/Process-Drift-Visualization-With-Declare/blob/           [18] J. Adamo, Data mining for association rules and sequential patterns -
master/publications/icpm-2020-demo-tutorial.pdf                            sequential and parallel algorithms, J. Adamo, Ed. Springer New York,
                                                                           2001.
   In future work, we will focus on the prediction of drifts in       [19] S. Aghabozorgi, A. Seyed Shirkhorshidi, and T. Ying Wah, “Time-series
running processes and the improvements of the interactivity of             clustering - a decade review,” IS, vol. 53, no. C, pp. 16–38, Oct. 2015.
the visualization system. Furthermore, we will conduct user           [20] R. Killick, P. Fearnhead, and I. A. Eckley, “Optimal detection of
                                                                           changepoints with a linear computational cost,” Journal of the American
studies to assess the perceived quality of the tool.                       Statistical Association, vol. 107, no. 500, pp. 1590–1598, 2012.
Acknowledgements.                                                     [21] A. Berti, S. J. van Zelst, and W. M. P. van der Aalst, “Process mining for
                                                                           python (pm4py): Bridging the gap between process- and data science,”
   This work is partially funded by the EU H2020 program                   CoRR, vol. abs/1905.06169, 2019.
under MSCA-RISE agreement 645751 (RISE BPM). Artem
Polyvyanyy is partly supported by the Australian Research
  2 http://pm4py.org, https://github.com/pm4py
  3 https://github.com/cdc08x/MINERful