VDD: A Visual Drift Detection System for Process Mining Anton Yeshchenko, Jan Mendling Claudio Di Ciccio Artem Polyvyanyy Vienna University of Economics and Business Sapienza University of Rome The University of Melbourne Vienna, Austria Rome, Italy Melbourne, Australia firstname.lastname@wu.ac.at claudio.diciccio@uniroma1.it artem.polyvyanyy@unimelb.edu.au Abstract—Research on concept drift detection has inspired re- cent advancements of process mining and expanding the growing arsenal of process analysis tools. What has so far been missing in this new research stream are techniques that support com- prehensive process drift analysis in terms of localizing, drilling- down, quantifying, and visualizing process drifts. In our research, we built on ideas from concept drift, process mining, and visualization research and present a novel web-based software tool to analyze process drifts, called Visual Drift Detection (VDD). Addressing the comprehensive analysis requirements, our tool Figure 1: Drift types, cf. [7, Fig. 2] is of benefit to researchers and practitioners in the business intelligence and process analytics area. It constitutes a valuable aid to those who are involved in business process redesign projects. from models being too complex inducing too high cognitive load to be comprehended in an accurate way [3]. I. I NTRODUCTION Research on data mining has discussed changes over time Process mining is a research field that is concerned with and distinguishes different types of so-called drift. Drift analysis leveraging real-world event data for providing transparency has been considered in prior research on process mining in of how business processes operate. Process discovery is a the following way. Recent works include such contributions branch of process mining that takes as input event logs, i.e., as Maaradji et al. [4] that use statistical tests in order to collections of event sequences (traces) wherein every event find sudden and gradual drifts, Zheng et al. [5] transform the corresponds to an activity execution, and returns the model that event logs into relationship matrices and find sudden drifts best describes the process generating the event log. However, with change point detection algorithms, and Ostovar et al. [6] process discovery analyzes event logs without distinguishing describes the sudden drift detection algorithm that relies on executions that are recent and that are far in the past. Therefore, discovering a number of process trees from the event log it does not explicitly show the behavioral changes that occur and the calculation of the number of change operations to in the time lapse during which those data is gathered. transform one tree into another. These papers focus on the These behavioral changes are a commonplace in the real- identification of some specific drift types, limiting to sudden world scenarios and introduce additional challenges for the drifts and gradual drifts. These papers also do not provide an existing process mining techniques that are usually assume interpretable solution for visualizing the content of the drifts. stable patterns of behaviour. If a drift is present in the data, In this paper, we present a technique for process drift it affects all stages of process mining namely discovery, detection, called Visual Drift Detection (VDD). VDD extends conformance and enhancement [1] As a consequence, the existing techniques with the following features. First, our discovered models are much more complex since they integrate technique not only finds sudden drifts but also helps the behaviour that is present in different points in time. Using data user to interpret the four different types of drifts shown in affected by behavioral changes for process conformance also Fig. 1. Second, it facilitates assessment of drifts through visual hinders the results by detecting non-compliant behaviour of the interpretation [8] by the help of an interactive visualization aggregated data from a process that might have been a norm system. The Visual Drift Detection (VDD) system is built to for a particular time-span. The process enhancement using the explain input data on different levels of granularity and supports event log containing changes would produce process models brushing and linking of the visualization views. Its back-end annotated with information that is not significant at all time builds on the formal rigor of temporal logic of D ECLARE stages. All these issues with process mining techniques could constraints [9], [10] and time series analysis [11]. Key strengths be alleviated or turned into the strengths by first analysing of VDD are the clustering of declarative behavioral constraints behaviour changes during process mining projects [2]. This is that exhibit similar trends of change over time, the automatic to the benefit of the process analyst who might quickly suffer detection of drift points, and the automated characterization of Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). the drift types. We leverage this information about the trends is a chaining constraint, which imposes that Leucocytes can in the data and represent the changes on the process behavior occur only if Release C is the activity that occur immediately entailed by the drifts by means of enhanced Directly-Follows before it (i.e., no other activities can occur in between). graphs [12], to provide further analysis features. These features N OT S UCCESSIONpER Registration, IV Liquidq is a negative con- allow us to detect and explain drifts that would otherwise go straint as it imposes that ER Registration cannot be followed undetected by other techniques. We illustrate the usage of the by IV Liquid. For all constraints, we measure their support, VDD system on a real-world data set publicly available on the confidence and interest factor. Based on established metrics of 4TU Data Centre.1 The event log contains events from sepsis association rule mining [18], they indicate the extent to which patients’ pathways in the hospital [13]. We will henceforth the constraints are satisfied in the log traces. The detailed refer to that data set as the Sepsis log. explanation of how those measures are computed is out of This is a tool demonstration paper illustrating the new scope for this paper. For further information on that matter, software implementation of the VDD system. The theoretical the interested reader can refer to [10]. design and evaluations of the presented system have been Specifically, the VDD system runs a background process partially described in [14], [15]. We remark that our earlier to calculate the measures of D ECLARE constraints and group work did not include the advanced features we present here the resulting time series into behavior clusters. First, traces in for drift type characterization and for the visualization of the the log are sorted by the timestamp of their respective first entailed change on the process behavior. events. Thereupon, we extract a sub-log of the given Win size from the first traces. We let the window slide over the log at II. T HE VDD A PPROACH the given Slide size. From each sub-log we mine the set of Our technique takes an event log (henceforth, log for short) D ECLARE constraints and compute their measures. In our case as an input and conducts a step-by-step visual analysis on study, with the window size set to 50 and the sliding step to process drifts. It consists of five steps, which we shall explain 25 we mine D ECLARE constraints out of 41 sub-logs. For each through the application of our tool on the case study of the sub-log, we compute the confidence of 3424 constraints. This Sepsis log. Figure 2 depicts the visualization system with step proceeds with the extraction of multi-variate time series connected views, showing the results of these steps. that represent the trends of the constraints’ confidence. 1) Input and setting of parameters As a result of this step, we obtain numerous time series (one per constraint and measure) which we cluster into groups that In the first step the user provides an XES [16] and sets exhibit similar confidence trends. Henceforth, we will refer to the parameters of the technique that will influence what can those groups as behavior clusters. In particular, we resort on be observed. In particular, the Win size parameter determines hierarchical clustering [19] to find groups of constraints that the granularity of the drift analysis, and more specifically the exhibit similar confidence trends (henceforth, behavior clusters). number of traces that will be included in each time window. Figure 2(a) shows the values of the time series (i.e., the Slide size describes the number of traces that should be skipped confidence measures) through the plasma color-blind friendly to calculate the next window. The system offers hover-on color map [8], from blue (low peak) to yellow (high peak). explanations about each parameter. The in-depth analysis of The y-axis lists the constraints, the starting timestamp of the the parameters is described in [14]. After that, the technique sub-logs lie on the x-axis. Constraints are sorted vertically by calculates the event log statistics and automatically proposes the similarity of their measures’ trends. White dotted horizontal default parameters as shown in Fig. 2(h). Sepsis log has 1050 lines visually separate the behavior clusters. On the Sepsis data cases and 15 214 events with 16 event variants. We chose the set, the Drift Map shows 18 behavior clusters. Win size of 50, Slide size of 25, and Cut threshold of 420 for our analysis. 3) Visualization of drifts 2) Window-based constraints mining and time series clustering In this step, we detect change points in the set of time This is a preprocessing step for the visual analysis. We split series, both for the whole log and each cluster separately. the log into sub-logs. From each resulting part of the log, we Those change points are what we identify as drift points. In measure the degree to which a set of behavioral relations in the following, we will interchangeably name them as change the form of declarative process constraints hold true in each or drift points depending on the context. We plot drift points window. In particular, we resort on the well-known declarative in Drift Maps (Figure 2(a)) and Drift Charts (Figure 2(b)) to language D ECLARE, whose full repertoire of constraints is effectively communicate the drifts to the user. described in [17]. The D ECLARE constraints represent the The Drift Map shown in Fig. 2(a) illustrates the detected behavior of a process by bind the occurrence of activities drift points over the time in the event log, which we shall to the verification of certain conditions over other events collectively name as drift situation. We add vertical lines to in the trace. For example, P RECEDENCEpRelease C, IV Liquidq mark such drift points. Drift Charts (e.g., those in Fig. 2(b)) states that IV Liquid can occur in the trace only if Release C have time on the x-axis and the average confidence of the occurred earlier. C HAIN P RECEDENCEpRelease C, Leucocytesq constraints in a behavior cluster on the y-axis. We add vertical lines to denote drift points as in Drift Maps. In Fig. 2(b) we 1 https://doi.org/10.4121/uuid:915d2bfb-7e84-49ad-a286-dc35f063a460 focus on behavior cluster 18 of the Sepsis log. We can observe Figure 2: The user interface of the VDD system, running on the Sepsis event log [13]. (a) Drift Map. (b) Drift Chart. (c) Autocorrelation plot. (d) Erratic measure. (e) Spread of constraints view. (f) Incremental drifts test. (g) Extended Directly-Follows Graph. (i) Behavior cluster selection menu. two drift points. Drift Map together with Drift Charts, autocorrelation plots, We also compute the values of measures called spread and stationarity tests. In the chosen cluster 18, the system of constraints and erratic measure to quantify the extent of automatically identifies two sudden drifts as shown in the the drifting behavior [14]. The spread of constraints (shown Drift Chart (Fig. 2(b)). To check for incremental drifts, we in Fig. 2(e)) intuitively indicates how variable and subject to inspect the results of the stationarity test (shown in Fig. 2(f)). change the event log is. The measure ranges from 0 to 1: the For the chosen behavior cluster, the VDD system reports no more the behavior changes over time, the higher the value incremental drift. Figure 2(c) depicts an autocorrelation plot gets. In the Sepsis log, the measured spread of constraints is that shows how the time series correlates with itself with a 0.247, which indicates a relatively small rate of change in the step defined in the y-axis. The blue area on this plot shows behavior. The erratic measure (shown in Fig. 2(d)) shows how the significant region of the analysis. Cluster 18 reveals an a chosen cluster (Fig. 2(i)) compares to the cluster with the autocorrelation on step 2, meaning that the drift shows signs maximum degree of change in the same log. of seasonality – thus being classifiable as a reoccurring drift. 4) Drift type detection 5) Understanding the drift behavior In this step, we use a range of methods to analyze drift types To get an understanding of the effect of drifts on the process (as those shown in Fig. 1) and visualize them in the connected behavior, we visually represent the general behavior found in views. We use multi-variate time series change point detection the log extended with specific behavior shown in a chosen algorithms to detect sudden drifts. In particular, we resort on behavior cluster. In particular, we use the gathered information the Pruned Exact Linear Time (PELT) algorithm [20] to detect on the measured D ECLARE constraints in a behavior cluster change points in the whole multi-variate time series as well and draw it on top of Directly-Follows graphs [12] such as within the behavior clusters. Thereupon, we make use of as the one in Fig. 2(g). A Directly-Follows graph connects the stationarity analysis in ensemble with the visual inspection via arcs the activities (nodes) with those other activities of Drift Charts to highlight gradual and incremental drifts. that followed at least once in a trace. Arcs are weighted With the aid of autocorrelation plots, we seek for the behavior by the number of such sequences. Nodes are weighted by clusters exposing reoccurring drifts. the frequency with which the related activities occur in the To show the results of this step, we resort on a mix of log. The Directly-Follows graph depicts the behavior that graphical and numerical representations: the aforementioned is common to the entire event log. We add arcs highlighted with different colors that represent additional D ECLARE, Council Discovery Project DP180102839. Claudio Di Ciccio cluster-specific constraints. Negative D ECLARE constraints are is partly supported by the MIUR under grant “Dipartimenti colored in red. Chaining constraints are in green. All other di eccellenza 2018-2022” of the Department of Computer relationships are in blue. For cluster 18 we see from Fig. 2(g) Science of Sapienza University of Rome. Anton Yeshchenko that activities Release C and Leucocytes occur in sequence, thanks Maryna Zadoianchuk and Oleksii Tkachenko for their bound by the C HAIN P RECEDENCEpRelease C, Leucocytesq assistance during the development of the web application. constraint. Furthermore, P RECEDENCEpRelease C, IV Liquidq R EFERENCES and P RECEDENCEpRelease C, IV Antibioticsq suggest that IV Liquid and IV Antibiotics require Release C to occur before, [1] W. M. P. van der Aalst, Process Mining - Data Science in Action. Springer, 2016. unlike in the general behavior. [2] M. L. van Eck, X. Lu, S. J. J. Leemans, and W. M. P. van der Aalst, “PM ˆ2 : A process mining project methodology,” in CAiSE. Springer, III. M ATURITY, D OCUMENTATION AND S CREENCAST 2015, pp. 297–313. [3] R. Moreno and R. E. Mayer, “Visual presentations in multimedia learning: We implemented the VDD system as a Python-based stand- Conditions that overload visual working memory,” in VISUAL, D. P. alone program for command line execution, and as a web Huijsmans and A. W. M. Smeulders, Eds. Springer, 1999, pp. 793–800. application with back-end and front-end parts. The algorithms [4] A. Maaradji, M. Dumas, M. La Rosa, and A. Ostovar, “Detecting sudden and gradual drifts in business processes from execution traces,” IEEE are implemented using Python 3, resorting on the scipy library TKDE, vol. 29, no. 10, pp. 2140–2154, 2017. for time-series clustering and on the ruptures library for [5] C. Zheng, L. Wen, and J. Wang, “Detecting process concept drifts from change point identification. We use PM4Py2 [21] for the event logs,” in OTM. Springer, 2017, pp. 524–542. [6] A. Ostovar, S. J. J. Leemans, and M. L. Rosa, “Robust drift Directly-Follows Graph visualization. We use the MINERful3 characterization from event streams of business processes,” ACM Trans. Java package for the discovery and measuring of D ECLARE Knowl. Discov. Data, vol. 14, no. 3, pp. 30:1–30:57, 2020. [Online]. constraints [10]. The front-end of the tool is implemented with Available: https://doi.org/10.1145/3375398 [7] J. Gama, I. Zliobaite, A. Bifet, M. Pechenizkiy, and A. Bouchachia, “A the React JavaScript library. The back-end is implemented with survey on concept drift adaptation,” ACM Comput. Surv., vol. 46, no. 4, flask python library. We run our experiments using a laptop pp. 44:1–44:37, 2014. equipped with an Intel Core i5 at 2.40GHz ˆ 2 with 8GB [8] C. Ware, Information visualization: perception for design. Elsevier, 2012. of RAM. With this modest hardware, the tool was able to [9] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly process data and produce the analysis outcome in about 17 declarative service flow language,” in WS-FM, ser. Lecture Notes in seconds using a real-size event log with 15 214 events from 16 Computer Science, vol. 4184. Springer, 2006, pp. 1–23. [10] C. Di Ciccio and M. Mecella, “On the discovery of declarative control activities over 1050 traces. This indicates that the VDD system flows for artful processes,” ACM TMIS, vol. 5, no. 4, pp. 24:1–24:37, has reached a fairly large degree of maturity as it performs 2015. well in terms of scalability. [11] G. C. Reinsel, Elements of multivariate time series analysis. Springer, 1993. We have created a project website for the VDD [12] S. J. Leemans, D. Fahland, and W. M. van der Aalst, “Discovering block- system, from which it can be downloaded together structured process models from event logs - A constructive approach,” with its sources at https://github.com/yesanton/ in PETRI NETS. Springer, 2013, pp. 311–329. [13] F. Mannhardt and D. Blinde, “Analyzing the trajectories of patients with Process-Drift-Visualization-With-Declare. It is free for sepsis using process mining,” in BPMDS/EMMSAD. CEUR-WS.org, academic and non-commercial use under the MIT license. 2017, pp. 72–80. On the project website, we provide documentation on its [14] A. Yeshchenko, C. Di Ciccio, J. Mendling, and A. Polyvyanyy, “Compre- hensive process drift detection with visual analytics,” in ER. Springer, installation and first run. The web tool with a graphical 2019, in print. interface is also available at https://yesanton.github.io/driftvis, [15] A. Yeshchenko, C. D. Ciccio, J. Mendling, and A. Polyvyanyy, “Com- to be used for testing without the need to install the software prehensive process drift analysis with the visual drift detection tool,” in ER Demos. CEUR-WS.org, 2019, pp. 108–112. on a local machine. A screencast documenting its usage is [16] “IEEE standard for extensible event stream (xes) for achieving available at https://youtu.be/mHOgVBZ4Imc. The GitHub interoperability in event logs and event streams,” pp. 1–50, Nov 2016. project page contains the step by step tutorial of how to [Online]. Available: http://dx.doi.org/10.1109/IEEESTD.2016.7740858 [17] W. M. P. van der Aalst and M. Pesic, “DecSerFlow: Towards a truly use the web-based tool. It is available at https://github. declarative service flow language,” in WS-FM. Springer, 2006, pp. 1–23. com/yesanton/Process-Drift-Visualization-With-Declare/blob/ [18] J. Adamo, Data mining for association rules and sequential patterns - master/publications/icpm-2020-demo-tutorial.pdf sequential and parallel algorithms, J. Adamo, Ed. Springer New York, 2001. In future work, we will focus on the prediction of drifts in [19] S. Aghabozorgi, A. Seyed Shirkhorshidi, and T. Ying Wah, “Time-series running processes and the improvements of the interactivity of clustering - a decade review,” IS, vol. 53, no. C, pp. 16–38, Oct. 2015. the visualization system. Furthermore, we will conduct user [20] R. Killick, P. Fearnhead, and I. A. Eckley, “Optimal detection of changepoints with a linear computational cost,” Journal of the American studies to assess the perceived quality of the tool. Statistical Association, vol. 107, no. 500, pp. 1590–1598, 2012. Acknowledgements. [21] A. Berti, S. J. van Zelst, and W. M. P. van der Aalst, “Process mining for python (pm4py): Bridging the gap between process- and data science,” This work is partially funded by the EU H2020 program CoRR, vol. abs/1905.06169, 2019. under MSCA-RISE agreement 645751 (RISE BPM). Artem Polyvyanyy is partly supported by the Australian Research 2 http://pm4py.org, https://github.com/pm4py 3 https://github.com/cdc08x/MINERful