The CDESF Toolkit: An Introduction Davide Mora∗ , Paolo Ceravolo∗ , Ernesto Damiani† , Gabriel Marques Tavares∗ ∗ Università degli Studi di Milano (UNIMI), Milan, Italy Email: davide.mora@studenti.unimi.it, {paolo.ceravolo, gabriel.tavares}@unimi.it † Khalifa University (KUST), Abu Dhabi, UAE Email: ernesto.damiani@kustar.ac.ae Abstract—Real-time response is crucial in many business pro- its associated class over time, and also consider incomplete cess scenarios, however, few tools support the online processing traces. Lastly, an updated version of the process model is of Process Mining tasks. In this paper, we present Concept Drift needed to assess the differences between reference and real in Event Stream Framework (CDESF), a tool focused on concept drift detection that also supports several online Process Mining executions correctly. Therefore, methods should be aware of tasks. CDESF highlights the process model evolution during the the process model and detect deviations at the same time while stream processing and alerts the detection of new drifts aided by maintaining an updated model and being prepared to handle an online clustering layer. This paper presents CDESF as a tool concept drifts [2]. that is available for the community and process practitioners. Several methods for online PM have been proposed in the Index Terms—Online process mining, clustering, concept drift detection, process model graph last few years, ranging from online process discovery [3]– [5], to conformance checking [6]–[8] and concept drift de- I. I NTRODUCTION tection [9]–[11], but they usually tackle only one of the required needs in online environments. Moreover, though Process Mining (PM) aims at extracting information from many works provide theoretical and practical advancements, event logs, which are recorded by information systems during at the same time, they do not provide frameworks and tools for the execution of business processes. Further, PM provides intermediate or end-user. To overcome the reported issues, we insights from recorded business processes, such as the cre- propose the use of Concept Drift in Event Stream Framework ation of a process model, detection of anomalous executions, (CDESF), a tool for the monitoring of concept drift in online and extraction of process metrics [1]. A relevant aspect in process mining. The formal background of the tool was pro- traditional PM is the offline assessment of event logs. That posed in [12] and advanced in [13]. We developed a framework is, the analyzed event logs depict past process executions, with the main characteristics being: which have already finished. However, on many occasions, stakeholders are interested in understanding the current state • Online process discovery: a process model graph (PMG) of the process, i.e., while the process is being performed. is maintained online and updated using the most recent Many real applications cannot wait for a complete process events. The updating step keeps the model relevance cycle to take action, which leverages the necessity for online while it serves as the representation of the business processing of event logs. This way, online PM has emerged as process behavior. an area with the goal of handling event streams, that is, events • Online conformance checking: the framework models are processed as their generation occurs. both the process model and the traces as graphs. The In offline settings, PM tasks can be performed asyn- decision is based on the broad support of graph opera- chronously, i.e., one might first discover a process model, then tions. This way, metrics are extracted by comparing trace apply a conformance checking technique to detect deviations and process model. Further, the obtained metrics allows and extract metrics, and so on. However, online PM im- the detection of deviations from normal behavior. poses constraints that make traditional processing unfeasible. • Online clustering: the tool is equipped with an online Namely, like in stream mining, online scenarios must assume clustering technique. The clustering step places cases in that the flow of events is continuous, fast, and possibly infinite. the feature space, identifying regions of interest, such as This poses the first constraint, the limitation of time and common and anomalous behavior. memory. Algorithms are required to have forgetting mecha- • Concept drift detection: the clustering step is based on nisms to keep memory consumption viable (as it is impossible the concept of micro-clusters. Given the micro-clusters to store an infinite stream) and low time consumption (as behavior over time, a concept drift is detected. the processing of events should be faster than the arrival To support the listed characteristics, CDESF follows two rate of events). Moreover, PM practiced in online scenarios main guidelines: must also consider concept drifts, a phenomenon characterized • Memory consumption: a forgetting mechanism controls by the change of the relation between a feature vector and the release of old cases from memory, thus, handling This study was also partly supported by the program “Piano di sostegno constraints posed by online scenarios. alla ricerca 2019” funded by Università degli Studi di Milano. • Support to event stream processing: the approach handles Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). event streams (not to be mistaken with trace streams) Event Cases: ingesting one event at a time unit. ... CDESF tackles several problems faced in online PM and Event Stream provides in-depth insights into the business process for stake- holders. This way, moving forward the current state of tools 1 4 available for users and PM enthusiasts. Recently, CDESF has B been released as a Python library for the community1 . A Transformation A E Check Point video demonstrating the framework’s use and its capabilities is Trace Graph available2 . Moreover, a tutorial showing detailed information on how to use the framework, along with a deeper discussion of its visualizations is also available3 . We also note that CDESF has been used in a pipeline with PM4Py [14] in 2 B D previous works, an integration that provides further support Distance A E F Cases Computation for online PM tasks. C The remaining of this paper is organized as follows: Sec- Memory tion II presents CDESF architecture and its main features. Process Graph Then, Section III shows the tool maturity, including several supported analysis by giving some examples. Finally, Sec- GDtime GDtrace tion IV leaves the concluding remarks. Distances II. CDESF A RCHITECTURE Figure 1 exposes CDESF architecture and how each step 3 is performed. Notice that the event stream is possibly infinite c-micro-cluster Stream and that events from different cases are interspersed. This way, Processing p-micro-cluster o-micro-cluster at the arrival of a new event, its case must be retrieved and complemented with the event. For each new event, CDESF DenStream Clustering steps 1 to 3 are processed, i.e., the framework does not consume data in chunks, such as window-based techniques. When a new event arrives, the Transformation step adds the Online Online Online event into its corresponding case and models the case into a Process Discovery Conformance Checking Process Enhancement graph-based representation. If the new event belongs to a new Anomaly Detection Drift Detection case (i.e. a case that has never been seen before), the case is initialized with the event. Then, the graph representation will Fig. 1: CDESF architecture is divided into four main steps. contain a single event. Instead, if the event is from an already First, the event case is retrieved and transformed into a graph. existing case, the case is updated with the event. Consequently, Then, distances between the trace and process model are the graph representation is also updated by adding a new node. extracted. These metrics are fed to a density-based online Graph nodes and edges represent events and directly-follow clustering for anomaly and concept drift detection. Finally, relations, respectively. a seasonal step updates the model and releases inactive cases The second step, Distance Computation, aims to extract from memory. features that describe the case behavior. For that, the case is compared with the PMG, which captures the current business process nature. The comparison catches two perspectives:  PT  tr i=1 |P M Gtime (tr[i])−tr[i]time | trace and time, noted as graph-distance trace (GDtrace ) and GDtime (tr) = log10 PTtr j=1 P M Gtime (tr[j]) graph-distance time (GDtime ). Distances are measured in a (2) normalized PMG. From a trace perspective, the normalization Given a trace tr, Ttr is the total amount of edges in tr, is the occurrence of a specific transition divided by the value tr[i] corresponds to the ith edge in tr, tr[i]time is the time of the most occurred transition. The result of this operation delta in tr[i] and tr[i]weight is the weight of tr[i]. The graph is referred to as weight. For the time perspective, edge nor- distances capture two essential aspects of traces, the control- malization computes the mean time of a transition between flow and the time between activities. And given the flexibility activities. Equations 1 and 2 show the computing of GDtrace of the framework, it is possible to add multiple dimensions and GDtime , respectively. considering additional perspectives, e.g., event attributes. PTtr i=1 1−P M Gweight (tr[i]) The graph distances previously calculated (GDtrace and GDtrace (tr) = Ttr (1) GDtime ) are fed to a density-based clustering algorithm (Den- 1 https://github.com/gbrltv/cdesf2 Stream [15]), which identifies common behavior over time 2 https://youtu.be/Hq3xLyZOmlg and highlights outliers. DenStream provides an aggregation 3 https://github.com/gbrltv/cdesf2/blob/master/tutorial.pdf of similar process behavior, where denser regions represent common patterns and sparser regions represent anomalous which controls CP frequency, and the PMG update. Higher patterns. Moreover, it detects cases that do not belong to any time horizons imply in a longer time between updates, this cluster or are not dense enough to form one. These points is advised in processes where more constant behavior is are also considered anomalous since they are isolated in the expected. Contrarily, lower time horizons have more frequent feature space. DenStream uses the concept of micro-clusters CPs triggering more updates, which is advised for faster and further classifies micro-clusters in three types: outlier streams or processes where change is expected. Other than micro-cluster (anomalies), potential micro-cluster (potentially the visual output, CDESF also saves the PMGs during every common behavior), and core micro-cluster (common behav- CP in a JSON format, increasing the possibilities of use in ior). The detection of concept drift relies on the detection of different software. new core micro-clusters, i.e., if a new behavior has appeared in the stream and is dense enough to form a core micro-cluster, then a drift has been detected. The Check Point (CP) step has two main goals: memory control and model update. It is controlled by a hyperparameter (in time unit) that sets a time horizon to how frequent CPs happen. Regarding model update, all new cases from the last time horizon are used to build a temporary graph. The temporary graph represents the newest behavior of the process since it is built with the newest events. Then, the temporary graph is merged into the PMG to update the process representation. Before merging, a decay factor is (a) CP 1 applied to the PMG to balance new and historical data. If a directly-follows relation stops appearing, its weight decreases gradually. This step supports process model enhancement by (b) CP 3 using new information to maintain the PMG faithful to real data. Regarding memory control, older cases are released from (c) CP 28 memory according to the Nyquist sampling theorem [16]. The use of Nyquist facilitates the specialist role, taking away the Fig. 2: Evolution of the PMG through several check points need for manual setting for a forgetting mechanism. (CP). In the first CP, the PMG is still very basic since very few events have arrived in the stream. With the consumption III. T OOL M ATURITY of more events over time, the PMG is updated to represent As an algorithm, CDESF has been used in several scientific new behavior better, as seen in CPs 3 and 28. researches [7], [12], [13], [17]. It was also tested using 942 event logs with several types of concept drift and anomalous Regarding drift alerts, CDESF has two main responses. behavior [18] obtaining great performance [2]. During the stream processing, CDESF triggers drift alerts One of the main interests in PM analysis is to evaluate the when a drift is detected. This approach facilitates decision process model, i.e., process discovery. Traditionally, given a making for organizations using the framework since the drift is complete event log, the relations between events are inferred detected and alerted on the fly. Moreover, CDESF also creates to build a process model. In a stream of events, algorithms a visual output after stream processing showing the number have to periodically update the process model since recent of detected drifts and their positions. This visualization is events may represent new behavior. This way, a constant shown in Figure 3. Vertical dashed lines represent the point in model update component is required, so the process always time where drifts were detected. At the same time, the figure represents the newest behavior seen in the stream. In CDESF, demonstrates the cumulative number of drifts detected in the the CP step controls model updates using the most recent stream. In this scenario, further behavioral analysis is possible. events that have arrived in the last time window. Figure 2 All the seven drifts were detected before half of the stream. shows the PMG in different stages of the stream. In the first This means that the process suffered many changes in the CP (Figure 2a), the PMG is still very simple given the low initial stages, and with time, the process execution got a stabler amount of events that have arrived. Note that CDESF does form. not start stream processing with a pre-built model, instead, CDESF is also capable of representing the evolution of it learns the model during the stream processing. This way, clusters in the stream. Figure 4 shows the clustering feature it is expected that during the first CPs, the process model is space after 2950 processed events. Core, potential, and outlier simple since the presented behavior of the stream is minimal. micro-clusters are represented in black, blue and red, respec- Already on CP 3 (Figure 2b), we can observe how the PMG tively. The micro-clusters contain two rings, the dashed line has embedded more information. And on CP 28 (Figure 2c), is the maximum range a micro-cluster can achieve (controlled an even more complete PMG has been created. There is a by a hyperparameter), while the continuous line is the actual clear relationship between the time horizon hyperparameter, radius of the micro-cluster. It follows that the points (cases) are Drifts and cumulative number of drifts in the stream process model and supports drift detection using an online 7 clustering layer. CDESF is an available open-source tool 6 implemented as a library in Python. Number of drifts 5 R EFERENCES 4 [1] W. M. P. van der Aalst, Process Mining: Data Science in Action, 2nd ed. 3 Heidelberg: Springer, 2016. [2] P. Ceravolo, G. M. Tavares, S. Barbon, and E. Damiani, “Evaluation 2 goals for online process mining: a concept drift perspective,” IEEE Drift points Transactions on Services Computing, pp. 1–1, 2020. 1 Number of drifts [3] S. J. van Zelst, B. F. van Dongen, and W. M. van der Aalst, “Event 0 1000 2000 3000 4000 5000 stream-based process discovery using abstract representations,” Knowl- Event stream edge and Information Systems, vol. 54, no. 2, pp. 407–435, 2018. [4] V. Leno, A. Armas-Cervantes, M. Dumas, M. La Rosa, and F. M. Fig. 3: Relation of the event stream processing, number of Maggi, “Discovering process maps from event streams,” in Proceedings detected drifts and their positions. Vertical lines represent the of the 2018 International Conference on Software and System Process, drift positions and the curve shows the cumulative number of ser. ICSSP ’18. New York, NY, USA: Association for Computing Machinery, 2018, p. 86–95. drifts. [5] A. Burattin, A. Sperduti, and W. M. van der Aalst, “Control-flow discovery from event streams,” in Evolutionary Computation (CEC), 2014 IEEE Congress on. IEEE, 2014, pp. 2420–2427. represented in two ways. The black dot is a normal case falling [6] A. Burattin and J. Carmona, “A framework for online conformance checking,” in Business Process Management Workshops, E. Teniente under a core micro-cluster area, while the yellow marker and M. Weidlich, Eds. Cham: Springer International Publishing, 2018, represents anomalous behavior, i.e., cases that do not conform pp. 165–177. to the PMG. [7] G. M. Tavares, V. G. T. da Costa, V. E. Martins, P. Ceravolo, and S. Barbon, Jr., “Anomaly detection in business process based on data stream mining,” in Proceedings of the XIV Brazilian Symposium on Information Systems, ser. SBSI’18. New York, NY, USA: ACM, 2018, pp. 16:1–16:8. [Online]. Available: http: //doi.acm.org/10.1145/3229345.3229362 [8] P. Koenig, J. Mangler, and S. Rinderle-Ma, “Compliance monitoring on process event streams from multiple sources,” in 2019 International Conference on Process Mining (ICPM), 2019, pp. 113–120. [9] R. P. J. C. Bose, W. M. P. van der Aalst, I. Žliobaitė, and M. Pechenizkiy, “Handling concept drift in process mining,” in Advanced Information Systems Engineering, H. Mouratidis and C. Rolland, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 391–405. [10] A. Yeshchenko, C. Di Ciccio, J. Mendling, and A. Polyvyanyy, “Com- prehensive process drift detection with visual analytics,” in Conceptual Modeling, A. H. F. Laender, B. Pernici, E.-P. Lim, and J. P. M. de Oliveira, Eds. Cham: Springer International Publishing, 2019, pp. 119–135. [11] A. Ostovar, S. J. Leemans, and M. L. Rosa, “Robust drift character- ization from event streams of business processes,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 14, no. 3, pp. 1–57, 2020. [12] S. Barbon Junior, G. M. Tavares, V. G. T. da Costa, P. Ceravolo, and E. Damiani, “A framework for human-in-the-loop monitoring of concept-drift detection in event log stream,” in Companion Proceedings of the The Web Conference 2018, ser. WWW ’18. International World Fig. 4: Feature space after 2950 events processed. Core micro- Wide Web Conferences Steering Committee, 2018, pp. 319–326. clusters (black regions) represent common process behavior. [13] G. M. Tavares, S. Barbon Junior, P. Ceravolo, and E. Damiani, Normal and anomalous cases are shown in black dots and “Overlapping analytic stages in online process mining,” in 2019 IEEE International Conference on Service Computing (SCC 2019), July 2019. yellow markers, respectively. [14] A. Berti, S. J. van Zelst, and W. van der Aalst, “Process mining for python (pm4py): Bridging the gap between process and data science,” 2019. All visualizations provided by CDESF are complemented [15] F. Cao, M. Estert, W. Qian, and A. Zhou, “Density-based clustering over by the metrics which are saved after the processing. CDESF an evolving data stream with noise,” in Proceedings of the 2006 SIAM provides the evolution of clusters and case metrics, which international conference on data mining. SIAM, 2006, pp. 328–339. [16] L. Lévesque, “Nyquist sampling theorem: understanding the illusion of opens opportunities for further analysis. Both visualizations a spinning wheel captured with a video camera,” Physics Education, and metrics information can be crossed in many ways to offer vol. 49, no. 6, pp. 697–705, 2014. in-depth insights into the business process. [17] N. J. Omori, G. M. Tavares, P. Ceravolo, and S. Barbon, “Comparing concept drift detection with process mining software,” iSys - Revista IV. C ONCLUSION Brasileira de Sistemas de Informação, vol. 13, no. 4, pp. 101–125, 2020. [18] G. M. Tavares, S. Barbon, and P. Ceravolo, “Synthetic event streams,” In this paper, we present CDESF, a tool for detection 2019. [Online]. Available: http://dx.doi.org/10.21227/2kxd-m509 of concept drifts and monitoring of business processes in online scenarios. CDESF handles conflicting goals (memory consumption and accuracy) in stream processing, maintains a