CauseCheck: A Tool for Simulating Deviations in Event Logs with Known Root Causes Frederik Hake1,* , Simon Schneider1,* , Nikolaos Theofanopoulos1,* , Camila Gonzalez1,* , Poey Sie Chuah1,* , Michael Grohs1,* and Jana-Rebecca Rehse1,* 1 University of Mannheim, L15 1-6, 68161 Mannheim, Germany Abstract Conformance checking compares process executions in an event log to a process model to detect where and how the executions deviate from the model. However, these techniques are not able to explain why deviations occur, i.e., what the root causes of deviations are. In this vein, root cause analysis techniques have been proposed, but their suitability for conformance deviations is uncertain due to a lack of appropriate evaluation data. To address this gap, this paper presents CauseCheck, a tool that simulates event logs with deviations from a process model for which the causes are known. In particular, the user defines such deviations and assigns corresponding root causes in the form of trace and event attributes. Thus, the logs can be used to show the ability to re-discover the known root causes for deviations. Keywords Process Mining, Conformance Checking, Root Cause Analysis, Event Log Generation Metadata description Value Tool name CauseCheck Current version 1.0 Legal code license Apache 2.0 Languages, tools, and services used React.tsx, Python, PM4Py, GraphViz Supported operating environment Microsoft Windows Download/Demo URL https://github.com/FrederikHake/CauseCheck Documentation URL https://github.com/FrederikHake/CauseCheck/blob/main/src/frontend/ public/Manual.pdf Source code repository https://github.com/FrederikHake/CauseCheck Screencast video https://github.com/FrederikHake/CauseCheck/blob/main/Demo% 20CauseCheck.mp4 1. Introduction Conformance checking aims to analyze the relation between the intended behavior of a process, captured in a process model, and the observed behavior of a process, captured in an event ICPM 2024 Tool Demonstration Track, October 14-18, 2024, Kongens Lyngby, Denmark * Corresponding author. $ frederik.hake@students.uni-mannheim.de (F. Hake); simon.schneider@students.uni-mannheim.de (S. Schneider); nikolaos.theofanopoulos@students.uni-mannheim.de (N. Theofanopoulos); camila.gonzalez.de.aranda@students.uni-mannheim.de (C. Gonzalez); poey.chuah@students.uni-mannheim.de (P. S. Chuah); michael.grohs@uni-mannheim.de (M. Grohs); rehse@uni-mannheim.de (J. Rehse) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings log [1]. Over the last years, multiple conformance checking techniques have been developed, such as rule checking, token-based replay, and alignments [2]. As output, the techniques often quantify the degree of conformance, so-called fitness. Some also provide more detailed insights. For example, alignments identify which events have been inserted or skipped [1]. Existing conformance checking techniques are able to identify how and where process executions deviate, but they are not providing any insights why a deviation occurred [2]. Having these insights into why such deviations occur could help process managers to prevent deviations. Consequently, it would be desirable to support managers by deriving causes for the deviations directly from event data, i.e., derive which attributes causally increase the likelihood of deviations [3]. Although some techniques aim to unravel root causes for problems (e.g., [3, 4]), they have not been shown to detect such causes for conformance deviations. To assess and compare the quality of approaches that derive root causes for deviations, appropriate evaluation data is required in the form of event logs that contain deviations from a process model for which the root causes are available as a ground truth. This is not the case for publicly available real-life event logs, for which such a ground truth of root causes for deviations does not exist. That is a common problem when evaluating root cause analysis techniques, not only when analysing causes of deviations. Thus, evaluations are often based on illustrations on these real-life logs rather than comparisons of techniques’ capabilities to a ground truth [3, 4]. The lack of appropriate evaluation data can be encountered by using simulated data with ground truth, which has not been done for conformance deviations. To address this gap, we present the CauseCheck tool for simulating deviations from a process model in event logs for which the root cause is known. Given a process model as input that captures the intended process behavior, the tool simulates an event log and synthetically injects different types of deviations that can occur in a process. In particular, it allows users to use deviation types based on five patterns commonly used to characterize deviations [5]: inserted, skipped, repeated, replaced, and swapped activities (or sequences thereof). These deviations are assigned root causes in the form of trace and event attributes. For that, the user first defines these trace and event attributes. Then, whenever a particular attribute has a particular value, a deviation occurs with a user-defined likelihood. For example, one potential cause-deviation pair could be “whenever the bank is equal to Bank A, activity Z is skipped in 50% of the traces although it is required according to the model”. Further, users can define noise levels, i.e., random occurrences of deviations that are not attributed to a root cause. The tool returns an event log that contains deviations and the root causes within the trace and event attributes. 2. The CauseCheck Tool At https://github.com/FrederikHake/CauseCheck, the CauseCheck Tool can be accessed. In this repository, the user can find the source code, instructions on how to run the tool, and further documentation. A demo video is available at https://github.com/FrederikHake/CauseCheck/ blob/main/Demo%20CauseCheck.mp4. 2.1. Functional Components As illustrated in Fig. 1, a process model is required as input to set up an evaluation experiment using CauseCheck. This process model describes the to-be behavior and a playout of it is synthetically changed to contain the deviation and causes. Then, the user defines general characteristics of the desired event log as well as decision point probabilities within the process models. After that, all event and trace attributes should be created. Subsequently, the user defines which deviations from the process model occur and by which trace and event attributes they are caused. Finally, the user has the option to include a level of noise. As output, the tool generates an XES file with the simulated deviations and causes. In the following, we present the steps of the tool in more detail using a loan application process as running example. Figure 1: Overview of CauseCheck’s functional components General Characteristics. After providing a process model which captures the intended behavior in a BPMN or PNML format as input, the user is requested to specify the time-frame of the event log as illustrated in Fig. 2. Further, the size of the event log should be defined. Figure 2: General Characteristics Figure 3: Decision Point Probabilities Decision Point Probabilities. Process models often include decision points where only one of the multiple paths should be followed. There are processes where certain paths are less common than others, e.g., a cancellation of a loan application is less common than its eventual payout. To account for these cases and define the intended behavior in detail, CauseCheck requires the user to assign decision point probabilities to all XOR-choices in the process model. As shown in Fig. 3, the tool displays the process model as a Petri net of the process in the upper part of the screen to identify decision points. Then, the user should define probabilities for each path. Per default, the simulation assumes an equal likelihood of all possible paths. Trace & Event Attributes. The third page of the application prompts the user to define trace and event attributes which can be selected later as root causes for deviations, starting with the event attributes. For that, the user can select attributes of the types nu- merical, categorical, and time-related from a drop-down menu. For all event attributes, the user defines which at- tribute values are possible and how Figure 4: Trace & Event Attributes likely each value is for each activity the event is associated with. Thereby, only the timestamp is mandatory to be selected so that the user is able to proceed to the next page of the application. After the successful definition of all event attributes, the tool shows a summary of them. Similarly to the event attributes, the user is then prompted to define trace attributes. This particular page is optional. Again, the user may select a trace attribute type from the drop-down, define all possible attribute values, and their corresponding likelihood. For instance, Fig. 4 shows the trace attribute “Bank” with the three possible values “Bank A”, “Bank B”, and “Bank C”. Deviations & Causes. In the next step, the user defines the deviations as well as corresponding ground truth of causes. For that, users can select out of five commonly used deviation types [5]: (1) inserted: an activity (or a sequence) is executed in addition to the intended model behavior at a point in the trace, which is defined by the user (2) skipped: an activity (or a sequence) is not executed although required in the model (3) repeated: an activity (or a sequence) is wrongly re-executed after it has been previously executed in accordance with the model (4) replaced: an activity (or a sequence) is executed instead of another activity (or a sequence) (5) swapped: two activities (or sequences) are performed in the wrong order For each deviation, the user enters a unique identifier and also which activity (or sequence) it should refer to. For example, the user can specify that activity “A_partly sub- mitted” is skipped. Then, a cause (or multiple causes) of the deviation should be defined. For that, users can select all trace and event attributes and choose a particular value as the cause. Further, a likelihood of deviation occurrence is required. For example, as il- lustrated in Fig. 5, the user can specify that “A_partly submitted” is skipped with a like- lihood of 50% whenever the trace attribute Figure 5: Deviations & Causes “Bank” is equal to “Bank A”. This likelihood corresponds to the causal effect of the attribute value “Bank A” on the skip. After specifying these details, the user adds the new deviation. This can be done for any number of deviations. Noise. In the final step, the user can include noise into the event log. This noise is defined as random occurrences of deviations with no associated cause. The user can add general noise (i.e., random occurrence of an undefined deviation), type specific noise (i.e., random occurrence of a deviation type like skipped) and deviation specific noise (i.e., random occurrence of a previously defined deviation). For all different options, the level of noise is assigned as a probability. The last screen of the application is responsible for downloading the simulated event log with all its deviations and causes. The user can download both the event log with the deviations they created and a deviation-free log. 2.2. Tool Architecture The CauseCheck tool features a Python-based back-end and a React-based front-end that communicate through request and response mechanisms. The back-end utilizes Flask-session to ensure communication with the back-end even if the front-end is closed during use. Further, PM4Py[6] handles the process model and its playout. The front-end, built with TypeScript, incorporates the Material UI React component library for flexibility and easy customization. 3. Maturity We used the tool for processes models of the BPI Challenges 2012 (sub-process with A_ ac- tivities only; 12A) and 2020 (International Declarations; Int.) obtained from [7]. 12A is rather straightforward with only 10 activities whereas Int. is more complex with 34 activities and a potential loop. Based on these models, we simulated 32 logs in total, 16 logs for each 12A and Int. In particular, we inserted either 5, 10, 15, or 20 different deviations into logs with either 100, 1,000, 10,000, or 100,000 traces. Executing these 4 × 4 combinations led to 16 logs per model, which we uploaded to our repository. Execution times in seconds for all 32 logs in Tab. 1 indicate reasonable computational efficiency. Thereby, times are not influenced by the number of deviations and scale approximately linearly with the number of traces. For Int., the times take substantially longer due to the complex process model but are still reasonable with 2 hours. To show the functionality of CauseCheck, consider Fig. 6. It illustrates the occurrences of a skip of the first activity in the 12A log within the simulation of 1,000 traces. This activity should occur in every trace but is synthetically skipped in 50% of the traces associated with Bank A. Since only 50% of the traces are associated with Bank A, the deviation should on average exist Table 1 Execution Times in Seconds for 12A and Int., subdivided by No. of Deviations and No. of Traces 12A Int. # Devs # Devs 5 10 15 20 5 10 15 20 # Traces # Traces 100 2 3 2 1 100 12 10 10 8 1,000 3 2 6 3 1,000 113 109 85 118 10,000 21 9 47 25 10,000 940 753 732 646 100,000 337 323 359 322 100,000 7,258 7,115 6,808 7,142 in 25% of the traces. In our simulation, the skip occurs in 233 of 1,000 traces, indicating that, after adjusting for randomness in the probabilities, the correct number of traces contains the deviation. Further, consider Fig. 7 which shows the alignment of a different deviating trace. Concretely, A_FINALIZED and A_ACCEPTED are swapped with each other, visible as a log and model move on A_FINALIZED with a synchronous move on A_ACCEPTED in between. A_SUBMITTED A_PARTLYSUBMITTED A_PREACCEPTED A_ACCEPTED ≫ A_FINALIZED A_APPROVED A_REGISTERED A_ACTIVATED A_SUBMITTED A_PARTLYSUBMITTED ≫ A_PREACCEPTED A_ACCEPTED A_FINALIZED A_APPROVED A_REGISTERED A_ACTIVATED Figure 7: Alignment of Trace with Swap of A_FINALIZED and A_ACCEPTED in BPIC12 4. Conclusion We presented CauseCheck, a tool for generating synthetic event logs with conformance deviations for which the ground truth of causes is known. It aims to provide a realistic simulation by assigning probabilities to decision points and incorporating noise. This allows researchers to evaluate tools that want to uncover root causes for conformance deviations. In particular, the output of root cause analysis techniques can be compared to the ground truth, quantifying whether the correct causes for the devi- ations are detected. In the future, we want to analyze the Figure 6: Distribution of Skip capabilities of techniques to re-discover deviation causes Based on Trace Attribute Bank and potentially propose our solution for the task. References [1] J. Carmona, B. van Dongen, A. Solti, M. Weidlich, Conformance Checking - Relating Processes and Models, Springer, 2018. [2] M. Grohs, J.-R. Rehse, Attribute-based conformance diagnosis: Correlating trace attributes with process conformance, in: ICPM Workshops, 2023, pp. 203–215. [3] M. Qafari, W. Aalst, Feature recommendation for structural equation model discovery in process mining, Prog Artif Intell (2022) 1–25. [4] Z. Bozorgi, I. Teinemaa, M. Dumas, M. La Rosa, A. Polyvyanyy, Process mining meets causal machine learning: Discovering causal rules from event logs, in: ICPM, 2020, pp. 129–136. [5] M. Hosseinpour, M. Jans, Auditors’ categorization of process deviations, Journal of Infor- mation Systems 38 (2024) 67–89. [6] A. Berti, S. J. van Zelst, W. van der Aalst, Process mining for python (pm4py): Bridging the gap between process-and data science, ICPM Demos (2019). [7] M. Grohs, P. Pfeiffer, J.-R. Rehse, Business process deviation prediction: Predicting non- conforming process behavior, in: ICPM, 2023, pp. 113–120.