PURPLE: a PURPose-guided Log GEnerator (Extended Abstract) Andrea Burattin1 , Barbara Re2 , Lorenzo Rossi2,∗ and Francesco Tiezzi3 1 Dpt. of Applied Mathematics and Computer Science, Technical University of Denmark, Kgs. Lyngby, Denmark 2 School of Science and Technology, University of Camerino, Camerino, Italy 3 Dpt. of Statistica, Informatica, Applicazioni, University of Florence, Florence, Italy Abstract Process mining collects a variety of techniques. To test and compare these techniques, we need event logs tailored to their specific mining purposes, e.g., process discovery and conformance checking. To this aim, we propose the PURPLE tool, a generator of event logs supporting different mining purposes. PURPLE performs guided simulations of a business model, shaping the resulting event log by the selected mining purpose. Keywords Process mining, Event log, Simulation 1. Introduction Process mining techniques play a crucial role in extracting non-trivial information from the execution data of business processes. The effectiveness and the precision of process mining depend on the reliability of the underlying algorithms, whose development requires testing them against event logs that suit the purpose for which the algorithm has been devised [1, 2]. Obtaining event logs fitting a specific purpose is a complex yet necessary achievement since bad quality logs hamper the use of process mining [3]. In this regard, several approaches, e.g., [4, 5], propose the automated generation of artificial event logs via the simulation of models in a predetermined language. However, most of them are not meant to produce logs fulfilling some specific properties. Instead, they are purpose-agnostic i.e., simulate random execution traces producing a different event log every time. To address this problem, we propose PURPLE: a PURPose-guided Log gEneration tool that implements the homonym framework we introduced in [6]. PURPLE is a web application able to simulate BPMN[7] and Petri-net[8] models and produce event logs tailored to process discovery and conformance checking purposes. Its key feature lies in the architecture, which can be adapted to many process modeling languages and mining purposes. The tool performs guided ICPM 2022 Doctoral Consortium and Tool Demonstration Track ∗ Main contributor and corresponding author. Envelope-Open andbur@dtu.dk (A. Burattin); barbara.re@unicam.it (B. Re); lorenzo.rossi@unicam.it (L. Rossi); francesco.tiezzi@unifi.it (F. Tiezzi) Orcid 0000-0002-0837-0183 (A. Burattin); 0000-0001-5374-2364 (B. Re); 0000-0002-6872-0616 (L. Rossi); 0000-0003-4740-7521 (F. Tiezzi) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 90 Figure 1: PURPLE interface. simulations of an input process model to incrementally generate specific execution traces until the produced log satisfies the desired mining purpose. We assess the maturity of PURPLE thought experiments where we measure the quality of logs generated by PURPLE. The rest of the paper is structured as follows. Section 2 presents the PURPLE tool and its features, then Section 3 discusses the tool maturity. Finally, Section 4 indicates where to download and access the tool. 2. PURPLE main features The PURPLE tool, Figure 1 is a progressive web app that can be used online as a service or executable on a local machine. It implements the PURPLE framework described in [6] thus, differently from tools that generate logs from random simulators of process models, it produces from guided simulations event logs tailored to the mining purpose under investigation. The key innovation of PURPLE lies in its architecture grounded on three components: a semantic engine, an evaluator, and a simulator. Except for the simulator that is fixed, the tool can be instantiated with a semantic engine supporting a given modeling language (e.g., BPMN, Petri Net), and an evaluator tailored to a desired mining purpose (e.g., process discovery, compliance checking). The semantic engine implements the semantics of the input model. Given an execution status, it returns the set of reachable states (thus the following activities that the model can perform), hence it lets the tool generate a labeled transition system (LTS) of the input model. The evaluator checks ‘how much’ an event log satisfies the properties needed for the purpose under consideration. The result of the evaluation is the delta that will drive the simulator. A delta is a sub-trace that acts as a bias for the simulation indicating the parts of the LTS to be traversed, thus influencing the produced traces. The simulator explores the LTS to produce execution traces to be added to the log. By taking as input the delta, the simulator tries to include the suggested sub-trace in the LTS traversal. This guarantees the production of traces, and hence a log, that satisfies the desired mining purpose. By fixing a modeling language and a mining purpose, we get an instantiation of PURPLE ready for producing logs. The event log is created in iterative steps. In each of them, the tool adds a new trace to the log and checks if the purpose is satisfied or not. If not, the next step is 91 guided to produce a new trace that better shapes the log for the target purpose. Besides implementing the architecture, the PURPLE tool provides different instantiations of the framework. It implements four evaluators addressing mining purposes about process discovery and conformance checking, and two semantic engines permitting to simulate BPMN and Petri-net modes. In the following, we describe these instantiations from the evaluator’s point of view since it is devised to shape the produced log. The first instantiation, process discovery via order relations, is devised for algorithms relying on the order relation between activities to obtain an accurate version of the original model, like the Alpha miner. This instantiation generates event logs covering the order relations seen in the input model without logging multiple times the same trace hence, the same order relations. Therefore, PURPLE produces the smallest log covering the relations in the footprint matrix. The evaluator component implemented for this purpose compares the order relations found in the input model with the ones generated in the log up to that moment, producing a delta accordingly. For instance, if the log misses a sequence relation between activities 𝐴 and 𝐵 the delta will contain the sub-trace ⟨𝐴, 𝐵⟩. In such a way, the next simulation tries to produce a trace containing ⟨𝐴, 𝐵⟩. Once all the order relations of the input model are covered, PURPLE produces as output the .xes file. The second instantiation, process discovery via frequencies, aims at generating event logs for discovery algorithms based on frequencies, such as the Heuristics miner. To address this purpose, it produces logs where some traces could be less or more frequent than others. During the log generation, the evaluator calculates in the current log the occurrences of traces and the thresholds for the loops chosen by the user, then generates a delta accordingly. If some of these values are lower than requested, the evaluator passes a delta to the simulator containing the entire traces still infrequent in the log. Once the requested occurrence percentages are satisfied, PURPLE returns the log .xes file. The other two instantiations regard conformance checking. To check the reliability of such techniques or to compare their performances, it is necessary to have logs embedding traces with deviations from the normal behavior, i.e., noisy behaviors. We propose two instantiations of PURPLE producing event logs from BPMN and Petri-nets with a precise amount of noisy behavior or with a precise alignment cost. The conformance checking via noise frequencies instantiation generates event logs with the desired percentages of noisy traces. The user can choose the noise percentages for missing head, missing tail, missing episode, order perturbation, and additional event. In this case, the evaluator passes an empty delta to the simulator to return a random trace without noise. Then, it compares the percentage of occurrences for each type of noise in the current log for the requested one. Consequently, the trace is modified introducing the type of noise farthest from the requested occurrence. Once PURPLE reaches the desired percentages, it returns the log. The last instantiation, conformance checking via fixed align cost, generates event logs with a precise amount of noise that involves a specific cost for the alignment. For this purpose, the tool extracts from the model the set of traces that can be produced and uses them later for calculating the alignment costs. Then, similarly to the purpose above, the evaluator receives from the simulator traces without noise, perturbs them with a type of noise, and updates the reached alignment cost. Every time a noisy trace is added to the current log, the evaluator calculates the optimal alignment cost between the noisy trace and traces previously extracted 92 from the model. 3. Tool maturity To show the maturity, we provide the results of experiments where we measured the quality of the event logs produced by each PURPLE instantiation. When possible, we compare our results with the ones achieved by reference tools, such as PLG2, BIMP (https://bimp.cs.ut.ee/), and the ProM plugin of the GED methodology (https://www.promtools.org/). For the experiments, we used BPMN and Petri-net models whose dimension ranges from 8 to 53 elements. They are both structured and unstructured, and some of them contain loops. Regarding the process discovery via order relations, PURPLE covers for all the models the 100% of relations, while PLG2, GED, and BIMP collected 93.1%, 77.1%, and 44.1% of relations respectively. In the process discovery via frequencies, PURPLE always produces logs with the required number of traces per type. In this case, PURPLE was not compared with any other tool since neither PLG2, GED, nor BIMP produces logs based on frequencies. In the conformance via noise frequencies, PURPLE produces exactly the required frequencies of noised traces, while PLG2, used for the comparison, approximates this result with an error of 20.8%. Finally, in the conformance checking with fixed align cost, PURPLE produces logs with alignment costs that, in the worst case, differ only from the 2.3% for the required cost. During the experiments, we measured the time spent by each tool for generating the logs. The reference tools take on average about 5 seconds, while PURPLE around 15 seconds, depending on the purpose. Even if PURPLE is less efficient than the other tools, it produces better logs in a reasonable amount of time. 4. Screencast and Website The PURPLE tool, as well as its source code, examples, artifacts generated during the experiments, user guide, and screencast, are available at https://pros.unicam.it/purple/. References [1] B. Van Dongen, A. De Medeiros, L. Wen, Process mining: Overview and outlook of petri net discovery algorithms, in: ToPNoC, volume 5460 of LNCS, Springer, 2009, pp. 225–242. [2] A. de Medeiros, C. Günther, Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms, in: Practical Use of Coloured Petri Nets and CPN Tools, volume 576, 2005, pp. 177–190. [3] R. Bose, R. Mans, W. van der Aalst, Wanna improve process mining results?, in: CIDM, IEEE, 2013, pp. 127–134. [4] A. Burattin, PLG2: Multiperspective Process Randomization with Online and Offline Simulations, in: BPM Demo Track, volume 1789, CEUR-WS.org, 2016, pp. 1–6. [5] E. Esgin, P. Karagoz, Process Profiling based Synthetic Event Log Generation, in: IC3K, volume 1, SCITEPRESS, 2019, pp. 516–524. 93 [6] A. Burattin, B. Re, L. Rossi, F. Tiezzi, A Purpose-Guided Log Generation Framework, in: Business Process Management, volume 13420 of LNCS, 2022, pp. 181–198. [7] F. Corradini, C. Muzi, B. Re, L. Rossi, F. Tiezzi, Formalising and animating multiple instances in BPMN collaborations, Information Systems 103 (2022). [8] J. Meseguert, U. Montanari, V. Sassonet, On the semantics of Petri nets, in: CONCUR, volume 630 of LNCS, Springer, 1992, pp. 286–301. 94