CC4Spark: Distributing Event Logs and big complex Conformance Checking problems Álvaro Valencia-Parra1 , Ángel Jesús Varela-Vaca1 , María Teresa Gómez-López1 and Josep Carmona2 1 Departamento de Lenguajes y Sistemas Informáticos, Universidad de Sevilla, Spain 2 Universitat Politècnica de Catalunya, Spain Abstract Conformance checking is one of the disciplines that best exposes the power of process mining, since it allows detecting anomalies and deviations in business processes, helping to assess and improve the quality of these. This is an indispensable task, especially in Big Data environments where large amounts of data are generated, and where the complexity of the processes is increasing. CC4Spark enables com- panies to face this challenging scenario in twofold. First, it supports distributing conformance checking alignment problems by means of a Big Data infrastructure based on Apache Spark, allowing users to im- port, transform and prepare event logs stored in distributed data sources, and solve them in a distributed environment. Secondly, this tool supports decomposed Petri nets. This helps to noticeably reduce the complexity of the models. Both characteristics help companies in facing increasingly frequent scenar- ios with large amounts of logs with highly complex business processes. CC4Spark is not tied to any particular conformance checking algorithm, so that users can employ customised algorithms. Keywords Conformance Checking, Big Data, Event Log Distribution 1. Introduction It is increasingly common that organisations define complex business processes that must be followed to achieve their objectives [1]. These processes can include different kind of interactions with humans, third-party services, device’s information, etc. Thereby these interactions can inadvertently enable unexpected deviations with respect to the expected process model. Further, this problem can be increased when the interactions come from multiple and heterogeneous sources, such as Internet-of-Things (IoT) devices or devices involved in Cyber-physical Systems [2]. Nowadays, conformance checking [3] techniques provide mechanisms to relate modelled and observed behaviour, so that the deviations between the footprints left by process executions and the process models that formalise the expected behaviour can be revealed. Proceedings of the Demonstration & Resources Track, Best BPM Dissertation Award, and Doctoral Consortium at BPM 2021 co-located with the 19th International Conference on Business Process Management, BPM 2021, Rome, Italy, September 6-10, 2021 " avalencia@us.es (.́ Valencia-Parra); ajvarela@us.es (.́ J. Varela-Vaca); maytegomez@us.es (M. T. Gómez-López); jcarmona@upc.edu (J. Carmona)  0000-0002-0645-1000 (.́ Valencia-Parra); 0000-0001-9953-6005 (.́ J. Varela-Vaca); 0000-0002-3562-875X (M. T. Gómez-López); 0000-0001-9656-254X (J. Carmona) © 2021 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) One of the major challenges in conformance checking is the alignment problem: given an observed trace 𝜎, compute an end-to-end model run that more closely resembles 𝜎. Computing alignments is an extremely difficult problem, with a complexity exponential in the size of the model or the trace [4]. Intuitively, computing an alignment requires a search through the state space of the model which, in certain cases implies an extensive exploration when the process model is large and/or highly concurrent. Traditionally, alignment problems are solved in a single node, which can increase the impact of the issue exposed above, especially in two situations: (i) when the dataset containing the event logs is too large, and (ii) when the model or Petri net is too complex. In those cases, calculating the alignment of a single trace can be very expensive in terms of computation time. In both cases, a solution could be distributing the alignment problems, solving them in parallel. As was demonstrated in [5], the impact in terms of computation time is reduced when the resolution of the alignment problems is distributed, since it tends to reduce the bottle-necks. Also, when Petri nets are too complex, the horizontal decomposition of them helps reduce that complexity. We propose a tool to tackle the complexity of the alignment problem based on that principle. Thanks to the use of Apache Spark, CC4Spark can read distributed event logs, and solve alignment problems in distributed environments. 2. Innovation CC4Spark 1 is the first tool that allows to distribute conformance checking alignment problems in big data clusters. This feature is supported by the use of Apache Spark. It offers a highly scalable way of processing both raw or XES event logs, and solving conformance checking alignment problems. This tool is the result of a previous study [5]. Next, the innovative features of this tool are presented: Distributed event logs. CC4Spark is able to load raw process logs from distributed data sources, and to transform them into a XES-standard format. This feature allows the processing of event logs in parallel. CC4Spark is then capable of processing large volumes of event logs that would not be possible to process by a single machine. CC4Spark is also able to read non-distributed event logs contained in formatted XES files. Processing horizontally-decomposed Petri nets. Horizontal decomposition of Petri nets is proposed in [5], and CC4Spark provides support for processing such models. Horizontal decomposition produces partial models, which are less complex than their original Petri nets due to their acyclicity. These partial models are generated by running multiple instances of the original Petri net. The set of partial models is equivalent to the original model. In this way, the complexity of cyclical Petri nets is noticeably reduced. In order to horizontally-decompose Petri nets, an external tool (VIP tool 2 ) is required. Solving distributed alignment problems. The alignment problem is distributed between the nodes of the cluster in partitions of alignment subproblems. Each subproblem consists of a specific trace and a Petri net (or a partial model if the Petri net is decomposed). In this way, 1 CC4Spark source code and documentation: https://github.com/IDEA-Research-Group/ conformancechecking4spark, http://www.idea.us.es/confcheckingbigdata/ 2 VIP Tool: https://www.fernuni-hagen.de/sttp/forschung/vip_tool.shtml all the possible combinations between different traces and Petri nets (or partial models) are generated and computed, generating partial solutions (i.e., a candidate solution composed of the trace and the best alignment found for that subproblem). The configuration of the partitions (i.e., the amount of partitions and the number of subproblems per partition) can be customised in this tool. CC4Spark also includes a heuristic to avoid computing unnecessary alignment subproblems. This heuristic consists of the estimation of a lower bound, indicating that the minimum value of alignment for that subproblem will never be lower than the estimated one. In that case, if the subproblems related to a certain trace produce a higher estimation than the last alignment computed for that trace, the rest of the subproblems related to that trace will be skipped. Finally, partial solutions are combined. The solution for each trace is the best alignment found for it. Remark that CC4Spark is not tied to any conformance checking algorithm: user is free to use its own implementation. Up to now, two approaches have been successfully tested (i.e., the A* algorithm implemented in the Python library PM4Py, and an approach based on Constraint Optimisation Paradigm [5]). 3. Case Studies Case Study A: Computing deviations on distributed event logs. This case study is related to the milk manufacturing industry. The process describes how milk cans are processed from when they leave the farm until they pass through the factory and are bottled for sale. Each machine in the process generates logs, which include the milk can being processed, the timestamp, and the values read by the sensors (e.g., temperature, pressure). The IoT infrastructure stores these logs in a MongoDB cluster. Then, CC4Spark reads such data. For the logs to be formatted in a XES format, the attributes which represent the case ID, the task ID, and the timestamp must be specified. Then, a path to a Petri net file must be specified. The results of the alignment problem will be store in the storage system specified by the user. Thanks to CC4Spark, it is possible to monitor the whole process and detect potential abnormalities in milk bottles. Case Study B: Computing deviations with complex Petri nets. In this example, we present the evaluations performed in [5]. Several datasets are tested, each one with a specific XES event log file, and a Petri net. While the size of the event logs are less than in Example A, the Petri nets are highly complex. For this reason, in the distributed approaches, the Petri nets were decomposed by using the technique previously described. The results of the tests are depicted in Figure 1, comparing the best configuration obtained for the A* algorithm standalone (the classical approach, without distribution), the A* algorithm with the best distributed configuration, and a novel approach based on Constraint Programming for solving the alignment problems (refer to [5] for further information). The Elapsed Real Time depicted in the graph represent the average time of 10 executions. The results demonstrated that distributing the alignment problems and solving them in parallel improves the execution time. It is especially noticeable for the datasets M2, M5, CCC20d, and prGm6 (in this case, neither of the other approaches were able to solve the alignment problems in a reasonable amount of time, being the A* algorithm in distributed mode the only capable of obtaining the results of the alignments). Figure 1: Comparison of the best configuration for each algorithm. The abscissa has a logarithmic scale 4. Tool Packaging and Maturity CC4Spark is developed on top of PySpark 3.0.1. It requires Python 3.7 and PM4Py 2.2.8. It is packaged as a Python module which can be downloaded and imported in a PySpark project. Although this tool facilitates the process of importing event logs, Petri nets models, and cal- culating the alignments, certain skills in Apache Spark are required for a proper use of this tool. For this reason, in future releases we intend to extend this tool by including a graphical user interface. CC4Spark might be run standalone inside a PySpark project in a local computer. However, for the tool to properly work in distributed environments, a cluster based on Apache Spark is required. Our manuals include instructions on the deployment in both modes. Figure 2 depicts the architecture when CC4Spark is deployed in distributed mode. There are three layers clearly differentiated: (i) Storage layer: It refers to the event log and Petri net models storage. As mentioned before, the event logs can be in distributed data sources or in a raw format. From the technical point of view, the storage layer might be any type of distributed database, distributed or local file system. (ii) Computing layer: It is intended to distribute the event logs into partitions, transform them into a XES format, generate the alignment subproblems, and solve them. CC4Spark relies on PySpark for this layer. (iii) Persistence layer: It stores the results of the alignment problems. It might be any database or starge system supported by PySpark. 5. Conclusion CC4Spark enables users to perform conformance checking computations with large amounts of event logs and highly complex Petri nets. As reported in Section 3 and in [5], the scalability of CC4Spark is shown. It has been tested with distributed event logs, non-distributed event logs, standard Petri nets and decomposed Petri nets. In those cases, the tool has demonstrated scala- bility and an improvement with respect to the classical approaches when it runs in distributed Distributed Logs Partial Model files Storage Layer Store resources (Logs and Partial Model files) Computing Layer Cluster Manager Driver Executors Persistence Layer Partitions Store Resource Create subproblems for the assigned scheduling subproblem management partition alignment and distribution Solve subproblems results by Reduce subproblems by trace trace Figure 2: Architecture of CC4Spark mode. A demonstration video is available3 . Acknowledgments Ministry of Science and Technology of Spain: ECLIPSE (RTI2018-094283-B-C33) project; Euro- pean Regional Development Fund (ERDF/FEDER); MINECO (TIN2017-86727-C2-1-R); University of Seville: VI Plan Propio de Investigación y Transferencia (VI PPIT-US). References [1] M. Dumas, W. M. P. van der Aalst, A. H. M. ter Hofstede, Process-aware Information Systems: Bridging People and Software through Process Technology, Wiley, 2005. URL: http://eu.wiley.com/WileyCDA/WileyTitle/productCd-0471663069.html. [2] H. Roehm, J. Oehlerking, M. Woehrle, M. Althoff, Model conformance for cyber-physical systems: A survey, TCPS 3 (2019) 30:1–30:26. URL: https://doi.org/10.1145/3306157. doi:10. 1145/3306157. [3] J. Carmona, B. F. van Dongen, A. Solti, M. Weidlich, Conformance Checking - Relating Processes and Models, Springer, 2018. URL: https://doi.org/10.1007/978-3-319-99414-7. doi:10.1007/978-3-319-99414-7. [4] A. Adriansyah, Aligning observed and modeled behavior, Ph.D. thesis, Technische Univer- siteit Eindhoven, 2014. [5] Á. Valencia-Parra, Á. J. Varela-Vaca, M. T. Gómez-López, J. Carmona, R. Bergenthum, Empowering conformance checking using Big Data through horizontal decomposition, Information Systems 99 (2021) 101731. URL: https://linkinghub.elsevier.com/retrieve/pii/ S0306437921000077. doi:10.1016/j.is.2021.101731. 3 Demo video: http://tiny.cc/cc4spark-video