Privacy-Preserving Process Mining with PM4Py (Extended Abstract) Henrik Kirchmann1,∗ , Stephan A. Fahrenkrog-Petersen1 , Martin Kabierski1 , Han van der Aa2 and Matthias Weidlich1 1 Humboldt-Universität zu Berlin, Unter den Linden 6, 10117 Berlin, Germany 2 University of Mannheim, 68131 Mannheim, Germany Abstract Process Mining allows for the data-driven analysis of business processes based on logs that contain fine-granular data from the process’ execution. However, such logs can potentially be exploited to extract sensitive information about process participants. To mitigate this risk, techniques that anonymize event logs to guarantee the privacy of process participants have recently been proposed. In this paper, we report on the integration of anonymization techniques for event logs into PM4Py, one of the leading process mining tools. Specifically, we incorporated several state-of-the-art solutions for differential privacy-based protection. By presenting the first integration of anonymization techniques into a general process mining toolkit, we make the respective algorithms accessible to the wider community of process mining experts and data scientists. Keywords Process Mining, Privacy-preserving Data Publishing, Differential Privacy, Event Logs 1. Introduction Process mining is a family of techniques to analyze the data recorded in information systems during the execution of business processes. The data is stored in so-called event logs that may include sensitive information, e.g., if they represent the clinical workflow of patients in a hospital. Privacy regulations such as the GDPR and the CCPA enforce the protection of such information [1]. Since it was shown that individuals can be re-identified within such datasets [2, 3], anonymization of event logs is needed to mitigate privacy risks. Recently, the development of anonymization techniques for event logs gained a lot of atten- tion [4, 5, 6]. Nonetheless, the adoption and uptake of these techniques has been limited. One reason being the lack of an easy-to-use integration of anonymization techniques into existing process mining toolkits [7, 8]. Specifically, many of the techniques for privacy-preserving process mining have been published in stand-alone tools [9, 10, 11], and they have, so far, not been accessible as part of the toolkits commonly used to realize process mining projects. ICPM 2022 Doctoral Consortium and Tool Demonstration Track ∗ Corresponding author. Envelope-Open henrik.kirchmann@hu-berlin.de (H. Kirchmann); stephan.fahrenkrog-petersen@hu-berlin.de (S. A. Fahrenkrog-Petersen); martin.kabierski@hu-berlin.de (M. Kabierski); han.van.der.aa@uni-mannheim.de (H. van der Aa); matthias.weidlich@hu-berlin.de (M. Weidlich) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 85 In this demo, we address this gap with the first integration of anonymization techniques for event logs in a leading process mining toolkit, i.e., PM4Py [7]. Particularly, we incorporate techniques that protect event logs with differential privacy, which is considered the state-of- the-art privacy guarantee, as also adopted by SAP [12], and the US Census Bureau [13]. Below, we first review the features that have been added to the PM4Py library in Section 2. Then, we provide information on the usage of our tool and its maturity (Section 3), before we conclude (Section 4). 2. Feature Overview We chose to integrate our anonymization techniques into PM4Py [7] due to the rich ecosystem provided by the toolkit. This includes, for instance, the ability to handle event logs in different file formats, such as IEEE XES and CSV files. Our tool facilitates two anonymization steps: Control-flow anonymization and the anonymiza- tion of contextual information. While the control-flow anonymization can be performed indepen- dently, the anonymization of contextual information requires the control-flow anonymization as a first step. In any case, we protect the privatized data with differential privacy through the insertion of noise into the event logs. 2.1. Control-Flow Anonymization Our tool offers control-flow anonymization through different algorithms that implement so- called trace variant queries, such as the Laplacian mechanism [14] and SaCoFa [15]. Both algorithms insert noise into a trace-variant count, through the step-wise construction of a prefix tree. Given an event log, the algorithms are configured with the following parameters: • 𝜖: The strength of the differential privacy guarantee. The smaller the value of 𝜖, the stronger the privacy guarantee that is provided. • 𝑘: The maximal length of considered traces in the prefix tree. We note that this parameter governs the runtime complexity of both algorithms, which is 𝒪(|𝐴|𝑘 ) with 𝐴 being the set of activities for which events have been recorded in the log. We recommend setting 𝑘, so that roughly 80% of all traces from the original event log are covered. Setting 𝑘 higher, might lead to event logs that overfit towards long traces. • 𝑝: The pruning parameter, which denotes the minimum count a prefix has to have to not be discarded. The 𝑘 dependent exponential runtime of the algorithms is mitigated by the pruning parameter. 2.2. Anonymization of Contextual Information In many application scenarios, an analyst might not only study control-flow information, but also incorporate contextual information, such as timestamps and resources. If that is the case, a solution that solely anonymizes the control-flow is not sufficient. Our tool handles these scenarios by the application of PRIPEL [16], an algorithm that enriches a control-flow 86 1 import pm4py 2 from pm4py.algo.anonymization.trace_variant_query import algorithm as trace_variant_query 3 from pm4py.algo.anonymization.pripel import algorithm as pripel 4 5 log = pm4py.read_xes(”logName.xes”) 6 epsilon = 0.01 7 8 sacofa_result = trace_variant_query.apply(log=log, variant=trace_variant_query.Variants.SACOFA, ↪ parameters={”epsilon”: epsilon, ”k”: 15, ”p”: 20}) 9 10 anonymized_log = pripel.apply(log=log, trace_variant_query=sacofa_result, epsilon=epsilon) Algorithm 1: An example how to anonymize a given log with a SaCoFa-based trace variant query and PRIPEL anonymized event log with contextual information, while still achieving differential privacy. In our tool, PRIPEL can be combined with both aforementioned control-flow anonymization techniques. For this reason, the implementation of PRIPEL requires the original event log and the corresponding result of the control-flow anonymization as input. The approach is fine-tuned by setting the following parameters: • 𝜖: The strength of the differential privacy guarantee. The 𝜖 value for PRIPEL and the 𝜖 value for the adopted control-flow anonymization should be the same. • Blocklist: Some event logs contain attributes that are equivalent to a case ID. For privacy reasons, such attributes must be deleted from the anonymized log. We handle such attributes with this list. As an example, in a hospital, the case ID could be based on a patient visit. However, the patient ID could be equivalently serving as a case ID and should therefore be omitted. 3. Availability and Usage of the Tool Our tool is publicly available on GitHub 1 . Algorithm 1 illustrates the application of it to anonymize an event log. First, the listing shows how a trace variant query is applied to anonymize the control flow of the event log. Our tool adopts a factory design pattern, which enables later extensions with novel types of trace variant queries. Afterwards, PRIPEL is also executed to anonymize the log’s contextual information. We showcase the usage in more detail in a screencast2 . Turning to the maturity of the tool, we note that it is based on algorithms that have been published in peer-reviewed venues. Moreover, we are currently in the process of publishing our tool as part of the official release of PM4Py. 1 https://github.com/samadeusfp/pm4py-core-anonymization/tree/Demo-Track 2 https://youtu.be/BRLMG_Bvdbs 87 4. Conclusion In this paper, we presented an enhancement for a leading process mining toolkit, PM4Py, which enables the anonymization of event logs. As such, we make the state of the art in privacy- preserving process mining more accessible for researchers and practitioners. In future work, we want to keep expanding the list of algorithms covered by our tool. References [1] G. Elkoumy, S. A. Fahrenkrog-Petersen, M. F. Sani, A. Koschmider, F. Mannhardt, S. N. Von Voigt, M. Rafiei, L. V. Waldthausen, Privacy and confidentiality in process mining: threats and research challenges, ACM TMIS 13 (2021) 1–17. [2] S. Nuñez von Voigt, S. A. Fahrenkrog-Petersen, D. Janssen, A. Koschmider, F. Tschorsch, F. Mannhardt, O. Landsiedel, M. Weidlich, Quantifying the re-identification risk of event logs for process mining, in: International Conference on Advanced Information Systems Engineering, Springer, 2020, pp. 252–267. [3] K. Maatouk, F. Mannhardt, Quantifying the re-identification risk in published process models, in: ICPM Workshops, Springer, 2021, pp. 382–394. [4] G. Elkoumy, A. Pankova, M. Dumas, Mine me but don’t single me out: Differentially private event logs for process mining, in: C. D. Ciccio, C. D. Francescomarino, P. Soffer (Eds.), 3rd International Conference on Process Mining, ICPM 2021, Eindhoven, The Netherlands, October 31 - Nov. 4, 2021, IEEE, 2021, pp. 80–87. doi:1 0 . 1 1 0 9 / I C P M 5 3 2 5 1 . 2 0 2 1 . 9 5 7 6 8 5 2 . [5] M. Rafiei, W. M. P. van der Aalst, Group-based privacy preservation techniques for process mining, Data Knowl. Eng. 134 (2021) 101908. doi:1 0 . 1 0 1 6 / j . d a t a k . 2 0 2 1 . 1 0 1 9 0 8 . [6] E. Batista, A. Solanas, A uniformization-based approach to preserve individuals’ privacy during process mining analyses, Peer-to-Peer Netw. Appl. 14 (2021) 1500–1519. doi:1 0 . 1007/s12083- 020- 01059- 1. [7] Alessandro Berti, Sebastiaan J. van Zelst, Wil M. P. van der Aalst, Process Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Science, CoRR abs/1905.06169 (2019). [8] G. Janssenswillen, B. Depaire, M. Swennen, M. Jans, K. Vanhoof, bupar: Enabling repro- ducible business process analysis, Knowledge-Based Systems 163 (2019) 927–930. [9] Martin Bauer, Stephan A. Fahrenkrog-Petersen, Agnes Koschmider, Felix Mannhardt, Han van der Aa, Matthias Weidlich, ELPaaS: Event Log Privacy as a Service, in: BPM Demos 2019, volume 2420 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 159–163. [10] M. Rafiei, A. Schnitzler, W. M. P. van der Aalst, PC4PM: A Tool for Privacy/Confidentiality Preservation in Process Mining, 2021. [11] Gamal Elkoumy, Stephan A. Fahrenkrog-Petersen, Marlon Dumas, Peeter Laud, Al- isa Pankova, Matthias Weidlich, Shareprom: A Tool for Privacy-Preserving Inter- Organizational Process Mining, in: BPM Demo 2020, volume 2673 of CEUR Workshop Proceedings, CEUR-WS.org, 2020, pp. 72–76. [12] S. Kessler, J. Hoff, J.-C. Freytag, Sap hana goes private: from privacy research to privacy aware enterprise analytics, Proceedings of the VLDB Endowment 12 (2019) 1998–2009. 88 [13] J. M. Abowd, The us census bureau adopts differential privacy, in: KDD, 2018, pp. 2867–2867. [14] F. Mannhardt, A. Koschmider, N. Baracaldo, M. Weidlich, J. Michael, Privacy-Preserving Process Mining, Business & Information Systems Engineering 61 (2019) 595–614. [15] S. A. Fahrenkog-Petersen, M. Kabierski, F. Rosel, H. van der Aa, M. Weidlich, SaCoFa: Semantics-aware Control-flow Anonymization for Process Mining, in: ICPM 2021, 2021, pp. 72–79. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 . 0 8 5 0 1 . [16] S. A. Fahrenkrog-Petersen, H. van der Aa, M. Weidlich, PRIPEL: Privacy-Preserving Event Log Publishing Including Contextual Information, BPM 2020 (2020). doi:1 0 . 1 0 0 7 / 978- 3- 030- 58666- 9_7. 89