Privacy-Preserving Process Mining with PM4Py
(Extended Abstract)
Henrik Kirchmann1,∗ , Stephan A. Fahrenkrog-Petersen1 , Martin Kabierski1 , Han van
der Aa2 and Matthias Weidlich1
1
    Humboldt-Universität zu Berlin, Unter den Linden 6, 10117 Berlin, Germany
2
    University of Mannheim, 68131 Mannheim, Germany


                                         Abstract
                                         Process Mining allows for the data-driven analysis of business processes based on logs that contain
                                         fine-granular data from the process’ execution. However, such logs can potentially be exploited to extract
                                         sensitive information about process participants. To mitigate this risk, techniques that anonymize event
                                         logs to guarantee the privacy of process participants have recently been proposed. In this paper, we
                                         report on the integration of anonymization techniques for event logs into PM4Py, one of the leading
                                         process mining tools. Specifically, we incorporated several state-of-the-art solutions for differential
                                         privacy-based protection. By presenting the first integration of anonymization techniques into a general
                                         process mining toolkit, we make the respective algorithms accessible to the wider community of process
                                         mining experts and data scientists.

                                         Keywords
                                         Process Mining, Privacy-preserving Data Publishing, Differential Privacy, Event Logs


1. Introduction
Process mining is a family of techniques to analyze the data recorded in information systems
during the execution of business processes. The data is stored in so-called event logs that
may include sensitive information, e.g., if they represent the clinical workflow of patients in
a hospital. Privacy regulations such as the GDPR and the CCPA enforce the protection of
such information [1]. Since it was shown that individuals can be re-identified within such
datasets [2, 3], anonymization of event logs is needed to mitigate privacy risks.
   Recently, the development of anonymization techniques for event logs gained a lot of atten-
tion [4, 5, 6]. Nonetheless, the adoption and uptake of these techniques has been limited. One
reason being the lack of an easy-to-use integration of anonymization techniques into existing
process mining toolkits [7, 8]. Specifically, many of the techniques for privacy-preserving
process mining have been published in stand-alone tools [9, 10, 11], and they have, so far, not
been accessible as part of the toolkits commonly used to realize process mining projects.


ICPM 2022 Doctoral Consortium and Tool Demonstration Track
∗
    Corresponding author.
Envelope-Open henrik.kirchmann@hu-berlin.de (H. Kirchmann); stephan.fahrenkrog-petersen@hu-berlin.de
(S. A. Fahrenkrog-Petersen); martin.kabierski@hu-berlin.de (M. Kabierski); han.van.der.aa@uni-mannheim.de
(H. van der Aa); matthias.weidlich@hu-berlin.de (M. Weidlich)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                          85
   In this demo, we address this gap with the first integration of anonymization techniques
for event logs in a leading process mining toolkit, i.e., PM4Py [7]. Particularly, we incorporate
techniques that protect event logs with differential privacy, which is considered the state-of-
the-art privacy guarantee, as also adopted by SAP [12], and the US Census Bureau [13].
   Below, we first review the features that have been added to the PM4Py library in Section 2.
Then, we provide information on the usage of our tool and its maturity (Section 3), before we
conclude (Section 4).


2. Feature Overview
We chose to integrate our anonymization techniques into PM4Py [7] due to the rich ecosystem
provided by the toolkit. This includes, for instance, the ability to handle event logs in different
file formats, such as IEEE XES and CSV files.
   Our tool facilitates two anonymization steps: Control-flow anonymization and the anonymiza-
tion of contextual information. While the control-flow anonymization can be performed indepen-
dently, the anonymization of contextual information requires the control-flow anonymization
as a first step. In any case, we protect the privatized data with differential privacy through the
insertion of noise into the event logs.

2.1. Control-Flow Anonymization
Our tool offers control-flow anonymization through different algorithms that implement so-
called trace variant queries, such as the Laplacian mechanism [14] and SaCoFa [15]. Both
algorithms insert noise into a trace-variant count, through the step-wise construction of a prefix
tree.
   Given an event log, the algorithms are configured with the following parameters:
    • 𝜖: The strength of the differential privacy guarantee. The smaller the value of 𝜖, the
      stronger the privacy guarantee that is provided.
    • 𝑘: The maximal length of considered traces in the prefix tree. We note that this parameter
      governs the runtime complexity of both algorithms, which is 𝒪(|𝐴|𝑘 ) with 𝐴 being the
      set of activities for which events have been recorded in the log. We recommend setting 𝑘,
      so that roughly 80% of all traces from the original event log are covered. Setting 𝑘 higher,
      might lead to event logs that overfit towards long traces.
    • 𝑝: The pruning parameter, which denotes the minimum count a prefix has to have to not
      be discarded. The 𝑘 dependent exponential runtime of the algorithms is mitigated by the
      pruning parameter.


2.2. Anonymization of Contextual Information
In many application scenarios, an analyst might not only study control-flow information, but
also incorporate contextual information, such as timestamps and resources. If that is the
case, a solution that solely anonymizes the control-flow is not sufficient. Our tool handles
these scenarios by the application of PRIPEL [16], an algorithm that enriches a control-flow


                                                86
    1   import pm4py
    2   from pm4py.algo.anonymization.trace_variant_query import algorithm as trace_variant_query
    3   from pm4py.algo.anonymization.pripel import algorithm as pripel
    4
    5   log = pm4py.read_xes(”logName.xes”)
    6   epsilon = 0.01
    7
    8   sacofa_result = trace_variant_query.apply(log=log, variant=trace_variant_query.Variants.SACOFA,
         ↪ parameters={”epsilon”: epsilon, ”k”: 15, ”p”: 20})
    9
10      anonymized_log = pripel.apply(log=log, trace_variant_query=sacofa_result, epsilon=epsilon)


Algorithm 1: An example how to anonymize a given log with a SaCoFa-based trace variant
             query and PRIPEL


anonymized event log with contextual information, while still achieving differential privacy.
In our tool, PRIPEL can be combined with both aforementioned control-flow anonymization
techniques. For this reason, the implementation of PRIPEL requires the original event log and
the corresponding result of the control-flow anonymization as input. The approach is fine-tuned
by setting the following parameters:

        • 𝜖: The strength of the differential privacy guarantee. The 𝜖 value for PRIPEL and the 𝜖
          value for the adopted control-flow anonymization should be the same.
        • Blocklist: Some event logs contain attributes that are equivalent to a case ID. For privacy
          reasons, such attributes must be deleted from the anonymized log. We handle such
          attributes with this list. As an example, in a hospital, the case ID could be based on a
          patient visit. However, the patient ID could be equivalently serving as a case ID and
          should therefore be omitted.


3. Availability and Usage of the Tool
Our tool is publicly available on GitHub 1 . Algorithm 1 illustrates the application of it to
anonymize an event log. First, the listing shows how a trace variant query is applied to
anonymize the control flow of the event log. Our tool adopts a factory design pattern, which
enables later extensions with novel types of trace variant queries. Afterwards, PRIPEL is also
executed to anonymize the log’s contextual information. We showcase the usage in more detail
in a screencast2 .
   Turning to the maturity of the tool, we note that it is based on algorithms that have been
published in peer-reviewed venues. Moreover, we are currently in the process of publishing our
tool as part of the official release of PM4Py.


1
    https://github.com/samadeusfp/pm4py-core-anonymization/tree/Demo-Track
2
    https://youtu.be/BRLMG_Bvdbs


                                                       87
4. Conclusion
In this paper, we presented an enhancement for a leading process mining toolkit, PM4Py, which
enables the anonymization of event logs. As such, we make the state of the art in privacy-
preserving process mining more accessible for researchers and practitioners. In future work,
we want to keep expanding the list of algorithms covered by our tool.


References
 [1] G. Elkoumy, S. A. Fahrenkrog-Petersen, M. F. Sani, A. Koschmider, F. Mannhardt, S. N.
     Von Voigt, M. Rafiei, L. V. Waldthausen, Privacy and confidentiality in process mining:
     threats and research challenges, ACM TMIS 13 (2021) 1–17.
 [2] S. Nuñez von Voigt, S. A. Fahrenkrog-Petersen, D. Janssen, A. Koschmider, F. Tschorsch,
     F. Mannhardt, O. Landsiedel, M. Weidlich, Quantifying the re-identification risk of event
     logs for process mining, in: International Conference on Advanced Information Systems
     Engineering, Springer, 2020, pp. 252–267.
 [3] K. Maatouk, F. Mannhardt, Quantifying the re-identification risk in published process
     models, in: ICPM Workshops, Springer, 2021, pp. 382–394.
 [4] G. Elkoumy, A. Pankova, M. Dumas, Mine me but don’t single me out: Differentially private
     event logs for process mining, in: C. D. Ciccio, C. D. Francescomarino, P. Soffer (Eds.), 3rd
     International Conference on Process Mining, ICPM 2021, Eindhoven, The Netherlands,
     October 31 - Nov. 4, 2021, IEEE, 2021, pp. 80–87. doi:1 0 . 1 1 0 9 / I C P M 5 3 2 5 1 . 2 0 2 1 . 9 5 7 6 8 5 2 .
 [5] M. Rafiei, W. M. P. van der Aalst, Group-based privacy preservation techniques for process
     mining, Data Knowl. Eng. 134 (2021) 101908. doi:1 0 . 1 0 1 6 / j . d a t a k . 2 0 2 1 . 1 0 1 9 0 8 .
 [6] E. Batista, A. Solanas, A uniformization-based approach to preserve individuals’ privacy
     during process mining analyses, Peer-to-Peer Netw. Appl. 14 (2021) 1500–1519. doi:1 0 .
     1007/s12083- 020- 01059- 1.
 [7] Alessandro Berti, Sebastiaan J. van Zelst, Wil M. P. van der Aalst, Process Mining for Python
     (PM4Py): Bridging the Gap Between Process- and Data Science, CoRR abs/1905.06169
     (2019).
 [8] G. Janssenswillen, B. Depaire, M. Swennen, M. Jans, K. Vanhoof, bupar: Enabling repro-
     ducible business process analysis, Knowledge-Based Systems 163 (2019) 927–930.
 [9] Martin Bauer, Stephan A. Fahrenkrog-Petersen, Agnes Koschmider, Felix Mannhardt, Han
     van der Aa, Matthias Weidlich, ELPaaS: Event Log Privacy as a Service, in: BPM Demos
     2019, volume 2420 of CEUR Workshop Proceedings, CEUR-WS.org, 2019, pp. 159–163.
[10] M. Rafiei, A. Schnitzler, W. M. P. van der Aalst, PC4PM: A Tool for Privacy/Confidentiality
     Preservation in Process Mining, 2021.
[11] Gamal Elkoumy, Stephan A. Fahrenkrog-Petersen, Marlon Dumas, Peeter Laud, Al-
     isa Pankova, Matthias Weidlich, Shareprom: A Tool for Privacy-Preserving Inter-
     Organizational Process Mining, in: BPM Demo 2020, volume 2673 of CEUR Workshop
     Proceedings, CEUR-WS.org, 2020, pp. 72–76.
[12] S. Kessler, J. Hoff, J.-C. Freytag, Sap hana goes private: from privacy research to privacy
     aware enterprise analytics, Proceedings of the VLDB Endowment 12 (2019) 1998–2009.


                                                          88
[13] J. M. Abowd, The us census bureau adopts differential privacy, in: KDD, 2018, pp.
     2867–2867.
[14] F. Mannhardt, A. Koschmider, N. Baracaldo, M. Weidlich, J. Michael, Privacy-Preserving
     Process Mining, Business & Information Systems Engineering 61 (2019) 595–614.
[15] S. A. Fahrenkog-Petersen, M. Kabierski, F. Rosel, H. van der Aa, M. Weidlich, SaCoFa:
     Semantics-aware Control-flow Anonymization for Process Mining, in: ICPM 2021, 2021,
     pp. 72–79. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 . 0 8 5 0 1 .
[16] S. A. Fahrenkrog-Petersen, H. van der Aa, M. Weidlich, PRIPEL: Privacy-Preserving
     Event Log Publishing Including Contextual Information, BPM 2020 (2020). doi:1 0 . 1 0 0 7 /
     978- 3- 030- 58666- 9_7.


                                              89