<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Privacy-Preserving Process Mining with PM4Py (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Henrik Kirchmann</string-name>
          <email>henrik.kirchmann@hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stephan A. Fahrenkrog-Petersen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Kabierski</string-name>
          <email>martin.kabierski@hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Han van</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>der Aa</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matthias Weidlich</string-name>
          <email>matthias.weidlich@hu-berlin.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Process Mining, Privacy-preserving Data Publishing, Diferential Privacy, Event Logs</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Humboldt-Universität zu Berlin</institution>
          ,
          <addr-line>Unter den Linden 6, 10117 Berlin</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Mannheim</institution>
          ,
          <addr-line>68131 Mannheim</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <fpage>85</fpage>
      <lpage>89</lpage>
      <abstract>
        <p>Process Mining allows for the data-driven analysis of business processes based on logs that contain ifne-granular data from the process' execution. However, such logs can potentially be exploited to extract sensitive information about process participants. To mitigate this risk, techniques that anonymize event logs to guarantee the privacy of process participants have recently been proposed. In this paper, we report on the integration of anonymization techniques for event logs into PM4Py, one of the leading process mining tools. Specifically, we incorporated several state-of-the-art solutions for diferential privacy-based protection. By presenting the first integration of anonymization techniques into a general process mining toolkit, we make the respective algorithms accessible to the wider community of process mining experts and data scientists.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Process mining is a family of techniques to analyze the data recorded in information systems
during the execution of business processes. The data is stored in so-called event logs that
may include sensitive information, e.g., if they represent the clinical workflow of patients in
a hospital. Privacy regulations such as the GDPR and the CCPA enforce the protection of
such information [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Since it was shown that individuals can be re-identified within such
datasets [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ], anonymization of event logs is needed to mitigate privacy risks.
      </p>
      <p>
        Recently, the development of anonymization techniques for event logs gained a lot of
attention [
        <xref ref-type="bibr" rid="ref4 ref5 ref6">4, 5, 6</xref>
        ]. Nonetheless, the adoption and uptake of these techniques has been limited. One
reason being the lack of an easy-to-use integration of anonymization techniques into existing
process mining toolkits [
        <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
        ]. Specifically, many of the techniques for privacy-preserving
process mining have been published in stand-alone tools [
        <xref ref-type="bibr" rid="ref10 ref11 ref9">9, 10, 11</xref>
        ], and they have, so far, not
been accessible as part of the toolkits commonly used to realize process mining projects.
CEUR
      </p>
      <p>
        In this demo, we address this gap with the first integration of anonymization techniques
for event logs in a leading process mining toolkit, i.e., PM4Py [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]. Particularly, we incorporate
techniques that protect event logs with diferential privacy, which is considered the
state-ofthe-art privacy guarantee, as also adopted by SAP [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], and the US Census Bureau [13].
      </p>
      <p>Below, we first review the features that have been added to the PM4Py library in Section 2.
Then, we provide information on the usage of our tool and its maturity (Section 3), before we
conclude (Section 4).</p>
    </sec>
    <sec id="sec-2">
      <title>2. Feature Overview</title>
      <p>
        We chose to integrate our anonymization techniques into PM4Py [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] due to the rich ecosystem
provided by the toolkit. This includes, for instance, the ability to handle event logs in diferent
ifle formats, such as IEEE XES and CSV files.
      </p>
      <p>Our tool facilitates two anonymization steps: Control-flow anonymization and the
anonymization of contextual information. While the control-flow anonymization can be performed
independently, the anonymization of contextual information requires the control-flow anonymization
as a first step. In any case, we protect the privatized data with diferential privacy through the
insertion of noise into the event logs.</p>
      <sec id="sec-2-1">
        <title>2.1. Control-Flow Anonymization</title>
        <p>Our tool ofers control-flow anonymization through diferent algorithms that implement
socalled trace variant queries, such as the Laplacian mechanism [14] and SaCoFa [15]. Both
algorithms insert noise into a trace-variant count, through the step-wise construction of a prefix
tree.</p>
        <p>Given an event log, the algorithms are configured with the following parameters:
•  : The strength of the diferential privacy guarantee. The smaller the value of  , the
stronger the privacy guarantee that is provided.
•  : The maximal length of considered traces in the prefix tree. We note that this parameter
governs the runtime complexity of both algorithms, which is  (||

) with  being the
set of activities for which events have been recorded in the log. We recommend setting  ,
so that roughly 80% of all traces from the original event log are covered. Setting  higher,
might lead to event logs that overfit towards long traces.
•  : The pruning parameter, which denotes the minimum count a prefix has to have to not
be discarded. The  dependent exponential runtime of the algorithms is mitigated by the
pruning parameter.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Anonymization of Contextual Information</title>
        <p>In many application scenarios, an analyst might not only study control-flow information, but
also incorporate contextual information, such as timestamps and resources. If that is the
case, a solution that solely anonymizes the control-flow is not suficient. Our tool handles
these scenarios by the application of PRIPEL [16], an algorithm that enriches a control-flow
1 import pm4py
2 from pm4py.algo.anonymization.trace_variant_query import algorithm as trace_variant_query
3 from pm4py.algo.anonymization.pripel import algorithm as pripel
4
5 log = pm4py.read_xes(”logName.xes”)
6 epsilon = 0.01
7
8 sacofa_result = trace_variant_query.apply(log=log, variant=trace_variant_query.Variants.SACOFA,
↪ parameters={”epsilon”: epsilon, ”k”: 15, ”p”: 20})
9
10 anonymized_log = pripel.apply(log=log, trace_variant_query=sacofa_result, epsilon=epsilon)
Algorithm 1: An example how to anonymize a given log with a SaCoFa-based trace variant
query and PRIPEL
anonymized event log with contextual information, while still achieving diferential privacy.
In our tool, PRIPEL can be combined with both aforementioned control-flow anonymization
techniques. For this reason, the implementation of PRIPEL requires the original event log and
the corresponding result of the control-flow anonymization as input. The approach is fine-tuned
by setting the following parameters:
•  : The strength of the diferential privacy guarantee. The  value for PRIPEL and the 
value for the adopted control-flow anonymization should be the same.
• Blocklist: Some event logs contain attributes that are equivalent to a case ID. For privacy
reasons, such attributes must be deleted from the anonymized log. We handle such
attributes with this list. As an example, in a hospital, the case ID could be based on a
patient visit. However, the patient ID could be equivalently serving as a case ID and
should therefore be omitted.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Availability and Usage of the Tool</title>
      <p>Our tool is publicly available on GitHub1. Algorithm 1 illustrates the application of it to
anonymize an event log. First, the listing shows how a trace variant query is applied to
anonymize the control flow of the event log. Our tool adopts a factory design pattern, which
enables later extensions with novel types of trace variant queries. Afterwards, PRIPEL is also
executed to anonymize the log’s contextual information. We showcase the usage in more detail
in a screencast2.</p>
      <p>Turning to the maturity of the tool, we note that it is based on algorithms that have been
published in peer-reviewed venues. Moreover, we are currently in the process of publishing our
tool as part of the oficial release of PM4Py.
1https://github.com/samadeusfp/pm4py-core-anonymization/tree/Demo-Track
2https://youtu.be/BRLMG_Bvdbs</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusion</title>
      <p>In this paper, we presented an enhancement for a leading process mining toolkit, PM4Py, which
enables the anonymization of event logs. As such, we make the state of the art in
privacypreserving process mining more accessible for researchers and practitioners. In future work,
we want to keep expanding the list of algorithms covered by our tool.
[13] J. M. Abowd, The us census bureau adopts diferential privacy, in: KDD, 2018, pp.</p>
      <p>2867–2867.
[14] F. Mannhardt, A. Koschmider, N. Baracaldo, M. Weidlich, J. Michael, Privacy-Preserving</p>
      <p>Process Mining, Business &amp; Information Systems Engineering 61 (2019) 595–614.
[15] S. A. Fahrenkog-Petersen, M. Kabierski, F. Rosel, H. van der Aa, M. Weidlich, SaCoFa:
Semantics-aware Control-flow Anonymization for Process Mining, in: ICPM 2021, 2021,
pp. 72–79. doi:1 0 . 4 8 5 5 0 / A R X I V . 2 1 0 9 . 0 8 5 0 1 .
[16] S. A. Fahrenkrog-Petersen, H. van der Aa, M. Weidlich, PRIPEL: Privacy-Preserving
Event Log Publishing Including Contextual Information, BPM 2020 (2020). doi:1 0 . 1 0 0 7 /
9 7 8 - 3 - 0 3 0 - 5 8 6 6 6 - 9 _ 7 .</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G.</given-names>
            <surname>Elkoumy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Fahrenkrog-Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Sani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koschmider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mannhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. N. Von</given-names>
            <surname>Voigt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V.</given-names>
            <surname>Waldthausen</surname>
          </string-name>
          ,
          <article-title>Privacy and confidentiality in process mining: threats and research challenges</article-title>
          ,
          <source>ACM TMIS 13</source>
          (
          <year>2021</year>
          )
          <fpage>1</fpage>
          -
          <lpage>17</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>S.</given-names>
            <surname>Nuñez von Voigt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Fahrenkrog-Petersen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Janssen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Koschmider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Tschorsch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mannhardt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Landsiedel</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Weidlich, Quantifying the re-identification risk of event logs for process mining</article-title>
          ,
          <source>in: International Conference on Advanced Information Systems Engineering</source>
          , Springer,
          <year>2020</year>
          , pp.
          <fpage>252</fpage>
          -
          <lpage>267</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>K.</given-names>
            <surname>Maatouk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Mannhardt</surname>
          </string-name>
          ,
          <article-title>Quantifying the re-identification risk in published process models</article-title>
          ,
          <source>in: ICPM Workshops</source>
          , Springer,
          <year>2021</year>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>394</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Elkoumy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pankova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumas</surname>
          </string-name>
          ,
          <article-title>Mine me but don't single me out: Diferentially private event logs for process mining</article-title>
          , in: C. D.
          <string-name>
            <surname>Ciccio</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Francescomarino</surname>
          </string-name>
          , P. Sofer (Eds.),
          <source>3rd International Conference on Process Mining, ICPM</source>
          <year>2021</year>
          , Eindhoven,
          <source>The Netherlands, October 31 - Nov. 4</source>
          ,
          <year>2021</year>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>80</fpage>
          -
          <lpage>87</lpage>
          .
          <source>doi:1 0 . 1 1 0 9 / I C P M 5 3</source>
          <volume>2 5 1 . 2 0 2 1 . 9 5 7 6 8 5 2 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Group-based privacy preservation techniques for process mining</article-title>
          ,
          <source>Data Knowl. Eng</source>
          .
          <volume>134</volume>
          (
          <year>2021</year>
          )
          <article-title>101908</article-title>
          .
          <source>doi:1 0 . 1 0</source>
          <volume>1 6</volume>
          / j . d
          <source>a t a k . 2 0</source>
          <volume>2 1 . 1 0 1 9 0 8 .</volume>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>E.</given-names>
            <surname>Batista</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Solanas</surname>
          </string-name>
          ,
          <article-title>A uniformization-based approach to preserve individuals' privacy during process mining analyses, Peer-to-Peer Netw</article-title>
          .
          <source>Appl</source>
          .
          <volume>14</volume>
          (
          <year>2021</year>
          )
          <fpage>1500</fpage>
          -
          <lpage>1519</lpage>
          .
          <source>doi:1 0 . 1 0 0 7 / s 1 2</source>
          <volume>0 8 3 - 0 2 0 - 0 1 0 5 9 - 1</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Alessandro</given-names>
            <surname>Berti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Sebastiaan J. van Zelst</given-names>
            ,
            <surname>Wil M. P. van der Aalst</surname>
          </string-name>
          ,
          <article-title>Process Mining for Python (PM4Py): Bridging the Gap Between Process-</article-title>
          and
          <string-name>
            <surname>Data Science</surname>
          </string-name>
          , CoRR abs/
          <year>1905</year>
          .06169 (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>G.</given-names>
            <surname>Janssenswillen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Depaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Swennen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Jans</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Vanhoof, bupar: Enabling reproducible business process analysis</article-title>
          ,
          <source>Knowledge-Based Systems</source>
          <volume>163</volume>
          (
          <year>2019</year>
          )
          <fpage>927</fpage>
          -
          <lpage>930</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Martin</given-names>
            <surname>Bauer</surname>
          </string-name>
          , Stephan A.
          <string-name>
            <surname>Fahrenkrog-Petersen</surname>
          </string-name>
          , Agnes Koschmider, Felix Mannhardt, Han van der Aa, Matthias Weidlich,
          <article-title>ELPaaS: Event Log Privacy as a Service</article-title>
          ,
          <source>in: BPM Demos</source>
          <year>2019</year>
          , volume
          <volume>2420</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>159</fpage>
          -
          <lpage>163</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rafiei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Schnitzler</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
          </string-name>
          , PC4PM: A Tool for Privacy/Confidentiality Preservation in Process Mining,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>Gamal</surname>
            <given-names>Elkoumy</given-names>
          </string-name>
          , Stephan A.
          <string-name>
            <surname>Fahrenkrog-Petersen</surname>
          </string-name>
          , Marlon Dumas, Peeter Laud, Alisa Pankova, Matthias Weidlich,
          <article-title>Shareprom: A Tool for Privacy-Preserving InterOrganizational Process Mining</article-title>
          ,
          <source>in: BPM Demo</source>
          <year>2020</year>
          , volume
          <volume>2673</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>72</fpage>
          -
          <lpage>76</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kessler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hof</surname>
          </string-name>
          , J.-C. Freytag,
          <article-title>Sap hana goes private: from privacy research to privacy aware enterprise analytics</article-title>
          ,
          <source>Proceedings of the VLDB Endowment</source>
          <volume>12</volume>
          (
          <year>2019</year>
          )
          <fpage>1998</fpage>
          -
          <lpage>2009</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>