<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Process Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Science</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Alessandro Berti</institution>
          ,
          <addr-line>Sebastiaan J. van Zelst</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>-Process mining, i.e., a sub-field of data science focusing on the analysis of event data generated during the execution of (business) processes, has seen a tremendous change over the past two decades. Starting off in the early 2000's, with limited to no tool support, nowadays, several software tools, i.e., both open-source, e.g., ProM and Apromore, and commercial, e.g., Disco, Celonis, ProcessGold, etc., exist. The commercial process mining tools provide limited support for implementing custom algorithms. Moreover, both commercial and open-source process mining tools are often only accessible through a graphical user interface, which hampers their usage in large-scale experimental settings. Initiatives such as RapidProM provide process mining support in the scientific workflow-based data science suite RapidMiner. However, these offer limited to no support for algorithmic customization. In the light of the aforementioned, in this paper, we present a novel process mining library, i.e., Process Mining for Python (PM4Py), that aims to bridge this gap, providing integration with state-of-the-art data science libraries, e.g., pandas, numpy, scipy and scikit-learn. We provide a global overview of the architecture and functionality of PM4Py, accompanied by some representative examples of its usage.</p>
      </abstract>
      <kwd-group>
        <kwd>Process Mining</kwd>
        <kwd>Data Science</kwd>
        <kwd>Python</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>I. INTRODUCTION</title>
      <p>
        The field of process mining [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] provides tools and
techniques to increase the overall knowledge of a (business)
process, by means of analyzing the event data stored during
the execution of the process. Process mining received a lot of
attention from both academia and industry, which led to the
development of several commercial and open-source process
mining tools. The majority of these tools supports process
discovery, i.e., discovering a process model that accurately
describes the process under study, as captured within the
analyzed event data. However, process mining also comprises
conformance checking, i.e., checking to what degree a given
process model is accurately describing event data, and process
enhancement, i.e., techniques that enhance process models
by projecting interesting information, e.g. case flow and/or
performance measures, on top of a model. The support of
such types of process mining analysis is typically limited to
open source, academic process mining tools such as the ProM
Framework [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Apromore [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        Both ProM and Apromore put a significant emphasis on
non-expert usability, i.e., by means of providing an easy to
use graphical user interface. Whereas such an interface helps to
engage non-expert users and, furthermore, helps to showcase
process mining to a larger audience, it hampers the usability
of the tools for the purpose of large-scale scientific
experimentation [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. To this end, the RapidProM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] initiative
allows for repeated execution of large-scale experiments with
process mining algorithms in the RapidMiner1 suite. However,
RapidProM provides neither easy algorithmic customization
nor an easy way to integrate custom developed algorithms. As
such, the aforementioned tools fail to support customizable
process mining algorithms and large-scale experimentation and
analysis.
      </p>
      <p>To bridge the aforementioned gap, i.e., the lack of process
mining software that i) is easily extendable, ii) allows for
algorithmic customization and iii) allows us to easily conduct
large scale experiments, we propose the Process Mining for
Python (PM4Py) framework. To achieve the aforementioned
goals, a fresh look on the currently available programming
languages and libraries indicates that the Python programming
language2, along with its ecosystem, is most suitable. In
particular, the data science world, both for classic data science
(pandas, numpy, scipy . . . ) and for cutting-edge machine
learning research (tensorflow, keras . . . ), is heavily using Python.</p>
      <p>
        Other libraries, albeit with a lower number of features, exist
already for the Python language (PMLAB [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], OpyenXES [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]).
      </p>
      <p>
        The bupaR library [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] supports process mining in the statistical
language R, that is widely used in data science. The main focal
points of the novel PM4Py library are:
      </p>
      <p>
        Lowering the barrier for algorithmic development and
customization when performing a process mining analysis
compared to existing academic tools such as ProM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ],
RapidProM [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and Apromore [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Allow for easy integration of process mining algorithms
with algorithms from other data science fields,
implemented in various state-of-the-art Python packages.
1http://rapidminer.com
2 http://python.org
1 from pm4py . a l g o . d i s c o v e r y . a l p h a import v e r s i o n s
2 from pm4py . o b j e c t s . c o n v e r s i o n . l o g import f a c t o r y a s l o g c o n v e r s i o n
3 ALPHA VERSION CLASSIC = ’ c l a s s i c ’
4 ALPHA VERSION PLUS = ’ p l u s ’
5 VERSIONS = fALPHA VERSION CLASSIC : v e r s i o n s . c l a s s i c . apply ,
6 ALPHA VERSION PLUS : v e r s i o n s . p l u s . apply g
7 d e f apply ( log , p a r a m e t e r s =None , v a r i a n t =ALPHA VERSION CLASSIC ) :
8 r e t u r n VERSIONS [ v a r i a n t ] ( l o g c o n v e r s i o n . apply ( log , p a r a m e t e r s , l o g c o n v e r s i o n . TO EVENT LOG ) , p a r a m e t e r s )
Create a collaborative eco-system that easily allows
researchers and practitioners to share valuable code and
results with the process mining community.</p>
      <p>Provide accurate user-support by means of a rich body
of documentation on the process mining techniques made
available in the library.</p>
      <p>Algorithmic stability by means of rigorous testing.</p>
      <p>The remainder of this paper is structured as follows. In
Section II, we present the architecture and an overview of
the features provided by PM4Py. In Section III, we present
some representative examples (process discovery, conformance
checking). Section IV discusses the maturity of the tool and
Section V concludes this paper.</p>
    </sec>
    <sec id="sec-2">
      <title>II. ARCHITECTURE AND FEATURES</title>
      <p>In order to maximize the possibility to understand and re-use
the code, and to be able to execute large-scale experiments,
the following architectural guidelines have been adopted on
the development of PM4Py:</p>
      <p>
        A strict separation between objects (event logs, Petri
nets, DFGs, . . . ), algorithms (Alpha Miner [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Inductive
Miner [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ], alignments [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] . . . ) and visualizations in
different packages. In the pm4py.object package, classes
to import/export and to store the information related to
the objects are provided, along with some utilities to
convert objects, e.g., process trees into Petri nets; while in
the pm4py.algo package, algorithms to discover, perform
conformance checking, enhancement and evaluation are
provided. All visualizations of objects are provided in the
pm4py.visualization package.
      </p>
      <p>Most functionality in PM4Py has been realized through
factory methods. These factory methods provide a single
access point for each algorithm, with a standardized set
of input objects, e.g., event data and a parameters object.
Consider the factory method of the Alpha Miner, depicted
in Fig. 1. The Alpha (variant=’classic’) and
the Alpha+ (variant=’plus’) are made available.
Factory methods allow for the extension of existing
algorithms whilst ensuring backward-compatibility. The
factory methods typically accept the name of the variant
of the algorithm to use, and some parameters (shared
among variants, or variant-specific).</p>
      <p>In the remainder of this section, we present the main
features of the library, organized in objects, algorithms, and
visualizations.</p>
      <sec id="sec-2-1">
        <title>A. Object Management</title>
        <p>Within process mining, the main source of data are event
data, often referred to as an event log. Such an event log,
represents a collection of events, describing what activities
have been performed for different instances of the process
under study. PM4Py provides support for different types of
event data structures:</p>
        <p>Event logs, i.e., representing a list of traces. Each trace,
in turn, is a list of events. The events are structured as
key-value maps.</p>
        <p>Event Streams representing one list of events (again
represented as key-value maps) that are not (yet) organized
in cases.</p>
        <p>Conversion utilities are provided to convert event data objects
from one format to the other. Furthermore, PM4Py supports
the use of pandas data frames, which are efficient in
case of using larger event data. Other objects currently
supported by PM4Py include: heuristic nets, accepting Petri nets,
process trees and transition systems.</p>
      </sec>
      <sec id="sec-2-2">
        <title>B. Algorithms</title>
        <p>The PM4Py library provides several mainstream process
mining techniques, including:</p>
        <p>
          Process discovery: Alpha(+) Miner [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and Inductive
Miner (IMDF [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]).
        </p>
        <p>
          Conformance Checking: Token-based replay and
alignments [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ].
        </p>
        <p>Measurement of fitness, precision, generalization and
simplicity of process models.</p>
        <p>Filtering based on time-frame, case performance, trace
endpoints, trace variants, attributes, and paths.</p>
        <p>Case management: statistics on variants and cases.
Graphs: case duration, events per time, distribution of a
numeric attribute’s values.</p>
        <p>
          Social Network Analysis [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]: handover of work, working
together, subcontracting and similar activities networks.
        </p>
        <p>NetworkX: static representation of social networks.
Pyvis: web-based, dynamic representation of social
networks (see Fig. 4).</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>III. EXAMPLES Figure 5: PM4Py in action: process discovery with the Alpha Miner.</title>
      <p>using Alpha Miner and visualize the process model. The
factory methods that are needed (XES importer, Alpha Miner
and Petri net visualization) are loaded (line 1-3). Then, an XES
log is imported (line 4), the Alpha Miner is applied providing
the log object (line 7), and the visualization is obtained: a
factory method is applied to layout the graph (line 8), and the
In this section, we provide some examples of the use of
result is shown in a window (line 9). The result is shown in
PM4Py.
1 from pm4py . a l g o . c o n f o r m a n c e . a l i g n m e n t s import f a c t o r y a s a l i g n m e n t s
2 # a l i g n m e n t s a c c e p t s a l o g and an a c c e p t i n g P e t r i n e t , i . e .
3 # a P e t r i n e t a l o n g w i t h an i n i t i a l ( im ) and a f i n a l ( fm ) m a r k i n g
4 a l i g n e d t r a c e s = a l i g n m e n t s . apply ( log , n e t , im , fm )
5 f o r i n d e x , r e s u l t i n enumerate ( a l i g n e d t r a c e s ) :
6 p r i n t ( i n d e x , r e s u l t [ ’ a l i g n m e n t ’ ] )
[ ( ’ r e g i s t e r r e q u e s t ’ , ’ r e g i s t e r r e q u e s t ’ ) , ( ’&gt;&gt;’ , None ) , ( ’ c h e c k t i c k e t ’ , ’ c h e c k t i c k e t ’ ) ,
( ’ examine t h o r o u g h l y ’ , ’ examine t h o r o u g h l y ’ ) , ( ’&gt;&gt;’ , None ) , ( ’ d e c i d e ’ , ’ d e c i d e ’ ) , ( ’&gt;&gt;’ , None ) ,
( ’ r e j e c t r e q u e s t ’ , ’ r e j e c t r e q u e s t ’ ) ]
of a trace on an example log and model is reported.
the result. First, the alignments factory method is loaded (line
1). Then, the alignments between a log object and a process
model are obtained (line 4). For each aligned trace (line 5)
the alignment result is displayed on the screen (line 6). The
alignment of a trace is reported in the lower part of Fig. 3.</p>
    </sec>
    <sec id="sec-4">
      <title>IV. MATURITY OF THE TOOL</title>
      <p>by 200 students in the “Introduction to Data Science” course
held by the Process and Data Science group in the RWTH
Aachen University. Already two academic projects have been
supported by PM4Py and are publicly available:</p>
      <p>Usage of probabilistic automata for compliance checking
Prefix
for
event
features. There are some integrations of the PM4Py library in
bupaR R process mining library uses PM4Py to handle
alignments and get models using the Inductive Miner.
A data analytics web interface was written in Vue.JS
In Fig. 6, some statistics taken from Google Analytics are
reported about the number of accesses to PM4Py web site during
the month of February 2019. In Fig. 7, some statistics about
the downloads of the PM4Py library from PIP are reported.
with maximum score, has been awarded to the PM4Py library.</p>
    </sec>
    <sec id="sec-5">
      <title>V. CONCLUSION In this paper, the</title>
      <p>PM4Py
process
mining
library
(http://www.pm4py.org) has been introduced. PM4Py supports
a rapidly growing set of process mining techniques (discovery,
conformance checking, enhancement . . . ). A video presenting
the library and some example applications (log management,
process discovery, conformance checking) has been made
available3. The library can be installed4 through the
command pip install pm4py. Extensive documentation is provided
through the official website of the library. Moreover, the
Github repository supports a collaborative eco-system where
users could signal problems or contribute to the code.
4Additional prerequisites, available at the page
http://pm4py.pads.rwthaachen.de/installation/ have to be installed.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <source>Process Mining - Data Science in Action, Second Edition</source>
          . Springer,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B. F.</given-names>
            <surname>Van Dongen</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. K. A. de Medeiros</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Weijters</surname>
          </string-name>
          , and W. van der Aalst, “
          <article-title>The prom framework: A new era in process mining tool support,” in International conference on application and theory of petri nets</article-title>
          . Springer,
          <year>2005</year>
          , pp.
          <fpage>444</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M.</given-names>
            <surname>La Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H. A.</given-names>
            <surname>Reijers</surname>
          </string-name>
          ,
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Dijkman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mendling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dumas</surname>
          </string-name>
          , and
          <string-name>
            <given-names>L.</given-names>
            <surname>Garc</surname>
          </string-name>
          <article-title>´ıa-Ban˜uelos, “Apromore: An advanced process model repository</article-title>
          ,
          <source>” Expert Systems with Applications</source>
          , vol.
          <volume>38</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>7029</fpage>
          -
          <lpage>7040</lpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          , M. de Leoni, and W. M. van der Aalst, “
          <article-title>Scientific workflows for process mining: building blocks, scenarios, and implementation</article-title>
          ,”
          <source>International Journal on Software Tools for Technology Transfer</source>
          , vol.
          <volume>18</volume>
          , no.
          <issue>6</issue>
          , pp.
          <fpage>607</fpage>
          -
          <lpage>628</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Mans</surname>
          </string-name>
          , W. van der Aalst, and
          <string-name>
            <given-names>H. E.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          , “
          <article-title>Supporting process mining workflows with RapidProM.” in BPM (Demos</article-title>
          ),
          <year>2014</year>
          , p.
          <fpage>56</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>W. van der Aalst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          , and
          <string-name>
            <surname>S. J. van Zelst</surname>
          </string-name>
          , “
          <article-title>RapidProM: Mine your processes and not just your data,” CoRR</article-title>
          , vol.
          <source>abs/1703.03740</source>
          ,
          <year>2017</year>
          . [Online]. Available: http://arxiv.org/abs/1703.03740
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J. Carmona</given-names>
            <surname>Vargas</surname>
          </string-name>
          and
          <string-name>
            <given-names>M.</given-names>
            <surname>Sole</surname>
          </string-name>
          ´, “
          <article-title>Pmlab: a scripting environment for process mining</article-title>
          ,”
          <source>in Proceedings of the BPM Demo Sessions</source>
          <year>2014</year>
          :
          <article-title>Colocated with the 12th International Conference on Business Process Management (BPM 2014) Eindhoven, The Netherlands</article-title>
          ,
          <year>September 10</year>
          ,
          <year>2014</year>
          . CEUR-WS. org,
          <year>2014</year>
          , pp.
          <fpage>16</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>H.</given-names>
            <surname>Valdivieso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. L. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Munoz-Gama</surname>
          </string-name>
          , and M. Sepu´lveda, “
          <article-title>Opyenxes: A complete python library for the extensible event stream standard</article-title>
          .”
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>G.</given-names>
            <surname>Janssenswillen</surname>
          </string-name>
          and
          <string-name>
            <given-names>B.</given-names>
            <surname>Depaire</surname>
          </string-name>
          , “
          <article-title>Bupar: business process analysis in r</article-title>
          ,”
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W. van der</given-names>
            <surname>Aalst</surname>
          </string-name>
          , T. Weijters, and L. Maruster, “
          <article-title>Workflow mining: Discovering process models from event logs</article-title>
          ,
          <source>” IEEE Transactions on Knowledge and Data Engineering</source>
          , vol.
          <volume>16</volume>
          , no.
          <issue>9</issue>
          , pp.
          <fpage>1128</fpage>
          -
          <lpage>1142</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Leemans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fahland</surname>
          </string-name>
          , and W. van der Aalst, “
          <article-title>Scalable process discovery with guarantees</article-title>
          ,” in International Conference on Enterprise,
          <source>Business-Process and Information Systems Modeling</source>
          . Springer,
          <year>2015</year>
          , pp.
          <fpage>85</fpage>
          -
          <lpage>101</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Adriansyah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sidorova</surname>
          </string-name>
          , and
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          , “
          <article-title>Cost-based fitness in conformance checking,” in 2011 Eleventh International Conference on Application of Concurrency to System Design</article-title>
          .
          <source>IEEE</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>57</fpage>
          -
          <lpage>66</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>W. van der Aalst and M.</given-names>
            <surname>Song</surname>
          </string-name>
          , “
          <article-title>Mining social networks: Uncovering interaction patterns in business processes</article-title>
          ,” in International conference on business process management. Springer,
          <year>2004</year>
          , pp.
          <fpage>244</fpage>
          -
          <lpage>260</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <surname>S. J. van Zelst</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bolt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hassani</surname>
          </string-name>
          ,
          <string-name>
            <surname>B. F. van Dongen</surname>
          </string-name>
          , and W. van der Aalst, “
          <article-title>Online conformance checking: relating event streams to process models using prefix-alignments</article-title>
          ,”
          <source>International Journal of Data Science and Analytics</source>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>16</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>