Process Mining for Python (PM4Py): Bridging the Gap Between Process- and Data Science Alessandro Berti∗ , Sebastiaan J. van Zelst∗† , Wil M.P. van der Aalst∗† ∗ RWTH Aachen University Process and Data Science group, Lehrstuhl für Informatik 9, 52074 Aachen, Germany {a.berti,s.j.v.zelst,wvdaalst}@pads.rwth-aachen.de † Fraunhofer Gesellschaft Institute for Applied Information Technology (FIT), Sankt Augustin, Germany {sebastiaan.van.zelst,wil.van.der.aalst}@fit.fraunhofer.de Abstract—Process mining, i.e., a sub-field of data science Both ProM and Apromore put a significant emphasis on focusing on the analysis of event data generated during the non-expert usability, i.e., by means of providing an easy to execution of (business) processes, has seen a tremendous change use graphical user interface. Whereas such an interface helps to over the past two decades. Starting off in the early 2000’s, with limited to no tool support, nowadays, several software engage non-expert users and, furthermore, helps to showcase tools, i.e., both open-source, e.g., ProM and Apromore, and process mining to a larger audience, it hampers the usability commercial, e.g., Disco, Celonis, ProcessGold, etc., exist. The of the tools for the purpose of large-scale scientific experi- commercial process mining tools provide limited support for mentation [4]. To this end, the RapidProM [5], [6] initiative implementing custom algorithms. Moreover, both commercial allows for repeated execution of large-scale experiments with and open-source process mining tools are often only accessible through a graphical user interface, which hampers their usage in process mining algorithms in the RapidMiner1 suite. However, large-scale experimental settings. Initiatives such as RapidProM RapidProM provides neither easy algorithmic customization provide process mining support in the scientific workflow-based nor an easy way to integrate custom developed algorithms. As data science suite RapidMiner. However, these offer limited to such, the aforementioned tools fail to support customizable no support for algorithmic customization. In the light of the process mining algorithms and large-scale experimentation and aforementioned, in this paper, we present a novel process mining library, i.e., Process Mining for Python (PM4Py), that aims to analysis. bridge this gap, providing integration with state-of-the-art data To bridge the aforementioned gap, i.e., the lack of process science libraries, e.g., pandas, numpy, scipy and scikit-learn. We provide a global overview of the architecture and functionality mining software that i) is easily extendable, ii) allows for of PM4Py, accompanied by some representative examples of its algorithmic customization and iii) allows us to easily conduct usage. large scale experiments, we propose the Process Mining for Index Terms—Process Mining; Data Science; Python. Python (PM4Py) framework. To achieve the aforementioned goals, a fresh look on the currently available programming languages and libraries indicates that the Python programming I. I NTRODUCTION language2 , along with its ecosystem, is most suitable. In particular, the data science world, both for classic data science The field of process mining [1] provides tools and tech- (pandas, numpy, scipy . . . ) and for cutting-edge machine learn- niques to increase the overall knowledge of a (business) ing research (tensorflow, keras . . . ), is heavily using Python. process, by means of analyzing the event data stored during Other libraries, albeit with a lower number of features, exist the execution of the process. Process mining received a lot of already for the Python language (PMLAB [7], OpyenXES [8]). attention from both academia and industry, which led to the The bupaR library [9] supports process mining in the statistical development of several commercial and open-source process language R, that is widely used in data science. The main focal mining tools. The majority of these tools supports process points of the novel PM4Py library are: discovery, i.e., discovering a process model that accurately describes the process under study, as captured within the • Lowering the barrier for algorithmic development and analyzed event data. However, process mining also comprises customization when performing a process mining analysis conformance checking, i.e., checking to what degree a given compared to existing academic tools such as ProM [2], process model is accurately describing event data, and process RapidProM [5] and Apromore [3]. enhancement, i.e., techniques that enhance process models • Allow for easy integration of process mining algorithms by projecting interesting information, e.g. case flow and/or with algorithms from other data science fields, imple- performance measures, on top of a model. The support of mented in various state-of-the-art Python packages. such types of process mining analysis is typically limited to open source, academic process mining tools such as the ProM 1 http://rapidminer.com Framework [2] and Apromore [3]. 2 http://python.org 1 from pm4py . a l g o . d i s c o v e r y . a l p h a import v e r s i o n s 2 from pm4py . o b j e c t s . c o n v e r s i o n . l o g import f a c t o r y a s l o g c o n v e r s i o n 3 ALPHA VERSION CLASSIC = ’ c l a s s i c ’ 4 ALPHA VERSION PLUS = ’ p l u s ’ 5 VERSIONS = {ALPHA VERSION CLASSIC : v e r s i o n s . c l a s s i c . apply , 6 ALPHA VERSION PLUS : v e r s i o n s . p l u s . a p p l y } 7 d e f a p p l y ( l o g , p a r a m e t e r s =None , v a r i a n t =ALPHA VERSION CLASSIC ) : 8 r e t u r n VERSIONS [ v a r i a n t ] ( l o g c o n v e r s i o n . a p p l y ( l o g , p a r a m e t e r s , l o g c o n v e r s i o n . TO EVENT LOG ) , p a r a m e t e r s ) Figure 1: Example factory method (Alpha Miner). Different variants (the Alpha and the Alpha+) are made available. • Create a collaborative eco-system that easily allows re- In the remainder of this section, we present the main searchers and practitioners to share valuable code and features of the library, organized in objects, algorithms, and results with the process mining community. visualizations. • Provide accurate user-support by means of a rich body of documentation on the process mining techniques made available in the library. A. Object Management • Algorithmic stability by means of rigorous testing. Within process mining, the main source of data are event The remainder of this paper is structured as follows. In data, often referred to as an event log. Such an event log, Section II, we present the architecture and an overview of represents a collection of events, describing what activities the features provided by PM4Py. In Section III, we present have been performed for different instances of the process some representative examples (process discovery, conformance under study. PM4Py provides support for different types of checking). Section IV discusses the maturity of the tool and event data structures: Section V concludes this paper. • Event logs, i.e., representing a list of traces. Each trace, in turn, is a list of events. The events are structured as II. A RCHITECTURE AND F EATURES key-value maps. • Event Streams representing one list of events (again rep- In order to maximize the possibility to understand and re-use resented as key-value maps) that are not (yet) organized the code, and to be able to execute large-scale experiments, in cases. the following architectural guidelines have been adopted on Conversion utilities are provided to convert event data objects the development of PM4Py: from one format to the other. Furthermore, PM4Py supports • A strict separation between objects (event logs, Petri the use of pandas data frames, which are efficient in nets, DFGs, . . . ), algorithms (Alpha Miner [10], Inductive case of using larger event data. Other objects currently sup- Miner [11], alignments [12] . . . ) and visualizations in ported by PM4Py include: heuristic nets, accepting Petri nets, different packages. In the pm4py.object package, classes process trees and transition systems. to import/export and to store the information related to the objects are provided, along with some utilities to B. Algorithms convert objects, e.g., process trees into Petri nets; while in the pm4py.algo package, algorithms to discover, perform The PM4Py library provides several mainstream process conformance checking, enhancement and evaluation are mining techniques, including: provided. All visualizations of objects are provided in the pm4py.visualization package. • Process discovery: Alpha(+) Miner [10] and Inductive • Most functionality in PM4Py has been realized through Miner (IMDF [11]). factory methods. These factory methods provide a single • Conformance Checking: Token-based replay and align- access point for each algorithm, with a standardized set ments [12]. of input objects, e.g., event data and a parameters object. • Measurement of fitness, precision, generalization and Consider the factory method of the Alpha Miner, depicted simplicity of process models. in Fig. 1. The Alpha (variant=’classic’) and • Filtering based on time-frame, case performance, trace the Alpha+ (variant=’plus’) are made available. endpoints, trace variants, attributes, and paths. Factory methods allow for the extension of existing • Case management: statistics on variants and cases. algorithms whilst ensuring backward-compatibility. The • Graphs: case duration, events per time, distribution of a factory methods typically accept the name of the variant numeric attribute’s values. of the algorithm to use, and some parameters (shared • Social Network Analysis [13]: handover of work, working among variants, or variant-specific). together, subcontracting and similar activities networks. 1 from pm4py . o b j e c t s . l o g . i m p o r t e r . x e s import f a c t o r y a s x e s i m p o r t e r 2 from pm4py . a l g o . d i s c o v e r y . a l p h a import f a c t o r y a s a l p h a m i n e r 3 from pm4py . v i s u a l i z a t i o n . p e t r i n e t import f a c t o r y a s p n v i s f a c t o r y 4 l o g = x e s i m p o r t e r . a p p l y ( ”C: \ \ r e c e i p t . x e s ” ) 5 # d i s c o v e r s a P e t r i n e t a l o n g w i t h an i n i t i a l ( im ) 6 # and a f i n a l m a r k i n g ( fm ) 7 n e t , im , fm = a l p h a m i n e r . a p p l y ( l o g ) 8 g v i z = p n v i s f a c t o r y . a p p l y ( n e t , im , fm ) 9 p n v i s f a c t o r y . view ( g v i z ) Figure 2: PM4Py code to load a log, apply Alpha Miner and visualize a Petri net. Figure 4: Social Network Analysis (Handover of Work metric) using Pyvis visualization. Figure 5: PM4Py in action: process discovery with the Alpha C. Visualizations Miner. The following Python visualization libraries have been used in the project: A. Process Discovery • GraphViz: representation of directly-follows graphs, Petri nets, transition systems, process trees. • NetworkX: static representation of social networks. Fig. 2 shows example code to perform process discovery • Pyvis: web-based, dynamic representation of social net- using Alpha Miner and visualize the process model. The works (see Fig. 4). factory methods that are needed (XES importer, Alpha Miner and Petri net visualization) are loaded (line 1-3). Then, an XES log is imported (line 4), the Alpha Miner is applied providing III. E XAMPLES the log object (line 7), and the visualization is obtained: a factory method is applied to layout the graph (line 8), and the In this section, we provide some examples of the use of result is shown in a window (line 9). The result is shown in PM4Py. Fig. 5. 1 from pm4py . a l g o . c o n f o r m a n c e . a l i g n m e n t s import f a c t o r y a s a l i g n m e n t s 2 # a l i g n m e n t s a c c e p t s a l o g and an a c c e p t i n g P e t r i n e t , i . e . 3 # a P e t r i n e t a l o n g w i t h an i n i t i a l ( im ) and a f i n a l ( fm ) m a r k i n g 4 a l i g n e d t r a c e s = a l i g n m e n t s . a p p l y ( l o g , n e t , im , fm ) 5 f o r i n d e x , r e s u l t i n enumerate ( a l i g n e d t r a c e s ) : 6 print ( index , r e s u l t [ ’ alignment ’ ] ) [ ( ’ r e g i s t e r r e q u e s t ’ , ’ r e g i s t e r r e q u e s t ’ ) , ( ’>>’ , None ) , ( ’ c h e c k t i c k e t ’ , ’ c h e c k t i c k e t ’ ) , ( ’ e x a m i n e t h o r o u g h l y ’ , ’ e x a m i n e t h o r o u g h l y ’ ) , ( ’>>’ , None ) , ( ’ d e c i d e ’ , ’ d e c i d e ’ ) , ( ’>>’ , None ) , ( ’ reject request ’ , ’ reject request ’ )] Figure 3: PM4Py code to perform alignments between a log and a model, and print the alignments. The output of the alignment of a trace on an example log and model is reported. Issues are managed through Github. The XES certification, with maximum score, has been awarded to the PM4Py library. Figure 6: Users that accessed the PM4Py website in February V. C ONCLUSION 2019 In this paper, the PM4Py process mining library (http://www.pm4py.org) has been introduced. PM4Py supports a rapidly growing set of process mining techniques (discovery, conformance checking, enhancement . . . ). A video presenting the library and some example applications (log management, process discovery, conformance checking) has been made available3 . The library can be installed4 through the com- mand pip install pm4py. Extensive documentation is provided through the official website of the library. Moreover, the Github repository supports a collaborative eco-system where users could signal problems or contribute to the code. R EFERENCES [1] W. van der Aalst, Process Mining - Data Science in Action, Second Edition. Springer, 2016. [2] B. F. Van Dongen, A. K. A. de Medeiros, H. Verbeek, A. Weijters, and W. van der Aalst, “The prom framework: A new era in process mining tool support,” in International conference on application and theory of Figure 7: Daily downloads of PM4Py from Pypi during the petri nets. Springer, 2005, pp. 444–454. month of February 2019 [3] M. La Rosa, H. A. Reijers, W. van der Aalst, R. M. Dijkman, J. Mendling, M. Dumas, and L. Garcı́a-Bañuelos, “Apromore: An advanced process model repository,” Expert Systems with Applications, vol. 38, no. 6, pp. 7029–7040, 2011. B. Conformance Checking [4] A. Bolt, M. de Leoni, and W. M. van der Aalst, “Scientific workflows for process mining: building blocks, scenarios, and implementation,” Fig. 3 shows example code to apply alignments and display International Journal on Software Tools for Technology Transfer, vol. 18, the result. First, the alignments factory method is loaded (line no. 6, pp. 607–628, 2016. 1). Then, the alignments between a log object and a process [5] R. Mans, W. van der Aalst, and H. E. Verbeek, “Supporting process mining workflows with RapidProM.” in BPM (Demos), 2014, p. 56. model are obtained (line 4). For each aligned trace (line 5) [6] W. van der Aalst, A. Bolt, and S. J. van Zelst, “RapidProM: Mine your the alignment result is displayed on the screen (line 6). The processes and not just your data,” CoRR, vol. abs/1703.03740, 2017. alignment of a trace is reported in the lower part of Fig. 3. [Online]. Available: http://arxiv.org/abs/1703.03740 [7] J. Carmona Vargas and M. Solé, “Pmlab: a scripting environment for process mining,” in Proceedings of the BPM Demo Sessions 2014: Co- IV. M ATURITY OF THE TOOL located with the 12th International Conference on Business Process Management (BPM 2014) Eindhoven, The Netherlands, September 10, PM4Py 1.0 has been released on 21/12/2018 and was used 2014. CEUR-WS. org, 2014, pp. 16–20. by 200 students in the “Introduction to Data Science” course [8] H. Valdivieso, W. L. J. Lee, J. Munoz-Gama, and M. Sepúlveda, “Opyenxes: A complete python library for the extensible event stream held by the Process and Data Science group in the RWTH standard.” Aachen University. Already two academic projects have been [9] G. Janssenswillen and B. Depaire, “Bupar: business process analysis in supported by PM4Py and are publicly available: r,” 2017. [10] W. van der Aalst, T. Weijters, and L. Maruster, “Workflow mining: • Usage of probabilistic automata for compliance checking Discovering process models from event logs,” IEEE Transactions on (https://github.com/lvzheqi/StreamingEventCompliance). Knowledge and Data Engineering, vol. 16, no. 9, pp. 1128–1142, 2004. [11] S. J. Leemans, D. Fahland, and W. van der Aalst, “Scalable process • Prefix alignments for streaming event data [14] discovery with guarantees,” in International Conference on Enterprise, (https://gitlab.com/prefal/confo). Business-Process and Information Systems Modeling. Springer, 2015, pp. 85–101. PM4Py 1.1 has been released on 22/02/2019 with additional [12] A. Adriansyah, N. Sidorova, and B. F. van Dongen, “Cost-based fitness features. There are some integrations of the PM4Py library in in conformance checking,” in 2011 Eleventh International Conference on Application of Concurrency to System Design. IEEE, 2011, pp. other projects: 57–66. [13] W. van der Aalst and M. Song, “Mining social networks: Uncovering • bupaR R process mining library uses PM4Py to handle interaction patterns in business processes,” in International conference alignments and get models using the Inductive Miner. on business process management. Springer, 2004, pp. 244–260. • A data analytics web interface was written in Vue.JS [14] S. J. van Zelst, A. Bolt, M. Hassani, B. F. van Dongen, and W. van der Aalst, “Online conformance checking: relating event streams to process (https://git.bogdan.co/b0gdan/beratungsleistungen). models using prefix-alignments,” International Journal of Data Science In Fig. 6, some statistics taken from Google Analytics are re- and Analytics, pp. 1–16, 2017. ported about the number of accesses to PM4Py web site during 3 http://pm4py.pads.rwth-aachen.de/pm4py-demo-video/ the month of February 2019. In Fig. 7, some statistics about 4 Additional prerequisites, available at the page http://pm4py.pads.rwth- the downloads of the PM4Py library from PIP are reported. aachen.de/installation/ have to be installed.