PMLAB: An Scripting Environment for Process Mining Josep Carmona1 and Marc Solé2? 1 Universitat Politècnica de Catalunya, Barcelona (Spain) 2 CA Technologies, Barcelona (Spain) Abstract In a decade of process mining research, several algorithms have been proposed to solve particular process mining tasks. At the same pace, tools have appeared both in the academic and the commercial domains. These tools have enabled the use of process mining practices to a rather limited extent. In this paper we advocate for a change in the mentality: process mining may be an exploratory discipline, and PMLAB – a Python-based scripting environment supporting this – is proposed. This demo presents the main features of the PMLAB environment 1 Introduction In scientific computing, one is not only building computational models and com- plex algorithms that enable quantitative analysis, but also continuously exposing these models and algorithms to real data in order to have a better understand- ing of the reality being studied. This exploratory view of the field (algorithms help but nobody in the field assumes that are sufficient to solve their particular problems) made environments like MATLAB R or Mathematica R to be tremen- dously successful in helping the progress of research. We advocate for having a similar environment in the novel discipline of process mining. In a nutshell, we believe process mining should be programmed and not only used. There are several tools available to use process mining algorithms, being ProM [3] the state-of-the-art tool. ProM is a great academic effort that incorporates around three hundreds plugins programmed from very different universities. It allows normal users, i.e., those ones not familiar with process mining, to use a graphical user-friendly front-end to process mining algorithms. This strength (process mining algorithms accessible for the masses) has become, in our opinion, a weakness: the expert user is restricted to work with strict GUIs, and it takes a considerable effort when the particular task to achieve is not fully satisfying the requirements of the plugins. As a programmer, a deep knowledge of the internals of ProM is required in order to create a new plugin, even if it represents a slight modification of the ones available. PMLAB is an interactive programming environment for (exploratory) process mining computing and/or research on top of a process-oriented language. In this ? Copyright c 2014 for this paper by its authors. Copying permitted for private and academic purposes. language, logs, models and many other high-level objects/tasks are first-class cit- izens, meaning that one can compute (interactively or not) on the basis of these elements. Importantly, there can be different granularities on the view of these high-level elements, e.g., a log can be simply passed to a discovery algorithm (coarse-level view), or analyzed to derive the most frequent cases (introspective view). The following is a list of PMLAB features: – Interactive shell: as happens in Mathematica, a shell where every object used/computed is available is provided, and process mining algorithms may be applied to these objects to create new ones. The typical session may start by importing the libraries to be used, and to continuously enrich the environment by computing new objects from the existing ones. – Process mining elements as first-class citizens of the language: importantly, the environment offers a solid and consistent library for some of the main tasks required in process mining, e.g., importing a log in XES format. Once a log is imported into a variable, algorithms can be applied on the variable to produce new elements (e.g., a discovery algorithm to derive a BPMN model). – Programmer friendly: the environment not only provides the necessary help for using the elements, but more importantly describes them in a way a programmer can incorporate these objects onto her/his programs. – Extendable: new functionalities can be added by means of new library mod- ules. – Irredundant: to have thirty algorithms to perform the same task maybe is not the ideal situation for using that functionality. As a policy, we believe the core environment should limit the amount of redundancy in order to simplify the usage. – Simple Programming: the syntax and semantics of the language should be easy, in order to allow for easy programming. One example of this is types in programming languages: although useful for programming and compilation, the learning curve required to master a statically-typed language is signif- icantly higher than the one for a dynamically-typed language. This makes dynamically-typed languages as Python a good candidate. – OS exposed: there is a good marriage between the operating system elements (files, directories, databases, etc ...) and the elements of the environment. This will easy the management and manipulation of the data within the environment. – With support to distributed/parallel computing: it is fairly easy to distribute or parallelize the computations to take advantage of the computing resources available. 2 PMLAB Tool Description 2.1 Architechture Figure 1 describes informally the three computing resources available in the environment: Python Programming PMLAB Modules IPython Figure 1. The PMLAB environment. – IPython [1]: this environment offers the shell to perform the computations needed to carry out process mining tasks. Many of the functionalities re- quired in the previous section exist in IPython. It is an open-source environ- ment (license BSD). – PMLAB modules: a set of process mining modules will form the basis for pro- cess mining computing. These modules will contain algorithms for the three process mining dimensions: discovery, conformance and enhancement. In the current version of PMLAB, only discovery and enhancement algorithms are included. – Python Programming: finally, for any task not considered in the PMLAB mod- ules, one can always use a python program on top of IPython. The results and intermediate computations (e.g., program’s variable assignments) may be, if wanted, incorporated into the IPython shell after running the program. In the next subsection we provide an example illustrating the user perspective of PMLAB. 2.2 Maturity The tool is currently under development, and therefore it may evolve towards a more stable version in the near future. The current version can be seen as a prototype of the ideas underlying scripting process mining algorithms. Moreover, the tool has only been tested with small or medium sized examples. In spite of this, there are few universities and companies using the tool in its current state. In terms of features, the tools provides support for the following objects: logs, Petri nets, transition systems, Causal nets, BPMN models. There exist trans- formations between some of these elements (e.g. from Causal nets to BPMN, or from logs to transition systems), and discovery techniques for Petri nets, Causal nets, and BPMN models. All elements can be graphically visualized and some of them simulated. High-level algorithms like log clustering, filtering, project- ing and event encoding are also available. Since PMLAB supports some of the standard process mining formats, it can be used to interact with other tools. 2.3 Getting the Tool In the following web page: http://www.lsi.upc.edu/~jcarmona/PMLAB/ one can find all the required information: a tutorial including installation in- structions and the distribution. Currently the tool is distributed in two forms: – Virtualized: We have created a VirtualBox virtual machine in Lubuntu which can be easily downloaded and installed in few steps. – Sources: We provide the python library together with the installation script. It is expected to be installed in a Linux distribution, since some binaries that are also provided are compiled for this platform. 3 Example A session that uses the already available functionalities in the environment is now described. We simply begin by starting the IPython environment in a directory that contains a XES file named exercise5.xes, which belongs to the example files distributed with ProM. The log contains sequences representing the typical actions in the process of reviewing papers for scientific publication. Our first action is loading the module that handles the logs, and reading the file. >>> import pmlab.log >>> log = pmlab.log.log_from_file(’exercise5.xes’) >>> log.statistics() Alphabet size: 14 Number of cases: 100 Number of unique cases: 96 Length of shortest case: 11 Length of largest case: 50 Average case length: 23.0 Imagine that we want to communicate the model to a company whose mem- bers are only familiar with the BPMN notation. The PMLAB package has a module that allows discovering BPMN diagrams from C-nets [2], so first of all we will discover a C-net and then transform it into a BPMN diagram. To discover a C-net we must load the C-net module and condition the log (C-net have some particular conditions that have to be fulfilled by the logs). The corresponding instructions are shown below: >>> import pmlab.cnet >>> clog = pmlab.cnet.condition_log_for_cnet(log) >>> cn, bfreq = pmlab.cnet.cnet_from_log(clog) >>> cn.save(’exercise5.cn’) >>> pmlab.cnet.save_frequencies(bfreq, ’exercise5.bfreq’) Additionally, after discovering the C-net we have saved the net and the binding frequencies discovered in two files just in case we need them in another occasion. Finally we will transform the C-net into a BPMN diagram. To do so we will first load the appropriate module and call the transformation function. Then we will add the frequency information to the diagram (so that most frequent paths appear thicker than infrequent ones), saving the diagram in a graphviz DOT file. >>> import pmlab.bpmn >>> bp = pmlab.bpmn.bpmn_from_cnet(cn) >>> bp.add_frequency_info(clog, bfreq) >>> bp.print_dot(’exercise5freq.bpmn.dot’) Then the DOT file can be used to generate a graphic file using the Graphviz suite, that can be called directly from IPython using the following command: >>> !dot -Tps exercise5freq.bpmn.dot > exercise5freq.bpmn.ps That produces the BPMN diagram of Figure 2 (bottom). Figure 2. BPMN diagrams of the example produced by the tool: without fre- quencies (top), with frequencies (bottom). Up to this point we have shown a classical use of the environment as a simple user. However, for this kind of tasks, a more user-friendly GUI would be nicer and would also save typing. What are the advantages of the environment ? Let us illustrate some of them. Assume that you want to repeat the previous processing with 1,000 different files. No problem. IPython offers a save command in which you can indicate which instruction numbers you want to save to a file. Using that command we can save all the previous typed instructions to a text file named, for instance, discoverBPMN.py. Now in this script we can change the literal ’exercise5.xes’ for a variable inside a loop that takes as value the name of each one of the 1,000 files. This script can be executed inside IPython with a simple run discoverBPMN.py References 1. F. Pérez and B. E. Granger. IPython: a System for Interactive Scientific Computing. Comput. Sci. Eng., 9(3):21–29, May 2007. 2. W. van der Aalst, A. Adriansyah, and B. van Dongen. Causal nets: a modeling language tailored towards process discovery. In CONCUR, pages 28–42, 2011. 3. W. M. P. van der Aalst, B. F. van Dongen, C. W. Günther, A. Rozinat, E. Verbeek, and T. Weijters. Prom: The process mining toolkit. In A. K. A. de Medeiros and B. Weber, editors, BPM (Demos), volume 489 of CEUR Workshop Proceedings. CEUR-WS.org, 2009.