<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PMLAB: An Scripting Environment for Process Mining</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Josep Carmona</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marc Sole</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CA Technologies</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Universitat Politecnica de Catalunya</institution>
          ,
          <addr-line>Barcelona</addr-line>
          ,
          <country country="ES">Spain</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In a decade of process mining research, several algorithms have been proposed to solve particular process mining tasks. At the same pace, tools have appeared both in the academic and the commercial domains. These tools have enabled the use of process mining practices to a rather limited extent. In this paper we advocate for a change in the mentality: process mining may be an exploratory discipline, and PMLAB { a Python-based scripting environment supporting this { is proposed. This demo presents the main features of the PMLAB environment In scienti c computing, one is not only building computational models and complex algorithms that enable quantitative analysis, but also continuously exposing these models and algorithms to real data in order to have a better understanding of the reality being studied. This exploratory view of the eld (algorithms help but nobody in the eld assumes that are su cient to solve their particular problems) made environments like MATLAB R or Mathematica R to be tremendously successful in helping the progress of research. We advocate for having a similar environment in the novel discipline of process mining. In a nutshell, we believe process mining should be programmed and not only used. There are several tools available to use process mining algorithms, being ProM [3] the state-of-the-art tool. ProM is a great academic e ort that incorporates around three hundreds plugins programmed from very di erent universities. It allows normal users, i.e., those ones not familiar with process mining, to use a graphical user-friendly front-end to process mining algorithms. This strength (process mining algorithms accessible for the masses) has become, in our opinion, a weakness: the expert user is restricted to work with strict GUIs, and it takes a considerable e ort when the particular task to achieve is not fully satisfying the requirements of the plugins. As a programmer, a deep knowledge of the internals of ProM is required in order to create a new plugin, even if it represents a slight modi cation of the ones available. PMLAB is an interactive programming environment for (exploratory) process mining computing and/or research on top of a process-oriented language. In this ? Copyright c 2014 for this paper by its authors. Copying permitted for private and academic purposes.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>language, logs, models and many other high-level objects/tasks are rst-class
citizens, meaning that one can compute (interactively or not) on the basis of these
elements. Importantly, there can be di erent granularities on the view of these
high-level elements, e.g., a log can be simply passed to a discovery algorithm
(coarse-level view), or analyzed to derive the most frequent cases (introspective
view). The following is a list of PMLAB features:
{ Interactive shell: as happens in Mathematica, a shell where every object
used/computed is available is provided, and process mining algorithms may
be applied to these objects to create new ones. The typical session may
start by importing the libraries to be used, and to continuously enrich the
environment by computing new objects from the existing ones.
{ Process mining elements as rst-class citizens of the language: importantly,
the environment o ers a solid and consistent library for some of the main
tasks required in process mining, e.g., importing a log in XES format. Once
a log is imported into a variable, algorithms can be applied on the variable to
produce new elements (e.g., a discovery algorithm to derive a BPMN model).
{ Programmer friendly: the environment not only provides the necessary help
for using the elements, but more importantly describes them in a way a
programmer can incorporate these objects onto her/his programs.
{ Extendable: new functionalities can be added by means of new library
modules.
{ Irredundant: to have thirty algorithms to perform the same task maybe is
not the ideal situation for using that functionality. As a policy, we believe
the core environment should limit the amount of redundancy in order to
simplify the usage.
{ Simple Programming: the syntax and semantics of the language should be
easy, in order to allow for easy programming. One example of this is types in
programming languages: although useful for programming and compilation,
the learning curve required to master a statically-typed language is
significantly higher than the one for a dynamically-typed language. This makes
dynamically-typed languages as Python a good candidate.
{ OS exposed: there is a good marriage between the operating system elements
( les, directories, databases, etc ...) and the elements of the environment.
This will easy the management and manipulation of the data within the
environment.
{ With support to distributed/parallel computing: it is fairly easy to distribute
or parallelize the computations to take advantage of the computing resources
available.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>PMLAB Tool Description</title>
      <sec id="sec-2-1">
        <title>Architechture</title>
        <p>
          IPython
{ IPython [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]: this environment o ers the shell to perform the computations
needed to carry out process mining tasks. Many of the functionalities
required in the previous section exist in IPython. It is an open-source
environment (license BSD).
{ PMLAB modules: a set of process mining modules will form the basis for
process mining computing. These modules will contain algorithms for the three
process mining dimensions: discovery, conformance and enhancement. In the
current version of PMLAB, only discovery and enhancement algorithms are
included.
{ Python Programming: nally, for any task not considered in the PMLAB
modules, one can always use a python program on top of IPython. The results
and intermediate computations (e.g., program's variable assignments) may
be, if wanted, incorporated into the IPython shell after running the program.
        </p>
        <p>In the next subsection we provide an example illustrating the user perspective
of PMLAB.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Maturity</title>
        <p>The tool is currently under development, and therefore it may evolve towards
a more stable version in the near future. The current version can be seen as a
prototype of the ideas underlying scripting process mining algorithms. Moreover,
the tool has only been tested with small or medium sized examples. In spite of
this, there are few universities and companies using the tool in its current state.</p>
        <p>In terms of features, the tools provides support for the following objects: logs,
Petri nets, transition systems, Causal nets, BPMN models. There exist
transformations between some of these elements (e.g. from Causal nets to BPMN, or
from logs to transition systems), and discovery techniques for Petri nets, Causal
nets, and BPMN models. All elements can be graphically visualized and some
of them simulated. High-level algorithms like log clustering, ltering,
projecting and event encoding are also available. Since PMLAB supports some of the
standard process mining formats, it can be used to interact with other tools.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Getting the Tool</title>
        <p>In the following web page: http://www.lsi.upc.edu/~jcarmona/PMLAB/
one can nd all the required information: a tutorial including installation
instructions and the distribution. Currently the tool is distributed in two forms:
{ Virtualized: We have created a VirtualBox virtual machine in Lubuntu which
can be easily downloaded and installed in few steps.
{ Sources: We provide the python library together with the installation script.</p>
        <p>It is expected to be installed in a Linux distribution, since some binaries
that are also provided are compiled for this platform.
3</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Example</title>
      <p>A session that uses the already available functionalities in the environment is now
described. We simply begin by starting the IPython environment in a directory
that contains a XES le named exercise5.xes, which belongs to the example
les distributed with ProM. The log contains sequences representing the typical
actions in the process of reviewing papers for scienti c publication.</p>
      <p>Our rst action is loading the module that handles the logs, and reading the
le.
&gt;&gt;&gt; import pmlab.log
&gt;&gt;&gt; log = pmlab.log.log_from_file('exercise5.xes')
&gt;&gt;&gt; log.statistics()
Alphabet size: 14
Number of cases: 100
Number of unique cases: 96
Length of shortest case: 11
Length of largest case: 50
Average case length: 23.0</p>
      <p>
        Imagine that we want to communicate the model to a company whose
members are only familiar with the BPMN notation. The PMLAB package has a module
that allows discovering BPMN diagrams from C-nets [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], so rst of all we will
discover a C-net and then transform it into a BPMN diagram. To discover a
C-net we must load the C-net module and condition the log (C-net have some
particular conditions that have to be ful lled by the logs). The corresponding
instructions are shown below:
&gt;&gt;&gt; import pmlab.cnet
&gt;&gt;&gt; clog = pmlab.cnet.condition_log_for_cnet(log)
&gt;&gt;&gt; cn, bfreq = pmlab.cnet.cnet_from_log(clog)
&gt;&gt;&gt; cn.save('exercise5.cn')
&gt;&gt;&gt; pmlab.cnet.save_frequencies(bfreq, 'exercise5.bfreq')
Additionally, after discovering the C-net we have saved the net and the binding
frequencies discovered in two les just in case we need them in another occasion.
      </p>
      <p>Finally we will transform the C-net into a BPMN diagram. To do so we will
rst load the appropriate module and call the transformation function. Then we
will add the frequency information to the diagram (so that most frequent paths
appear thicker than infrequent ones), saving the diagram in a graphviz DOT le.
&gt;&gt;&gt; import pmlab.bpmn
&gt;&gt;&gt; bp = pmlab.bpmn.bpmn_from_cnet(cn)
&gt;&gt;&gt; bp.add_frequency_info(clog, bfreq)
&gt;&gt;&gt; bp.print_dot('exercise5freq.bpmn.dot')
Then the DOT le can be used to generate a graphic le using the Graphviz
suite, that can be called directly from IPython using the following command:
&gt;&gt;&gt; !dot -Tps exercise5freq.bpmn.dot &gt; exercise5freq.bpmn.ps
That produces the BPMN diagram of Figure 2 (bottom).</p>
      <p>Up to this point we have shown a classical use of the environment as a simple
user. However, for this kind of tasks, a more user-friendly GUI would be nicer
and would also save typing. What are the advantages of the environment ? Let us
illustrate some of them. Assume that you want to repeat the previous processing
with 1,000 di erent les. No problem. IPython o ers a save command in which
you can indicate which instruction numbers you want to save to a le. Using that
command we can save all the previous typed instructions to a text le named,
for instance, discoverBPMN.py. Now in this script we can change the literal
'exercise5.xes' for a variable inside a loop that takes as value the name of
each one of the 1,000 les. This script can be executed inside IPython with a
simple run discoverBPMN.py</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>F.</given-names>
            <surname>Perez</surname>
          </string-name>
          and
          <string-name>
            <surname>B. E. Granger.</surname>
          </string-name>
          <article-title>IPython: a System for Interactive Scienti c Computing</article-title>
          .
          <source>Comput. Sci. Eng</source>
          .,
          <volume>9</volume>
          (
          <issue>3</issue>
          ):
          <volume>21</volume>
          {
          <fpage>29</fpage>
          , May
          <year>2007</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2. W. van der Aalst,
          <string-name>
            <given-names>A.</given-names>
            <surname>Adriansyah</surname>
          </string-name>
          , and
          <string-name>
            <surname>B. van Dongen.</surname>
          </string-name>
          <article-title>Causal nets: a modeling language tailored towards process discovery</article-title>
          .
          <source>In CONCUR</source>
          , pages
          <volume>28</volume>
          {
          <fpage>42</fpage>
          ,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>W. M. P. van der Aalst</surname>
            ,
            <given-names>B. F. van Dongen</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>C. W.</given-names>
            <surname>Gu</surname>
          </string-name>
          <article-title>nther,</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Rozinat</surname>
          </string-name>
          , E. Verbeek, and
          <string-name>
            <given-names>T.</given-names>
            <surname>Weijters</surname>
          </string-name>
          . Prom:
          <article-title>The process mining toolkit</article-title>
          . In
          <string-name>
            <surname>A. K. A. de Medeiros</surname>
          </string-name>
          and B. Weber, editors,
          <source>BPM (Demos)</source>
          , volume
          <volume>489</volume>
          <source>of CEUR Workshop Proceedings. CEUR-WS.org</source>
          ,
          <year>2009</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>