<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Unearthing the Real Process Behind the Event Data: The Case for Increased Process Realism (Extended Abstract)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Gert Janssenswillen</string-name>
          <email>gert.janssenswillen@uhasselt.be</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>UHasselt - Hasselt University</institution>
          ,
          <addr-line>Martelarenlaan 42, 3500 Hasselt</addr-line>
          ,
          <country country="BE">Belgium</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Companies in the 21st century possess a large amount of data about their products, customers and transactions. The increase in available event data gave rise to process mining, a discipline that focuses on extracting insights about processes from event logs. However, correctly displaying business processes is not a trivial task. The concept of process realism is introduced in this dissertation | stressing the need for reliable process analysis results for evidence-based decision making | which is approached from two angles. Firstly, quality dimensions and measures for process discovery are analysed on a large scale and compared with each other on the basis of empirical experiments. Secondly, by developing a transparent and extensible tool-set, a framework is o ered to analyse process data from di erent perspectives. Exploratory and descriptive analysis of process data and testing of hypotheses again leads to increased process realism. Based on both approaches, recommendations are made for future research, and a call is made to give the process realism mindset a central place within process mining analyses.</p>
      </abstract>
      <kwd-group>
        <kwd>Process mining</kwd>
        <kwd>Event data</kwd>
        <kwd>Conformance Checking</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In current times, organisations possess a tremendous amount of data concerning
their customers, products and processes. Many activities which are taking place
in their operational processes are being recorded in event logs [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Techniques
from the process mining eld, which has grown steadily over the last decades,
can be applied to gain insights into these event data [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Over the past decade, a
lot of attention has been given to the discovery of process models from logs [
        <xref ref-type="bibr" rid="ref4 ref5">4,5</xref>
        ],
and the quality measurement of these models [
        <xref ref-type="bibr" rid="ref3 ref6 ref7">3, 6, 7</xref>
        ].
      </p>
      <p>The results of process mining analyses, if acted upon, can have important
rami cations for business operations in two ways. Firstly, improvements to the
performance of processes. Performance | or a lack thereof | can be expressed
in many di erent manners, such as the time spent on the process or the incurred
operational costs. Secondly, improvements in compliance with rules and
regulations, whether imposed internally in organisations or by (inter)national laws,
are important to prevent fraud and other types of risk. Both of these aspects
strongly rely on the ability to accurately delineate the process and all its
relevant characteristics based on the process data that has been extracted from the
organisation's information systems.</p>
      <p>This dissertation aims to contribute from several angles related to this
accurate representation of processes, introducing the notion of the concept process
realism. Process realism can be de ned as the interest or concern for the
actual or real process, as distinguished from the abstract, speculative, etc., or, the
tendency to view or represent processes as they really are. In order to optimise
processes, evidence-based decision making is needed. Consequently, it is essential
to map these processes in a realistic way. Blindly relying on both partial and/or
inconsistent data and on algorithms can lead to wrong actions being taken.</p>
      <p>Process realism is approached from two perspectives. First, quality
dimensions and measures for process discovery results are analysed on a large scale
and compared with each other on the basis of empirical experiments (Part II of
the thesis). Which measures are best suited to assess the quality of a discovered
process model? What are their weaknesses and strengths? Which challenges still
need to be overcome in order to evolve towards reliable quality measurement?
The results of these experiments are discussed in Section 2.</p>
      <p>In addition to the focus on process models, process realism is also approached
from a data point of view. By developing a transparent and extensible tool-set, a
framework is o ered to analyse process data from di erent perspectives (Part III
of the thesis). Exploratory and descriptive analysis of process data and testing
of hypotheses again leads to increased process realism. This led to the creation
of bupaR, an open-source software suite for process analysis in R. This part of
the dissertation is further discussed in Section 3.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Process Model Quality</title>
      <p>When looking at the results of a process discovery algorithm, the outcome is
often too easily (mis)taken for absolute truth about the underlying process.
However, the fact that it was discovered from a sample of event data, which
probably also contains measurement errors, tells us that this is not necessarily
the case. Simultaneously, it is not a reliable representation of the original event
data either, because of the lters and other choices and assumptions imposed
by the discovery algorithm used. As such, awareness about whether you are
describing the event data, making assertions about the underlying process, or
an ambiguous mix of both is currently missing.</p>
      <p>Being able to accurately quantify the quality of discovered process models,
which is an important component of conformance checking, is critical for process
discovery. Only through accurate quality measurement can the trustworthiness
of discovered process models be assessed, to see whether the insights they deliver
are reliable. It is crucial to know whether a discovered process model is a precise
and tting representation of the event data or the underlying process. Many
quality measures | tness, precision and generalization | have been developed
over the past years, but they have so far only been evaluated narrowly on how
they compare to each other. Moreover, it is not clear how to interpret or combine
di erent dimensions such as precision and generalization. The research objective
related to process model quality is therefore twofold:
1. examine the measures in terms of validity, sensitivity and feasibility, and
2. analyse their ability to quantify the quality of the model as a representation
of the underlying process, i.e. the system.</p>
      <p>A summary of the results of Part II is shown in Table 1. The unbiased
estimator column for tness and precision measures refers to the ability of measures to
act as an unbiased estimator of the correspondence between the process model
and the underlying process (i.e, the system), when applied between a log and a
model. We refer to this as system- tness (system-precision) and log- tness
(logprecision), respectively. For generalization measures, unbiased estimator refers
to the ability of generalization measures to unbiasedly estimate system- tness,
which is how generalization is most often de ned.1</p>
      <p>The experiments of Part II and their conclusions indicate several important
challenges to be tackled by future research. Challenges related to process
quality measurement itself | e.g. how to estimate system-quality in an unbiased
manner? | as well as to the experimental set-up for the empirical evaluation |
which types of models or how many systems to generate are questions in these
kind of experiments to which current literature does not provide an answer.
3</p>
    </sec>
    <sec id="sec-3">
      <title>Process Analytics</title>
      <p>A process model, notwithstanding how superior in quality it might be, will always
make abstraction of certain information | such as information on resources,
1 Note that, in contrast to tness and precision, there is no universally agreed-upon
de nition of generalization.
time, or other attributes | thereby partly sacri cing the realism one has about
the process. While a model can indicate certain surprising or interesting patterns
with regards to the process, the practitioner will want to have a means to further
investigate this pattern, to understand why and how it came about, before he
can decide whether | and which | corrective actions are required to improve
the performance or compliance of the process.</p>
      <p>As such, this dissertation is also motivated by the necessity a tool-set to
analyse process data in a exible and powerful way, able to focus on very speci c
segments or perspectives of the processes. Important in this respect is the
capability to use proven data analytics techniques | from statistics to contemporary
data mining tools | in order to truly unravel these patterns and con rm their
reality. While many developments with respect to process analysis tools have
been already made, important limitations can still be found which prevent these
type of exible and transparent inquiries.</p>
      <p>The contribution of the second part of this dissertation is therefore the
development of a tool-set, answering to speci c requirements which are identi ed
based on the inventory of state-of-the-art tools, both of open-source and
commercial nature. In particular, the following characteristics are considered:
{ Flexibility | the ability of the tool to analyse multiple perspectives of the
process besides the omnipresent focus on control- ow. Also non-standard
case and event attributes should receive their place in the analysis of the
process.
{ Connectivity | the ability to use existing tools and techniques. Being
connected with these existing functionalities will prevent that process analysis
will end isolated from the advances in the broader data science eld.
{ Transparency | abolishing the often obscuring characteristics of process
analytics tools, such as hidden assumptions and ambiguous,
behind-thescenes pre- or post-processing steps. In order to bring about process realism,
the tool should clearly document the workings of all the functionalities and
allow for reproducible work- ows.</p>
      <p>Based on these aspects, the framework bupaR is introduced. bupaR is an
extensible set of R-packages for business process analysis, developed in order to
support exible, reproducible and extensible process analytics. As an evaluation
of the framework, it is applied to two case studies. First, how can we use
process data to better understand students' study trajectories and to better guide
students? Secondly, how can we apply process analysis in a railway context, in
order to achieve a smoother service for passengers?</p>
      <p>Both case studies show that the framework clearly has added value, and
that the answers to the questions asked can help to improve the processes under
consideration. At the same time, unresolved challenges within process mining are
also emphasised, such as the analysis of processes at the right level of granularity,
and the assumption that process instances are independent of each other.
From both perspectives, process model and process data, recommendations are
made for future research, and a call is made to give the process realism mindset a
central place within process mining analyses. The research objective of the rst
part, with a focus on process model quality, was to analyse quality measures
to examine their usefulness in terms of validity, sensitivity and feasibility, as
well as their ability to quantify the quality of the model as a representation
of the underlying process. So far, little research has been done concerning the
evaluation and comparison of quality measures. The empirical analyses done in
this dissertation elucidate this poor comprehension.</p>
      <p>Secondly, reproducible analysis is an important requirement for tools in
current times. It is relevant for both academics | allowing them to reproduce
experiments | as for industry | to rerun analyses on new data, or using new
assumptions. Reproducibility can be obtained using di erent approaches. One is
to support the creation of graphical work- ows, such as RapidMiner and other
graphical data analysis environments do. Another approach is through scripting,
which was the approach taken by bupaR. Because of investments in
documentation, tutorials, examples and a website, as well as through a straightforward API
design, many actions have been taken to make bupaR as accessible as possible,
thereby helping to spread the adoption of process mining to a broader audience.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reijers</surname>
            ,
            <given-names>H.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weijters</surname>
            , A.J., van Dongen,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>De Medeiros</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Verbeek</surname>
            ,
            <given-names>H.M.W.</given-names>
          </string-name>
          :
          <article-title>Business process mining: An industrial application</article-title>
          .
          <source>Information Systems</source>
          <volume>32</volume>
          (
          <issue>5</issue>
          ),
          <volume>713</volume>
          {
          <fpage>732</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weijters</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Maruster</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Work ow mining: Discovering process models from event logs. Knowledge and Data Engineering</article-title>
          , IEEE Transactions on
          <volume>16</volume>
          (
          <issue>9</issue>
          ),
          <volume>1128</volume>
          {
          <fpage>1142</fpage>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Adriansyah</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Munoz-Gama</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Carmona</surname>
            , J., van Dongen,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Alignment based precision checking</article-title>
          .
          <source>In: Business Process Management Workshops</source>
          . pp.
          <volume>137</volume>
          {
          <fpage>149</fpage>
          . Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Garcia-Banuelos</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>La</given-names>
            <surname>Rosa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>De Weerdt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Ekanayake</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.C.</surname>
          </string-name>
          :
          <article-title>Controlled automated discovery of collections of business process models</article-title>
          .
          <source>Information Systems</source>
          <volume>46</volume>
          ,
          <fpage>85</fpage>
          {
          <fpage>101</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Leemans</surname>
            ,
            <given-names>S.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fahland</surname>
            , D., van der Aalst,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Discovering block-structured process models from event logs: A constructive approach</article-title>
          . In: International conference
          <article-title>on applications and theory of Petri nets and concurrency</article-title>
          , pp.
          <volume>311</volume>
          {
          <issue>329</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6. de Leoni,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Maggi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.M.</given-names>
            ,
            <surname>van der Aalst</surname>
          </string-name>
          ,
          <string-name>
            <surname>W.M.P.:</surname>
          </string-name>
          <article-title>An alignment-based framework to check the conformance of declarative process models and to preprocess event-log data</article-title>
          .
          <source>Information Systems</source>
          <volume>47</volume>
          ,
          <fpage>258</fpage>
          {
          <fpage>277</fpage>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Senderovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weidlich</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yedidsion</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mandelbaum</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kadish</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bunnell</surname>
            ,
            <given-names>C.A.</given-names>
          </string-name>
          :
          <article-title>Conformance checking and performance improvement in scheduled processes: A queueing-network perspective</article-title>
          .
          <source>Information Systems</source>
          <volume>62</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>