<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Heuristic Mining Revamped: An Interactive, Data-aware, and Conformance-aware Miner</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Felix Mannhardt</string-name>
          <email>f.mannhardt@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimiliano de Leoni</string-name>
          <email>m.d.leoni@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hajo A. Reijers</string-name>
          <email>h.a.reijers@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Eindhoven University of Technology</institution>
          ,
          <addr-line>Eindhoven</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vrije Universiteit Amsterdam</institution>
          ,
          <addr-line>Amsterdam</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Process discovery methods automatically infer process models based on events logs that are recorded by information systems. Several heuristic process discovery methods have been proposed to cope with less structured processes and the presence of noise in the event log. However, (1) a large parameter space needs to be explored, (2) several of the many available heuristics can be chosen from, (3) data attributes are not used for discovery, (4) discovered models are not visualized as described in literature, and (5) existing tools do not give reliable quality diagnostics for discovered models. We present the interactive Data-aware Heuristics Miner (iDHM), a modular tool that attempts to address those five issues. The iDHM enables quick interactive exploration of the parameter space and several heuristics. It uses data attributes to improve the discovery procedure and provides built-in conformance checking to get direct feedback on the quality of the model. It is the first tool that visualizes models using the concise Causal Net (C-Net) notation. We provide a walk-through of the iDHM by applying it to a large event log with hospital billing information.</p>
      </abstract>
      <kwd-group>
        <kwd>Process Discovery</kwd>
        <kwd>Heuristics Miner</kwd>
        <kwd>Data-aware Processes</kwd>
        <kwd>Interactive</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Data on the execution of processes is recorded by information systems that are used
to support processes, i.e., events and contextual data are stored for the execution of
activities. Process discovery methods use this recorded execution data (i.e., event logs)
to automatically infer a process model. Both less structured processes and the presence
of noise in the event log (i.e., out-of-order events, missing events, infrequent variants
etc.) are challenging problems for process discovery methods [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
      </p>
      <p>
        Several heuristic process discovery methods have been proposed to cope with these
challenges and provide insights into processes at the desired level of detail [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. Most
heuristic methods are based on frequency-based measures that determine the strength
of dependencies between any two activities based on the observations in the event log.
Examples are the Fuzzy Miner (FM) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the Flexible Heuristics Miner (FHM) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Fodina
Miner (FM) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and the Data-aware Heuristic Miner [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Moreover, most commercial
process mining tools (e.g., Fluxicon, Lexmark, Celonis) rely on some form of heuristic
discovery. Thus, heuristic process discovery methods are widely used.
      </p>
      <p>Event Log</p>
      <p>Dependency
Relations (1)</p>
      <p>Conditional
Relations (2)</p>
      <p>Splits and
Joins (3)</p>
      <p>This paper reports on a number of improvements over shortcomings of existing
heuristic-based process-discovery tools: (1) the large parameter space of heuristic
methods needs to be explored manually; (2) several of the many available heuristics can be
chosen from; (3) data attributes (i.e., the event payload) are not used for process discovery;
(4) discovered Causal Nets (C-Nets) are not visualized as described in literature; and (5)
existing tools do not give reliable quality diagnostics for the discovered models despite
the lack of quality guarantees offered by heuristic approaches.</p>
      <p>
        We present the interactive Data-aware Heuristics Miner (iDHM), an open-source tool in
the ProM framework3. The iDHM aims to help tackling the described issues by
integrating several existing methods in a user-friendly tool. The iDHM provides an interactive
exploration of the parameter space and includes built-in conformance checking to
diagnose the quality of the discovered model. Thus, it is easier to explore a large parameter
space and directly spot conformance issues of the discovered model, e.g., deviating and
missing behavior in the event log. Furthermore, the iDHM uses data attributes of the
event log to reveal infrequent conditional dependencies. In [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we show that this helps
uncovering potentially interesting process behavior while still filtering random noise.
The discovered models are, then, visualized as Causal Nets (C-Nets), a concise graphical
notation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] with clear semantics, which includes information on split and join gateways.
Existing tools rely on tabular descriptions of the splits and joins (e.g., FHM and FM),
i.e., the exact semantics are not directly visible in the process model. Finally, the iDHM
provides a plug-in architecture that allows for new heuristics to be added.
      </p>
      <p>
        We applied the iDHM in a case study on an event log with hospital billing
information [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. In the remainder of this paper, we provide a walk-through of the iDHM using
this event log. We distinguish five steps of data-aware process discovery (Fig. 1): (1)
mining dependencies, (2) mining data-aware conditional dependencies, (3) mining
splitand join information, (4) discovering decision rules, and (5) checking conformance. Each
step can be configured with heuristics implemented as plug-ins to the iDHM.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2 Interactive Data-aware Heuristic Miner</title>
      <p>
        We present a walk-through of the iDHM using an event log obtained from the billing
process of a regional hospital in The Netherlands. Additional documentation and a
screencast of this walk-through are available at: https://fmannhardt.de/g/dhm.
Figure 2 shows the main screen of the iDHM. The required input is an event log in the
XES format. Traces from the event log may be filtered by using a SQL-like query syntax
À. The figures in this paper show the output in form of a C-Net, but other output formats
3 Available in the DataAwareCNetMiner package of ProM 6.7: http://promtools.org
such as Petri net or Data Petri net can also be chosen Á. The heuristic used for each step
of the iDHM can be configured Â. Several thresholds are used to configure the heuristics
Ã. At the bottom of the screen a legend explains the used color coding Ä. Edges can be
selected to show more information Å.
(Step 1) Discover Dependency Relations. First, the set of dependency relations is
determined. A dependency relation (a; b) between two activities represents a causal
dependency from activity a to activity b, e.g., in Fig. 2 the activity DELETE depends on a
prior execution of the activity NEW. This is depicted as a directed edge between both
activities. Only strong dependencies that exceed a configured dependency threshold and
are observed more often than a configured observation frequency threshold are included.
Several heuristics have been proposed to discover dependency relations and their strength.
We allow four choices: the FHM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the Alpha Miner [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], the FM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and a combination
of these three miners by taking the average dependency value obtained. In Fig. 2, we
used the FHM method. New methods can be added as plug-ins to the iDHM.
(Step 2) Discover Conditional Dependency Relations. Second, the data perspective is
taken into account when discovering the control flow of a process. Classification
techniques are used to reveal data dependencies between activities. Those data dependencies
are used to distinguish random noise from infrequent conditional dependencies as
described in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. This greatly extends existing techniques, which are solely based on the
control-flow perspective and, hence, would disregard such infrequent behavior as noise.
Conditional dependencies may provide insights for process analysts since they could
indicate, e.g., workarounds and deviations from normal process behavior. In Fig. 2, e.g., the
dependency between FIN and the end of the process Å is a conditional dependency that
was discovered based on the data condition closeCode = H with a quality score of 0.98.
Conditional edges are highlighted in red. Further statistics on the discovered condition
can be obtained by right-clicking on the edge. We only include conditional dependencies
for which the quality of the underlying data condition exceeds the configured condition
threshold. We implemented two plug-ins, one that uses Cohen’s kappa as proposed in [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
and one that uses F1-score as quality measure.
(Step 3) Discover Split- and Join Information Third, the split- and
join information of the C-Net, i.e., its input- and output bindings,
need to be discovered to obtain a model with precise semantics.
Bindings are visualized as dots on the edges of the C-Net. Disconnected
dots represent exclusive splits (XOR) and connected dots represent
parallel splits (AND). In Fig. 3, after executing RELEASE an
exclusive choice between either BILLED or CODE OK is modeled,
i.e., the binding dots are not connected with each other. Currently
the heuristic proposed by the FHM in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is implemented and only
frequent bindings exceeding the bindings threshold are displayed. In
future work, other heuristics may be added, such as the structuring
approach presented in [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
(Step 4) Discover Decision Rules. Fourth, decision rules that
determine which of the output bindings may be activated are discovered.
      </p>
      <p>
        Bindings with attached decision rules, i.e., guarded bindings, are de- Fig. 3. Binding
frepicted by a double border. In Fig. 3 decision rules could be discovered quencies, guarded
for both output bindings of RELEASE. The binding from RELEASE bindings and
conto CODE OK is activated only for cases with the caseType attribute formance
informavalues A,B, and D. The binding from RELEASE to BILLED is ac- tion.
tivated only for the values C, F, and G. Additional information on the decision rule can
be accessed by right-clicking on the binding. Decision rules may be filtered based on
a decision-rule quality threshold. We implemented two decision mining methods, one
based on C4.5 decision trees and one based on overlapping decision rules [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ].
(Step 5) Check Conformance of the Model. Finally, we apply conformance checking
techniques on the C-Net and the event log to project frequencies on the activities and bindings.
Activities and bindings are color-coded based on their frequency of occurrence according
to the event log. The size of the binding dots also scales with their frequency of occurrence.
In Fig. 3 the binding from RELEASE to CODE OK occurs much more often than the
binding from RELEASE to BILLED. Moreover, conformance problems are diagnosed: some
events regarding the activity BILLED are not represented in the model (yellow bar) and
very few executions of BILLED are not observed in the log (purple bar). Additionally, we
add a small circle of the respective color to the top left of the activity to indicate the
presence of conformance issues. Several heuristic and exact approaches are possible. Currently,
we implement both the nearest activity heuristic introduced by the FHM [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] and an optimal
multi-perspective alignment [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] based on a Data Petri net representation of the C-Net. A
possible addition in future work would be, e.g., the heuristic execution semantics proposed
in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] or approaches that approximate an optimal alignments to provide faster feedback.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Conclusion</title>
      <p>
        We presented the iDHM, an interactive data-aware heuristic process discovery tool that
is integrated with the ProM framework. The tool integrates several available heuristics
and is envisioned as platform for future research on new heuristics. We showed how the
iDHM improves upon several shortcomings of academic heuristic process discovery tools.
Its modularity and interactive UI allow for a quick exploration of the parameter space (1)
and the available heuristics (2). It makes use of data attributes to improve the discovery of
the control-flow (3). The result is visualized as C-Nets with clear semantics (4). Finally,
built-in conformance checking directly reveals quality problems of the models discovered
(5). We successfully tested the iDHM on several real-life event logs with up to 6 million
events, which confirms that it has reached a high degree of maturity. In this paper, we
applied it to an event log of a hospital billing process with more than 450,000 events [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. A
limitation is the dependence on an automatic layout based on the GraphViz software. Very
complex C-Nets are unlikely to be readable and it is not yet possible to manually adjust
their layout. As future work we plan to further improve its usability and enrich the C-Net
notation with perspectives different than control-flow. We would also like to investigate the
understandability of C-Nets among users and aim to add additional heuristics as plug-ins.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          : Process Mining - Data Science in Action,
          <source>Second Edition</source>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Günther</surname>
            , C.W., van der Aalst,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Fuzzy mining - adaptive process simplification based on multi-perspective metrics</article-title>
          .
          <source>In: BPM 2007</source>
          .
          <article-title>Volume 4714 of LNCS</article-title>
          ., Springer (
          <year>2007</year>
          )
          <fpage>328</fpage>
          -
          <lpage>343</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Weijters</surname>
            ,
            <given-names>A.J.M.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ribeiro</surname>
          </string-name>
          , J.T.S.:
          <article-title>Flexible Heuristics Miner (FHM)</article-title>
          .
          <source>In: CIDM</source>
          <year>2011</year>
          ,
          <article-title>IEEE</article-title>
          , IEEE (
          <year>2011</year>
          )
          <fpage>310</fpage>
          -
          <lpage>317</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4. vanden Broucke,
          <string-name>
            <given-names>S.K.</given-names>
            ,
            <surname>Weerdt</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.D.</surname>
          </string-name>
          :
          <article-title>Fodina: A robust and flexible heuristic process discovery technique</article-title>
          .
          <source>Decis Support Syst</source>
          (
          <year>2017</year>
          )
          <article-title>(to appear).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Mannhardt</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>de Leoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reijers</surname>
          </string-name>
          , H.A.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Data-driven process discovery - revealing conditional infrequent behavior from event logs</article-title>
          .
          <source>In: CAiSE 2017</source>
          .
          <article-title>Volume 10253 of LNCS</article-title>
          . (
          <year>2017</year>
          )
          <fpage>545</fpage>
          -
          <lpage>560</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Mannhardt</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <string-name>
            <surname>Hospital Billing - Event Log</surname>
          </string-name>
          . Eindhoven University of Technology. Dataset. (
          <year>2017</year>
          ) https://doi.org/10.4121/uuid:
          <fpage>76c46b83</fpage>
          -c930
          <string-name>
            <surname>-</surname>
          </string-name>
          4798
          <string-name>
            <surname>-</surname>
          </string-name>
          a1c9-4be94dfeb741.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Augusto</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Conforti</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dumas</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosa</surname>
            ,
            <given-names>M.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bruno</surname>
          </string-name>
          , G.:
          <article-title>Automated discovery of structured process models: Discover structured vs. discover and structure</article-title>
          .
          <source>In: ER 2016</source>
          .
          <article-title>Volume 9974 of LNCS</article-title>
          . (
          <year>2016</year>
          )
          <fpage>313</fpage>
          -
          <lpage>329</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Mannhardt</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>de Leoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reijers</surname>
          </string-name>
          , H.A.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Decision mining revisited - discovering overlapping rules</article-title>
          .
          <source>In: CAiSE 2016</source>
          .
          <article-title>Volume 9694 of LNCS</article-title>
          ., Springer (
          <year>2016</year>
          )
          <fpage>377</fpage>
          -
          <lpage>392</lpage>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Mannhardt</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>de Leoni</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reijers</surname>
          </string-name>
          , H.A.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Balanced multi-perspective checking of process conformance</article-title>
          .
          <source>Computing</source>
          <volume>98</volume>
          (
          <issue>4</issue>
          ) (
          <year>2016</year>
          )
          <fpage>407</fpage>
          -
          <lpage>437</lpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>