Robust Process Mining with Guarantees

                              Sander J.J. Leemans

             Queensland University of Technology, Brisbane, Australia

   Organisations store a lot of data from their business processes nowadays.
This execution data is often stored in an event log. In many cases, the inner
workings of the process are not known to management, or, for instance, only
vaguely resemble their three-year old PowerPoint design. Gaining insights into
the process as it is executed using only an event log is the aim of process mining.
In this thesis, we address three aspects of process mining: process discovery,
conformance checking and enhancement.


Process Discovery
Once an event log has been obtained, process discovery is an essential next
step in many process mining projects. Process discovery aims to automatically
discover a process model from the event log, such that this model describes
the underlying (unknown) process well. As we do not want to assume that all
possible behaviour is included in the event log, process discovery algorithms
inherently make a trade-off between several quality criteria. For instance, fitness
denotes the fraction of the event log is described in the model, and log precision
denotes the fraction of the model is described in the event log. Simplicity denotes
whether the model needs few and simple constructs to express its behaviour.
For some event logs, process models that score well on fitness, log precision and
simplicity might not exist in the representational bias of the algorithm. Other
quality criteria of algorithms include how well a discovered model resembles the
running process that underlies the event log. Even though the process is typically
unknown, studying under which conditions a discovery algorithm rediscovers the
process allows us to compare discovery algorithms. In this thesis, we argue that
process discovery algorithms should provide several guarantees, such as that the
discovered models are free of deadlocks and other anomalies, that, depending on
the use case, fitness is perfect, and that the processes underlying the event logs
are rediscovered (the algorithm should provide rediscoverability). With these
guarantees, algorithms become more robust to real-life behaviour and larger
event logs.
    In order to ease generalising over the behaviour in an event log, many process
discovery algorithms use an abstraction of the behaviour in the event log, such
as directly follows graphs. We conducted a systematic study of four of these
abstractions and identified classes of models for which the abstractions uniquely
represent systems and logs. That is, we proved that any difference in behaviour
is detectable in the abstraction. The four abstractions studied were the directly-
follows relation, the minimum self-distance relation, the concurrent-optional-or
relation and the co-occurrence relation.
    Use cases might require several different process discovery techniques, but
these techniques might offer similar guarantees and require similar proofs. There-
fore, we introduced the Inductive Miner (IM) framework, which aids algorithms
in providing several guarantees and enables algorithm designers to consider only
the most important behaviour in an event log, instead of all behaviour. The IM
framework searches recursively for the most important behaviour in an event
log. That is, the framework searches for the combination of the root process tree
operator and a proper division of the activities in the event log (a cut). If such
a cut can be found, the event log is split into several sublogs and the framework
is applied to each sublog recursively until a base case (for instance, an event log
containing only a single activity) is encountered. If no such cut can be found, a
fall through (a model allowing for a superset of the behaviour in the event log)
is returned. An algorithm that uses the IM framework only needs to provide im-
plementations for these four steps. The guarantees aided by the IM framework
are soundness, which is guaranteed for any algorithm by the use of process trees,
and fitness, log-precision and rediscoverability, for which proof obligations have
been expressed as local properties.
     Using the IM framework and the studied abstractions, we introduced sev-
eral discovery algorithms: we introduced a fitness-guaranteeing basic algorithm,
a deviating- and infrequent-behaviour handling algorithm and an incomplete-
behaviour handling algorithm. Deviating behaviour should not be possible in
the process but appears in the event log, while infrequent behaviour denotes
little-used parts of the process. Such behaviour needs to be filtered to avoid
complex models. An event log that does not fully witness a language abstraction
is incomplete, and this challenges discovery algorithms. However, we show that
in some cases guarantees can still be given. Furthermore, we introduced algo-
rithms to handle more process tree constructs, such as inclusive choices (OR)
and skipping (τ ) constructs. These algorithms highlight the flexibility of the IM
framework: one can take an existing algorithm and improve it locally with little-
impacting changes, and likely rediscoverability and perhaps fitness guarantees
will be preserved. For all these algorithms, we proved rediscoverability, using the
proof obligations identified for the IM framework.
    Some event logs contain non-atomic executions of activities: activities take
time, as the event contains the starts and ends of activity executions. To handle
such event logs, we introduced a family of algorithms that uses knowledge of non-
atomicity in the construction of a directly follows graph, but further resembles
the other algorithms: a fitness-guaranteeing basic algorithm, a deviating and
infrequent behaviour handling algorithm and an incomplete behaviour handling
algorithm. Also for these algorithms, we proved rediscoverability.
    Most of the proposed algorithms have a run time that is polynomial in the
number of activities and run quick on real-life event logs. However, the IM frame-
work requires the event log to be copied for every recursion, thus larger event
logs might be problematic. To handle larger event logs, that is, logs containing
tens of millions of events and thousands of activities, we adapted the IM frame-
work to recurse on a directly follows abstraction instead of on event logs, such
that in each recursion, only a directly follows abstraction needs to be copied
instead of an event log. This framework, the IM directly follows based (IMd)
framework, was instantiated in three algorithms: a basic algorithm, a infrequent
and deviating behaviour handling algorithm and an incompleteness handling
algorithm.
    Even though the algorithms presented in this thesis apply different strategies,
instead of searching for the entire behaviour while worrying about soundness,
the Inductive Miner framework allowed us to focus on strategies to find the
most important behaviour in an event log (that is, the root operator and root
activity partition) and have soundness and, in some cases, fitness guaranteed.
By using the IM framework in different settings and for different algorithms, we
have shown that it, and the proofs for guarantees, can be reused. Therefore, the
IM framework can be seen as a starting point for more algorithms that leverage
the algorithms and proofs we provided.


Conformance Checking

While discovering a model, process discovery algorithms might need to exclude
behaviour of the event log from the model, or include behaviour that is not
in the event log into the model, in order to obtain a model with the “right”
balance. Therefore, discovered models should be evaluated before further us-
age, for which a conformance checking technique could be used. Two types of
conformance checking techniques were addressed in this thesis: log-conformance
checking, which compares a process model to an event log and advises on their
differences, and model-conformance checking, which compares two process mod-
els. For instance, the discovered model and another model representing a refer-
ence implementation, representing a different geographical area or representing
a different time period could be compared. We identified three levels on which
conformance checking techniques provide information about the correspondence
between logs and models: a summarised measure (e.g. a fitness or precision num-
ber), information on the model level, and information on the log level. Many
existing conformance checking techniques are either not robust enough to deal
with the complexity of real-life event logs and the models discovered from these
logs, do not support all features of such discovered models, or use an abstraction
that is too coarse to capture the behaviour of logs and models well.
    Therefore, we introduced a new robust approach to conformance checking:
the Projected Conformance Checking (PCC) framework. This framework is ap-
plicable to compare event logs to models and models to models, and checks for
conformance by constructing the language of these logs and models explicitly in
DFAs and compares their behaviour to measure fitness and precision. Thus, the
PCC framework supports all regular languages, regardless of the model formal-
ism used. To avoid constructing the entire state space and consequently take a
lot of time, the PCC framework constructs DFAs of all subsets of activities of a
user-specified length in the logs and the model. This allows the PCC framework
to consider behaviour on a scale from fine-grained to very coarse, depending on
the size of the subsets. These partial measures provide insight in the locations
of deviations between the logs and models, that is average fitness and precision
can be computed for each activity and this can be visualised on the model. Fur-
thermore, the partial measures can be averaged over all subsets of activities to
provide a summarised fitness and precision measure.


Evaluation
To evaluate the introduced process discovery and conformance checking tech-
niques, we performed several experiments, using both real-life and synthetic
data.
    The new process discovery techniques were compared to 10 existing process
discovery techniques: techniques that guarantee soundness and other techniques.
First of all, we performed a scalability experiment, using several artificially gen-
erated logs of increasing size and complexity in a limited-RAM environment,
which showed that most existing techniques cannot handle the complexity and
size of logs that the algorithms of the IM framework can handle. This experiment
did not challenge the directly follows-based IMd framework, as the disk space to
generate the event logs became the bottleneck.
    In a second discovery experiment, the techniques were applied to 9 real-life
public event logs, and were compared both quantitatively and qualitatively. We
found that of the 9 real-life logs, an algorithm of the IMd framework was pareto
optimal for all 9 logs, and other Inductive Miner algorithms were pareto optimal
for 8 logs. Existing techniques achieving pareto optimality were the Evolutionary
Tree Miner [2] (8 times pareto optimal, however many activities were left out of
the models) and the Structured Miner [1] (2 times).
    In a third discovery experiment, we evaluated the new techniques on redis-
coverability challenges: we applied the algorithms to artificially generated event
logs that contained several challenges: deviating, incomplete and infrequent be-
haviour. We found that the introduced algorithms targeting these challenges
handled them better than other algorithms. For instance, the algorithms de-
signed to be robust against incompleteness of information required only half of
the information as other techniques to rediscover the intended behaviour.
    Furthermore, we evaluated the PCC framework on real-life event logs and
discovered models, and found that it is applicable to large event logs that can-
not be handled by current conformance checking techniques. Furthermore, we
found that in many cases, the measures of the PCC framework rank process
models discovered by discovery techniques similar to existing alignment-based
techniques.


Enhancement & Tool Support
Given a discovered process model and the result of a log-conformance checking
technique, a process model and an event log can be enhanced with additional
information. We addressed four types of information to enhance models and
event logs: deviations, frequency, performance and animation. Performance in-
formation, for instance the sojourn, waiting, queueing and service times, enables
analysts to discover time-consuming activities in the process. For log animation,
the event log is visually replayed on the model: each case can be visually tracked
as it traverses the process, which enables the detection of changes in the process
(concept drift) and bottlenecks. Queues might determine the majority of waiting
times in a business process, so analysing queues might reveal bottlenecks. Devi-
ations between log and model, that is, log moves and model moves, are essential
to evaluate a model and should be considered before drawing conclusions about
a process using a model.
    We introduced a software tool, the Inductive visual Miner (IvM), which per-
forms several steps, all fully automated. First, IvM discovers a process model
using one of the newly introduced process discovery algorithms of the IM frame-
work. Second, it performs conformance checking, that is, computes an alignment,
between the event log and the discovered model. Third, it enhances the model
based on this alignment with the four described types of information: options
include performance information on the model and the event log, animation on
the model, and deviations projected on both event log and model.
    The IvM combines the strong points of commercial products with strong
points of academic software. For instance, commercial products offer ease-of-
use and practical applicability, while academic software provides reliability and
semantic results that allow an analyst to validate any gained insights. Using IvM,
analysts can explore the event log by repeatedly discovering a model (which
is guaranteed sound, and potentially is fitting and language equivalent to the
system), evaluate this model to ensure its validity, filter the event log and enhance
it with performance information. Inductive visual Miner shows that it is possible
to use powerful techniques with formal guarantees in a user-friendly package,
and we hope that IvM will inspire commercial vendors to consider models with
executable semantics and support deviation analysis.1


References
1. Augusto, A., Conforti, R., Dumas, M., Rosa, M.L., Bruno, G.: Automated discov-
   ery of structured process models: Discover structured vs. discover and structure.
   In: Conceptual Modeling - 35th International Conference, ER 2016, Gifu, Japan,
   November 14-17, 2016, Proceedings. pp. 313–329 (2016)
2. Buijs, J.C.A.M.: Flexible Evolutionary Algorithms for Mining Structured Process
   Models. Ph.D. thesis, Eindhoven University of Technology (2014)


1
    Update April 2018: commercial parties, such as Celonis, are starting to support
    deviation analyses.