Accurate and Efficient Discovery of Process Models
          from Event Logs (Extended Abstract)

                                    Adriano Augusto1,2
                                1 University of Tartu, Estonia
                           2   University of Melbourne, Australia
                                 a.augusto@unimelb.edu.au

Keywords: automated process discovery · process mining · event log · business process
management · optimization metaheuristic · BPMN


1   Introduction

Everyday, organizations deliver services and products to their customers by enacting
their business processes, the quality and efficiency of which directly influence the cus-
tomer experience. In competitive business environments, achieving a great customer
experience is fundamental to be a successful company. For this reason, companies rely
on programmatic business process management in order to discover, analyse, improve,
automate, and monitor their business processes [8].
    One of the core activities of business process management is process discovery. The
goal of process discovery is to generate a graphical representation of a business process,
namely business process model, which is then used for analysis and optimization pur-
poses. Traditionally, process discovery has been a time-consuming activity performed
either by interviewing relevant process stakeholders, or by observing process partici-
pants in action, or by analysing process reference documentation. However, with the
diffusion of information systems and specialised software in organizational settings, a
new form of process discovery is slowly emerging, which goes by the name of auto-
mated process discovery [15].
    Automated process discovery allows business analysts to exploit process’ execution
data (recorded into so-called event logs) to automatically generate process models. Dis-
covering high-quality process models is extremely important to reduce the time spent
to enhance them and avoid mistakes during process analysis. In the past two decades,
many research studies have addressed the problem of automated discovery of business
process models from event logs [18, 11, 7, 3, 17]. Despite the richness of proposals, the
state-of-the-art automated process discovery approaches (APDAs) struggle to systemat-
ically discover accurate process models. In general, the quality of a discovered process
model depends on both the input log and the APDA that is applied. When the input
event log is simple, some state-of-the-art APDAs can output accurate and simple pro-
cess models. However, as the complexity of the input data increases, the quality of the
discovered process models can worsen quickly. Given that real-life event logs tend to
be highly complex (i.e. containing noise and incomplete event records), state-of-the-
art APDAs turn to be unreliable. In addition, some of these APDAs do not scale up to
real-life logs in terms of time performance.


Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons
License Attribution 4.0 International (CC BY 4.0).
2

    Against this backdrop, in this thesis, we address the following three research ques-
tions.
    • RQ1 - What are the state-of-the-art automated process discovery approaches, their
      strengths and limitations?
    • RQ2 - How to strike a trade-off between the various quality dimensions in auto-
      mated process discovery in an effective and efficient manner?
    • RQ3 - How can the accuracy of an automated process discovery approach be effi-
      ciently optimized?


2     Summary of Contributions
In order to answer RQ1, we conducted a Systematic Literature Review (SLR) [4] through
a scientific, rigorous and replicable approach as specified by Kitchenham [10]. We for-
mulated a set of research questions to scope the search and developed a list of search
strings that we ran on different academic databases. We applied inclusion criteria to
select the studies retrieved through the search. Our SLR highlights a growing interest in
the field of automated process discovery, and confirms the existence of a wide and het-
erogeneous number of proposals. Between 2012 and 2017 more than 80 studies propos-
ing an automated process discovery approach have been published. We grouped these
studies into 34 groups, each referring to a specific APDA, and we described their fea-
tures, such as: the type of model they can discover, semantic they are able to represent,
type of implementation and available artefacts and tools, and type of evaluation. De-
spite the variety of proposals, we could clearly identify two main streams: approaches
that output procedural process models, and approaches that output declarative process
models. Furthermore, while the latter ones only rely on declarative statements to rep-
resent a process, the former provide various language alternatives, although most of
these methods output Petri nets. The predominance of Petri nets is driven by the se-
mantic power of this language, and by the requirements of the methods used to assess
the quality of the discovered process models (chiefly, fitness and precision). Finally, we
designed a benchmark framework to enable researchers to empirically compare new
APDAs against existing ones in a unified setting. Our benchmark shows that existing
APDAs are affected by one (or more) of the following three limitations when used with
their default input parameters: (i) they achieve limited accuracy; (ii) they are compu-
tationally inefficient to be used in practice; (iii) they discover syntactically incorrect
process models. To analyse the importance and the impact of the APDAs input parame-
ters, we also ran a hyper-parameter optimization evaluation. The latter highlighted that
existing APDAs can achieve better results when their input parameters are finely tuned.
However, such a tuning is expensive in terms of computational power and time, some-
times requiring up to 24 hours to discover a process model as opposed to seconds (or
minutes) when choosing default parameters.
    To address the limitations identified in our benchmark, and to answer RQ2, we pro-
posed a novel APDA, namely Split Miner [5]. Split Miner operates over six steps: DFG
and loops discovery; concurrency discovery; filtering; splits discovery; joins discovery;
and OR-joins minimization. The algorithms that implement these six steps have been
designed with the goal to strike a trade-off between the various quality dimensions in
                                                                                          3

automated process discovery in an effective and efficient manner. In other words, each
of the six steps plays a fundamental role in the discovery of a highly fitting and precise
process model, that is at the same time simple and deadlock-free. About the latter prop-
erty, we formally proved that Split Miner guarantees to discover sound acyclic process
models, and deadlock-free cyclic process models. Differently than other APDAs that
guarantee soundness [11, 7], Split Miner does not achieve such a result by enforcing
full structuredness on its discovered process models, making SM the very first3 APDA
that guarantees to discover sound (or deadlock-free) process models that are not (neces-
sarily) fully structured. Furthermore, we showed that the time complexity of Split Miner
is polynomial, guaranteeing efficiency in terms of execution times. Finally, in line with
its formal properties, the results of the empirical evaluation of Split Miner highlighted
that our approach performs better than the state-of-the-art APDAs. In particular, Split
Miner achieved with statistical significance higher F-scores (of fitness and precision)
than state-of-the-art APDAs. This is achieved by balancing fitness and precision, while
maintaining a low complexity of the discovered process models. The empirical evalu-
ation also highlighted that despite SM cannot guarantee soundness for cyclic process
models, SM is still able to discover sound cyclic process models. Lastly, SM proved to
be the fastest APDA, discovering process models in less than a second regardless of the
size of the input event logs.
     As in the majority of existing APDAs, Split Miner also requires input parameters.
Such parameters can be optimized to achieve considerable improvements in the quality
of the process models produced. As already noted, the quality improvement APDAs
achieve via hyper-parameter optimization comes at the cost of longer execution times
and higher computational requirements. This inefficiency is the direct consequence of
two different problems: (i) the low scalability of the accuracy measures, especially pre-
cision; and (ii) a brute-force exploration of the solution space. By addressing both prob-
lems we can tackle RQ3.
     Over the past decade, several measures of fitness and precision have been proposed
and empirically evaluated in the literature [9]. Existing precision measures (and to a
lesser extent also fitness measures) suffer from scalability issues when applied to mod-
els discovered from real-life event logs. In addition, Tax et al. [14] have shown that none
of the existing precision measures fulfils a set of five intuitive properties, namely preci-
sion axioms. These axioms were then revised by Syring et al. [13], who also proposed
a set of desirable properties of fitness measures, namely fitness propositions, showing
that only the latest fitness and precision measures designed by Polyvyanyy et al. [12]
can fulfil the proposed properties.
     In this thesis, we designed a family of fitness and precision measures [2] based on
the idea of comparing the kth -order Markovian abstraction of a process model against
that of an event log. We showed that the proposed measures fulfil all aforementioned
properties for a suitable k dependent on the log.
     While fulfilling the proposed properties is desirable, this does not guarantee that
the proposed measures provide intuitive results in practice. To validate the intuitiveness
of our measures, we compared the ranking they induce against those induced by exist-
ing fitness and precision measures using a collection of model-log pairs proposed by
3 According to our literature review and benchmark.
4

van Dongen et al. [16] (extended to cover fitness in addition to precision). For k ≥ 4,
the proposed fitness measures induce rankings that coincide with alignment-based fit-
ness [1] – a commonly used fitness measure. Meanwhile, for k ≥ 3, the proposed pre-
cision measures induce rankings consistent with those of the anti-alignment precision
measure [16], previously posited as a ground-truth for precision measures.
     A second evaluation using real-life event logs showed that the execution times of the
proposed fitness measures (for k ≤ 5) are considerably lower than existing fitness mea-
sures, except when applied to process models that contain flower structures, in which
case alignment-based fitness offers the best performance. Similarly, the execution times
of the proposed precision measures (for k ≤ 5) are considerably lower than existing
precision measures, except for models that contain flower structures.
     The comparison between the process model and the event log Markovian abstrac-
tions allows us to identify the behaviour to add to (remove from) the process model
in order to improve its fitness (precision). Indeed, such information is encoded in the
edges of the event log Markovian abstraction that cannot be found in the process model
Markovian abstraction (and viceversa), and it is independent from the value of k and
not prone to approximation by design.
     By employing our Markovian accuracy measures to efficiently explore the solution-
space of APDAs based on directly-follows graphs (DFGs), we designed an optimization
framework powered by single-solution-based metaheuristics [6], that boosts the capa-
bility of DFG-based APDAs to discover process models with higher accuracy. The core
idea of our framework is to perturb the intermediate representation of an event log used
by the majority of the available APDAs, namely the Directly-follows Graph (DFG).
Specifically, we consider perturbations that add or remove edges with the aim of im-
proving fitness or precision, and allowing the underlying APDA to discover a process
model from the perturbed DFG.
     We instantiated our framework for three state-of-the-art APDAs: Split Miner, Fo-
dina, and Inductive Miner, and using a benchmark of 20 real-life logs, we compared
the accuracy gains yielded by four optimization metaheuristics relative to each other
and relative to the hyper-parameter optimized APDA. The evaluation of our optimiza-
tion framework highlights its effectiveness in optimizing DFG-based APDAs, allowing
APDAs to explore their solution-space beyond the boundaries of hyper-parameter opti-
mization and most of the times in a faster manner, ultimately discovering more accurate
process models in less time, compared to the traditional hyper-parameters optimization.
    In order to foster reproducibility and reuse, all the artefacts designed and developed
for this thesis are publicly available as standalone open-source Java command-line ap-
plications.4 Split Miner, our core contribution to the field of automated process discov-
ery, has also been integrated into Apromore, an open-source business process analytics
platform used by academics and practitioners alike.

References
 1. A. Adriansyah, B. van Dongen, and W. van der Aalst. Conformance checking using cost-
    based fitness analysis. In EDOC. IEEE, 2011.
4 Applications available at https://apromore.org/platform/tools/
                                                                                            5

 2. A. Augusto, A. Armas Cervantes, R. Conforti, M. Dumas, M. La Rosa, and D. Reissner.
    Measuring fitness and precision of automatically discovered process models: A principled
    and scalable approach. Technical report, University of Melbourne, 2019.
 3. A. Augusto, R. Conforti, M. Dumas, M. La Rosa, and G. Bruno. Automated discovery of
    structured process models: Discover structured vs. discover and structure. In Conceptual
    Modeling: 35th International Conference, ER 2016, Gifu, Japan, November 14-17, 2016,
    Proceedings, pages 313–329. Springer, 2016.
 4. A. Augusto, R. Conforti, M. Dumas, M. La Rosa, F.M. Maggi, A. Marrella, M. Mecella, and
    A. Soo. Automated discovery of process models from event logs: Review and benchmark.
    IEEE TKDE, 31(4), 2019.
 5. A. Augusto, R. Conforti, M. Dumas, M. La Rosa, and A. Polyvyanyy. Split miner: automated
    discovery of accurate and simple business process models from event logs. KAIS, 2018.
 6. A. Augusto, M. Dumas, and M. La Rosa. Metaheuristic optimization for automated business
    process discovery. In BPM. Springer, 2019.
 7. J. Buijs, B.F. van Dongen, and W.M.P. van der Aalst. Quality dimensions in process dis-
    covery: The importance of fitness, precision, generalization and simplicity. International
    Journal of Cooperative Information Systems, 23(01):1440001, 2014.
 8. M. Dumas, M. La Rosa, J. Mendling, and H.A. Reijers. Fundamentals of business process
    management (2nd Edition). Springer, 2018.
 9. G. Janssenswillen, N. Donders, T. Jouck, and B. Depaire. A comparative study of existing
    quality measures for process discovery. Information Systems, 71:1–15, 2017.
10. B. Kitchenham. Procedures for performing systematic reviews. Keele, UK, Keele University,
    33(2004):1–26, 2004.
11. S.J.J Leemans, D. Fahland, and W.M.P. van der Aalst. Discovering Block-Structured Process
    Models from Incomplete Event Logs. In Application and Theory of Petri Nets and Concur-
    rency: 35th International Conference, PETRI NETS 2014, Tunis, Tunisia, June 23-27, 2014.
    Proceedings, pages 91–110. Springer International Publishing, 2014.
12. A. Polyvyanyy, A. Solti, M. Weidlich, C. Di Ciccio, and J. Mendling. Monotone precision
    and recall measures for comparing executions and specifications of dynamic systems. CoRR,
    abs/1812.07334, 2018.
13. A.F. Syring, N. Tax, and W.M.P. van der Aalst. Evaluating conformance measures
    in process mining using conformance propositions (extended version). arXiv preprint
    arXiv:1909.02393, 2019.
14. N. Tax, X. Lu, N. Sidorova, D. Fahland, and W.M.P. van der Aalst. The imprecisions of
    precision measures in process mining. Information Processing Letters, 135, 2018.
15. W.M.P. van der Aalst. Process Mining - Data Science in Action. Springer, 2016.
16. B.F. van Dongen, J. Carmona, and T. Chatain. A unified approach for measuring precision
    and generalization based on anti-alignments. In BPM. Springer, 2016.
17. S.K.L.M. vanden Broucke and J. De Weerdt. Fodina: a robust and flexible heuristic process
    discovery technique. Decision Support Systems, 2017.
18. A.J.M.M. Weijters and J.T.S. Ribeiro. Flexible heuristics miner (FHM). In Computational
    Intelligence and Data Mining (CIDM), 2011 IEEE Symposium on, pages 310–317. IEEE,
    2011.