Robust Discovery of Complex Process-Structures
                                                           (Extended Abstract)

                                                                Lisa Luise Mannel
                                        Chair of Process and Data Science (PADS)
              Department of Computer Science, RWTH Aachen University, Ahornstr. 55, 52074, Aachen, Germany
                                               mannel@pads.rwth-aachen.de


   Abstract—More and more companies gather and analyze the                                               TABLE I
data generated by their processes. An often crucial step within               OVERVIEW OF THE DESIRED PROPERTIES OF THE DEVELOPED ALGORITHM
this analysis is the discovery of the process based on the collected
data. Unfortunately, there are still real-live scenarios where algo-             Representational Bias         Frequencies & Noise        Guarantees
rithms developed for this purpose fail to meet expectations. The
research project presented in this paper develops the discovery                Precise Petri nets:             Robustness:               Reliability:
                                                                               1) long-term dependencies       1) filter out noise       1) fitness
algorithm eST-Miner, which introduces new concepts to address
                                                                               2) self-loops                      (errors, deviations)   2) precision
well-known challenges and aims to flexibly balance the strengths               3) silent transitions (skips,   2) retain infrequent      3) rediscover-
and weaknesses known from existing approaches.                                    loops, etc)                     behaviour                 ability
   Index Terms—process discovery, Petri nets, implicit places                  4) duplicate transitions           (rare but
                                                                                  (counting, process              well-defined)
                         I. I NTRODUCTION                                         variations)

   In process discovery, a process model is constructed aiming
to reflect and summarize behavior defined by the example
executions in a given event log. This task is challenging for                 in-depth explanations deferred to the corresponding papers.
a variety of reasons. Ideally, a discovery algorithm returns                  Future research is described in more detail in Section IV.
a model which is able to produce the important behavior                       Finally, Section V positions the research project with respect
represented within the event log (fitness) and at the same                    to the state of the art and concludes the paper.
time does not allow for much unobserved behavior (precision),                    The research methodology is guided by the principles of
while remaining simple enough to be understood by a human                     Design Science Research [1]. For each subgoal, the developed
interpreter. In real life logs, noise (errors, deviations) often              solutions are formalized and implemented. Theoretical proofs
further complicates this task. It is rarely possible to fulfill               of correctness and applicability are complemented by practical
all requirements simultaneously. Based on the capabilities and                experiments based on adequate artificial examples as well as
focus of the used algorithm, the discovered models can vary                   real-life data. Finally, the results are compared to existing
greatly, and different trade-offs are possible.                               algorithms to evaluate the contribution.

                        II. R ESEARCH G OAL
                                                                                                 III. C OMPLETED R ESEARCH
   The presented research project aims to develop a discovery
algorithm that is able to reasonably balance the different                       To allow for accurate process models, the chosen repre-
quality aspects described above and additionally provides the                 sentational bias (Table I) includes long-term dependencies, as
flexibility to the users to skew the algorithm’s behavior to fit              illustrated by the places p9 and p10 in Figure 1. The basic
their needs. In particular, it focuses on the discovery of models             idea of the eST-Miner presented in [2] is to evaluate all
with high precision and guaranteed minimal fitness, which are                 possible places in an efficient way using a noise parameter τ ,
not overfitting to include noise but still represent well-defined             and to return the places fitting a fraction of traces based
infrequent behavior. Additionally, the algorithm should main-                 on τ . Therefore, it can reliably find frequent dependencies,
tain reasonable model simplicity and computational efficiency.                including long-term dependencies, without being hampered by
Figure 1 shows an example of a model, which the developed                     noise. Heuristic approaches based on log characteristics can
discovery algorithm should return based on the given log, even                be utilized to immensely increase efficiency further, with very
if noise were added.                                                          minor drawbacks to fitness and precision [3]. A variant of the
   To achieve this overall research goal, several subgoals and                algorithm discovers the novel class of uniwired Petri nets [3],
desired properties have been defined as shown in Table I. In                  further boosting efficiency and simplifying the returned models
the following these goals will be explained and motivated.                    while still being able to discover long-term dependencies. Fi-
Section III summarizes the results achieved so far, with                      nally, to retain only meaningful places, the eST-Miner removes
                                                                              implicit places, i.e., places that do not further restrict the
  I would like to thank my doctoral advisor Prof. Wil van der Aalst for his
constant support. We thank the Alexander von Humboldt (AvH) Stiftung for      models behavior. The novel approach introduced in [4] uses
supporting our research.                                                      the even log as an additional source of information.

  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).
                                                                                       model is build and adapted to comply with traces that become
                                                                                       available over time. Another possible extension would be the
                                                                                       evaluation of places with respect to a partial orders rather than
                                                                                       single traces [6], which could be applied, for example, in the
                                                                                       context of process discovery on uncertain event data [7].
Fig. 1. The model above can be considered a desirable result for the event log
L = [hI, a, c, c, c, d, i104 , hI b, c, c, c, e, i90 , hI a, c, c, c, d, f, i17 ,                V. R ELATED W ORK AND S UMMARY
hI b, c, c, c, e, f, i21 ], but is hard to discover for existing algorithms.
                                                                                          The combination of properties and capabilities listed in
                                                                                       Table I and detailed in the previous sections set the eST-Miner
                       IV. P LANNED R ESEARCH                                          apart from existing discovery approaches. Algorithms based
   Considering the research goals in Table I, several directions                       on simple log abstractions, like the well-known Inductive
remain open for investigation. The present lack of silent and                          Miner [8] and Alpha Miner [9], lack the ability to discover
duplicate transition labels in the returned models severely                            complex control-flow structures and are thus limited in their
hampers the attainability of high precision values. Certain                            precision. Region-based approaches [10], [11] guarantee high
common log structures, some of which are illustrated in                                precision only for models with unique transition labels, and
Figure 1, cannot be expressed by Petri nets restricted to                              have issues with overfitting as well as time and space re-
uniquely labeled transitions without resorting to flowerlike                           quirements. Heuristic and genetic approaches do not provide
submodels. Examples are activities that can either happen once                         guarantees and returned models are often hard to interpret.
or be skipped (f ), or activities that happen at least once but can                       The eST-Miner combines the ability to reliably discover
be repeated, activities that happen a certain number of times                          long-term dependencies with unique noise handling capabili-
(c), or logs that include data on very dissimilar subprocesses                         ties and can provide guarantees with respect to fitness, preci-
that would result in completely different submodels. The                               sion and rediscoverability. Future projects will investigate the
identification of such patterns and their inclusion in the model                       targeted addition of silent and duplicate transitions to improve
by the adequate use of non-unique transitions labels is bound                          fitness and precision, optimize the heuristic approaches and
to significantly increase the models precision.                                        explore the adaption of the concept to other application areas.
   Noise, and its differentiation from infrequent behavior, is                         Results will be evaluated theoretically and experimentally with
often an issue in real life logs. The eST-Miner aims to achieve                        respect to the formulated goals as well as previous solutions.
robustness to noise by requiring a place to replay only a certain                      The existing extensions, variants and parameters allow to
fraction of traces to be considered fitting. This allows the                           flexibly tailor the basic concept of the eST-Miner towards
algorithm to disregard patternless infrequent behavior while                           the unique needs of each applicant. The developed ideas and
rare, yet well-defined, behavior is still discovered. An issue                         concepts can be valuable in other contexts as well, and thus
caused by this approach is that the traces replayable by the                           contribute to the field of process mining as a whole.
model are limited to the intersection of the traces replayable by                                                    R EFERENCES
all places. To increase the number of replayable traces without                         [1] A. Hevner, S. March, J. Park, and S. Ram, “Design science in infor-
jeopardizing simplicity or precision, the targeted insertion of                             mation systems research,” Management Information Systems Quarterly,
                                                                                            vol. 28, pp. 75–, 03 2004.
duplicate or silent transitions will be investigated.                                   [2] L. L. Mannel and W. M. P. van der Aalst, “Finding complex process-
   The novel implicit place removal strategy introduced in [4]                              structures by exploiting the token-game,” in Application and Theory of
works well for places perfectly fitting the event log. Worth                                Petri Nets and Concurrency. Springer Nature Switzerland AG, 2019.
                                                                                        [3] L. L. Mannel, Y. Epstein, and W. M. P. van der Aalst, Improving the
further investigation is its combination with the eST-Miner’s                               State-Space Traversal of the eST-Miner by Exploiting Underlying Log
noise handling approach or, in a more general setting, with                                 Structures, 01 2020, pp. 334–347.
Petri nets that do not perfectly fit the log. In this context its                       [4] L. L. Mannel, R. Bergenthum, and W. M. P. van der Aalst, “Removing
                                                                                            implicit places using regions for process discovery,” in Proceedings of
use for repairing a given model or adapting a model within an                               the International Workshop on Algorithms & Theories for the Analysis
online discovery setting would be of interest as well.                                      of Event Data (ATAED) 2020, vol. 2625. CEUR-WS.org, pp. 20–32.
   The algorithm allows for a large variety of optional ex-                             [5] L. L. Mannel and W. M. P. van der Aalst, “Finding uniwired Petri nets
                                                                                            using eST-miner,” in Business Process Management Workshops, 2019.
tensions and applications, some of which use heuristics to                              [6] R. Bergenthum, “Firing partial orders in a petri net,” in Application and
improve the quality of the returned model or to increase                                    Theory of Petri Nets and Concurrency. Springer Int. Publishing, 2021.
efficiency. Such approaches usually employ some kind of                                 [7] M. Pegoraro, M. S. Uysal, and W. M. P. van der Aalst, “Discovering
                                                                                            Process Models from Uncertain Event Data,” Sep 2019. [Online].
ranking on the activities or candidate places. Prior research                               Available: https://publications.rwth-aachen.de/record/782564
has focused on simple rankings based on the directly and                                [8] S. Leemans, D. Fahland, and W. van der Aalst, “Discovering block-
eventually follows relations for candidate places and the                                   structured process models from event logs - a constructive approach,”
                                                                                            Application and Theory of Petri Nets and Concurrency, 2013.
occurrence patterns of transitions [3], [5]. Besides refining                           [9] E. Badouel, “On the α-reconstructibility of workflow nets,” in Applica-
these basic approaches, the use of alternative concepts can be                              tion and Theory of Petri Nets. Springer Berlin Heidelberg, 2012.
investigated, which would be reusable for several extensions                           [10] E. Badouel, L. Bernardinello, and P. Darondeau, Petri Net Synthesis, ser.
                                                                                            Text in Theoretical Computer Science, EATCS Series. Springer, 2015.
of the eST-Miner.                                                                      [11] J. M. van der Werf, B. van Dongen, C. Hurkens, and A. Serebrenik,
   The approach can be adapted towards new application                                      “Process discovery using integer linear programming,” in Applications
areas. One such option would be the online setting, where the                               and Theory of Petri Nets. Berlin, Heidelberg: Springer, 2008.


  Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0
  International (CC BY 4.0).