Applications of Extended Prefix Automata in Process
Mining
Maxim Vidgof1,∗
1
    Vienna University of Economics and Business, Welthandelsplatz 1, 1020, Vienna, Austria


                                         Abstract
                                         Process mining is a research area focusing on extracting insights from process execution data called
                                         event logs. Different areas of process mining have developed own methods and algorithms suited for
                                         specific purposes. In this PhD project, we are developing Extended Prefix Automata – a behavioral model
                                         serving as unified representation of a business process – and study its applicability in various areas
                                         of process mining. We analyze the applicability of Extended Prefix Automata for measuring process
                                         complexity, for predictive process monitoring and automated process discovery. In this paper, we report
                                         the current state and outline future plans for each of these applications.

                                         Keywords
                                         Process mining, Extended prefix automata, Process complexity, Predictive process monitoring, Process
                                         discovery


1. Introduction
Process mining is a research area focusing on the design of techniques that can automatically
provide insights into business processes by analyzing historic process execution data, referred
to as event logs [1, 2]. Over the years, different areas of process mining have developed different
tools and algorithms for specific purposes.
   This PhD project started with a study of connection between process complexity and quality
of discovered process models [3]. In this study, we developed a new process complexity metric
that would overcome the limitations of the existing ones by combining different complexity
aspects. Extended Prefix Automata were developed merely as a process representation that was
used to compute this metric. As this study was finished, however, it has not only been proven
that the newly developed complexity metric correlates with model quality metrics and can be
used to predict them, but two further research directions emerged. First, it was proposed that
the complexity metrics can be used to predict other process characteristics, such as process
performance or accuracy of predictive process monitoring. Second, Extended Prefix Automata
turned out to be a powerful and extensible technique, which made it reasonable to apply it in
other areas of process mining.
   Thus in this PhD project, the applicability of Extended Prefix Automata to various areas
of process mining is studied. This project follows the framework for algorithm engineering

BPM’22: Doctoral Consortium, September 11–16, 2022, Münster, Germany
Envelope-Open maxim.vidgof@wu.ac.at (M. Vidgof)
Orcid 0000-0003-2394-2247 (M. Vidgof)
                                       © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).
    CEUR
    Workshop
    Proceedings
                  http://ceur-ws.org
                  ISSN 1613-0073
                                       CEUR Workshop Proceedings (CEUR-WS.org)


                                                                                                         72
proposed in [4].
   This paper is structured as follows. Section 2 introduces Extended Prefix Automata. Then,
Section 3 reviews the applications related to measuring process complexity and reports the
results. Sections 4 and 5 outline the proposed applications in predictive process monitoring and
automated process discovery, respectively. Finally, Section 6 concludes the paper.


2. Extended Prefix Automata
The concept of a prefix automaton was introduced and formalized in [5] based on the transition
systems described in [6]. [6] introduced an approach to building transition systems from
event logs using 3 parameters: decision of a state (past, future or both), representation of the
information (sequence, set and multiset), and horizon (limited or in�nite). Prefix automata use
past,sequence and infinite settings since it allows to derive a transition system with the same
language as the respective event log, i.e. a behavioral representation of the log [5]. Thus, a
prefix automaton is a transition system with the states corresponding to the states of the event
log and transitions corresponding to the activities. In [6], transition systems were used for
abstraction in order to balance between overfitting and underfitting in process discovery. In [5],
prefix automata were used for conformance checking.
   Based on these concepts, we introduced Extended Prefix Automata (or EPAs) [3]. An example
of an EPA is shown in Figure 1. Same as the prefix automata, EPAs describe event logs they
are constructed from without loss of information and abstraction. However, we introduce two
important extensions. First, we map each state of an EPA to a set of events in the input event log
having the same prefix, so that every event in the log corresponds to exactly one state. Second,
we introduce a partitioning function that splits an EPA into 0 ≤ 𝑘 ≤ |𝐿| partitions, where |𝐿| is
the number of traces in the log.

                                                                               e
                                                                         f1           e1
                                                                   f
                                                                                      P1
                                         c            d            e
                                 b1,b4        c1,c4        d1,d4         e4    P2
                            b
              a    a1,a2,                                                             P3
                   a3,a4
      root                  c            b            d            f           e
                                 c2,c3        b2,b3        d2,d3         f2           e2


                                                                   e
                                                                         e3    P4

Figure 1:              Extended prefix automaton with partitions derived from the event log
⟨𝑎, 𝑏, 𝑐, 𝑑, 𝑓 , 𝑒⟩, ⟨𝑎, 𝑐, 𝑏, 𝑑, 𝑓 , 𝑒⟩, ⟨𝑎, 𝑐, 𝑏, 𝑑, 𝑒⟩, ⟨𝑎, 𝑏, 𝑐, 𝑑, 𝑒⟩, reused from [3]


                                               73
   As will be shown in the next sections, these extensions are helpful in various regards. The
partitioning function, for instance, is essential for computing complexity metrics using EPAs.
The mapping between the states of EPA and events is not only helpful for complexity metrics as
well, but also generally adds more transparency as for each state one can directly observe which
events are responsible for the very existence of this state and use this information on the state
level. Finally, it is also important to mention that the EPA is defined in such way that further
extensions are also possible. This feature is extremely useful for enabling various applications,
probably even beyond those covered in this PhD project.


3. Process Complexity
Research on process mining algorithms, including process discovery, has up to present moment
taken the perspective that better algorithms would always yield better results [3]. However, as
seen in data mining, this assumption does not necessarily hold. The characteristics of the input
data can also have influence on the output.
    In [3], we studied which process complexity metrics correlate with the quality of models
discovered by process mining algorithms. However, before doing so, we realized that existing
complexity metrics always focus on one of the three complexity aspects: size—how many events,
traces, etc. the event log contains, variation—how many variants of executing the process the
event log contains, and distance—how much these variants differ. We further observed that
if two event logs are similar in one complexity dimension, e.g. size but different on the other
dimensions, size-based complexity metrics will not capture these differences. Same applies to
other categories of complexity metrics, i.e. variation-based metrics will not capture differences
in size, etc. Thus, we first needed to construct a complexity metric that would capture all the
three dimensions simultaneously.
    We used Extended Prefix Automata and graph entropy to develop our complexity metrics.
On a side note, EPAs were in fact initially developed in order to enable such metrics. First, we
introduced variant entropy by directly applying the idea of graph entropy to EPAs. To calculate
it, we use the number of states in each partition. We also introduced normalized variant entropy
having the range from 0 to 1. However, variant entropy did not take the events, which were
one of the main extensions in the EPA, into account, and thus did non suffice to capture all
complexity dimensions, especially size. Thus, we also introduced sequence entropy. The main
difference between sequence entropy and variant entropy is that in sequence entropy we are
interested not in the number of states but in the number of events corresponding to states in
each partition. Finally, we introduced normalized sequence entropy, which is also in range [0, 1].
It is especially helpful if one wants to compare structural complexity of event logs that differ in
support or magnitude.
    We have seen that complexity of the event logs correlates with the quality metrics of the
models discovered from these logs, namely precision, fitness, F-score as well as size and control
flow complexity. Out of 20 complexity metrics used in our study only 6 correlate highly with
most model quality metrics, including our newly developed normalized sequence entropy.
    We have also observed that complexity metrics can be used to suggest the most suitable process
discovery algorithm since different complexity metrics have different impact on different process


                                               74
discovery algorithms, e.g. the quality of models produced by Evolutionary Tree Miner [7] and
Inductive Miner [8] mainly depend on average edit distance [9], while quality of Split Miner [10]
models depends on normalized sequence entropy.
   Process complexity, however, is not only influencing model quality. We have also conducted a
study [11] of how process complexity influences process performance. For this, we took multiple
event logs, split them into time periods and observed the development of process complexity
and performance over the timeframe captured in the event logs. As expected, we found a
dependence between normalized sequence entropy and throughout time. These results have
strong implications for research and practice. Previously, process complexity was evaluated
based on interviews with process stakeholders – an approach that does not scale. Our technique
allows for automated computation of process complexity based on process mining. In addition,
our metric is objective and not perceptual as the interview-based ones.
   It is also expected that process complexity will influence the accuracy of predictive process
monitoring algorithms. In the best case, we may be able to recommend an algorithm based
on complexity. We have also incorporated the concept of forgetting in our sequence entropy
metrics, such that more recent events receive more weight and add more to complexity than
the older ones. This study is, however, in its initial phase yet.


4. Predictive Process Monitoring
While initially developed to measure process complexity, EPAs have a number of features that
make it promising to apply them in other process mining areas. Among such features is the
extensibility of EPAs as well as the fact that EPAs represent the event logs without loss of
information or abstraction.
   Predictive process monitoring (PPM) is a technique that continuously provides information
(next activity, outcome, risk, remaining time, etc.) about the future of ongoing traces [12, 13].
[14] shows that prefix-based trace clustering combined with classification can accurately predict
whether some predicate will be fulfilled upon completion of a running trace. Another popular
technique in predictive process monitoring are recurring neural networks, especially Long
Short-Term Memory (LSTM) networks [15]. The latter approach, however, also brings the
common disadvantages of neural networks: long training times and/or high computational
power requirements as well as black-box nature.
   The basic idea of a PPM technique proposed in this project is to attach a separate predictor to
every state of the Extended Prefix Automaton. A predictor is a machine learning algorithm with
unified interface. Such setting, first of all, allows for flexibility in choosing the machine learning
algorithm (decision tree, nearest-neighbours, regression, neural networks), especially allowing
to have different types of algorithms used in different states of the EPA. This means that in
some states with simpler choices simpler algorithms may be used to save up computational
power for states what require more complex algorithms. Moreover, it is possible that different
algorithms of comparable complexity still perform differently for different traces, also making
it more lucrative to use different algorithms for such different cases instead of applying just
one to all traces. To summarize, EPA can act as a framework that allows to transparently use
multiple predictive algorithms.


                                                 75
   This is, however, not the only potential advantage of PPM on EPAs. Such framework uses
multiple ML models instead of one, but these models are essentially simpler and in addition can
be trained separately. This may reduce the time required for training, especially for re-training
with additional data, as it is very probable that not all models will be affected. In the best case,
this can even allow for efficient incremental training where each newly observed event can be
simultaneously incorporated into the predictive model instead of training in batches.
   Finally, EPAs are transparent by design as they show which actual events in the input log
correspond to an EPA state. Combined with simpler ML models like kNN or decision trees, this
can lead to a predictive process monitoring algorithm that is explainable by design as opposed
to black-box neural networks.
   It is expected that EPAs will primarily be used for next activity and suffix prediction. However,
given the flexibility of the approach, by merely changing the predictors one can apply such
framework for other types of predictions, e.g. categorial outcome or even remaining time
predictions.
   Preliminary evaluation on some of the BPIC logs has shown that even the most simple
algorithm ZeroR that always predicts the majority class can yield 13 to 68% accuracy in next
activity prediction depending on the event log. While this result is by no means impressive, it
provides a good baseline. It is expected that using slightly more complex but still white-box ML
algorithms can increase the accuracy without sacrificing explainability.


5. Automated Process Discovery
To close the circle, EPAs also have potential to be used in process discovery. Automated process
discovery is an area of process mining aiming to produce business process models from event
logs [1]. [6] already showed how transition systems similar to EPAs can be transformed into
Petri Nets. Thus, adapting this method for actual EPAs is trivial.
   However, without any additional modifications EPAs and process models constructed from
them are merely enumerations of observed traces, meaning excellent precision and fitness,
however no generalization and oftentimes also low readability. [6] used ”massaging” the
transition system and abstraction to overcome the latter two issues. [16] also highlights the
necessity of abstraction and proposes to use sets for state representation instead of sequences
and also to consider suffixes for automaton construction instead of prefixes.
   While these proposals will almost certainly improve process discovery, it must be noted,
however, that they destroy some crucial properties of EPAs. Thus these techniques should only
be applied as the last step and only for process discovery, and automata obtained in such way
can only be seen as an intermediary step towards a process modes but cannot be used for other
purposes described in the previous sections.
   Nevertheless, for process discovery, we plan to incorporate some of these techniques in their
exact or modified form as well as some additional steps not mentioned before. In addition, since
EPAs also include actual events from the input event log, frequency-based filtering of branches
can be incorporated.
   Preliminary evaluation on PDC 20211 data showed that the algorithm gives perfect (1.0)
1
    https://icpmconference.org/2021/process-discovery-contest/


                                                        76
negative accuracy, i.e. classifies all not-fitting traces correctly, but extremely low (0.01) positive
accuracy, i.e. it does not recognize most of the conforming traces. This effectively means it
rejects almost all test traces. Given that at current stage EPAs in their unmodified state were
used, such results are not surprising. However, as already mentioned, it is relatively simple to
add e.g. loop recognition, parallel execution recognition or exclusive choice recognition before
producing a final Petri Net. Also more advanced techniques such as automated frequency-based
or statistical significance-based filtering may increase the performance even further. With these
and probably even further modifications, process discovery with EPAs theoretically can perform
at least better than the alpha-algorithm. The upper boundaries of this method are yet to be
discovered.


6. Conclusion
In this paper, we have reported the current state and outlined the future plans of the proposed
PhD project. We introduced Extended Prefix Automata (EPAs) and gave an overview of process
mining areas where EPAs can be applied. We reported the first results of applications related to
measuring process complexity as well as outlined the planned applications in predictive process
monitoring and automated process discovery.


References
 [1] W. M. van der Aalst, Process Mining: Data Science in Action, Second Edition, Springer,
     2016.
 [2] M. Dumas, M. L. Rosa, J. Mendling, H. A. Reijers, Fundamentals of Business Process
     Management, Second Edition, Springer, 2018.
 [3] A. Augusto, J. Mendling, M. Vidgof, B. Wurm, The connection between process complexity
     of event sequences and models discovered by process mining, CoRR abs/2106.07990 (2021).
     URL: https://arxiv.org/abs/2106.07990. a r X i v : 2 1 0 6 . 0 7 9 9 0 .
 [4] J. Mendling, B. Depaire, H. Leopold, Theory and practice of algorithm engineering, CoRR
     abs/2107.10675 (2021). URL: https://arxiv.org/abs/2107.10675. a r X i v : 2 1 0 7 . 1 0 6 7 5 .
 [5] J. Munoz-Gama, J. Carmona, A fresh look at precision in process conformance, in: R. Hull,
     J. Mendling, S. Tai (Eds.), Business Process Management - 8th International Conference,
     BPM 2010, Hoboken, NJ, USA, September 13-16, 2010. Proceedings, volume 6336 of Lecture
     Notes in Computer Science, Springer, 2010, pp. 211–226.
 [6] W. M. P. van der Aalst, V. A. Rubin, H. M. W. Verbeek, B. F. van Dongen, E. Kindler,
     C. W. Günther, Process mining: a two-step approach to balance between underfitting and
     overfitting, Software and Systems Modeling 9 (2010) 87–111.
 [7] J. Buijs, B. van Dongen, W. van der Aalst, On the role of fitness, precision, generalization
     and simplicity in process discovery, in: CoopIS, Springer, 2012.
 [8] S. Leemans, D. Fahland, W. van der Aalst, Discovering block-structured process models
     from event logs containing infrequent behaviour, in: BPM Workshops, Springer, 2014.
 [9] B. T. Pentland, Conceptualizing and measuring variety in the execution of organizational
     work processes, Management Science 49 (2003) 857–870.


                                                 77
[10] A. Augusto, R. Conforti, M. Dumas, M. La Rosa, A. Polyvyanyy, Split miner: automated
     discovery of accurate and simple business process models from event logs, Knowledge
     and Information Systems (2018).
[11] M. Vidgof, B. Wurm, J. Mendling, A study of the connection between process complexity
     and performance based on digital trace data, 2022.
[12] F. M. Maggi, C. D. Francescomarino, M. Dumas, C. Ghidini, Predictive monitoring of
     business processes, in: M. Jarke, J. Mylopoulos, C. Quix, C. Rolland, Y. Manolopou-
     los, H. Mouratidis, J. Horkoff (Eds.), Advanced Information Systems Engineering - 26th
     International Conference, CAiSE 2014, Thessaloniki, Greece, June 16-20, 2014. Proceed-
     ings, volume 8484 of Lecture Notes in Computer Science, Springer, 2014, pp. 457–472. URL:
     https://doi.org/10.1007/978-3-319-07881-6_31. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 0 7 8 8 1 - 6 \ _ 3 1 .
[13] C. D. Francescomarino, C. Ghidini, F. M. Maggi, F. Milani, Predictive process monitoring
     methods: Which one suits me best?, in: M. Weske, M. Montali, I. Weber, J. vom Brocke (Eds.),
     Business Process Management - 16th International Conference, BPM 2018, Sydney, NSW,
     Australia, September 9-14, 2018, Proceedings, volume 11080 of Lecture Notes in Computer
     Science, Springer, 2018, pp. 462–479. URL: https://doi.org/10.1007/978-3-319-98648-7_27.
     doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 9 8 6 4 8 - 7 \ _ 2 7 .
[14] C. D. Francescomarino, M. Dumas, F. M. Maggi, I. Teinemaa, Clustering-based predictive
     process monitoring, IEEE Trans. Serv. Comput. 12 (2019) 896–909. URL: https://doi.org/10.
     1109/TSC.2016.2645153. doi:1 0 . 1 1 0 9 / T S C . 2 0 1 6 . 2 6 4 5 1 5 3 .
[15] N. Tax, I. Verenich, M. L. Rosa, M. Dumas, Predictive business process monitoring with
     LSTM neural networks, in: E. Dubois, K. Pohl (Eds.), Advanced Information Systems
     Engineering - 29th International Conference, CAiSE 2017, Essen, Germany, June 12-16, 2017,
     Proceedings, volume 10253 of Lecture Notes in Computer Science, Springer, 2017, pp. 477–492.
     URL: https://doi.org/10.1007/978-3-319-59536-8_30. doi:1 0 . 1 0 0 7 / 9 7 8 - 3 - 3 1 9 - 5 9 5 3 6 - 8 \ _ 3 0 .
[16] A. Adriansyah, J. Munoz-Gama, J. Carmona, B. F. van Dongen, W. M. P. van der Aalst,
     Measuring precision of modeled behavior, Inf. Syst. E Bus. Manag. 13 (2015) 37–67. URL:
     https://doi.org/10.1007/s10257-014-0234-7. doi:1 0 . 1 0 0 7 / s 1 0 2 5 7 - 0 1 4 - 0 2 3 4 - 7 .


                                                         78