1. Introduction

A. Küsters);

Process Discovery Applications

Aaron Küsters

Wil M.P. van der Aalst

wvdaalst@pads.rwth-aachen.de 0

Process Discovery

Process Mining

Process Models

Petri Nets

0 Chair of Process and Data Science (PADS), RWTH Aachen University , Germany

2023

000 0 0002

The Alpha algorithm was the first process discovery algorithm that was able to discover process models with concurrency based on incomplete event data while still providing formal guarantees. However, as was stated in the original paper, practical applicability is limited when dealing with exceptional behavior and processes that cannot be described as a structured workflow net without short loops. This paper presents the Alpha+++ algorithm that overcomes many of these limitations, making the algorithm competitive with more recent process mining approaches. The diferent steps provide insights into the practical challenges of learning process models with concurrency, choices, sequences, loops, and skipping from event data. The approach was implemented in ProM and tested on various publicly available, real-life event logs.

1. Introduction

The original Alpha algorithm was developed over twenty years ago [ 1, 2 ]. The goal of the algorithm was to show the challenges related to discovering process models with concurrency from example traces. It was formally proven that, a process modeled as a structured workflow net without short loops, can be rediscovered from an event log that is directly-follows complete [ 2 ]. Despite this remarkable theoretical result, the Alpha algorithm has limited practical relevance for two main reasons: • The original algorithm did not attempt to filter out infrequent behavior. Since exceptional behavior is not separated from frequent behavior, it is generally impossible to uncover structure from real-life event logs. • The original algorithm assumed that the process can be modeled as a free-choice Petri net with unique visible activity labels. Most real-life processes can not be modeled as a structured workflow net without short loops and unique visible labels.

These limitations were already acknowledged in the papers proposing the algorithm, e.g., the focus of [ 2 ] was on showing the theoretical limits of process discovery based on directly-follows complete event logs. Many of the later process discovery approaches use these insights. Various Algorithms and Theories for the Analysis of Event Data (ATAED’23), 2023, Portugal nEvelop-O extensions of the Alpha algorithm have been proposed, e.g., [ 3 ] extends the core algorithm to deal with long-term dependencies, and [ 4 ] extends the core algorithm to deal with invisible activities (e.g., skipping). Region-based process-discovery approaches provide formal guarantees. State-based regions were introduced by Ehrenfeucht and Rozenberg [ 5 ] in 1989 and generalized by Cortadella et al. [ 6 ]. In [ 7 ], it is shown how these state-based regions can be applied to process mining by first creating a log-based automaton using diferent abstractions. In [ 8, 9 ], refinements are proposed to tailor state-based regions toward process discovery. Languagebased regions work directly on traces without creating an automaton first; see, for example, the approaches presented in [ 10, 11, 12 ].

Variants of the Alpha algorithm and the region-based approaches have problems dealing with infrequent behavior and are rarely used in practice. The region-based approaches are also infeasible for larger models and logs. Approaches such as the eST-Miner [ 13 ] and the diferent variants of the inductive miner [ 14, 15 ] aim to provide formal guarantees but can also handle infrequent behavior. Variants of the inductive miner have also been implemented in various commercial systems (e.g., Celonis). The so-called split-miner uses a combination of approaches to balance recall and precision [16].

The goal of this paper is to go back to the original ideas used by the Alpha algorithm and make the algorithm work in practical settings. The result is the Alpha+++ algorithm, which, not only extends the core algorithm, but also removes problematic noisy activities, adds invisible activities, repairs loops, and post-processes the resulting Petri net. The approach uses a broad combination of novel ideas, making the Alpha algorithm competitive when compared with the state-of-the-art. The ideas incorporated in the Alpha+++ algorithm may also be used in combination with other approaches (e.g., identifying problematic activities and introducing artificially created invisible activities).

The remainder of this paper is organized as follows. Section 2 introduces event logs, directlyfollows graphs, and the original Alpha algorithm. Section 3 describes the Alpha+++ algorithm. The algorithm has been implemented in ProM (cf. Section 4) and evaluated using various event logs (cf. Section 5). Section 6 concludes the paper.

2. Preliminaries 2.1. Event Logs

Process mining starts from event data. An event may have many diferent attributes. However, here we focus on discovering the control flow and assume that each event has a case attribute, an activity attribute, and a timestamp attribute. We only use the timestamps to order events related to the same case. Therefore, each case can be described as a sequence of activities, also called trace. An event log is a multiset of traces, as diferent cases can exhibit the same trace. Definition 1 (Event Log). Uact is the universe of activity names. A trace = ⟨ ⟩ ∈ Uact ∗ is a sequence of activities. An event log ∈ B(Uact ∗)is a multiset of traces. 1, 2, … ,

For example, 1 = [⟨, , , ⟩ 400, ⟨, , ⟩ 250, ⟨, , , ⟩ taining 656 cases with 4 diferent variants. Variant ⟨, , , ⟩ 1(⟨, , , ⟩) = 400 . 4, ⟨, , ⟩ 2] is an event log conis the most frequent one, i.e., log and () = { ∣ ∈

actMult()}for the set of activities.

We write actMult() = [ () ∣ ∈ ∧ 1 ≤ ≤ || ] for the multiset of activities in an event

2.2. Directly-Follows Graphs

node, are added additionally.

A Directly-Follows Graph (DFG) is a graph showing how often one activity is followed by another.

A DFG consists of the activities as nodes and has an arc from an activity ∈

∈

Uact if is directly followed by . Two special nodes, corresponding to a start and an end Uact to an activity Definition 2 (Directly-Follows Graph). (, =⇒), where ⊆

Uact is a set of activities and =⇒ {■ }) ∪ (▶{} × {■})i)sa multiset of arcs. ▶ is the start node and ■ is the end node.

A Directly-Follows Graph (DFG) is a pair = ∈ B(( × ) ∪ ({▶} × ) ∪ ( × =⇒ holds if and only if =⇒(, ) ≥ .

≥ The construction of a DFG from an event log is straightforward.

Note that a DFG has arc weights. Hence, =⇒ is a multiset, where =⇒(, ) denotes how often is followed by . We write =⇒ if and only if =⇒(, ) > 0 holds. Similarly, we say that Definition 3 (Constructing DFGs from Event Logs). can construct a DFG discdfg () = (, =⇒)based on the directly-follows relations of event Let ∈

B(Uact ∗)be an event log. We log , with the set of activities =

{ ∈ ∣ ∈ } where artificial start and end activities have been added. [( , +1 ) ∣ ∈ ′ ∧ 1 ≤ < ||] , where ′ = [⟨▶⟩ ⋅ ⋅ ⟨ ■ ⟩ ∣ ∈ ] and the multiset of arcs =⇒

= denotes the event log refer to the directly-follows relations in represented by =⇒ directly.

Given an event log , we can construct a DFG discdfg () = (, =⇒), and in the context of

2.3. Petri Nets

We would like to discover process models which can represent more complex control-flow structures, like choices, loops, and concurrency. Therefore, we use labeled Petri nets as a target format for process discovery. The reader is assumed to be familiar with the Petri net basics. a labeling function ∈ ↛ that cannot be observed).

Definition 4 (Labeled Petri Net).

A labeled Petri net is a tuple = ( , , , ) with a set of places , a set of transitions (where ∩ = ∅ ), a flow relation ⊆ ( × ) ∪ ( × ) , and Uact . We write () = if ∈ ∖ dom()(i.e., is a silent transition ∪ ∣ (, ) ∈ } initial and final state. define the preset of as • =

{ ∈ ∪ ∣ (, ) ∈ }

A marking is represented by a multiset of places ∈

B( ). For a node ∈ ∪ and the postset of as • = { ∈ , we . We focus on so-called accepting Petri nets, i.e., Petri nets with a defined Definition 5 (Accepting Petri Net).

An accepting Petri net is a triplet AN = ( , init , final ) B( ) is the final marking. UAN is the set of accepting Petri nets. where = ( , , , ) is a labeled Petri net,

init ∈ B( ) is the initial marking, and final ∈

The language defined by an accepting Petri net is then simply given by the set of traces marking . A firing sequence leading from to corresponding to all firing sequences that start in the initial marking ifring . Hence, the language of an accepting Petri net is a set of traces. a sequence of activities. Note that transitions that fire are mapped onto the corresponding activities. If a transition is silent (i.e., () = ), no corresponding activity is created when and end in the final is converted into a trace, i.e.,

2.4. Alpha Algorithm

A process discovery algorithm aims to discover a model from event data such that the language of the model best characterizes the example behavior seen in the event log.

Definition 6 (Process Discovery Algorithm).

A process discovery algorithm is a function disc ∈ B(Uact ∗) →UAN , i.e., based on a multiset of traces, an accepting Petri net is discovered.

The classical Alpha process discovery algorithm was introduced in [ 2 ]. To be able to better explain the extensions presented in this paper, we split the description into three main parts. From an input event log , place candidates are constructed based on the directly-follows relations of the log. The resulting set of place candidates is pruned to remove dominated candidates. Finally, the discovered Petri net is constructed.

Cnd = {(, ) ∣ ∅ ⊊ , ⊆ () ∧ ∀ ∈ ∀∈ ( =⇒ )

∧ ∀, ′∈ ( =⇒⧸ ′) ∧ ∀, ′∈ ( =⇒⧸ ′)} Candidate Building Candidate Pruning

Sel = {(1, 2) ∈Cnd ∣ ∀( ′1, ′2)∈Cnd ((1 ⊆ ′1 ∧ 2 ⊆ ′2) Petri Net Construction Let = (( , , , ), • = { ∣ ∈ ()} • = {( , (,) • = {( , ) ∣ ∈ ()} = [ ] ) ∣ (, ) ∈ Sel ∧ ∈ } ∪ {( (,) , ) ∣ (, ) ∈ Sel ∧ ∈ } ∪ {( , ) ∣ ∃ ⟨⟩ ⋅ ∈ } ∪ {( , ) ∣ ∃ ⋅ ⟨⟩ ∈ } 3. Alpha+++ In this section, we introduce the Alpha+++ process discovery algorithm based on the classical Alpha algorithm. Through certain pre-processing steps on the event log and a corresponding DFG, as well as fitness-based place filtering, this algorithm is especially well suited for real-life event logs.

The input for this process discovery algorithm is an event log . In particular, only ordered traces of activities with corresponding frequencies are required. For the main steps of the algorithm, a DFG based on the event log is used exclusively. Traces of the event log are only used for replay to remove unfitting place candidates. For simplicity, we assume that the traces of already include artificial start and end activities, in particular, we assume {▶, ■ } ⊆ () .

We introduce the steps of the algorithm in the following order: 1. Determine Activities, where the set of activities used throughout the algorithm is determined. Problematic activities are removed from the event log and artificial activities are added, resulting in a repaired event log ̂. 2. Create an Advising DFG, where an advising DFG is constructed based on the DFG corresponding to the repaired log ̂, retaining only some of the original DFG edges. 3. Candidate Building, where a set of place candidates is built based on the directly-follows relation of the activities. 4. Candidate Pruning, where through eficient multistep filtering unfit or undesirable place candidates are discarded. 5. Petri Net Construction, where a Petri net is constructed based on the activities of the event log, the added artificial activities and the remaining place candidates. 6. Post-Processing Petri Net, where the repaired event log is replayed on the Petri net to remove problematic places.

3.1. Determine Activities

First, we determine the set of activities used in the later steps. Initially, starting with the set of activities occurring in the log, we first remove problematic activities that can cause issues with discovering place candidates later on. Next, we also add artificial activities to allow discovery of place candidates for certain loop and skip constructs. 3.1.1. Removing Problematic Activities Problematic activities can significantly alter the directly-follows relations of an event log, which are used in the later steps to identify place candidates. In the most extreme case, if a problematic activity randomly occurs between any other two activities in all traces, all the directly-follows information between two other activities would be lost.

We select a subset A ⊆ () of activities to keep and remove the other problematic activities () ∖ A . There are many possible approaches to identifying problematic activities, such as calculating a problem-score per activity and considering all values above a certain threshold as problematic. For instance, for a very simple problem-score, the fraction of directlyfollows relation involving an activity which are parallel, i.e., also occur in the opposite direction, could be considered. This would, for example, allow to correctly identify the problem in the aforementioned extreme case. 3.1.2. Adding Artificial Activities Discovering Petri net constructs involving silent transitions is a non-trivial task for a DFG-based algorithm. Additionally, in later steps, we want to use traces from the log to assess the fitness of place candidates. Silent activities make calculating fitness scores significantly harder, as then token-based replay is no longer suficient and computationally expensive alignments have to be computed. As a solution, we propose adding artificial activities to traces. They are not part of the activity set of the event log and are only used to find and evaluate place candidates. In the final discovered Petri net, these artificial activities are then translated as silent transitions. This allows discovering Petri nets with silent transitions, while still retaining the advantages of token-based replay fitness evaluation during the algorithm steps. We add artificial activities for two types of constructs: Loops and Skips.

Adding artificial activities for loops is necessary, as the directly-follows relation between an end activity and a start activity of a loop can cause the discovery of problematic places. For example, consider the event log ⟲ = [⟨, , , ⟩, ⟨, , , , , , ⟩] . Clearly, this event log can be nicely expressed by a Petri net containing a loop construct, which allows repeating the activities , , . However, the directly-follows relation ==⇒⟲ prevents discovering this loop accurately, as shown in Figure 1.

We detect loop constructs based on the directly-follows relations of the input event log .

For a given threshold ∈ R+, we can define the set of detected loops:

Definition 7 (Detected Loops).

Let loops be the function that maps an event log to the set of detected loop start and end activities.

loops() = {(, ) ∈ () × () ∣ ∃ ( 1,…, )∈() ∧ ∀∈{1,…,−1} ∧ ==⇒ ∧ ≥ ∗,∈{1,…,}

( = ≥ ( ==⇒ +1 ) ≥ ==⇒ )} to a trace ′ ∈ (A ∪ A

)∗. loop

, trace ∈

The parameter determines the minimal DFG edge weight to consider when looking for loops. For example, with a threshold =1 and the event log ⟲ , we can calculate loops( ⟲ ) = {(, )}. As loop constructs can make a process model very imprecise, we do not want to falsely detect loop behavior from rather infrequent directly-follows relations. For convenience, we can also consider threshold values relative to the mean directly-follows weight.

For each detected loop endpoint pair (, ) ∈ loops(), we want to add an artificial activity artificial loop activities. Additionally, we define a transformation function which transforms a ∉ () . We write A = {loop ,

∣ (, ) ∈ loops()}to denote the set of added Definition 8 (Loop Repair Function).

Let repair into a repaired trace with added artificial loop activities. ⟲

be the function that transforms a trace repair ⟲ (, ) = ⎧⟨, loop { ⎨⟨⟩ {⎩⟨⟩ ⋅ repair

⟲ ( ′, ) , , ⟩ ⋅ repair ⟲ ( ′, ) if ∃(,)∈ loops() = ⟨, ⟩ ⋅ which artificial activities have been added to relevant traces.

This function will be later used to transform the input event log into a repaired event log, in

Next, we describe how artificial activities can assist in correctly discovering activity Skips, as shown in Figure 2.

For a directly-follows-weight threshold ∈

R+, the detected skips for event log are defined by the following function, which provides the set of activities that have been detected as being “skippable” after an activity ∈ ( A ∪ A appropriate model discovery in the rest of the algorithm, the log is repaired using a new artificial activity , ∉ () . The set of all artificial skip activities is denoted by A

This artificial skip activity is inserted everywhere in a trace of , where activity is not directly followed by an activity ∈

in (i.e., was skipped). For that, we define the following by replacing the directly-follows relation between and . directly-follows relation between and would suggest considering place candidates with poor fitness. The second log, where an artificial activity is inserted where and are skipped, mitigates this problem transformation function: repair (, ) = ⎧⟨⟩ { {⎩⟨, ⟩ ⋅ {⟨⟩ ⋅ repair ( ′, ) ⎨{⟨, skipa,B⟩ ⋅ repair (⟨⟩ ⋅ ′, ) repair ( ′, )

We can now construct a repaired event log ̂ from the input event log based on the previously identified set of detected loops loops() and skips skips(). For that, we use their corresponding artificial activity set formation functions repair ⟲

and A event log ̂. Note that ( )̂ = (A ⊍ A ⊍ A and repair to transform the input event log into a repaired =̂ [ repair (repair ⟲ (, ↾

A ), ↾ A ) ∣ ∈ ↾

A ]

3.2. Create an Advising DFG

DFG (abbreviated as aDFG) as follows: are also removed.

Next, we extract a pruned DFG from the repaired event log ̂, which ignores infrequent directlyfollows relations. This DFG is used as guidance using the following algorithm steps. Note that this step does not modify the repaired event log: The output of this step is a pruned DFG containing the activities (

)̂ as nodes. Edges between activities and are retained if their weight corresponds to at least 1% of the sum of the weights of all incoming edges to or 1% of the sum of all outgoing edges from . The value of 1% was determined as a good cutof through experimentation. In addition, edges with weights below an absolute threshold value ∈ N 0

For the repaired event log ̂ and a given DFG-weight threshold ∈

N0, we define the advising minW (, ) = 0.01 ⋅ min{ ∈( ∑ ) ̂ ̂ =⇒(, ),

∑ ∈( ) ̂ ̂ =⇒(, )} aDFG = (( )̂ , [(, ) ∈ ( )̂ 2∣ =⇒̂(, ) ≥ max {, minW (, )}])

3.3. Candidate Building

With the repaired event log and the aDFG, we can continue with building place candidates. Place candidates are composed of two sets of activities: The first set corresponds to the transitions that should add a token to this place in a Petri net. The second set corresponds to transitions that should remove a token from this place.

The set of all place candidates is given by: aDFG Cnd0 = {(1, 2) ∣ 1, 2 ⊆ ( )̂ ∧ ∀ 1∈ 1 ∀ 2∈ 2 ( 1 ==⇒ 2) aDFG ∧ ∀ 1∈ 1 ∀ 2∈ 1∖ 2 ( 1 =⧸ ⇒ 2)

aDFG ∧ ∀ 1∈ 2∖ 1 ∀ 2∈ 2 ( 1 =⧸ ⇒ 2)

aDFG ∧ ∃ 1∈ 1∖ 2 ∃ 2∈ 2∖ 1 ( 2 =⧸ ⇒ 1)}

3.4. Candidate Pruning

The set of place candidates Cnd0 includes many unfit places, which would produce process models with very low fitness. Furthermore, some place candidates might be dominated by others (e.g., the place candidate ({}, { }is)dominated by the candidate ({, }, {, })). Pruning the set of place candidates requires an eficient approach, as the number of place candidates can easily grow huge. We propose a three-step pruning approach. First, place candidates are ifltered purely based on activity counts. If the diference in frequency of the input and output activity set is relatively large, the place candidate is rather unfit. This condition can be checked very eficiently. Next, the local fitness of the place candidate is calculated based on local trace replay. Local trace replay takes the order of the activities in the traces into account, and thus can detect even more unfit place candidates. Finally, to remove dominated place candidates, we retain only maximal place candidates. 3.4.1. Balance-based Pruning: For the balance-based pruning, we consider the number of activity occurrences in the log ̂ using actMult()̂ . For a set of activities, ⊆ ( )̂ we can then sum the frequencies together (as 1co,un2t)(:, )̂ = ∑ ∈ actMult()̂ () . Based on that, we define the balance of a candidate balance(, ̂ Let fit (, ( 1, 2), )be defined as follows: ⎧1 if = ⟨⟩, = 0 { {0 if = ⟨⟩, ≠ 0 { {0 if = ⟨⟩ ⋅ ′, = 0, ∉ 1, ∈ 2 ift (, ( 1, 2), ) = {⎨fit ( ′, ( 1, 2), + 1) if = ⟨⟩ ⋅ ′, ∈ 1, ∉ 2 {{fit ( ′, ( 1, 2), − 1) if = ⟨⟩ ⋅ ′, ≥ 1, ∉ 1, ∈ 2 {⎩ift ( ′, ( 1, 2), ) if = ⟨⟩ ⋅ ′, ( ∈ 1 ∩ 2 ∨ ∉ Note that fit (, ( 1, 2), 0) = 1if the place candidate ( 1, 2)fits the trace; otherwise it takes the value 0.

The traces relevant for a place candidate ( 1, 2)are defined by the following function: 1 ∪ 2) rel( 1, 2) = [ = ⟨ 1, … , ⟩ ∈ ∣̂ ∃ ∈{1,…,} ( ∈ 1 ∨ ∈ 2)] We consider traces relevant for a place candidate, if they contain at least one activity that is in the set of outgoing or ingoing activities of that place candidate. For a single activity, we use the notation rel() ∶= rel({}, ∅)to denote the traces containing that activity.

We write fit (, ( 1, 2)) ≔fit (, ( 1, 2), 0)and ift (, (̂ 1, 2)) ≔ ∑ ∈ rel( 1, 2) ift (, ( 1, 2)) for ease of notation.

For a given local candidate fitness threshold ∈ [ 0, 1 ], the candidates remaining after the local fitness replay pruning are then given as:

mfit ( 1, 2) =min { Cnd 2 = {( 1, 2) ∈Cnd 1 ∣ ∑∈ rel() ift (, ( 1, 2))∣ ∈ |rel()|

1 ∪ 2} ift (, (̂ 1, 2))≥ ∧ mfit ( 1, 2) ≥ }

|rel( 1, 2)| 3.4.3. Maximal Candidate Selection: Finally, as the last candidate pruning step, all dominated place candidates are removed, just like in the original Alpha algorithm.

Sel = {(1, 2) ∈Cnd 2 ∣ ∀( ′1, ′2)∈Cnd2 ((1 ⊆ ′1 ∧ 2 ⊆ ′2) • = { ( 1, 2) ∣ ( 1, 2) ∈Sel} • = { ∣ ∈ (

)̂ ∖ {▶, ■ }} Sel ∧ ∈

2 ∖ {▶, ■ }} • = {( , ( 1, 2)) ∣ (1, 2) ∈Sel ∧ ∈ 1 ∖ {▶, ■ }} ∪ {((1 , 2), ) ∣ (1, 2) ∈

3.5. Petri Net Construction

Based on the remaining place candidates, an accepting Petri net is constructed as the tuple (( , , , ), • = {( , ) ∣ ∈

A } ∪ {(, ) ∣ ∈ ( A ∪ A } are the components defined using the results of the previous steps.

3.6. Post-Processing Petri Net

time when replaying on

). is given by: the post-process replay as (( ′, , Let replay(, , ) of the Petri net

be the replay function, which takes the value 1 exactly when the place can replay trace (i.e., there is no missing or remaining token in at any For a given local place replay fitness threshold ∈ [ 0, 1 ], we can then define the result of ′, ), ′ , ′ ), where the set of updated places ′ ′ = { ( 1, 2) ∣ ( 1, 2) ∈Sel ∧ ∑∈( 1, 2) |( replay(, , ) 1, 2)| ≥ } The flow relation and initial and final markings are also updated correspondingly: • ′ = {(, ) ∈ ∣ ∈ ′ ∧ ∈

′} • ′

The final accepting Petri net discovered is (( ′, ,

′, ), ′ , ′ . )

4. Implementation

We implemented the Alpha+++ algorithm as a ProM1 plugin (Java) and also created a Python implementation2 for large-scale evaluation on a variety of real-life event logs. The ProM plugin (AlphaRevisitExperiments) can be installed in ProM Nightly versions and can be used in standard mode to simply discover a Petri net or in interactive mode to experiment with diferent algorithm step options and view additional information (e.g., how many place candidates were pruned in which step). In both versions, the Alpha+++ preset can be selected out of the preset list on the top. The parameters used throughout the algorithm steps can then be changed. Additionally, the diferent algorithm steps can be swapped with alternatives or skipped, allowing for further experimentation.

5. Evaluation

To evaluate the proposed Alpha+++ algorithm ( +++), we discovered Petri nets for five real-life event logs, shown in Table 1. For comparison, we also discovered models using the Inductive Miner Infrequent (IMf) and the standard Alpha algorithm ( ). We subsequently calculated alignment-based fitness, precision and F1-scores using PM4Py 3.

For IMf, we evaluated four models per event log using noise thresholds of 0.1, 0.2, 0.3 and 0.4. For , we used four variant filtering approaches upfront: Either only selecting the 10 most common variants or the most common variants to cover at least 10%, 50% or 80% of traces. For +++, we chose artificial activity thresholds of 2 and 4 (relative to the mean directly-follows weight) for the log repair steps. Here, a lower threshold value causes more artificial activities to be added. For each artificial activity threshold, we selected five combinations of the balance , local candidate fitness and local place replay fitness thresholds. Note, that for and a value closer to 1 and for a value closer to 0 is more restrictive. We did not apply problematic activities filtering.

The evaluation results are shown in Table 2. Overall, the fitness and F1-scores of +++ are competitive compared to the IMf. 8 of the 20 models discovered with are not easy sound (i.e., no final marking is reachable), and thus no alignment scores could be computed. The remaining 12 models exhibit rather low fitness for some logs but very high precision across the board, significantly boosting the corresponding F1-values. Although our approach does not guarantee 3https://pm4py.fit.fraunhofer.de/ (Version 2.6.1) easy soundness, all 50 Petri nets discovered with +++ are easy sound and allow computation of alignments. There are notable diferences across the diferent event logs: +++ performs significantly worse compared to the IMf on the Sepsis log in terms of F1-score, caused by lower precision scores, as the models discovered with +++ seem to be underfitting. On the two BPI Challenge 2020 logs, +++ outperforms the IMf in most configurations, often also exhibiting better fitness and precision scores simultaneously.

The influence of the parameters of +++ is mostly as expected: More restrictive , , values improve the fitness of the models while decreasing the precision.

Manual inspection of the discovered models reveals that the models discovered with +++ are mostly rather simple and often consist of several disconnected model fragments. Furthermore, multiple models exhibit redundant structures involving silent transitions (e.g., a place with one labeled transition as preset and one silent transition as postset). Such constructs could be removed by further post-processing of the Petri net. For further details, a comprehensive list of the discovered models, as well as simplicity and generalization results, we refer interested readers to the extended report version of this paper at [22].

6. Conclusion

In this paper, we revisited the Alpha algorithm to overcome its limitations, focusing on real-life event logs. For that, we presented the Alpha+++ algorithm which, like the Alpha algorithm, primarily uses directly-follows relations to discover Petri nets. Alpha+++ pre-processes event logs by adding artificial activities for potential loop or skip constructs. This allows discovering silent transitions while still assessing the fitness of places by easily computable token-based replay instead of expensive alignment computations. Subsequently, place candidates are generated based on a pruned DFG. A multistep candidate filtering approach eficiently removes place candidates with low fitness, configurable through parameters. We implemented the Alpha+++ algorithm both as a ProM plugin and in Python. The ProM plugin is available in ProM nightly builds and also features an interactive mode to allow experimenting with diferent algorithm steps and parameters. We evaluated the Alpha+++ on five real-life event logs and compared the results to the classical Alpha algorithm and the widely adapted Inductive Miner Infrequent. Overall, the results indicate that the Alpha+++ algorithm is competitive in terms of fitness and precision. In general, the diferent step parameter configurations tested reliably determine the trade-of between fitness and precision.

Further research should include further evaluation of the algorithm. For that, additional performance metrics like simplicity or generality could be included and also compared to other process discovery algorithms. It is particularly interesting to see if there are any patterns regarding algorithm parameters, event log properties, and model performance. Such observations could enable automatic parameter selection based on the log, and thus simplify Alpha+++ to a well-performing one-in-all algorithm. Additionally, a more comprehensive qualitative analysis of the discovered models is needed. Furthermore, the assumptions and efects of the algorithm steps should be studied, e.g., using artificial event logs with relevant control structures. Further research could also explore if any theoretical guarantees, such as easy-soundness, are attainable, e.g., using more sophisticated post-processing of the discovered Petri net. s10270- 016- 0545- x. [16] A. Augusto, R. Conforti, M. Marlon, M. La Rosa, A. Polyvyanyy, Split Miner: Automated Discovery of Accurate and Simple Business Process Models from Event Logs, Knowledge Information Systems 59 (2) (2019) 251–284. [17] M. de Leoni, F.Mannhardt, Road Trafic Fine Management Process, https://data.4tu.nl/ articles/dataset/Road_Traffic_Fine_Management_Process/12683249 (2015). doi:10.4121/ uuid:270fd440- 1057- 4fb9- 89a9- b699b47990f5. [18] F. Mannhardt, Sepsis Cases - Event Log, https://data.4tu.nl/articles/ dataset/Sepsis_Cases_-_Event_Log/12707639 (2016). doi:10.4121/uuid: 915d2bfb- 7e84- 49ad- a286- dc35f063a460. [19] B. van Dongen, BPI Challenge 2019, https://data.4tu.nl/articles/dataset/BPI_Challenge_ 2019/12715853 (2019). doi:10.4121/uuid:d06aff4b- 79f0- 45e6- 8ec8- e19730c248f1. [20] B. van Dongen, BPI Challenge 2020: Request For Payment, https://data.4tu.nl/articles/ dataset/BPI_Challenge_2020_Request_For_Payment/12706886 (2020). doi:10.4121/uuid: 895b26fb- 6f25- 46eb- 9e48- 0dca26fcd030. [21] B. van Dongen, BPI Challenge 2020: Domestic Declarations, https://data.4tu.nl/articles/ dataset/BPI_Challenge_2020_Domestic_Declarations/12692543 (2020). doi:10.4121/ uuid:3f422315- ed9d- 4882- 891f- e180b5b4feb5. [22] A. Küsters, W.M.P. van der Aalst, Revisiting the Alpha Algorithm To Enable Real-Life Process Discovery Applications – Extended Report (2023). arXiv:2305.17767.

[1] W.M.P. van der Aalst , B.F. van Dongen , Discovering Workflow Performance Models from Timed Logs , in: Y. Han,

Tai , D. Wikarski (Eds.), International Conference on Engineering and Deployment of Cooperative Information Systems (EDCIS 2002) , Vol. 2480 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, 2002 , pp. 45 - 63 .

[2] W.M.P. van der Aalst , A. J.M.M. Weijters , L. Maruster, Workflow Mining: Discovering Process Models from Event Logs , IEEE Transactions on Knowledge and Data Engineering 16 ( 9 ) ( 2004 ) 1128 - 1142 .

[3]

Wen , W.M.P. van der Aalst , J. Wang , J. Sun , Mining Process Models with Non-Free- Choice

Constructs

, Data Mining and Knowledge Discovery 15 ( 2 ) ( 2007 ) 145 - 180 .

[4]

Wen ,

Wang , W.M.P. van der Aalst , B.

Huang , J.

Sun , Mining Process Models with Prime Invisible Tasks, Data and Knowledge Engineering 69 ( 10 ) ( 2010 ) 999 - 1021 .

[5]

Ehrenfeucht , G. Rozenberg, Partial (Set) 2-Structures - Part 1 and Part 2, Acta Informatica 27 ( 4 ) ( 1989 ) 315 - 368 .

[6]

Cortadella ,

Kishinevsky ,

Lavagno ,

Yakovlev , Deriving Petri Nets from Finite Transition Systems, IEEE Transactions on Computers 47 ( 8 ) ( 1998 ) 859 - 882 .

[7] W.M.P. van der Aalst , V.

Rubin , H.M.W.

Verbeek , B.F. van Dongen ,

Kindler ,

C.W.

Günther , Process Mining: A Two-Step Approach to Balance Between Underfitting and Overfitting , Software and Systems Modeling 9 ( 1 ) ( 2010 ) 87 - 111 .

[8]

Carmona ,

Cortadella ,

Kishinevsky , A Region-Based Algorithm for Discovering Petri Nets from Event Logs, in: Business Process Management (BPM 2008 ), 2008 , pp. 358 - 373 .

[9]

Solé ,

Carmona , Process Mining from a Basis of State Regions , in: J. Lilius , W. Penczek (Eds.), Applications and Theory of Petri Nets 2010 , Vol. 6128 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, 2010 , pp. 226 - 245 .

[10]

Bergenthum ,

Desel ,

Lorenz ,

Mauser , Process Mining Based on Regions of Languages , in: G. Alonso,

Dadam , M. Rosemann (Eds.), International Conference on Business Process Management (BPM 2007 ), Vol. 4714 of Lecture Notes in Computer Science, Springer-Verlag, Berlin, 2007 , pp. 375 - 383 .

[11] J.M.E.M. van der Werf , B.F. van Dongen ,

C.A.J.

Hurkens ,

Serebrenik , Process Discovery using Integer Linear Programming , Fundamenta Informaticae 94 ( 2010 ) 387 - 412 .

[12] S.J. van Zelst , B.F. van Dongen , W.M.P. van der Aalst , H.M.W Verbeek, Discovering Worklfow Nets Using Integer Linear Programming , Computing 100 ( 5 ) ( 2018 ) 529 - 556 .

[13]

L.L.

Mannel , W.M.P. van der Aalst , Discovering Process Models with Long-Term Dependencies While Providing Guarantees and Handling Infrequent Behavior , in: L. Bernardinello , L. Petrucci (Eds.), Application and Theory of Petri Nets and Concurrency (Petri Nets 2022 ), Vol. 13288 of Lecture Notes in Computer Science, 2022 , pp. 303 - 324 .

[14]

S.J.J.

Leemans ,

Fahland , W.M.P. van der Aalst , Discovering Block-Structured Process Models from Event Logs Containing Infrequent Behaviour , in: N. Lohmann , M. Song , P. Wohed (Eds.), Business Process Management Workshops, International Workshop on Business Process Intelligence (BPI 2013 ), Vol. 171 of Lecture Notes in Business Information Processing, Springer-Verlag, Berlin, 2014 , pp. 66 - 78 .

[15]

S.J.J.

Leemans ,

Fahland , W.M.P. van der Aalst , Scalable Process Discovery and Conformance Checking, Software and Systems Modeling 17 ( 2 ) ( 2018 ) 599 - 631 . doi: 10 .1007/