=Paper=
{{Paper
|id=Vol-3618/forum_paper_23
|storemode=property
|title=Discovering high-quality process models despite data scarcity
|pdfUrl=https://ceur-ws.org/Vol-3618/forum_paper_23.pdf
|volume=Vol-3618
|authors=Jan Niklas Adams,Jari Peeperkorn,Tobias Brockhoff,Isabelle Terrier,Heiko Göhner,Merih Seran Uysal,Seppe vanden Broucke,Jochen De Weerdt,Wil M.P. van der Aalst
|dblpUrl=https://dblp.org/rec/conf/er/AdamsPBTGUBWA23
}}
==Discovering high-quality process models despite data scarcity==
Discovering high-quality process models despite data scarcity Jan Niklas Adams1,∗ , Jari Peeperkorn2 , Tobias Brockhoff1 , Isabelle Terrier3 , Heiko Göhner3 , Merih Seran Uysal1 , Seppe vanden Broucke2,4 , Jochen De Weerdt2 and Wil M.P. van der Aalst1 1 Chair of Process and Data Science, RWTH Aachen University, Aachen, Germany 2 Research Center for Information Systems Engineering (LIRIS), KU Leuven, Leuven, Belgium 3 Heidelberger Druckmaschinen AG, Heidelberg, Germany 4 Department of Business Informatics and Operations Management, Ghent University, Ghent, Belgium Abstract Process discovery algorithms learn process models from executed activity sequences, describing concurrency, causality, and conflict. C oncurrent a ctivities r equire o bserving multiple permu- tations, increasing data requirements, especially for processes with concurrent subprocesses such as hierarchical, composite, or distributed processes. While process discovery algorithms traditionally use sequences of activities as input, recently introduced object-centric process discovery algorithms can use graphs of activities as input, encoding partial orders between activities. As such, they contain the concurrency information of many sequences in a single graph. In this paper, we address the research question of reducing process discovery data requirements when using object-centric event logs for process discovery. We classify different real-life processes according to the control-flow complexity within and between subprocesses and introduce an evaluation framework to assess process discovery algorithm quality of traditional and object-centric process discovery based on the sample size. We complement this with a large-scale production process case study. Our results show reduced data requirements, enabling the discovery of large, concurrent processes such as manufacturing with little data, previously infeasible with traditional process discovery. Our findings s uggest t hat o bject-centric process mining could revolutionize process discovery in various sectors, including manufacturing and supply chains. Keywords Process Mining, Process Discovery, Object-Centric Process Mining ER2023: Companion Proceedings of the 42nd International Conference on Conceptual Modeling: ER Forum, 7th SCME, Project Exhibitions, Posters and Demos, and Doctoral Consortium, November 06-09, 2023, Lisbon, Portugal ∗ Corresponding author. Envelope-Open niklas.adams@pads.rwth-aachen.de (J. N. Adams); jari.peeperkorn@kuleuven.be (J. Peeperkorn); brockhoff@pads.rwth-aachen.de (T. Brockhoff); uysal@pads.rwth-aachen.de (M. S. Uysal); seppe.vandenbroucke@kuleuven.be (S. vanden Broucke); jochen.deweerdt@kuleuven.be (J. De Weerdt); wvdaalst@pads.rwth-aachen.de (W. M.P. van der Aalst) Orcid 0000-0001-8954-4925 (J. N. Adams); 0000-0003-4644-4881 (J. Peeperkorn); 0000-0003-1115-6601 (M. S. Uysal); 0000-0002-8781-3906 (S. vanden Broucke); 0000-0001-6151-0504 (J. De Weerdt); 0000-0002-0955-6940 (W. M.P. van der Aalst) © 2023 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 1. Introduction Throughout time, business endeavors have been supported by processes, ranging from bookkeeping to production processes. With the advent of modern information systems, data traces of process executions are recorded in the underlying databases [1]. While businesses mostly have some idea of how their processes look like, analyzing the recorded data allows them to uncover their real execution. Algorithms transforming the data of process executions, i.e., event logs, into process models are called process discovery algorithms [2]. A plethora of process discovery algorithms have been proposed over the last two decades [3]. Process discovery techniques assume the existence of an extracted event log composed of a set of event sequences (cases) describing different process executions. In general, these algorithms project each event sequence to its activity sequence and aim to uncover the sequentiality/causality, concurrency, or conflict between activities. While causality and conflict can be uncovered using relatively few activity sequences, concurrency needs to be confirmed using a large set of observations. For example, four concurrent activities could manifest in 24 different sequences, and seven concurrent activities already in 5040 possible sequences. The number of sequences grows factorial with the number of concurrent activities. Therefore, larger numbers of concurrent activities require more observations for process discovery. This becomes infeasible for large numbers of concurrent activities or event logs with very few cases. Problems with highly concurrent behavior are especially present in processes that are composed of multiple, concurrently running and lightly interacting subprocesses. The amount of interaction between subprocesses is called the inter-object complexity. High inter-object complexity indicates a tight coupling between subprocesses, i.e., large amounts of shared control flow. Low inter-object complexity indicates concurrently running subprocesses. We depict an overview of typical processes and their inter-object complexity in Fig. 1. As inter-object complexity only captures control-flow complexity between subprocesses, we complement this dimension with intra-object complexity, capturing the control-flow complexity within subprocesses. We collect descriptions of typical real-life processes from the literature and depict their mapping onto these two dimensions in Fig. 1. Specifically, we show processes considered in traditional process mining (workflows like ticketing processes and business process case management [4, 5]), discrete manufacturing systems [6] (job shop/mass customization [7], (flexible) assembly lines [6]), and composite workflows (business process like P2P/O2C [8], end-to-end order processing [9]), and supply chains [10]. Additionally, we depict where a traditional, completely linear process would be positioned. While traditional event logs record cases as sequences of events, Object-Centric Event Logs (OCELs) [11] encode cases as graphs of events [12], preserving partial orders between events. By extracting the event data of compound processes as OCELs instead of traditional event logs, the concurrency between the subprocesses can be preserved using these graph-based process executions. Object-centric process discovery was recently introduced [13], using an OCEL as input and discovering a model composed of multiple interacting subprocesses. This enables process owners to discover a process directly from Figure 1: Categorization of different real-life processes based on inter- and intra-object complexity. the graph-based process executions, rather than forcing them into the sequential structure needed for traditional process discovery and removing concurrency information. In this paper, we address the following research question: How do data requirements reduce when using object-centric process discovery instead of traditional process discovery? Specifically, we want to break down the effect of employing object-centric discovery for different groups of processes with respect to their control-flow complexity between and within subprocesses, mapping the reduction in data requirements back to real-life processes. We tackle these research questions through two main contributions: First, we formally define inter- and intra-object complexity and an evaluation framework that can exactly assess the discovered model quality based on the sample size of an event log. We instantiate this evaluation framework with 45.000 model and sample size combinations and compare the quality of the discovered model between traditional discovery and object-centric discovery. We map the results back to the dimension of inter- and inter-object complexity and assess which processes have the highest data requirement reduction when using object-centric process discovery. Second, we complement this experimental evaluation with a large-scale case study. Since the experimental evaluation is very computation heavy, it is only applicable to smaller models. Our case study shows the successful application of object-centric process discovery on a large real-life manufacturing process with hundreds of activities. It also shows that traditional process discovery fails to deliver the same quality. 2. Related work Object-centric process mining addresses the problem of distinguishing multiple objects involved in a process. These objects can be used to identify subprocesses. In the past, this problem has extensively been studied from the modeling side. Different process modeling languages/notations to capture object-centric processes have been proposed, such as artifact-centric process models [14], Object-Centric Behavioral Constraints (OCBC) [15], proclets [16], DB-Nets [17], COA-Nets [18], and t-PNIDs [19]. Some of these modeling techniques have been accompanied by a data format that is able to capture event data for processes of this kind. Artifact-centric event logs [20, 21], and eXtensible Object- Centric event logs XOC [22] have been proposed. Furthermore, Esser and Fahland have recently proposed a general-purpose graph database to store object-centric event data [23]. Object-centric process mining tackles the problem of object-centricity with the modeling language of object-centric Petri nets [13] and the accompanying data format of OCELs [11]. Both of these approaches have been developed to improve the complexity and scalability issues of existing modeling and data storage formats. The core motivation of process discovery is the uncovering of as-is processes from real-life data. As such, process-discovery techniques aim to connect event data with process models. For the case of object-centric process mining, discovery techniques have been proposed for the modeling languages where data storage formats exist, namely artifact-centric discovery [20] and OCBC discovery from XOC [24]. Van der Aalst and Berti have introduced a general approach for discovering object-centric Petri nets from object-centric event data [13]. Subsequently, these process models are utilized in other data-driven process mining tasks, such as conformance checking [25] or enhancement [26]. This approach applies a traditional process discovery to each object type and merges the resulting models at interaction points. Any traditional process discovery algorithm can be employed in object-centric discovery. Traditional process discovery describes the problem of mapping a set of activity sequences to a process model [2]. Many different techniques have been proposed [3], where the Split Miner [27] and the Inductive Miner [28] remain among the most popular algorithms. In our paper, we use the inductive miner due to its guarantee of sound workflow nets for every individual object type. This paper provides a quantitative and qualitative study as to how much process discovery results can be improved by individually discovering and merging process models for each object type rather than discovering one model on the flattened event data of all object types. 3. Traditional and object-centric process discovery We introduce the foundations of object-centric and traditional event data and process discovery in this section. We link OCELs to traditional event logs and define the flattening of a traditional event log. Analogously, we introduce object-centric Petri nets and their relationship to traditional Petri nets. 3.1. Linking object-centric and traditional event data First, we introduce some notations used throughout this paper. ℰ is the universe of event identifiers, 𝒪𝒯 is the universe of object types, and 𝒪 is the universe of objects. Each object is associated with exactly one object type through 𝜋type ∶ 𝒪 → 𝒪𝒯. 𝒯 denotes the universe Figure 2: An object-centric event log (a) and its flattened counterpart (b). of event timestamps and 𝒜 the universe of event activities. We denote the powerset of a set 𝑋, the set of all subsets, with 𝒫 (𝑋 ). A sequence of length 𝑛 ∈ ℕ orders elements of a set 𝑋 and is denoted with 𝜎 ∶ {1, … , 𝑛} → 𝑋 and 𝜎 = ⟨𝑥1 , … , 𝑥𝑛 ⟩. The set of directly follows relationships in a sequence is denoted by df (⟨𝑥1 , … , 𝑥𝑛 ⟩) = {(𝑥𝑖 , 𝑥𝑖 + 1) ∣ 𝑖 ∈ {1, … , 𝑛 − 1}}. Definition 1 (Object-Centric Event Log). An object-centric event log is a tuple 𝐿 = (𝐸, 𝑂 , 𝑂𝑇 , 𝜋act , 𝜋time , 𝜋trace ) consisting of • events 𝐸 ⊆ ℰ, objects 𝑂 ⊆ 𝒪, object types 𝑂𝑇 = {𝜋type (𝑜) ∣ 𝑜 ∈ 𝑂}, • activities 𝜋act ∶ 𝐸 → 𝒜, timestamps 𝜋time ∶ 𝐸 → 𝒯, and • ordering 𝜋trace ∶ 𝑂 → 𝐸 ∗ mapping each object to a sequence of events such that ∀𝑜∈𝑂 𝜋trace (𝑜) = ⟨𝑒1 , … , 𝑒𝑛 ⟩ ∧ ∀𝑖∈{1,…,𝑛−1} 𝜋time (𝑒𝑖 ) ≤ 𝜋time (𝑒𝑖+1 ) Each event is linked to objects 𝜋obj (𝑒) = {𝑜 ∈ 𝑂 ∣ 𝑒 ∈ 𝜋trace (𝑜)}. An example of an OCEL is depicted in Fig. 2. Each row is an event associated with objects of different types, an activity, and a timestamp. Each object is associated with a sequence of events, e.g., 𝜋trace (Tire2) = ⟨𝑒2 , 𝑒7 , 𝑒10 ⟩. A traditional event log is a special case of an OCEL where objects are all of the same type and an event is associated with exactly one object. Definition 2 (Traditional Event Log). An object-centric event log 𝐿=(𝐸, 𝑂, 𝑂𝑇 , 𝜋act , 𝜋time , 𝜋trace ) is a traditional event log iff |𝑂𝑇 | = 1 ∧ ∀𝑒∈𝐸 |𝜋obj (𝑒)| = 1. An event log consists of different executions of the same process. Each execution contains multiple events. In traditional process mining, a process execution is called a case and is a sequence of events for one object. We generalize this notion such that a process execution is a graph of events for multiple, connected objects [12]. The graph describes a partial order of events induced by the precedence constraints of each 𝜋trace of the involved objects. Definition 3 (Process Executions [12]). Let 𝐿 = (𝐸, 𝑂, 𝑂𝑇 , 𝜋act , 𝜋time , 𝜋trace ) be an object- centric event log. The object graph of object co-appearances is defined by OG 𝐿 = (𝑂, 𝐼 ) Figure 3: Top: A graph-based process execution from the object-centric event log of Fig. 2 a) and the sequential process execution from the flattened, traditional event log of Fig. 2 b). Bottom: The process models discovered from both of these process executions. with 𝐼 = {{𝑜, 𝑜 ′ } ∣ 𝑜 ≠ 𝑜 ′ ∧ ∃𝑒∈𝐸 {𝑜, 𝑜 ′ } ⊆ 𝜋obj (𝑒)}. The set of interdependent ob- ject sets are the connected components of the object graph cc(𝐿) = {𝑂 ′ ⊆ 𝑂 ∣ 𝑂 ′ form a connected component in OG 𝐿 }. One set of dependent objects 𝑋 ∈ cc(𝐿) spans a process execution. A process execution is a graph 𝑝𝑋 = (𝐸𝑋 , 𝐷𝑋 ) of nodes 𝐸𝑋 = {𝑒 ∈ 𝐸 ∣ 𝜋obj (𝑒) ∩ 𝑋 ≠ ∅} and edges 𝐷𝑋 = {(𝑒, 𝑒 ′ ) ∈ 𝐸𝑋 × 𝐸𝑋 ∣ ∃𝑜∈𝑋 (𝑒, 𝑒 ′ ) ∈ df (𝜋trace (𝑜))}. The set of all process executions of an event log is defined by px(𝐿) = {𝑝𝑋 ∣ 𝑋 ∈ cc(𝐿)}. The upper left part of Fig. 3 contains an example of the object-centric process execution contained in Fig. 2. The process execution shows the production of different parts that run concurrently and are assembled in a bottom-up way. An OCEL can be transformed into the traditional event log format by flattening [11] it. This involves the choice of a case notion and the sequentializing of events for each object of that case notion. The case notion can be chosen freely, typically either a single object type is chosen or a set of connected objects. Flattening transforms a graph-based structure of process executions into a less expressive sequential structure of process execution, therefore, some information loss will never be preventable. We choose to flatten with all connected objects, i.e., flattening a process execution from a graph into a sequence. By doing this, we prevent events from disappearing or being duplicated as it would be the case if one would flatten on a single object type [11]. Definition 4 (Flattening). Let 𝐿 = (𝐸, 𝑂, 𝑂𝑇 , 𝜋act , 𝜋time , 𝜋trace ) be an object-centric event log. ′ 𝐿flat = (𝐸, 𝑂 ′ , 𝑂𝑇 ′ , 𝜋act , 𝜋time , 𝜋trace ) is the flattened event log with three modified elements: • a new, single object type 𝑂𝑇 ′ = {𝑜𝑡} for an 𝑜𝑡 ∈ 𝒪𝒯 ∧ 𝑜𝑡 ∉ 𝑂𝑇. • new object identifiers 𝑂 ′ ⊆ {𝑜 ∈ 𝒪 ∣ 𝜋type (𝑜) ∈ 𝑂𝑇 ′ } associated with the objects of the process executions cc(𝐿) through a bijection 𝜋flat ∶ 𝑂 ′ → cc(𝐿). • event sequences for the new objects 𝜋trace ′ (𝑜 ′ ) = ⟨𝑒1 , … , 𝑒𝑛 ⟩ with {𝑒1 , … , 𝑒𝑛 } = ⋃𝑜∈𝜋flat (𝑜 ′ ) 𝜋trace (𝑜) and 𝜋time (𝑒1 ) ≤ ⋯ ≤ 𝜋time (𝑒𝑛 ) for 𝑜 ′ ∈ 𝑂 ′ . We show the flattened event log in Fig. 2 b). The corresponding process execution, a sequence, is depicted in the upper right of Fig. 3. The concurrency information from the graph-based process execution is lost when flattening to a sequence. 3.2. Object-centric process models A process can be represented as a process model, typically a Petri net. Object-centric processes are represented using an object-centric Petri net [13], a Petri net with places of different types and variable arcs which are able to consume more than one token. Definition 5 (Object-Centric Petri Net). An object-centric Petri net is a tuple OCPN = (𝑃, 𝜋pt , 𝐹var ) of • a Petri net 𝑃 = (𝑇 , 𝑃, 𝐹 , 𝑙) consisting of transitions 𝑇, places 𝑃, arcs 𝐹 ⊆ (𝑇 ×𝑃)∪(𝑃 ×𝑇 ), and a labeling function 𝑙 ∶ 𝑇 ↛ 𝒜, • a type function mapping each place to an object type 𝜋pt ∶ 𝑃 → 𝒪𝒯, and • a set of variable arcs 𝐹var ⊆ 𝐹. We define the following notations for object-centric Petri nets: • the preset of a transition •𝑡={𝑝∈𝑃 ∣ (𝑝, 𝑡)∈𝐹 } for 𝑡 ∈ 𝑇. • the postset of a transition 𝑡•={𝑝∈𝑃 ∣ (𝑡, 𝑝)∈𝐹 } for 𝑡 ∈ 𝑇. • 𝑡𝑝𝑙(𝑡)={𝜋𝑝𝑡 (𝑝) ∣ 𝑝∈•𝑡∪𝑡•} are the object types associated to transition 𝑡 ∈ 𝑇. • 𝑡𝑝𝑙𝑛𝑣 (𝑡)={𝜋𝑝𝑡 (𝑝)∣𝑝∈𝑃 ∧ {(𝑝, 𝑡), (𝑡, 𝑝)}∩(𝐹⧵𝐹var )≠∅} are the object types with non-variable arcs associated to transition 𝑡 ∈ 𝑇. An example of an object-centric Petri net is depicted in the lower left of Fig. 3. Places of different types are depicted with different colors, variable arcs are depicted with double lines. Please note that we dropped the output places of an object type’s last transition for presentation purposes. Analogously to object-centric and traditional event logs, a traditional Petri net is a special case of an object-centric Petri net where all places have the same type and no variable arcs exist. The bottom right of Fig. 3 depicts a traditional Petri net. Definition 6 (Traditional Petri Net). An object-centric Petri net OCPN = (𝑃, 𝜋pt , 𝐹var ) is a traditional Petri net iff |range(𝜋pt )| = 1 ∧ 𝐹var = ∅. We describe the state of an object-centric Petri net with a marking. Definition 7 (Marking). Let OCPN = ((𝑇 , 𝑃, 𝐹 , 𝑙), 𝜋pt , 𝐹var ) be an object-centric Petri net. A token associates an object with a place. The set of tokens is define by 𝑄OCPN = {(𝑝, 𝑜) ∈ 𝑃 × 𝒪 ∣ 𝜋pt (𝑝) = 𝜋type (𝑜)}. A marking is a multiset of tokens 𝑀 ∈ ℬ(𝑄OCPN ). Given a marking, a transition can be enabled. If it is enabled, it can be executed. The execution of a transition is bound to objects. These are consumed in the input places and produced in the output places. Definition 8 (Binding Execution). Let OCPN = ((𝑇 , 𝑃, 𝐹 , 𝑙), 𝜋pt , 𝐹var ) be an object-centric Petri net. 𝐵OCPN ={(𝑡, 𝑏) ∈ 𝑇×(𝒪𝒯↛𝒫 (𝒪)) ∣ 𝑑𝑜𝑚(𝑏)=𝑡𝑝𝑙(𝑡) ∧ ∀𝑜𝑡∈𝑡𝑝𝑙𝑛𝑣 (𝑡) |𝑏(𝑜𝑡)|=1} defines the set of all possible bindings. The consumed tokens of a binding (𝑡, 𝑏)∈𝐵OCPN are defined by 𝑐𝑜𝑛𝑠(𝑡, 𝑏)=[(𝑝, 𝑜)∈𝑄OCPN ∣ 𝑝∈•𝑡 ∧ 𝑜∈𝑏(𝜋pt (𝑝))] and the produced tokens are defined by 𝑝𝑟𝑜𝑑(𝑡, 𝑏)=[(𝑝, 𝑜)∈𝑄OCPN ∣ 𝑝∈𝑡• ∧ 𝑜∈𝑏(𝜋pt (𝑝))]. A binding (𝑡, 𝑏)∈𝐵 in marking 𝑀 ∈ ℬ(𝑄OCPN ) is enabled if 𝑐𝑜𝑛𝑠(𝑡, 𝑏) ≤ 𝑀. Executing an enabled binding in 𝑀 leads to marking 𝑀 ′ =𝑀 − 𝑐𝑜𝑛𝑠(𝑡, 𝑏) + 𝑝𝑟𝑜𝑑(𝑡, 𝑏). Executing an enabled binding (𝑡, 𝑏) ∈ 𝐵OCPN in marking 𝑀 is denoted (𝑡,𝑏) with 𝑀−−−→𝑀 ′ . Multiple subsequent enabled bindings can be encoded as a binding ∗ sequence 𝜎=⟨(𝑡1 , 𝑏1 ), … , (𝑡𝑛 , 𝑏𝑛 )⟩∈𝐵OCPN . A sequence that starts in marking 𝑀0 and results 𝜎 in marking 𝑀𝑛 is encoded as 𝑀0 − →𝑀𝑛 . To define a process model that allows certain behavior, we need start and end points. These are sets of marking for an object-centric Petri net. Definition 9 (Accepting Object-Centric Petri Net). Let OCPN = ((𝑇 , 𝑃, 𝐹 , 𝑙), 𝜋pt , 𝐹var ) be an object-centric Petri net and let 𝑀init , 𝑀final ⊆ ℬ(𝑄OCPN ) be initial and final markings of the object-centric Petri net. OCPN 𝐴 = (OCPN , 𝑀init , 𝑀final ) is an accepting object-centric Petri net. All binding sequences that lead from an initial marking to a final marking define the possible end-to-end behavior of the accepting object-centric Petri net. We define the language of the accepting object-centric Petri net as the set of all possible visible loop-free label sequences in the end-to-end behavior of the object-centric Petri net. In this way, we exclude binding sequences with loops that would render the language of the model infinite. Definition 10 (Petri Net Language). Let OCPN 𝐴 = (OCPN , 𝑀init , 𝑀final ) be an ac- cepting object-centric Petri net. All loop-free valid binding sequences of the ∗ object-centric Petri net are given by 𝑆(OCPN 𝐴 ) = {⟨(𝑡1 , 𝑏1 ), … , (𝑡𝑛 , 𝑏𝑛 )⟩ ∈ 𝐵OCPN ∣ (𝑡1 ,𝑏1 ) (𝑡𝑛 ,𝑏𝑛 ) ∃𝑀𝑖 ∈𝑀init ∃𝑀𝑓 ∈𝑀final 𝑀𝑖 −−−−−→ ⋯ −−−−→𝑀𝑓 ∧ ∀𝑖,𝑗∈{1,…,𝑛}∧𝑖≠𝑗 𝑀𝑖 ≠ 𝑀𝑗 }. The language of the object- centric Petri net is the set of visible transition firing sequences Σ(𝑂𝐶𝑃𝑁𝐴 ) = {⟨𝑙(𝑡1 ), … , 𝑙(𝑡𝑛 )⟩ ∣ ⟨(𝑡1 , 𝑏1 ), … , (𝑡𝑛 , 𝑏𝑛 )⟩ ∈ 𝑆(OCPN 𝐴 )} An accepting object-centric Petri net can be discovered from an OCEL using the general approach introduced by van der Aalst and Berti [13]. A Petri net is discovered for each object type. Given the cardinalities of different objects, the individual Petri nets are merged into one object-centric Petri net, connecting the individual nets at interconnecting transitions (transitions including multiple object types). Place types and variable arcs are assigned according to the types and cardinalities stemming from the individual subnets. Definition 11 (Process Discovery). Let 𝐿 = (𝐸, 𝑂𝑇, 𝑂, 𝜋act , 𝜋time , 𝜋trace ) be an object-centric event log. A process discovery algorithm 𝑑(𝐿) = OCPN 𝐴 returns an accepting object- centric Petri net for the object-centric event log. This general approach discovers a traditional Petri net if a traditional event log is the input. The object-centric Petri nets in Fig. 3 are discovered from the OCEL and the flattened event log of Fig. 2. 3.3. Inter- and intra-object complexity In this section, we formally define the inter- and intra-object complexity which were already informally introduced in the introduction. We later use these dimensions in our experimental framework to differentiate which process characteristics and, therefore, which real-life processes benefit most from employing object-centric discovery. Definition 12 (Inter-Object Complexity). Let OCPN 𝐴 = (((𝑇 , 𝑃, 𝐹 , 𝑙), 𝜋pt , 𝐹var ), 𝑀init , 𝑀final ) be an accepting object-centric Petri net. We define the following model attributes: • numt(OCPN 𝐴 ) = |{𝑡 ∈ 𝑇 ∣ 𝑡 ∈ dom(𝑙)}| the number of non-silent transitions, • numot(OCPN 𝐴 ) = |range(𝜋pt )| the number of object types, and 1 ∑𝑡∈𝑇 |{𝜋pt (𝑝)∣𝑝∈•𝑡}|−1 • inter(OCPN 𝐴 ) = max(1,|𝑂𝑇 |−1) |𝑇 | is the inter-object complexity of the model. The inter-object complexity is defined as the amount of shared transitions between objects. The more transitions are shared between objects, the more the object’s control flows are interwoven with each other. Based on the definition, we make two observations. Observation 13. High inter-object complexities point to many shared transitions. In the extreme case, all objects share all transitions. This would make objects redundant, as they all describe the same control flow. Only one object is necessary to encode the control flow, i.e., a process with an inter-object complexity of 1 is equivalent to a traditional process described by a Petri net. Observation 14. A traditional process has an inter-object complexity of 1, as the single object of the process is involved in all transitions. The inter-object complexity only describes the complexity of control flow between objects. To fully capture the complexity of the control flow, we further define the intra-object-complexity, capturing the control flow within objects. Definition 15 (Intra-Object Complexity). Let OCPN 𝐴 = (((𝑇 , 𝑃, 𝐹 , 𝑙), 𝜋pt , 𝐹var ), 𝑀init , 𝑀final ) be an accepting object-centric Petri net. For an object type ot ∈ range(𝜋pt ), we retrieve the subnet OCPN 𝐴,ot = (𝑃ot , 𝑇ot , 𝐹ot , 𝑙ot ) with • 𝑃ot = {𝑝 ∈ 𝑃 ∣ 𝜋pt (𝑝) = ot}, • 𝑇ot = {𝑡 ∈ 𝑇 ∣ ∃𝑝∈•𝑡∪𝑡• 𝜋pt (𝑝) = ot}, • 𝐹ot = {(𝑠, 𝑡) ∈ 𝐹 ∣ (𝑠 ∈ 𝑃ot ∧ 𝑡 ∈ 𝑇ot ) ∨ (𝑠 ∈ 𝑇ot ∧ 𝑡 ∈ 𝑃ot )}, and • 𝑙ot (𝑡) = 𝑙(𝑡) for 𝑡 ∈ 𝑇ot . Figure 4: Experimental framework. We generate many different system models with varying character- istics. We define the intra-object complexity for an individual type as tioc(OCPN 𝐴,ot ) = |Σ(OCPN 𝐴,ot )| numt(OCPN 𝐴,ot )! . The intra-object complexity of the model is defined as the av- 1 erage intra-object complexity of all object types intra(OCPN 𝐴 ) = numot(OCPN 𝐴 ) ⋅ ∑ot∈range(𝜋pt ) tioc(OCPN 𝐴,ot ). The intra-object complexity quantifies the objects’ average deviation from a completely linear control flow. An intra-object complexity of 0 means that all objects follow a strictly linear control flow, while an intra-object complexity of 1 means that all objects would have a completely concurrent control flow. 4. Experimental framework This section introduces the experimental setup to compare process discovery on object- centric and flattened event data which is depicted in Fig. 4. The experimental framework consists of three steps: Model generation, event log sampling, and discovery quality assessment. Model generation We use a process model called the system model resembling a ground truth process. To cover a wide range of combinations of inter- and intra-object complexity, we generate a large set of different models. We depict the distribution of our generated system model along with some exemplary models in Fig. 5. In total, we have randomly generated 44570 system models with 6-8 visible activities (additionally, they can have silent transitions to model AND constructs) and two object types. These models do not have loops, as this would conflict with the next part of the framework, the event log sampling. We address this limitation in the threats to validity section at the end of Sec. 5. The model sizes are limited to maximum 8 visible transitions, as larger models would produce so many possible traces that an exact computation of the model language, as described in the next section, would become infeasible. However, we complement the Figure 5: Distribution of generated model characteristics and example models for different combinations. experiments with a case study on a large production process, showing the applicability and improvements of discovery also for very large models. Event log sampling We extract an event log by sampling from the language of the system model. To do so, we compute the language of the system model, i.e., all possible process executions. We sample a subset of this language to simulate extracting an event log that only contains a part of the possible behavior. Definition 16 (Event Log Extraction). Let OCPN 𝐴 =(OCPN , 𝑀init , 𝑀final ) be an accepting object-centric Petri net and 𝑠 ∈ [0, 1] be an extraction sampling rate. sample(OCPN 𝐴 , 𝑠) ⊆ Σ(𝑂𝐶𝑃𝑁𝐴 ) samples a subset of the Petri net language corresponding to the sampling rate 𝑠. The language subset is mapped to an OCEL by gen(sample(OCPN 𝐴 , 𝑠), OCPN 𝐴 ) = (𝐸, 𝑂 , 𝑂𝑇 , 𝜋act , 𝜋time , 𝜋trace ). The presented approach allows us to quantify a sampling rate, i.e., how much of the possible model behavior is contained in the event log. Since we computed the full system model language, we can also compare the language of a discovered model to the language of a system model, quantifying how well the discovered model corresponds to the ground-truth system model given the sampling rate. This is explained in the next paragraph Quantifying discovered model quality We compare the quality of discovered process models from the object-centric and flattened event logs. Starting from the OCEL, we discover a model before and after flattening the event log using Inductive Miner [28]. The model quality is defined as the recall/precision of the discovered model language with respect to the language of the system model. When the model is discovered from a sampled event log of the system model, the fitness defined here also measures the generalization capabilities of the discovery technique [29]. Definition 17 (Model Quality). Let OCPN 𝐴 = (OCPN , 𝑀init , 𝑀final ) be a system model and 𝑠 ∈ [0, 1] be a sampling rate. The sampled OCEL is given by 𝐿 = 𝑑(gen(sample( OCPN 𝐴 , 𝑠), OCPN 𝐴 ). The discovered model on the object-centric event log is denoted with OCPN 𝑑𝐴 = 𝑑(𝐿) and the discovered model on the flattened event log is denoted with flat,𝑑 OCPN 𝐴 = 𝑑(𝐿flat ). For any of the two discovered models PN ∈ {OCPN 𝑑 , OCPN flat,𝑑 }, |Σ(OCPN 𝐴 )∩Σ(PN )| we define the fitness fit(OCPN 𝐴 , PN ) = |Σ𝑑 (OCPN 𝐴 )| and the precision prec(OCPN 𝐴 , |Σ(OCPN 𝐴 )∩Σ(PN )| PN ) = |Σ(PN )| . Using this setup, we retrieve the discovered model quality for traditional and object- centric process discovery for different combinations of inter-object complexity, intra-object complexity, and sample sizes. 5. Experimental results We present our experimental results in this section. The implementation is publicly available1 . We present the dependency between discovered model quality and event log sample size for traditional and object-centric discovery on system models with varying inter- and intra-object complexity, as described in Sec. 4. The results are depicted in Fig. 6 (fitness) and Fig. 7 (precision). We split the models according to their characteristics using low/high inter- and intra- object complexity. The decision for low and high values are made according to Fig. 5, i.e., an inter-object complexity of more than 0.2 is considered high and an intra-object complexity of less than 0.15 is considered to be low. We choose both values for the following reasons: First, high inter-object complexity indicates high redundancies between object types, i.e., the system model is behaviorally very close to a traditional Petri net. By choosing a low threshold we single out system models that are significantly different from traditional process models. Second, high intra-object complexity models indicate a lack of structure in the process, i.e., no order is enforced. By choosing a low threshold, we can single out system models that show high degrees of sequentiality in one bin and less structured models in the other bin. The highest quality differences can be observed when the system model exhibits low levels of inter-object complexity, especially when the intra-object complexity is also low, i.e., each object has a relatively sequential path and only some interaction points with other objects. In this case, the average fitness of the discovered process model for extracted OCELs with a low sample rate is already 0.9. The fitness of the discovered model from the flattened event log is very low, starting at almost 0. Mapping this back to the initial taxonomy depicted in Fig. 1, object-centric discovery can significantly 1 https://github.com/jn-adams/ObjectCentricDiscovery Figure 6: Discovered model fitness depending on the sample size of the event log for the four different quadrants of our process taxonomy. Figure 7: Discovered model precision depending on the sample size of the event log for the four different quadrants of our process taxonomy. reduce the data required for manufacturing processes, supply chains, and composite workflows. Especially for processes with relatively sequential subprocesses, like assembly lines, object-centric discovery allows for discovery with much less data compared to traditional discovery. This means, that discovery for larger models, which was prohibitive due to the data requirements in traditional discovery, is now enabled using object-centric discovery. We will also see this in the case study in Sec. 6. Models with high inter-object complexity cannot benefit that much from switching to object-centric discovery. Since high inter-object complexity points to redundant control flows and is, in the extreme case, equivalent to traditional process models, these results are not surprising. It is notable, that – within this evaluation – object-centric discovery does not worsen results. When considering the precision of discovered models, the quality differences are larger for models with high intra-object complexity, although precision levels for low intra-object complexity are, generally, higher. This can be explained by low levels of intra-object complexity allowing more concurrent behavior, such that the discovery of flower-like models will be more precise than for more restrictive system models. In general, we observe that the high amounts of sequences produced by concurrent behavior push the discovery algorithm to discover flower models for the flattened event logs, significantly reducing the precision. 5.1. Threats to validity As our experimental framework computes the exact languages of the models, it is computationally quite demanding, leading us to restrict the system model generation in two major ways: without loops and only limited to eight visible transitions. These are also the main limitations of this experimental framework: First, our results cannot be generalized to process models containing many loops. Second, our generated models only cover relatively small models of 8 visible transitions. While the results are already very clear and the difference should only increase for larger models, our experiments cannot prove this claim. Therefore, we complement the experiments with a case study performing object-centric discovery on a large-scale production process and comparing the results to traditional process discovery. We aim to mitigate the limitations of our experimental framework using this case study, as it shows that object-centric discovery is feasible for large processes and provides better results than traditional discovery. 6. Case study We use object-centric event data from a production process in the German manufacturing company Heidelberger Druckmaschinen AG [30]. The confidential original event log contains hundreds of process executions with more than 800 activities. Object-centric discovery can be applied to the original event log, however, the resulting visualization is too large for human comprehension. Therefore, we limit our analysis to a subprocess of 105 activities. Our sublog contains 13 process executions. We, first, apply object-centric process discovery and, second, flatten the event log and apply traditional process discovery. Even though we do not know the true ground truth process to provide an exact sample rate, it would be close to 0 given the large number of concurrent activities and the low number of process executions. The results are depicted in Fig. 8. The structure of the process can be rediscovered using object-centric discovery: In- dividual concurrent object paths are visible, ending up in one assembly activity that combines individual components into an output component. This output component follows its individual path afterward. The low number of process executions is not an issue when using object-centric discovery, it can still rediscover the concurrency between objects. This is not the case for process discovery on flattened event data. The flattened process discovery algorithm shows a flower model, i.e., the structure of the process is completely lost. Furthermore, it is generally infeasible to expect that a flattened event log would contain enough process executions such that concurrency could completely be Figure 8: Discovered process model of a production process using the object-centric (left) and flattened event log (right). For the model discovered from the OCEL, different components follow their individual paths and are assembled into a merged component. For the model discovered from the flattened log, the process is mostly a flower model and the screenshot only shows part of the width of this flower model. rediscovered: 100 partly concurrent activities can generate a large number of possible activity sequences, such that the data requirements of traditional process discovery would be beyond any feasible amount of produced products, i.e., the available sample size. 7. Conclusion In this paper, we investigated the data requirement reduction that object-centric process discovery yields over traditional discovery. We categorize different real-life processes according to control-flow complexity across and within subprocesses and formally define these dimensions. We define an experimental framework that generates models across these dimensions and assess the reduction in data requirements when using object-centric discovery over traditional discovery. We show that the data requirements are drastically reduced, especially for low inter-object complexity process models like production pro- cesses or supply chains. To address limitations of the experimental framework w.r.t. the size of generated models we complement our evaluation with a large-scale production process case study. We show that object-centric discovery captures the production pro- cess much better than traditional process discovery. In this case, the data requirements of traditional discovery would even exceed the number of produced items, rendering traditional process discovery infeasible. Our results have significant implications for the application of process discovery to large-scale processes: Areas once infeasible for discovery due to large data requirements are now accessible using object-centric process mining. Acknowledgments We thank the Alexander von Humboldt (AvH) Stiftung for supporting our research. Funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) under Germanys Excellence Strategy—EXC-2023 Internet of Production390621612. References [1] M. Dumas, M. L. Rosa, J. Mendling, H. A. Reijers, Fundamentals of Business Process Management, Second Edition, Springer, 2018. doi:10.1007/978- 3- 662- 56509- 4 . [2] W. M. P. van der Aalst, Process Mining - Data Science in Action, Second Edition, Springer, 2016. doi:10.1007/978- 3- 662- 49851- 4 . [3] A. Augusto, R. Conforti, M. Dumas, M. L. Rosa, F. M. Maggi, A. Marrella, M. Mecella, A. Soo, Automated discovery of process models from event logs: Review and benchmark, IEEE Trans. Knowl. Data Eng. 31 (2019) 686–705. doi:10.1109/ TKDE.2018.2841877 . [4] W. M. P. van der Aalst, B. F. van Dongen, J. Herbst, L. Maruster, G. Schimm, A. J. M. M. Weijters, Workflow mining: A survey of issues and approaches, Data Knowl. Eng. 47 (2003) 237–267. doi:10.1016/S0169- 023X(03)00066- 1 . [5] J. D. Weerdt, M. T. Wynn, Foundations of process event data, in: Process Mining Handbook, Springer, 2022, pp. 193–211. doi:10.1007/978- 3- 031- 08848- 3\_6 . [6] Y. Kim, J. Lee, Manufacturing strategy and production systems: An integrated framework, Journal of Operations Management 11 (1993) 3–15. doi:https://doi. org/10.1016/0272- 6963(93)90029- O . [7] G. Da Silveira, D. Borenstein, F. S. Fogliatto, Mass customization: Literature review and research directions, International Journal of Production Economics 72 (2001) 1–13. doi:https://doi.org/10.1016/S0925- 5273(00)00079- 7 . [8] S. Zeng, P. Melville, C. A. Lang, I. M. Boier-Martin, C. Murphy, Using predic- tive analysis to improve invoice-to-cash collection, in: SIGKDD, ACM, 2008, pp. 1043–1050. doi:10.1145/1401890.1402014 . [9] G. Schuh, A. Gützlaff, S. Cremer, S. Schmitz, A. Ayati, A data model to apply process mining in end-to-end order processing processes of manufacturing companies, in: IEEM, IEEE, 2020, pp. 151–155. doi:10.1109/IEEM45057.2020.9309946 . [10] M. Milgate, Supply chain complexity and delivery performance: an international exploratory study, Supply chain management: An international Journal 6 (2001) 106–118. [11] W. M. P. van der Aalst, Object-centric process mining: Dealing with divergence and convergence in event data, in: SEFM, Springer, 2019, pp. 3–25. doi:10.1007/ 978- 3- 030- 30446- 1\_1 . [12] J. N. Adams, D. Schuster, S. Schmitz, G. Schuh, W. M. P. van der Aalst, Defining cases and variants for object-centric event data, in: (ICPM), 2022, pp. 128–135. doi:10.1109/ICPM57379.2022.9980730 . [13] W. M. P. van der Aalst, A. Berti, Discovering object-centric Petri nets, Fundam. Informaticae 175 (2020) 1–40. doi:10.3233/FI- 2020- 1946 . [14] D. Cohn, R. Hull, Business artifacts: A data-centric approach to modeling business operations and processes, IEEE Data Eng. Bull. 32 (2009) 3–9. [15] W. M. P. van der Aalst, A. Artale, M. Montali, S. Tritini, Object-centric behavioral constraints: Integrating data and declarative process modelling, in: Description Logics, CEUR-WS.org, 2017. [16] D. Fahland, Describing behavior of processes with many-to-many interactions, in: PETRI NETS, Springer, 2019, pp. 3–24. doi:10.1007/978- 3- 030- 21571- 2\_1 . [17] M. Montali, A. Rivkin, Db-nets: On the marriage of colored Petri nets and relational databases, Trans. Petri Nets Other Model. Concurr. 12 (2017) 91–118. doi:10.1007/978- 3- 662- 55862- 1\_5 . [18] S. Ghilardi, A. Gianola, M. Montali, A. Rivkin, Petri nets with parameterised data - modelling and verification, in: Business Process Management - 18th International Conference, BPM 2020, Seville, Spain, September 13-18, 2020, Proceedings, volume 12168 of Lecture Notes in Computer Science, Springer, 2020, pp. 55–74. doi:10. 1007/978- 3- 030- 58666- 9\_4 . [19] J. M. E. M. van der Werf, A. Rivkin, A. Polyvyanyy, M. Montali, Data and process resonance - identifier soundness for models of information systems, in: PETRI NETS, Springer, 2022, pp. 369–392. doi:10.1007/978- 3- 031- 06653- 5\_19 . [20] E. H. J. Nooijen, B. F. van Dongen, D. Fahland, Automatic discovery of data-centric and artifact-centric processes, in: BPM Workshops, Springer, 2012, pp. 316–327. doi:10.1007/978- 3- 642- 36285- 9\_36 . [21] L. Moctar-M’Baba, N. Assy, M. Sellami, W. Gaaloul, M. F. Nanne, Extracting artifact-centric event logs from blockchain applications, in: SCC, IEEE, 2022, pp. 274–283. doi:10.1109/SCC55611.2022.00048 . [22] G. Li, E. G. L. de Murillas, R. M. de Carvalho, W. M. P. van der Aalst, Extracting object-centric event logs to support process mining on databases, in: CAiSE Forum, Springer, 2018, pp. 182–199. doi:10.1007/978- 3- 319- 92901- 9\_16 . [23] S. Esser, D. Fahland, Multi-dimensional event data in graph databases, J. Data Semant. 10 (2021) 109–141. doi:10.1007/s13740- 021- 00122- 1 . [24] G. Li, R. M. de Carvalho, W. M. P. van der Aalst, Automatic discovery of object- centric behavioral constraint models, in: BIS, Springer, 2017, pp. 43–58. doi:10. 1007/978- 3- 319- 59336- 4\_4 . [25] J. N. Adams, W. M. P. van der Aalst, Precision and fitness in object-centric process mining, in: ICPM, IEEE, 2021, pp. 128–135. doi:10.1109/ICPM53251.2021.9576886 . [26] G. Park, J. N. Adams, W. M. P. van der Aalst, Opera: Object-centric perfor- mance analysis, in: ER, volume 13607, Springer, 2022, pp. 281–292. doi:10.1007/ 978- 3- 031- 17995- 2\_20 . [27] A. Augusto, R. Conforti, M. Dumas, M. L. Rosa, A. Polyvyanyy, Split miner: automated discovery of accurate and simple business process models from event logs, Knowl. Inf. Syst. 59 (2019) 251–284. doi:10.1007/s10115- 018- 1214- x . [28] S. J. J. Leemans, D. Fahland, W. M. P. van der Aalst, Discovering block-structured process models from event logs - A constructive approach, in: PETRI NETS, Springer, 2013, pp. 311–329. doi:10.1007/978- 3- 642- 38697- 8\_17 . [29] A. Polyvyanyy, A. Moffat, L. García-Bañuelos, Bootstrapping generalization of process models discovered from event data, in: CAiSE, Springer, 2022, pp. 36–54. doi:10.1007/978- 3- 031- 07472- 1\_3 . [30] T. Brockhoff, M. S. Uysal, I. Terrier, H. Göhner, W. M. P. van der Aalst, Analyzing multi-level bom-structured event data, in: ICPM Workshops, Springer, 2021, pp. 47–59. doi:10.1007/978- 3- 030- 98581- 3\_4 .