Relating Process Models and Event Logs 21 Conformance Propositions Wil M.P. van der Aalst Process and Data Science (Informatik 9), RWTH Aachen University, D-52056 Aachen, Germany wvdaalst@rwth-aachen.de Abstract. Process mining sheds new light on the relationship between process models and real-life processes. Process discovery can be used to learn process models from event logs. Conformance checking can be used to confront a de- scriptive or normative model with the actual behavior observed. Quantifying the relationship between process models and event logs is notoriously difficult. Exist- ing approaches seem to make ad-hoc choices, leading to results that are not easy to interpret. To address these problems, we present 21 conformance propositions. These describe expected or desired properties of conformance measures. These cover notions such as recall (also called fitness), precision, and generalization. These propositions are put into a broader frame of Log-to-Log (L2L), Model-to- Model (M2M), and Log-to-Model (L2M) comparisons. Moreover, it is argued that conformance measures need to consider the likelihood of traces. Real-life event logs are incomplete showing only sample behavior. Moreover, observed traces tend to follow a Pareto distribution. Yet, mainstream process mining approaches stick to a “binary view” on process models (observed traces are fitting or not). This is hindering progress. The goal is to trigger a discussion by clearly formu- lating the challenges and requirements (rather than proposing new measures or techniques). Keywords: Process mining · Conformance checking · Recall · Precision · Gen- eralization. 1 Introduction In process mining, there are two main types of artifacts: event logs and process models [1]. In the background, there is the “real process” that is responsible for the events in the event log. However, this process is only fully known in case one is using artificial (also called synthetic) event logs. Therefore, we assume that we only have event logs and models. However, we assume that the models are “rich enough” to generate event data. This leads to the two scenarios shown in Figure 1. Scenario 1 starts from a model that is used to generate an event log. This log can be used as input for a process discovery technique that produces another model. One can compare the discovered model with the original model using a so-called M2M (Model-to-Model) assessment. We can also conduct two L2M (Log-to-Model) assessments by comparing the log with both the orig- inal model and the discovered model. Scenario 2 starts from an event log that is used 56 M2M (a) Scenario 1 L2M L2M play-out play-in (simulate) (discover) process model event log process model L2L (b) Scenario 2 L2M L2M play-in play-out (discover) (simulate) event log process model event log Fig. 1. Two scenarios and three types of comparisons: L2M (Log-to-Model), M2M (Model-to- Model), and L2L (Log-to-Log). to discover a process model that is subsequently used to produce another event log. The original and the generated event logs can be compared using an L2L (Log-to-Log) assessment. Event logs only contain example behavior. This not only makes process discovery challenging; it is also difficult to assess the quality of the process model in relation to the log. Since process mining focuses on the relationship between observed behavior and modeled behavior (i.e., L2M assessments), it is important to quantify the relation- ship between an event log l ∈ L and a process model m ∈ M. This is reflected by the different conformance measures that have been proposed [1, 3, 4, 7–10, 14, 16, 18, 21, 22]. When learning a process model from event data, there is typically a trade-off between the following four quality dimensions [1]: (1) recall: the discovered model should allow for the behavior seen in the event log (avoiding “non-fitting” behavior), (2) precision: the discovered model should not allow for behavior completely unrelated to what was seen in the event log (avoiding “underfitting”), (3) generalization: the dis- covered model should generalize the example behavior seen in the event log (avoiding “overfitting”), and (4) simplicity: the discovered model should be as simple as possible. The simplicity dimension refers to Occam’s Razor: “one should not increase, beyond what is necessary, the number of entities required to explain anything”. In the context of process mining, this is often operationalized by quantifying the complexity of the model (number of nodes, number of arcs, understandability, etc.). We do not consider the simplicity dimension in this paper, since we focus on behavior and abstract from the actual model representation. Recall is often referred to as fitness in process mining literature. Sometimes fitness refers to a combination of the four quality dimensions. 57 To avoid later confusion, we use the term recall which is commonly used in pattern recognition, information retrieval, and (binary) classification. This paper does not aim to provide concrete conformance measures (although some examples are given). Instead, the focus is on characteristic/desired properties of such measures. We propose 21 conformance propositions, i.e., properties that recall, pre- cision, and generalizations measures could or should satisfy for any model and log combination.1 Researchers and developers creating new measures should be clear on what conformance propositions they aim to support and ensure that the implemented measures support these. Users of existing conformance measures should be aware of seemingly obvious conformance propositions that are violated by existing approaches. Next to defining 21 conformance propositions, this paper provides two additional contributions. First of all, we provide a more holistic perspective on the challenges related to comparing models and logs. Second, we show that probabilities play a key role in such definition. The likelihood of traces in models and the frequency of traces in logs are key elements that should not be ignored. It is important to note that this paper focuses on trace-based control-flow-oriented conformance measures, i.e., we abstract from other event attributes (resources, prod- ucts, time, costs, etc.) and consider full traces (not local patterns). This scope is clearly limited. However, it helps to focus the discussion. The remainder is organized as follows. Section 2 presents some preliminaries. Next to event logs also process models which define trace likelihoods are defined. Sections 3, 4, and 5 discuss L2L (Log-to-Log), M2M (Model-to-Model), and L2M (Log-to-Model) comparisons and define example measures. Section 6 describes the 21 conformance propositions. Section 7 reflects on the conformance propositions using the simple mea- sures defined in sections 3, 4, and 5. Section 8 discusses the viewpoint used in the paper and possible alternative viewpoints. Section 9 concludes the paper. 2 Preliminaries Process mining techniques focus on the relationship between observed behavior and modeled behavior. Therefore, we first formalize event logs (i.e., observed behavior) and process models (i.e., modeled behavior). To do this, we consider a very simple setting where we only focus on the control-flow, i.e., sequences of activities. However, unlike most other papers, we attach probabilities to traces. 2.1 Event Logs The starting point for process mining is an event log. Each event in such a log refers to an activity possibly executed by a resource at a particular time and for a particular case. An event may have many more attributes, e.g., transactional information, costs, cus- tomer, location, and unit. Here, we focus on control-flow. Therefore, we only consider activity labels and the ordering of events within cases. 1 The Merriam-Webster dictionary defines the noun proposition as “an expression in language or signs of something that can be believed, doubted, or denied or is either true or false” [15]. We use the term in this manner. We do not state that all 21 conformance propositions should hold universally. We merely use them as potential requirements and discuss their implications. 58 Definition 1 (Traces). A is the universe of activities. A trace t ∈ A∗ is a sequence of activities. T = A∗ is the universe of traces. Trace t = ha, b, c, d, ci ∈ T refers to 5 events belonging to the same case (i.e., |t| = 5). An event log is a collection of cases each represented by a trace. Definition 2 (Event Log). L = B(T ) is the universe of event logs. An event log l ∈ L is a finite multiset of observed traces. τ (l) = {t ∈ l} ⊆ T is the set of traces appearing in l ∈ L. τ (l) = {t ∈ T | t 6∈ τ (l)} is the complement of l, i.e., non-observed traces. An event log is a multiset of traces. Event log l = [ha, b, ci5 , hb, a, di3 , ha, b, di2 ] refers to 10 cases (i.e., |l| = 10). Five cases are represented by the trace ha, b, ci, three cases are represented by the trace hb, a, di, and two cases are represented by the trace ha, b, di. l(t) is the number of times t appears in l, e.g., l(hb, a, di) = 3. We assume that the usual operators are defined for multisets. l1 ] l2 is the union of two multisets, |l| is the number of elements, and l1 \ l2 is the difference. l1 ∩ l2 is the intersection of two multisets. Hence, (l1 ∩ l2 )(t) = l1 (t) max l2 (t). [t ∈ l | b(t)] is the multiset of all elements in l that satisfy some condition b. l1 ≤ l2 means that l1 is contained in l2 . For example, [a2 , b] ≤ [a2 , b2 , c], but [a2 , b3 ] 6≤ [a2 , b2 , c2 ] and [a2 , b2 , c] 6≤ [a3 , b3 ]. τ (l) is the set traces that were not observed in the event log. Note that τ (l) and τ (l) partition the universe of traces T . 2.2 Process Models Often the behavior of a process model m is simply defined by the set of traces allowed by m. However, there may be vast differences in the likelihoods of traces. For many processes, the observed traces follow a Pareto distribution. Typically, there are a limited number of traces that can be observed frequently, while there are often much more traces that are infrequent. Actually, many event logs follow the “80-20 rule”, i.e., more than 80% of all ob- served traces are explained by less than 20% of all the observed trace variants. Formally this means there is a set of traces X ⊆ τ (l) such that |[t ∈ l | t ∈ X]|/|l| > 0.8 and |X|/|τ (l)| < 0.2. Because of the distribution of traces in the log, it is often too simplistic to describe a model as a set of traces. Since “behavior is not binary”, we use a trace likelihood function to describe the probability of each trace. Definition 3 (Trace Likelihood Function). Π = T → [0, 1] is the set of all trace likelihood P functions. π ∈ Π assigns a likelihood π(t) to any trace t ∈ T such that t∈T π(t) = 1. When describing a trace likelihood function, we assume that the number of traces is countable. Therefore, P (Ω, F, P ) with Ω = T , F = P(T ), and P ∈ F → [0, 1] such that P (X) = t∈X π(t), defines a proper probability space.2 Note that if A is countable, then also T is countable. 2 A probability space (Ω, F, P ) consists of three parts: a sample space Ω which defines the set of all possible outcomes, a set of “events” F where each event is a set containing zero or more outcomes, and an assignment of probabilities to the events P ∈ F → [0, 1]. 59 m1 a c π12 trace likelihood abc 0.375 bac 0.375 start end abd 0.125 b d bad 0.125 (a) A Petri net model (with start and end transitions) (c) Likelihood of each trace m2 a c l12 trace frequency abc 44 bac 36 start end abd 14 b d bad 6 (b) A BPMN model allowing for the same behavior (d) Frequency of each trace Fig. 2. Two process models m1 and m2 having the same behavior: τ (m1 ) = τ (m2 ) = {ha, b, ci, ha, b, di, hb, a, ci, hb, a, di}. m1 is a Petri net with a special start activity (occurs once at the beginning) and a special end activity to signal the end of the trace. m2 is a BPMN (Business Process Model and Notation) model. The likelihood of each trace π12 = π(m1 ) = π( m2 ) (c) and an example log l12 (d) are also shown. For pragmatic reasons, we can also restrict A to be finite and T to traces of a max- imal lengthP (e.g., the maximal length seen in any log). For any set of traces X ⊆ T , π(X) = t∈X π(t) is the probability that a run of the process π yields a trace in X. We assume that traces are sampled independently. When considering time, this is not realistic (due to queueing), but for the control-flow (routing of a case through the model) this is a common assumption. Definition 4 (Process Model). M is the set of process models. A process model m ∈ M has a set of traces τ (m) ⊆ T and a trace likelihood function π(m) ∈ Π. The relation between τ (m) and π(m) is not very strict. Sometimes we will require τ (m) ⊆ {t ∈ T | π(t) > 0} (all modeled traces are possible), {t ∈ T | π(t) > 0} ⊆ τ (m) (the model allows for all traces possible), or even τ (m) = {t ∈ T | π(t) > 0}. Trace set τ (m) may be an abstraction of the real process and leave out unlikely behavior. Due to simplification, the trace set τ (m) may also allow for traces that cannot happen (e.g., particular interleavings or loops). We define τ (m) = {t ∈ T | t 6∈ τ (m)} as the complement of the set of traces allowed by model m ∈ M (like we did for event logs). We distinguish between representation and behavior of a model. Process model m ∈ M can be represented using a plethora of modeling languages, e.g., Petri nets, BPMN models, UML activity diagrams, automata, and process trees. Here, we abstract from the actual representation and focus on behavioral characteristics like τ (m) ⊆ T and π(m) ∈ Π. Figure 2 shows two process models that have the same behavior: τ (m1 ) = τ (m2 ) = {ha, b, ci, ha, b, di, hb, a, ci, hb, a, di} and π12 = π(m1 ) = π(m2 ) with π12 (ha, b, ci) = 0.375, π12 (hb, a, ci) = 0.375, π12 (ha, b, di) = 0.125, π12 (hb, a, di) = 0.125, and 60 m3 a c π34 trace abc likelihood 0.375 bac 0.375 abcdc 0.09375 start end bacdc 0.09375 b d abcdcdc 0.0234375 bacdcdc 0.0234375 (a) A Petri net model (with start and end transitions) abcdcdcdc 0.005859375 ... ... (c) Likelihood of each trace m4 a c l34 trace frequency abc 4 bac 3 start end abcdc 2 b d bacdcdcdcdc 1 (b) A BPMN model allowing for the same behavior (d) Frequency of each trace Fig. 3. Two process models m3 and m4 allowing for the same set of traces (τ (m3 ) = τ (m4 )) and having the identical trace likelihoods (π34 = π(m3 ) = π(m4 )) and an example log l34 (d). π12 (t) = 0 for all other traces t. The two process models m1 and m2 are not graphically annotated with probabilities. However, there are extensions of these notations to do so. We can use for example Stochastic Petri Nets (SPN) or BPMN-based simulation models. For the models in Figure 2, we assume that initially there is a “race” between a and b where both have a 50% probability to win. After both a and b occurred, c occurs with 75% probability and d occurs with 25% probability. This explains the likelihoods in Figure 2(c). Figure 2(d) shows a possible event log l12 composed of 100 traces generated by one of these models: l12 (ha, b, ci) = 44, l12 (hb, a, ci) = 36, l12 (ha, b, di) = 14, and l12 (hb, a, di) = 6. Process models may allow for an infinite number of traces. We use Figure 3 to illustrate this. Again both processes (m3 and m4 ) express the same behavior. Again there is a “race” between a and b where both have a 50% probability to win. After a and b, activity c will occur. Then there is a 75% probability that the process ends and a 25% probability that d occurs. Let ta,k = ha, bi · (hc, di)k · hci be the trace that starts with a and where d is executed k times. tb,k = hb, ai · (hc, di)k · hciSis the trace that starts with b and where d is executed k times. τ (m3 ) = τ (m3 ) = k≥0 {ta,k , tb,k }. π34 = π(m3 ) = π(m4 ) with π34 (ta,k ) = π34 (tb,k ) = 0.5 · (0.25)k · 0.75. Some example values are given in Figure 3(c). Figure 3(d) shows a possible event log l34 composed of 10 traces generated by one of the models (m3 or m4 ): l34 (ha, b, ci) = 4, l34 (hb, a, ci) = 3, l34 (ha, b, c, d, ci) = 2, and l34 (hb, a, c, d, c, d, c, d, c, d, ci) = 1. Since any log contains only a finite number of traces, one can never observe all traces possible in m3 or m4 . 61 2.3 Process Discovery A discovery algorithm takes an event log as input and returns a process model. For example, m1 and m2 in Figure 2 could have been discovered based on event log l = [ha, b, ci5 , hb, a, di3 , ha, b, di2 ]. Ideally, the process model captures the (dominant) be- havior observed, but also generalizes without becoming too imprecise. For example, m1 and m2 allow for trace t = hb, a, ci although this was never observed. Definition 5 (Discovery Algorithm). A discovery algorithm can be described as a function disc ∈ L → M mapping event logs onto process models. We abstract from concrete discovery algorithms. Over 100 discovery algorithms have been proposed in literature [1]. Most of the discovery algorithms do not include probabilities. However, by replaying the event log on the discovered process model, these can be added easily. Merely as a reference to explain basic notions, we define three simple, but extreme, algorithms: disc ofit , disc ufit , and disc nfit . Let l ∈ L be a log. disc ofit (l) = mo such that τ (mo ) = τ (l) produces an overfitting model that allows only for the behavior seen in the log. disc ufit (l) = mu such that τ (mu ) = T produces an underfitting model that allows for any behavior. disc nfit (l) = mn such that τ (mn ) = τ (l) produces a non-fitting model that allows for all behavior not seen in the log. We do not define the probabilities explicitly here. Just assume some function that assigns positive values to all included traces. 3 L2L: Log-To-Log Comparison As Figure 1 shows, we are interested in three types of conformance measurements. Al- though the focus will be on L2M (Log-to-Model) comparisons, we also briefly consider L2L (Log-to-Log) and M2M (Model-to-Model) comparisons. We start by providing precision and recall measures for L2L comparisons. Definition 6 (Set-Based L2L Precision and Recall). Let l1 , l2 ∈ L be two logs. Set- based L2L precision and recall are defined as follows: |τ (l1 ) ∩ τ (l2 )| |τ (l1 ) ∩ τ (l2 )| rec L2LSB (l1 , l2 ) = prec L2LSB (l1 , l2 ) = (1) |τ (l1 )| |τ (l2 )| Definition 7 (Multiset-Based L2L Precision and Recall). Let l1 , l2 ∈ L be two logs. Multiset-based L2L precision and recall are defined as follows: |l1 ∩ l2 | |l1 ∩ l2 | rec L2LMSB (l1 , l2 ) = prec L2LMSB (l1 , l2 ) = (2) |l1 | |l2 | rec L2LMSB and prec L2LMSB take into account the frequencies of traces. In rec L2LSB and prec L2LSB it is not relevant how many times the same trace appears in the log. Consider l12 in Figure 2(d) and l34 in Figure 3(d). rec L2LSB (l12 , l34 ) = 2/4 = 0.5, prec L2LSB (l12 , l34 ) = 2/4 = 0.5, rec L2LMSB (l12 , l34 ) = (4 + 3)/100 = 0.07, and prec L2LMSB (l12 , l34 ) = (4 + 3)/10 = 0.7. This shows that the latter two measures only make sense if both logs have comparable sizes or things are normalized. 62 4 M2M: Model-To-Model Comparison We also define two M2M (Model-to-Model) recall and precision measures. Definition 8 (Trace-Based M2M Precision and Recall). Let m1 , m2 ∈ M be two process models. Trace-based M2M precision and recall are defined as follows: |τ (m1 ) ∩ τ (m2 )| rec M2M TB (m1 , m2 ) = (3) |τ (m1 )| |τ (m1 ) ∩ τ (m2 )| prec M2M TB (m1 , m2 ) = (4) |τ (m2 )| Note that these recall and precision measures do not take into account the likeli- hood of traces. Moreover, they are undefined if the number of traces is unbounded, i.e., |τ (m1 )| = ∞ or |τ (m2 )| = ∞. These two issues can be addressed by taking into account trace likelihoods, resulting in the following measures. Definition 9 (Likelihood-Based M2M Precision and Recall). Let m1 , m2 ∈ M be two process models. Likelihood-based M2M precision and recall are defined as follows: π(m1 )(τ (m1 ) ∩ τ (m2 )) rec M2M LB (m1 , m2 ) = (5) π(m1 )(τ (m1 )) π(m2 )(τ (m1 ) ∩ τ (m2 )) prec M2M LB (m1 , m2 ) = (6) π(m2 )(τ (m2 )) Note that π(m1 )(τ (m1 ) ∩ τ (m2 )) quantifies the likelihood of traces in the set τ (m1 ) ∩ τ (m2 ) according to model m1 (recall that π(m1 ) ∈ Π = T → [0, 1] is a trace like- lihood function). To compute recall, we compare the traces allowed by both models (τ (m1 ) ∩ τ (m2 )) with the traces allowed by the first model (τ (m1 )). π(m1 ) is used to quantify the fractions of traces. To compute precision, we compare the traces allowed by both models (τ (m1 ) ∩ τ (m2 )) with the traces allowed by the second model (τ (m2 )). Now, π(m2 ) is used to quantify the fractions of traces. Note that it does not make any sense to use π(m1 ) when computing precision, because the behavior in τ (m2 ) \ τ (m1 ) will be unlikely according to π(m1 ). Hence, precision based on π(m1 ) would always yield a high value even when m2 allows for lots of additional behavior. Similarly, it does not make any sense to use π(m2 ) when computing recall. 5 L2M: Log-To-Model Comparison The recall and precision measures in Definition 9 seem quite natural for models and there is no need for additional notions such as generalization. However, when consider- ing event logs, recall and precision are not addressing the fact that there is a difference between (1) the behavior in the log and (2) possible future behavior. Overfitting models may have perfect recall, but unable to allow for future cases. In the remainder, we assume the existence of three functions: rec(), prec(), gen(). All three take a log and model as input and return a value between 0 and 1. The higher 63 the value, the better. In process mining literature one can find many proposals for such functions. Although we provide a few example functions for clarity, we focus on desired properties of conformance measures rather than concrete measures. Definition 10 (Recall). A recall measure rec ∈ L × M → [0, 1] aims to quantify the fraction of observed behavior that is allowed by the model. Definition 11 (Precision). A precision measure prec ∈ L × M → [0, 1] aims to quan- tify the fraction of behavior allowed by the model that was actually observed. Definition 12 (Generalization). A generalization measure gen ∈ L × M → [0, 1] aims to quantify the likelihood that new unseen cases will fit the model. If we ignore frequencies and likelihoods of traces, we can simply count fractions of traces yielding the following two simple measures. Definition 13 (Trace-Based L2M Precision and Recall). Let l ∈ L and m ∈ M be an event log and a process model. Trace-based L2M precision and recall are defined as follows: |τ (l) ∩ τ (m)| |τ (l) ∩ τ (m)| rec L2M TB (l, m) = prec L2M TB (l, m) = (7) |τ (l)| |τ (m)| Since |τ (l)| is bounded by the size of the log, rec L2M TB (l, m) is well-defined. How- ever, prec L2M TB (l, m) is undefined when τ (m) is unbounded (e.g., in case of loops). Another approach is to build a dummy process model ml based on the event log and then apply Definition 9. Definition 14 (Estimation-Based L2M Precision and Recall). Let l ∈ L and m ∈ M be an event log and a process model. Let ml be a model created based on l such that τ (ml ) = τ (l) and π(ml )(t) = l(t)/ |l|. Estimation-based L2M precision and recall are defined as follows: π(ml )(τ (ml ) ∩ τ (m)) |[t ∈ l | t ∈ τ (m)]| rec L2M EB (l, m) = = (8) π(ml )(τ (ml )) |l| π(m)(τ (ml ) ∩ τ (m)) π(m)(τ (l) ∩ τ (m)) prec L2M EB (l, m) = = (9) π(m)(τ (m)) π(m)(τ (m)) The above measures do not capture generalization. Note that generalization is good when the process model is also able to explain future traces (i.e., unseen cases). There- fore, generalization takes into account the redundancy of fitting behavior. If the log contains many fitting traces that are also frequent, then generalization is likely to be good. Definition 15 (L2M Generalization). Let l ∈ L and m ∈ M be an event log and a process model. Two example L2M precision measures are defined as follows: |[t ∈ l | t ∈ τ (m) ∧ l(t) ≥ q]| gen L2M q (l, m) = with q ≥ 1 (10) |l| |[t ∈ l | t ∈ τ (m)]| − |τ (l) ∩ τ (m)| gen L2M HB (l, m) = (11) |l| 64 gen L2M q (l, m) uses a threshold q. The more fitting traces there are that have a fre- quency of at least q, the better. Non-fitting traces or traces that are unique lower gener- alization. gen L2M HB (l, m) measures the fraction of traces that is fitting, but subtracts a penalty for each unique trace. Consider l0 = [ha, b, ci60 , hb, a, di38 , ha, b, di1 , hc, b, ai1 ] and m1 in Figure 2 and let q = 10. gen L2M q (l0 , m1 ) = (60 + 38)/100 = 0.98 and gen L2M HB (l0 , m1 ) = ((60 + 38 + 1) − 3)/100 = 0.96. Assume now that l00 = [ha, b, ci7 , hb, a, di1 , ha, b, di1 , hc, b, ai1 ]. gen L2M q (l00 , m1 ) = 0/10 = 0.0 and gen L2M HB (l00 , m1 ) = ((7 + 1 + 1) − 3)/10 = 0.6. Note that, although the set of traces in the event log did not change (τ (l0 ) = τ (l00 )), it is reasonable to assume that generalization is impacted. 6 A Collection of Conformance Propositions Many recall measures have been proposed in literature [1, 3, 4, 7–10, 14, 16, 18, 21, 22]. In recent years, also several precision measures have been proposed [5, 20]. Only few generalization measures have been proposed [3]. The goal of this paper is not to present new measures, but to define properties for such quality measures. Through this, we hope to facilitate a better understanding of process discovery and conformance checking. Moreover, these properties may help to choose an existing measure or to define new ones. 6.1 General Propositions In the remainder, we provide 21 conformance propositions. The Merriam-Webster dic- tionary defines the noun proposition as “an expression in language or signs of some- thing that can be believed, doubted, or denied or is either true or false” [15]. Most of the conformance propositions have broad support from the community, i.e., there is broad consensus that these propositions should hold. These are marked with a “+”. For example, the first two propositions are commonly accepted; the computation of a qual- ity measure should be deterministic (DetPro+ ) and only depend on behavioral aspects (BehPro+ ). The latter is a design choice. We deliberately exclude simplicity notions. More controversial propositions are marked with a “0” (rather than a “+”). Proposition 1 (DetPro+ ). rec(), prec(), gen() are deterministic functions, i.e., the measures rec(l, m), prec(l, m), gen(l, m) are fully determined by l ∈ L and m ∈ M. Proposition 2 (BehPro+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) = τ (m2 ) and π(m1 ) = π(m2 ): rec(l, m1 ) = rec(l, m2 ), prec(l, m1 ) = prec(l, m2 ), and gen(l, m1 ) = gen(l, m2 ), i.e., the measures are fully determined by the behavior observed and the behavior described by the model (representation does not matter). 6.2 Recall Propositions First, we consider a few recall propositions. rec ∈ L × M → [0, 1] aims to quantify the fraction of observed behavior that is allowed by the model. 65 Proposition 3 (RecPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆ τ (m2 ): rec(l, m1 ) ≤ rec(l, m2 ). Proposition RecPro1+ states that extending the model to allow for more behavior can never result in a lower recall. Similarly, it cannot be the case that adding fitting behavior to the event logs, lowers recall (RecPro2+ ). Adding non-fitting behavior to the log, cannot improve recall (RecPro3+ ). Proposition 4 (RecPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): rec(l1 , m) ≤ rec(l2 , m). Proposition 5 (RecPro3+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): rec(l1 , m) ≥ rec(l2 , m). For any natural number k: lk (t) = k · l(t), e.g., if l = [ha, bi3 , hci2 ], then l4 = [ha, bi12 , hci8 ]. We use this notation to enlarge event logs without changing the distribu- tion. One could argue that this should not influence recall (RecPro40 ), e.g., rec([ha, bi3 , hci2 ], m) = rec([ha, bi12 , hci8 ], m). However, unlike the previous propositions, this re- quirement is debatable as is indicated by the “0” tag. Proposition 6 (RecPro40 ). For any l ∈ L, m ∈ M, and k ≥ 1: rec(lk , m) = rec(l, m). Finally, we provide a proposition stating that recall should be 1 if all traces in the log fit the model (RecPro5+ ). As a result, the empty log has recall 1 for any model. Proposition 7 (RecPro5+ ). For any l ∈ L and m ∈ M such that τ (l) ⊆ τ (m): rec(l, m) = 1. Based on this proposition, rec(l, disc ofit (l)) = rec(l, disc ufit (l)) = 1 for any log l. 6.3 Precision Propositions Precision (prec ∈ L×M → [0, 1]) aims to quantify the fraction of behavior allowed by the model that was actually observed. In [20] several precision axioms were introduced. These partly overlap with the propositions below (but more are added and some have been strengthened). PrecPro1+ states that removing behavior from a model that does not happen in the event log cannot lead to a lower precision. Adding fitting traces to the event log can also not lower precision (PrecPro2+ ). However, adding non-fitting traces to the event log should not change precision (PrecPro30 ). Proposition 8 (PrecPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆ τ (m2 ) and τ (l) ∩ (τ (m2 ) \ τ (m1 )) = ∅: prec(l, m1 ) ≥ prec(l, m2 ). Proposition 9 (PrecPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): prec(l1 , m) ≤ prec(l2 , m). Proposition 10 (PrecPro30 ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): prec(l1 , m) = prec(l2 , m). 66 One could also argue that duplicating the event log should not influence precision because the distribution remains the same (PrecPro40 ), e.g., prec([ha, bi20 , hci20 ], m) = prec([ha, bi40 , hci40 ], m). Proposition 11 (PrecPro40 ). For any l ∈ L, m ∈ M, and k ≥ 1: prec(lk , m) = prec(l, m). If the model allows for the behavior observed and nothing more, precision should be maximal (PrecPro5+ ). One could also argue that if all modeled behavior was observed, precision should also be 1 (PrecPro60 ). The latter proposition is debatable, because it implies that the non-fitting behavior cannot influence perfect precision. Consider for example extreme cases where the model covers just a small fraction of all observed behavior (or even more extreme situations like τ (m) = ∅). Proposition 12 (PrecPro5+ ). For any l ∈ L and m ∈ M such that τ (m) = τ (l): prec(l, m) = 1. Proposition 13 (PrecPro60 ). For any l ∈ L and m ∈ M such that τ (m) ⊆ τ (l): prec(l, m) = 1. According to PrecPro5+ and PrecPro60 , rec(l, disc ofit (l)) = 1 for any log l. 6.4 Generalization Propositions Generalization (gen ∈ L × M → [0, 1]) aims to quantify the likelihood that new un- seen cases will fit the model. This conformance dimension is a bit different than the other two dimensions, because it reasons about future unseen cases (i.e., not yet in the event log). If the recall is good and the log is complete with lots of repeating behavior, then future cases will most likely fit the model. Analogous to recall, model extensions cannot lower generalization (GenPro1+ ), extending the log with fitting behavior can- not lower generalization (GenPro2+ ), and extending the log with non-fitting behavior cannot improve generalization (GenPro3+ ). Proposition 14 (GenPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆ τ (m2 ): gen(l, m1 ) ≤ gen(l, m2 ). Proposition 15 (GenPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): gen(l1 , m) ≤ gen(l2 , m). Proposition 16 (GenPro3+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3 and τ (l3 ) ⊆ τ (m): gen(l1 , m) ≥ gen(l2 , m). Duplicating the event log does not necessarily influence recall and precision. Ac- cording to propositions RecPro40 and PrecPro40 this should have no effect on recall and precision. However, making the event log more redundant, should have an effect on generalization. For fitting logs, adding redundancy without changing the distribution can only improve generalization (GenPro4+ ). For non-fitting logs, adding redundancy without changing the distribution can only lower generalization (GenPro5+ ). Note that 67 GenPro4+ and GenPro5+ are special cases of GenPro60 and GenPro70 . GenPro60 and GenPro70 consider logs where some traces are fitting and others are not. For a log where more than half of the traces is fitting, duplication can only improve generaliza- tion (GenPro60 ). For a log where more than half of the traces is non-fitting, duplication can only lower generalization (GenPro60 ). Proposition 17 (GenPro4+ ). For any l ∈ L, m ∈ M, and k ≥ 1 such that τ (l) ⊆ τ (m): gen(lk , m) ≥ gen(l, m). Proposition 18 (GenPro5+ ). For any l ∈ L, m ∈ M, and k ≥ 1 such that τ (l) ⊆ τ (m): gen(lk , m) ≤ gen(l, m). Proposition 19 (GenPro60 ). For any l ∈ L, m ∈ M, and k ≥ 1 such that most traces are fitting (|[t ∈ l | t ∈ τ (m)]| ≥ |[t ∈ l | t 6∈ τ (m)]|): gen(lk , m) ≥ gen(l, m). Proposition 20 (GenPro70 ). For any l ∈ L, m ∈ M, and k ≥ 1 such that most traces are non-fitting (|[t ∈ l | t ∈ τ (m)]| ≤ |[t ∈ l | t 6∈ τ (m)]|): gen(lk , m) ≤ gen(l, m). When the model allows for any behavior, clearly the next case will also be fitting (GenPro80 ). Nevertheless, it is marked as controversial because the proposition would also need to hold for an empty event log. Proposition 21 (GenPro80 ). For any l ∈ L and m ∈ M such that τ (m) = T : gen(l, m) = 1. This concludes this first inventory of conformance propositions. As mentioned, their purpose is twofold. First of all, they help to understand the goals and challenges of pro- cess discovery. Second, they provide a checklist for existing and future conformance measures. As demonstrated in [20] for precision, most of the existing approaches vi- olate seemingly obvious requirements. This justifies the formulation of the above 21 conformance propositions. 7 Example Evaluation Although we do not aim to propose new conformance measures, we defined 6 L2M measures in Section 5. We use these example measures to discuss the usefulness of the 21 conformance propositions. Table 7 shows for each of the six conformance measure √ defined in Section 5 which conformance propositions always hold (marked with ). The cells tagged with × are measure-proposition combinations that can be violated. Consider for example proposition RecPro1+ and measure rec L2M EB . Obviously, |[t ∈ l | t ∈ τ (m1 )]| /|l| ≤ |[t ∈ l | t ∈ τ (m2 )]| /|l| if τ (m1 ) ⊆ τ (m2 ). Next, we con- sider PrecPro1+ and prec L2M TB . |τ (l) ∩ τ (m1 )| /|τ (m1 )| ≤ |τ (l) ∩ τ (m2 )| /|τ (m2 )| if τ (m1 ) ⊆ τ (m2 ) and τ (l) ∩ (τ (m2 ) \ τ (m1 )) = ∅ (because τ (l) ∩ τ (m2 ) = √ τ (l) ∩ τ (m1 )). Similarly, we can show the correctness of all entries in Table 7. More interesting are the × entries in the table. prec L2M EB does not satisfy proposi- tion PrecPro1+ because π(m1 ) may be very different from π(m2 ) even when τ (m1 ) = τ (m2 ). Both generalization measures do not satisfy GenPro2+ . When adding fitting 68 Table 1. Overview of the conformance propositions that hold for the six example measures (under √ the assumption that l 6= [ ] and π(m)(τ (m)) > 0): means that the proposition holds for any log and model and × means that the proposition does not always hold. Proposition Name rec L2M TB rec L2M EB prec L2M TB prec L2M EB gen L2M q gen L2M HB √ √ √ √ √ √ 1 DetPro+ √ √ √ √ √ √ 2 BehPro+ √ √ 3 RecPro1+ √ √ 4 RecPro2+ √ √ 5 RecPro3+ √ √ 6 RecPro40 √ √ 7 RecPro5+ √ 8 PrecPro1+ × √ √ 9 PrecPro2+ √ √ 10 PrecPro30 √ √ 11 PrecPro40 √ √ 12 PrecPro5+ √ √ 13 PrecPro60 √ √ 14 GenPro1+ 15 GenPro2+ × × √ √ 16 GenPro3+ √ √ 17 GenPro4+ √ √ 18 GenPro5+ √ √ 19 GenPro60 20 GenPro70 × × 21 GenPro80 × × traces, these traces may be low frequent, possibly lowering both generalization mea- sures. They also do not satisfy GenPro70 . From the two definitions we can see that replication can only lead to better generalization values even when the majority of traces is not fitting. gen L2M q and gen L2M HB also do not satisfy GenPro80 because the proposition does not depend on the event log as long as any trace can generated by the model. The two measures do depend on the event log. Actually, gen L2M HB can only approach value 1. The × entries in the table illustrate that the properties are not as straightforward as they may seem at first glance. 8 Discussion The goal of this paper is to trigger discussion and to address common misconceptions. When it comes to relating process models and event logs different viewpoints are pos- sible and often authors tend to impose their own assumptions on others without fully understanding the different settings. Therefore, we put the definitions and propositions presented in this paper in the context of related work. As mentioned in the introduction, a broad range of conformance notions has been proposed in literature [1, 3, 4, 7–10, 14, 16, 18, 21, 22]. Moreover, different evaluation frameworks have been proposed [11, 19, 69 10% 10% 10% 10% 10% 10% (a) whole data set (b) training set (b) test set Fig. 4. The original data set is partitioned into 10 equal size subsets for 10-fold cross-validation. This allows for 10 experiments where 9 subsets are used for training and one subset is used for testing. 21]. The four quality dimensions described in [1] (recall/fitness, precision, generaliza- tion, simplicity) are generally accepted. In [6] it was shown that removing one of these dimensions may lead to degenerate models that are obviously undesirable but scoring well on the three remaining dimensions. 8.1 Evaluating Models Versus Evaluating Algorithms In this paper, we considered the setting where two artifacts (e.g., a process model and an event log) are given and need to be compared. We are not evaluating or comparing dif- ferent discovery algorithms. Hence, one cannot use cross-validation. Cross-validation is a technique to evaluate algorithms by partitioning the original data set into a training set to discover a model, and a test set to evaluate it. In k-fold cross-validation, the orig- inal data set is randomly partitioned into k equal size subsets (cf. Figure 4). Of the k subsets, a single subset is retained as the validation data for testing the model, and the remaining k − 1 sets are used as training data. The cross-validation process is then re- peated k times (the so-called folds), with each of the k subsets used exactly once as the validation data. The k results from the folds can then be combined to produce a single estimation. The whole data set is used for both training and validation and each data element is used for validation exactly once. Moreover, by comparing the performance on the different folds one gets insights into the robustness of the approach. Cross-validation approaches can be used to evaluate process discovery algorithms. One can distribute the cases in the original event log over 10 smaller event logs. The discovery algorithm can be applied to the 10 folds and tested on the corresponding test logs. As usual, there is the problem that event logs do not provide negative examples. Therefore, recall is easier to measure than precision. Moreover, one can only use (k- fold) cross-validation to evaluate an algorithm and not a process model. Therefore, it is unrelated to most of the things discussed in this paper, e.g., L2M comparison and the 21 propositions. 8.2 Log Quality Versus Process Variability In this paper, we assumed that all events are recorded correctly. In [1] a classification is given of different quality issues. Different entities (cases, activity instances, or events) 70 S L M (a) system (b) event log (c) process model Fig. 5. Venn diagram showing the relationships between system S, log L, and model M . can be missing in the log, missing in reality (recording of things that did not happen), or concealed in log (recorded but not easy to locate and select). Event attributes may be missing, incorrect, or imprecise. For example, the timestamp of an event may be miss- ing, an event is related to another case, or the value of a timestamp is too coarse-grained (just a date). A particular data quality problem (e.g., a missing attribute) may persist, repeat in an unpredictable manner, or return periodically. In total 108 data quality prob- lems were identified in [1]. Process variability is orthogonal to log quality. A process that exhibits many differ- ent behaviors, some of which are frequent and others infrequent, is difficult to analyze even when the data quality is high. See [13] for a detailed discussion of these two di- mensions referred to as individual trustworthiness (quality of event data from a data perspective) and global conclusiveness (quality of event data from an analysis perspec- tive). In [17] a generalized conformance checking framework is proposed that caters for the common case, when one does neither fully trust the log nor the model. Although data quality problems are important, the above papers are mostly unrelated to the propo- sitions discussed in this paper. 8.3 System Model Versus Event Log Scenario 1 shown in Figure 1 starts from a model that is used to generate an event log (e.g. using simulation) which is subsequently used as input for a process discovery technique that produces another model. One can compare the discovered model with the original model using a M2M (Model-to-Model) assessment. The idea is that there is a known “ground truth”, i.e., the original process model. In [6] a similar approach is used by introducing the so-called system that represents the real process generating the behavior. Assume that S, L, and M are the behaviors allowed by the system (S), found in the event log (L), and allowed by the model (M ) respectively. To understand the above let us first assume that S, L, and M are sets of traces. Hence, intersection and union are well-defined (e.g., S ∩ L ∩ M and S ∪ L ∪ M are sets of traces). This allows us to use a Venn diagram to show the relationships between system S, log L, and model M (see Figure 5). Given the Venn diagram showing the relationships between system S, log L, and model M , we can identify the following seven types of behavior (numbers refer to Figure 6): 71 S 7 L (1) Modeled and observed system behavior (2) Not modeled but observed non-system behavior (3) Modeled and observed non-system behavior (4) Modeled but unobserved non-system behavior 6 2 (5) Modeled but unobserved system behavior (6) Not modeled and unobserved system behavior (7) Not modeled but observed system behavior 1 5 3 4 M Fig. 6. Given a system S, log L, and model M we can identify seven types of behavior (ignoring non-system behavior that is not modeled and not observed) [6]. – (1) Modeled and observed system behavior (S ∩ L ∩ M ). – (2) Not modeled but observed non-system behavior (L \ (M ∪ S)). – (3) Modeled and observed non-system behavior ((L ∩ M ) \ S). – (4) Modeled but unobserved non-system behavior (M \ (S ∪ L)). – (5) Modeled but unobserved system behavior ((M ∩ S) \ L). – (6) Not modeled and unobserved system behavior (S \ (M ∪ L)). – (7) Not modeled but observed system behavior ((S ∩ L) \ M ). Clearly, one would like to like S and M to coincide. System-model recall can be defined as |S ∩ M | / |S| and can be maximized by minimizing (6) and (7). Precision can be defined as |S ∩ M | / |M | and can be maximized by minimizing (3) and (4). One can also define log-model recall/precision and system-log recall/precision. Although Figure 6 is useful and consistent with the earlier definitions, it is also too simplistic and partly misleading. First of all, in reality, we do not know the system S. It is only possible to control and know S in an experimental setting (e.g. using a simulation model). Second, the notion of S suggests that behavior can be classified into system behavior and non-system behavior (i.e., a binary classification). This ignores the likelihood of behavior and the purpose of the model (e.g., striving for the 80/20 model). “Murphy’s Law of Process Mining” suggest that in principle anything can happen if one is prepared to wait long enough, but it is uninteresting to return a model that allows for any behavior. This paper clearly shows the need to include probabilities. Third, system S may be changing over time. If we partition the timeline into segments i ∈ {1, 2, 3, . . .}, then we will be confronted with different versions of the system Si and new events logs LS i . How to relate model Mi to Si and Li while also using earlier observations (e.g., j