=Paper= {{Paper |id=Vol-2115/ATAED2018-56-74 |storemode=property |title=Relating Process Models and Event Logs - 21 Conformance Propositions |pdfUrl=https://ceur-ws.org/Vol-2115/ATAED2018-56-74.pdf |volume=Vol-2115 |authors=Wil M.P. van der Aalst |dblpUrl=https://dblp.org/rec/conf/apn/Aalst18a }} ==Relating Process Models and Event Logs - 21 Conformance Propositions== https://ceur-ws.org/Vol-2115/ATAED2018-56-74.pdf
             Relating Process Models and Event Logs
                          21 Conformance Propositions

                                   Wil M.P. van der Aalst

                        Process and Data Science (Informatik 9),
                    RWTH Aachen University, D-52056 Aachen, Germany
                           wvdaalst@rwth-aachen.de



       Abstract. Process mining sheds new light on the relationship between process
       models and real-life processes. Process discovery can be used to learn process
       models from event logs. Conformance checking can be used to confront a de-
       scriptive or normative model with the actual behavior observed. Quantifying the
       relationship between process models and event logs is notoriously difficult. Exist-
       ing approaches seem to make ad-hoc choices, leading to results that are not easy
       to interpret. To address these problems, we present 21 conformance propositions.
       These describe expected or desired properties of conformance measures. These
       cover notions such as recall (also called fitness), precision, and generalization.
       These propositions are put into a broader frame of Log-to-Log (L2L), Model-to-
       Model (M2M), and Log-to-Model (L2M) comparisons. Moreover, it is argued that
       conformance measures need to consider the likelihood of traces. Real-life event
       logs are incomplete showing only sample behavior. Moreover, observed traces
       tend to follow a Pareto distribution. Yet, mainstream process mining approaches
       stick to a “binary view” on process models (observed traces are fitting or not).
       This is hindering progress. The goal is to trigger a discussion by clearly formu-
       lating the challenges and requirements (rather than proposing new measures or
       techniques).

       Keywords: Process mining · Conformance checking · Recall · Precision · Gen-
       eralization.


1 Introduction

In process mining, there are two main types of artifacts: event logs and process models
[1]. In the background, there is the “real process” that is responsible for the events in the
event log. However, this process is only fully known in case one is using artificial (also
called synthetic) event logs. Therefore, we assume that we only have event logs and
models. However, we assume that the models are “rich enough” to generate event data.
This leads to the two scenarios shown in Figure 1. Scenario 1 starts from a model that
is used to generate an event log. This log can be used as input for a process discovery
technique that produces another model. One can compare the discovered model with
the original model using a so-called M2M (Model-to-Model) assessment. We can also
conduct two L2M (Log-to-Model) assessments by comparing the log with both the orig-
inal model and the discovered model. Scenario 2 starts from an event log that is used

                                               56
                                                                M2M
          (a) Scenario 1


                                                L2M                               L2M




                                            play-out                           play-in
                                           (simulate)                        (discover)
                           process model                       event log                  process model


                                                                 L2L
          (b) Scenario 2


                                                L2M                               L2M




                                             play-in                          play-out
                                           (discover)                        (simulate)
                             event log                       process model                  event log



Fig. 1. Two scenarios and three types of comparisons: L2M (Log-to-Model), M2M (Model-to-
Model), and L2L (Log-to-Log).



to discover a process model that is subsequently used to produce another event log.
The original and the generated event logs can be compared using an L2L (Log-to-Log)
assessment.
     Event logs only contain example behavior. This not only makes process discovery
challenging; it is also difficult to assess the quality of the process model in relation to
the log. Since process mining focuses on the relationship between observed behavior
and modeled behavior (i.e., L2M assessments), it is important to quantify the relation-
ship between an event log l ∈ L and a process model m ∈ M. This is reflected by
the different conformance measures that have been proposed [1, 3, 4, 7–10, 14, 16, 18,
21, 22]. When learning a process model from event data, there is typically a trade-off
between the following four quality dimensions [1]: (1) recall: the discovered model
should allow for the behavior seen in the event log (avoiding “non-fitting” behavior),
(2) precision: the discovered model should not allow for behavior completely unrelated
to what was seen in the event log (avoiding “underfitting”), (3) generalization: the dis-
covered model should generalize the example behavior seen in the event log (avoiding
“overfitting”), and (4) simplicity: the discovered model should be as simple as possible.
The simplicity dimension refers to Occam’s Razor: “one should not increase, beyond
what is necessary, the number of entities required to explain anything”. In the context
of process mining, this is often operationalized by quantifying the complexity of the
model (number of nodes, number of arcs, understandability, etc.). We do not consider
the simplicity dimension in this paper, since we focus on behavior and abstract from
the actual model representation. Recall is often referred to as fitness in process mining
literature. Sometimes fitness refers to a combination of the four quality dimensions.

                                                        57
To avoid later confusion, we use the term recall which is commonly used in pattern
recognition, information retrieval, and (binary) classification.
    This paper does not aim to provide concrete conformance measures (although some
examples are given). Instead, the focus is on characteristic/desired properties of such
measures. We propose 21 conformance propositions, i.e., properties that recall, pre-
cision, and generalizations measures could or should satisfy for any model and log
combination.1 Researchers and developers creating new measures should be clear on
what conformance propositions they aim to support and ensure that the implemented
measures support these. Users of existing conformance measures should be aware of
seemingly obvious conformance propositions that are violated by existing approaches.
    Next to defining 21 conformance propositions, this paper provides two additional
contributions. First of all, we provide a more holistic perspective on the challenges
related to comparing models and logs. Second, we show that probabilities play a key
role in such definition. The likelihood of traces in models and the frequency of traces in
logs are key elements that should not be ignored.
    It is important to note that this paper focuses on trace-based control-flow-oriented
conformance measures, i.e., we abstract from other event attributes (resources, prod-
ucts, time, costs, etc.) and consider full traces (not local patterns). This scope is clearly
limited. However, it helps to focus the discussion.
    The remainder is organized as follows. Section 2 presents some preliminaries. Next
to event logs also process models which define trace likelihoods are defined. Sections 3,
4, and 5 discuss L2L (Log-to-Log), M2M (Model-to-Model), and L2M (Log-to-Model)
comparisons and define example measures. Section 6 describes the 21 conformance
propositions. Section 7 reflects on the conformance propositions using the simple mea-
sures defined in sections 3, 4, and 5. Section 8 discusses the viewpoint used in the paper
and possible alternative viewpoints. Section 9 concludes the paper.

2 Preliminaries
Process mining techniques focus on the relationship between observed behavior and
modeled behavior. Therefore, we first formalize event logs (i.e., observed behavior) and
process models (i.e., modeled behavior). To do this, we consider a very simple setting
where we only focus on the control-flow, i.e., sequences of activities. However, unlike
most other papers, we attach probabilities to traces.

2.1     Event Logs
The starting point for process mining is an event log. Each event in such a log refers to
an activity possibly executed by a resource at a particular time and for a particular case.
An event may have many more attributes, e.g., transactional information, costs, cus-
tomer, location, and unit. Here, we focus on control-flow. Therefore, we only consider
activity labels and the ordering of events within cases.
 1
     The Merriam-Webster dictionary defines the noun proposition as “an expression in language
     or signs of something that can be believed, doubted, or denied or is either true or false” [15].
     We use the term in this manner. We do not state that all 21 conformance propositions should
     hold universally. We merely use them as potential requirements and discuss their implications.


                                                  58
Definition 1 (Traces). A is the universe of activities. A trace t ∈ A∗ is a sequence of
activities. T = A∗ is the universe of traces.
    Trace t = ha, b, c, d, ci ∈ T refers to 5 events belonging to the same case (i.e.,
|t| = 5). An event log is a collection of cases each represented by a trace.
Definition 2 (Event Log). L = B(T ) is the universe of event logs. An event log l ∈ L
is a finite multiset of observed traces. τ (l) = {t ∈ l} ⊆ T is the set of traces appearing
in l ∈ L. τ (l) = {t ∈ T | t 6∈ τ (l)} is the complement of l, i.e., non-observed traces.
     An event log is a multiset of traces. Event log l = [ha, b, ci5 , hb, a, di3 , ha, b, di2 ]
refers to 10 cases (i.e., |l| = 10). Five cases are represented by the trace ha, b, ci, three
cases are represented by the trace hb, a, di, and two cases are represented by the trace
ha, b, di. l(t) is the number of times t appears in l, e.g., l(hb, a, di) = 3. We assume
that the usual operators are defined for multisets. l1 ] l2 is the union of two multisets,
|l| is the number of elements, and l1 \ l2 is the difference. l1 ∩ l2 is the intersection of
two multisets. Hence, (l1 ∩ l2 )(t) = l1 (t) max l2 (t). [t ∈ l | b(t)] is the multiset of all
elements in l that satisfy some condition b. l1 ≤ l2 means that l1 is contained in l2 . For
example, [a2 , b] ≤ [a2 , b2 , c], but [a2 , b3 ] 6≤ [a2 , b2 , c2 ] and [a2 , b2 , c] 6≤ [a3 , b3 ].
     τ (l) is the set traces that were not observed in the event log. Note that τ (l) and τ (l)
partition the universe of traces T .

2.2     Process Models
Often the behavior of a process model m is simply defined by the set of traces allowed
by m. However, there may be vast differences in the likelihoods of traces. For many
processes, the observed traces follow a Pareto distribution. Typically, there are a limited
number of traces that can be observed frequently, while there are often much more traces
that are infrequent.
    Actually, many event logs follow the “80-20 rule”, i.e., more than 80% of all ob-
served traces are explained by less than 20% of all the observed trace variants. Formally
this means there is a set of traces X ⊆ τ (l) such that |[t ∈ l | t ∈ X]|/|l| > 0.8 and
|X|/|τ (l)| < 0.2.
    Because of the distribution of traces in the log, it is often too simplistic to describe
a model as a set of traces. Since “behavior is not binary”, we use a trace likelihood
function to describe the probability of each trace.
Definition 3 (Trace Likelihood Function). Π = T → [0, 1] is the set of all trace
likelihood
P          functions. π ∈ Π assigns a likelihood π(t) to any trace t ∈ T such that
   t∈T π(t) = 1.
    When describing a trace likelihood function, we assume that the number of traces
is countable. Therefore,
                    P (Ω, F, P ) with Ω = T , F = P(T ), and P ∈ F → [0, 1]
such that P (X) = t∈X π(t), defines a proper probability space.2 Note that if A is
countable, then also T is countable.
 2
     A probability space (Ω, F, P ) consists of three parts: a sample space Ω which defines the set
     of all possible outcomes, a set of “events” F where each event is a set containing zero or more
     outcomes, and an assignment of probabilities to the events P ∈ F → [0, 1].


                                                 59
 m1                      a                c                          π12     trace   likelihood
                                                                              abc       0.375
                                                                              bac       0.375
 start                                                         end            abd       0.125
                         b                d                                   bad       0.125

      (a) A Petri net model (with start and end transitions)           (c) Likelihood of each trace

 m2
                     a                        c                      l12     trace   frequency
                                                                              abc        44
                                                                              bac        36
 start                                                         end            abd        14
                     b                        d                               bad         6

         (b) A BPMN model allowing for the same behavior              (d) Frequency of each trace


Fig. 2. Two process models m1 and m2 having the same behavior: τ (m1 ) = τ (m2 ) = {ha, b, ci,
ha, b, di, hb, a, ci, hb, a, di}. m1 is a Petri net with a special start activity (occurs once at the
beginning) and a special end activity to signal the end of the trace. m2 is a BPMN (Business
Process Model and Notation) model. The likelihood of each trace π12 = π(m1 ) = π( m2 ) (c)
and an example log l12 (d) are also shown.


    For pragmatic reasons, we can also restrict A to be finite and T to traces of a max-
imal lengthP (e.g., the maximal length seen in any log). For any set of traces X ⊆ T ,
π(X) = t∈X π(t) is the probability that a run of the process π yields a trace in X.
We assume that traces are sampled independently. When considering time, this is not
realistic (due to queueing), but for the control-flow (routing of a case through the model)
this is a common assumption.

Definition 4 (Process Model). M is the set of process models. A process model m ∈
M has a set of traces τ (m) ⊆ T and a trace likelihood function π(m) ∈ Π.

    The relation between τ (m) and π(m) is not very strict. Sometimes we will require
τ (m) ⊆ {t ∈ T | π(t) > 0} (all modeled traces are possible), {t ∈ T | π(t) > 0} ⊆
τ (m) (the model allows for all traces possible), or even τ (m) = {t ∈ T | π(t) > 0}.
Trace set τ (m) may be an abstraction of the real process and leave out unlikely behavior.
Due to simplification, the trace set τ (m) may also allow for traces that cannot happen
(e.g., particular interleavings or loops).
    We define τ (m) = {t ∈ T | t 6∈ τ (m)} as the complement of the set of traces
allowed by model m ∈ M (like we did for event logs).
    We distinguish between representation and behavior of a model. Process model
m ∈ M can be represented using a plethora of modeling languages, e.g., Petri nets,
BPMN models, UML activity diagrams, automata, and process trees. Here, we abstract
from the actual representation and focus on behavioral characteristics like τ (m) ⊆ T
and π(m) ∈ Π.
    Figure 2 shows two process models that have the same behavior: τ (m1 ) = τ (m2 ) =
{ha, b, ci, ha, b, di, hb, a, ci, hb, a, di} and π12 = π(m1 ) = π(m2 ) with π12 (ha, b, ci) =
0.375, π12 (hb, a, ci) = 0.375, π12 (ha, b, di) = 0.125, π12 (hb, a, di) = 0.125, and

                                                  60
 m3                      a                c                          π34       trace
                                                                                abc
                                                                                           likelihood
                                                                                              0.375
                                                                                bac           0.375
                                                                               abcdc         0.09375
 start                                                         end             bacdc         0.09375
                         b                d                                   abcdcdc      0.0234375
                                                                              bacdcdc      0.0234375
      (a) A Petri net model (with start and end transitions)
                                                                             abcdcdcdc    0.005859375
                                                                                 ...            ...

                                                                           (c) Likelihood of each trace
 m4
                     a                        c
                                                                     l34       trace       frequency
                                                                                abc            4
                                                                                bac            3
 start                                                         end             abcdc           2
                     b                        d                             bacdcdcdcdc        1

         (b) A BPMN model allowing for the same behavior                   (d) Frequency of each trace


Fig. 3. Two process models m3 and m4 allowing for the same set of traces (τ (m3 ) = τ (m4 ))
and having the identical trace likelihoods (π34 = π(m3 ) = π(m4 )) and an example log l34 (d).




π12 (t) = 0 for all other traces t. The two process models m1 and m2 are not graphically
annotated with probabilities. However, there are extensions of these notations to do
so. We can use for example Stochastic Petri Nets (SPN) or BPMN-based simulation
models. For the models in Figure 2, we assume that initially there is a “race” between a
and b where both have a 50% probability to win. After both a and b occurred, c occurs
with 75% probability and d occurs with 25% probability. This explains the likelihoods
in Figure 2(c).

     Figure 2(d) shows a possible event log l12 composed of 100 traces generated by
one of these models: l12 (ha, b, ci) = 44, l12 (hb, a, ci) = 36, l12 (ha, b, di) = 14, and
l12 (hb, a, di) = 6.

    Process models may allow for an infinite number of traces. We use Figure 3 to
illustrate this. Again both processes (m3 and m4 ) express the same behavior. Again
there is a “race” between a and b where both have a 50% probability to win. After a
and b, activity c will occur. Then there is a 75% probability that the process ends and a
25% probability that d occurs. Let ta,k = ha, bi · (hc, di)k · hci be the trace that starts
with a and where d is executed k times. tb,k = hb, ai · (hc, di)k · hciSis the trace that
starts with b and where d is executed k times. τ (m3 ) = τ (m3 ) = k≥0 {ta,k , tb,k }.
π34 = π(m3 ) = π(m4 ) with π34 (ta,k ) = π34 (tb,k ) = 0.5 · (0.25)k · 0.75. Some
example values are given in Figure 3(c).

    Figure 3(d) shows a possible event log l34 composed of 10 traces generated by one
of the models (m3 or m4 ): l34 (ha, b, ci) = 4, l34 (hb, a, ci) = 3, l34 (ha, b, c, d, ci) = 2,
and l34 (hb, a, c, d, c, d, c, d, c, d, ci) = 1. Since any log contains only a finite number of
traces, one can never observe all traces possible in m3 or m4 .

                                                  61
2.3   Process Discovery
A discovery algorithm takes an event log as input and returns a process model. For
example, m1 and m2 in Figure 2 could have been discovered based on event log l =
[ha, b, ci5 , hb, a, di3 , ha, b, di2 ]. Ideally, the process model captures the (dominant) be-
havior observed, but also generalizes without becoming too imprecise. For example,
m1 and m2 allow for trace t = hb, a, ci although this was never observed.

Definition 5 (Discovery Algorithm). A discovery algorithm can be described as a
function disc ∈ L → M mapping event logs onto process models.

    We abstract from concrete discovery algorithms. Over 100 discovery algorithms
have been proposed in literature [1]. Most of the discovery algorithms do not include
probabilities. However, by replaying the event log on the discovered process model,
these can be added easily. Merely as a reference to explain basic notions, we define
three simple, but extreme, algorithms: disc ofit , disc ufit , and disc nfit . Let l ∈ L be
a log. disc ofit (l) = mo such that τ (mo ) = τ (l) produces an overfitting model that
allows only for the behavior seen in the log. disc ufit (l) = mu such that τ (mu ) = T
produces an underfitting model that allows for any behavior. disc nfit (l) = mn such that
τ (mn ) = τ (l) produces a non-fitting model that allows for all behavior not seen in the
log. We do not define the probabilities explicitly here. Just assume some function that
assigns positive values to all included traces.


3 L2L: Log-To-Log Comparison
As Figure 1 shows, we are interested in three types of conformance measurements. Al-
though the focus will be on L2M (Log-to-Model) comparisons, we also briefly consider
L2L (Log-to-Log) and M2M (Model-to-Model) comparisons. We start by providing
precision and recall measures for L2L comparisons.

Definition 6 (Set-Based L2L Precision and Recall). Let l1 , l2 ∈ L be two logs. Set-
based L2L precision and recall are defined as follows:
                               |τ (l1 ) ∩ τ (l2 )|                                  |τ (l1 ) ∩ τ (l2 )|
      rec L2LSB (l1 , l2 ) =                              prec L2LSB (l1 , l2 ) =                         (1)
                                    |τ (l1 )|                                            |τ (l2 )|
Definition 7 (Multiset-Based L2L Precision and Recall). Let l1 , l2 ∈ L be two logs.
Multiset-based L2L precision and recall are defined as follows:
                                       |l1 ∩ l2 |                                    |l1 ∩ l2 |
            rec L2LMSB (l1 , l2 ) =                       prec L2LMSB (l1 , l2 ) =                        (2)
                                          |l1 |                                         |l2 |
   rec L2LMSB and prec L2LMSB take into account the frequencies of traces. In rec L2LSB
and prec L2LSB it is not relevant how many times the same trace appears in the log.
Consider l12 in Figure 2(d) and l34 in Figure 3(d). rec L2LSB (l12 , l34 ) = 2/4 = 0.5,
prec L2LSB (l12 , l34 ) = 2/4 = 0.5, rec L2LMSB (l12 , l34 ) = (4 + 3)/100 = 0.07, and
prec L2LMSB (l12 , l34 ) = (4 + 3)/10 = 0.7. This shows that the latter two measures only
make sense if both logs have comparable sizes or things are normalized.

                                                     62
4 M2M: Model-To-Model Comparison

We also define two M2M (Model-to-Model) recall and precision measures.

Definition 8 (Trace-Based M2M Precision and Recall). Let m1 , m2 ∈ M be two
process models. Trace-based M2M precision and recall are defined as follows:

                                                |τ (m1 ) ∩ τ (m2 )|
                        rec M2M TB (m1 , m2 ) =                                         (3)
                                                     |τ (m1 )|
                                                |τ (m1 ) ∩ τ (m2 )|
                       prec M2M TB (m1 , m2 ) =                                         (4)
                                                     |τ (m2 )|

    Note that these recall and precision measures do not take into account the likeli-
hood of traces. Moreover, they are undefined if the number of traces is unbounded, i.e.,
|τ (m1 )| = ∞ or |τ (m2 )| = ∞. These two issues can be addressed by taking into
account trace likelihoods, resulting in the following measures.

Definition 9 (Likelihood-Based M2M Precision and Recall). Let m1 , m2 ∈ M be
two process models. Likelihood-based M2M precision and recall are defined as follows:

                                            π(m1 )(τ (m1 ) ∩ τ (m2 ))
                    rec M2M LB (m1 , m2 ) =                                             (5)
                                               π(m1 )(τ (m1 ))
                                            π(m2 )(τ (m1 ) ∩ τ (m2 ))
                   prec M2M LB (m1 , m2 ) =                                             (6)
                                               π(m2 )(τ (m2 ))

Note that π(m1 )(τ (m1 ) ∩ τ (m2 )) quantifies the likelihood of traces in the set τ (m1 ) ∩
τ (m2 ) according to model m1 (recall that π(m1 ) ∈ Π = T → [0, 1] is a trace like-
lihood function). To compute recall, we compare the traces allowed by both models
(τ (m1 ) ∩ τ (m2 )) with the traces allowed by the first model (τ (m1 )). π(m1 ) is used to
quantify the fractions of traces. To compute precision, we compare the traces allowed
by both models (τ (m1 ) ∩ τ (m2 )) with the traces allowed by the second model (τ (m2 )).
Now, π(m2 ) is used to quantify the fractions of traces. Note that it does not make any
sense to use π(m1 ) when computing precision, because the behavior in τ (m2 ) \ τ (m1 )
will be unlikely according to π(m1 ). Hence, precision based on π(m1 ) would always
yield a high value even when m2 allows for lots of additional behavior. Similarly, it
does not make any sense to use π(m2 ) when computing recall.


5 L2M: Log-To-Model Comparison

The recall and precision measures in Definition 9 seem quite natural for models and
there is no need for additional notions such as generalization. However, when consider-
ing event logs, recall and precision are not addressing the fact that there is a difference
between (1) the behavior in the log and (2) possible future behavior. Overfitting models
may have perfect recall, but unable to allow for future cases.
    In the remainder, we assume the existence of three functions: rec(), prec(), gen().
All three take a log and model as input and return a value between 0 and 1. The higher

                                            63
the value, the better. In process mining literature one can find many proposals for such
functions. Although we provide a few example functions for clarity, we focus on desired
properties of conformance measures rather than concrete measures.
Definition 10 (Recall). A recall measure rec ∈ L × M → [0, 1] aims to quantify the
fraction of observed behavior that is allowed by the model.
Definition 11 (Precision). A precision measure prec ∈ L × M → [0, 1] aims to quan-
tify the fraction of behavior allowed by the model that was actually observed.
Definition 12 (Generalization). A generalization measure gen ∈ L × M → [0, 1]
aims to quantify the likelihood that new unseen cases will fit the model.
    If we ignore frequencies and likelihoods of traces, we can simply count fractions of
traces yielding the following two simple measures.
Definition 13 (Trace-Based L2M Precision and Recall). Let l ∈ L and m ∈ M be
an event log and a process model. Trace-based L2M precision and recall are defined as
follows:
                             |τ (l) ∩ τ (m)|                           |τ (l) ∩ τ (m)|
       rec L2M TB (l, m) =                      prec L2M TB (l, m) =                      (7)
                                  |τ (l)|                                   |τ (m)|
    Since |τ (l)| is bounded by the size of the log, rec L2M TB (l, m) is well-defined. How-
ever, prec L2M TB (l, m) is undefined when τ (m) is unbounded (e.g., in case of loops).
    Another approach is to build a dummy process model ml based on the event log and
then apply Definition 9.
Definition 14 (Estimation-Based L2M Precision and Recall). Let l ∈ L and m ∈ M
be an event log and a process model. Let ml be a model created based on l such that
τ (ml ) = τ (l) and π(ml )(t) = l(t)/ |l|. Estimation-based L2M precision and recall are
defined as follows:
                               π(ml )(τ (ml ) ∩ τ (m))    |[t ∈ l | t ∈ τ (m)]|
           rec L2M EB (l, m) =                         =                                  (8)
                                  π(ml )(τ (ml ))                   |l|
                               π(m)(τ (ml ) ∩ τ (m))     π(m)(τ (l) ∩ τ (m))
          prec L2M EB (l, m) =                         =                                  (9)
                                  π(m)(τ (m))                 π(m)(τ (m))
    The above measures do not capture generalization. Note that generalization is good
when the process model is also able to explain future traces (i.e., unseen cases). There-
fore, generalization takes into account the redundancy of fitting behavior. If the log
contains many fitting traces that are also frequent, then generalization is likely to be
good.
Definition 15 (L2M Generalization). Let l ∈ L and m ∈ M be an event log and a
process model. Two example L2M precision measures are defined as follows:
                              |[t ∈ l | t ∈ τ (m) ∧ l(t) ≥ q]|
            gen L2M q (l, m) =                                     with q ≥ 1            (10)
                                               |l|
                              |[t ∈ l | t ∈ τ (m)]| − |τ (l) ∩ τ (m)|
          gen L2M HB (l, m) =                                                            (11)
                                                   |l|

                                               64
    gen L2M q (l, m) uses a threshold q. The more fitting traces there are that have a fre-
quency of at least q, the better. Non-fitting traces or traces that are unique lower gener-
alization. gen L2M HB (l, m) measures the fraction of traces that is fitting, but subtracts a
penalty for each unique trace. Consider l0 = [ha, b, ci60 , hb, a, di38 , ha, b, di1 , hc, b, ai1 ]
and m1 in Figure 2 and let q = 10. gen L2M q (l0 , m1 ) = (60 + 38)/100 = 0.98 and
gen L2M HB (l0 , m1 ) = ((60 + 38 + 1) − 3)/100 = 0.96.
    Assume now that l00 = [ha, b, ci7 , hb, a, di1 , ha, b, di1 , hc, b, ai1 ]. gen L2M q (l00 , m1 ) =
0/10 = 0.0 and gen L2M HB (l00 , m1 ) = ((7 + 1 + 1) − 3)/10 = 0.6. Note that, although
the set of traces in the event log did not change (τ (l0 ) = τ (l00 )), it is reasonable to
assume that generalization is impacted.


6 A Collection of Conformance Propositions

Many recall measures have been proposed in literature [1, 3, 4, 7–10, 14, 16, 18, 21, 22].
In recent years, also several precision measures have been proposed [5, 20]. Only few
generalization measures have been proposed [3]. The goal of this paper is not to present
new measures, but to define properties for such quality measures. Through this, we hope
to facilitate a better understanding of process discovery and conformance checking.
Moreover, these properties may help to choose an existing measure or to define new
ones.


6.1   General Propositions

In the remainder, we provide 21 conformance propositions. The Merriam-Webster dic-
tionary defines the noun proposition as “an expression in language or signs of some-
thing that can be believed, doubted, or denied or is either true or false” [15]. Most of
the conformance propositions have broad support from the community, i.e., there is
broad consensus that these propositions should hold. These are marked with a “+”. For
example, the first two propositions are commonly accepted; the computation of a qual-
ity measure should be deterministic (DetPro+ ) and only depend on behavioral aspects
(BehPro+ ). The latter is a design choice. We deliberately exclude simplicity notions.
More controversial propositions are marked with a “0” (rather than a “+”).

Proposition 1 (DetPro+ ). rec(), prec(), gen() are deterministic functions, i.e., the
measures rec(l, m), prec(l, m), gen(l, m) are fully determined by l ∈ L and m ∈ M.

Proposition 2 (BehPro+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) =
τ (m2 ) and π(m1 ) = π(m2 ): rec(l, m1 ) = rec(l, m2 ), prec(l, m1 ) = prec(l, m2 ),
and gen(l, m1 ) = gen(l, m2 ), i.e., the measures are fully determined by the behavior
observed and the behavior described by the model (representation does not matter).


6.2   Recall Propositions

First, we consider a few recall propositions. rec ∈ L × M → [0, 1] aims to quantify
the fraction of observed behavior that is allowed by the model.

                                                 65
Proposition 3 (RecPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆
τ (m2 ): rec(l, m1 ) ≤ rec(l, m2 ).

    Proposition RecPro1+ states that extending the model to allow for more behavior
can never result in a lower recall. Similarly, it cannot be the case that adding fitting
behavior to the event logs, lowers recall (RecPro2+ ). Adding non-fitting behavior to
the log, cannot improve recall (RecPro3+ ).

Proposition 4 (RecPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): rec(l1 , m) ≤ rec(l2 , m).

Proposition 5 (RecPro3+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): rec(l1 , m) ≥ rec(l2 , m).

    For any natural number k: lk (t) = k · l(t), e.g., if l = [ha, bi3 , hci2 ], then l4 =
[ha, bi12 , hci8 ]. We use this notation to enlarge event logs without changing the distribu-
tion. One could argue that this should not influence recall (RecPro40 ), e.g., rec([ha, bi3 ,
hci2 ], m) = rec([ha, bi12 , hci8 ], m). However, unlike the previous propositions, this re-
quirement is debatable as is indicated by the “0” tag.

Proposition 6 (RecPro40 ). For any l ∈ L, m ∈ M, and k ≥ 1: rec(lk , m) =
rec(l, m).

    Finally, we provide a proposition stating that recall should be 1 if all traces in the
log fit the model (RecPro5+ ). As a result, the empty log has recall 1 for any model.

Proposition 7 (RecPro5+ ). For any l ∈ L and m ∈ M such that τ (l) ⊆ τ (m):
rec(l, m) = 1.

      Based on this proposition, rec(l, disc ofit (l)) = rec(l, disc ufit (l)) = 1 for any log l.

6.3     Precision Propositions
Precision (prec ∈ L×M → [0, 1]) aims to quantify the fraction of behavior allowed by
the model that was actually observed. In [20] several precision axioms were introduced.
These partly overlap with the propositions below (but more are added and some have
been strengthened). PrecPro1+ states that removing behavior from a model that does
not happen in the event log cannot lead to a lower precision. Adding fitting traces to the
event log can also not lower precision (PrecPro2+ ). However, adding non-fitting traces
to the event log should not change precision (PrecPro30 ).

Proposition 8 (PrecPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆
τ (m2 ) and τ (l) ∩ (τ (m2 ) \ τ (m1 )) = ∅: prec(l, m1 ) ≥ prec(l, m2 ).

Proposition 9 (PrecPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): prec(l1 , m) ≤ prec(l2 , m).

Proposition 10 (PrecPro30 ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): prec(l1 , m) = prec(l2 , m).

                                                66
   One could also argue that duplicating the event log should not influence precision
because the distribution remains the same (PrecPro40 ), e.g., prec([ha, bi20 , hci20 ], m) =
prec([ha, bi40 , hci40 ], m).

Proposition 11 (PrecPro40 ). For any l ∈ L, m ∈ M, and k ≥ 1: prec(lk , m) =
prec(l, m).

    If the model allows for the behavior observed and nothing more, precision should be
maximal (PrecPro5+ ). One could also argue that if all modeled behavior was observed,
precision should also be 1 (PrecPro60 ). The latter proposition is debatable, because it
implies that the non-fitting behavior cannot influence perfect precision. Consider for
example extreme cases where the model covers just a small fraction of all observed
behavior (or even more extreme situations like τ (m) = ∅).

Proposition 12 (PrecPro5+ ). For any l ∈ L and m ∈ M such that τ (m) = τ (l):
prec(l, m) = 1.

Proposition 13 (PrecPro60 ). For any l ∈ L and m ∈ M such that τ (m) ⊆ τ (l):
prec(l, m) = 1.

      According to PrecPro5+ and PrecPro60 , rec(l, disc ofit (l)) = 1 for any log l.


6.4    Generalization Propositions

Generalization (gen ∈ L × M → [0, 1]) aims to quantify the likelihood that new un-
seen cases will fit the model. This conformance dimension is a bit different than the
other two dimensions, because it reasons about future unseen cases (i.e., not yet in the
event log). If the recall is good and the log is complete with lots of repeating behavior,
then future cases will most likely fit the model. Analogous to recall, model extensions
cannot lower generalization (GenPro1+ ), extending the log with fitting behavior can-
not lower generalization (GenPro2+ ), and extending the log with non-fitting behavior
cannot improve generalization (GenPro3+ ).

Proposition 14 (GenPro1+ ). For any l ∈ L and m1 , m2 ∈ M such that τ (m1 ) ⊆
τ (m2 ): gen(l, m1 ) ≤ gen(l, m2 ).

Proposition 15 (GenPro2+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): gen(l1 , m) ≤ gen(l2 , m).

Proposition 16 (GenPro3+ ). For any l1 , l2 , l3 ∈ L and m ∈ M such that l2 = l1 ] l3
and τ (l3 ) ⊆ τ (m): gen(l1 , m) ≥ gen(l2 , m).

    Duplicating the event log does not necessarily influence recall and precision. Ac-
cording to propositions RecPro40 and PrecPro40 this should have no effect on recall
and precision. However, making the event log more redundant, should have an effect
on generalization. For fitting logs, adding redundancy without changing the distribution
can only improve generalization (GenPro4+ ). For non-fitting logs, adding redundancy
without changing the distribution can only lower generalization (GenPro5+ ). Note that

                                             67
GenPro4+ and GenPro5+ are special cases of GenPro60 and GenPro70 . GenPro60
and GenPro70 consider logs where some traces are fitting and others are not. For a log
where more than half of the traces is fitting, duplication can only improve generaliza-
tion (GenPro60 ). For a log where more than half of the traces is non-fitting, duplication
can only lower generalization (GenPro60 ).

Proposition 17 (GenPro4+ ). For any l ∈ L, m ∈ M, and k ≥ 1 such that τ (l) ⊆
τ (m): gen(lk , m) ≥ gen(l, m).

Proposition 18 (GenPro5+ ). For any l ∈ L, m ∈ M, and k ≥ 1 such that τ (l) ⊆
τ (m): gen(lk , m) ≤ gen(l, m).

Proposition 19 (GenPro60 ). For any l ∈ L, m ∈ M, and k ≥ 1 such that most traces
are fitting (|[t ∈ l | t ∈ τ (m)]| ≥ |[t ∈ l | t 6∈ τ (m)]|): gen(lk , m) ≥ gen(l, m).

Proposition 20 (GenPro70 ). For any l ∈ L, m ∈ M, and k ≥ 1 such that most traces
are non-fitting (|[t ∈ l | t ∈ τ (m)]| ≤ |[t ∈ l | t 6∈ τ (m)]|): gen(lk , m) ≤ gen(l, m).

    When the model allows for any behavior, clearly the next case will also be fitting
(GenPro80 ). Nevertheless, it is marked as controversial because the proposition would
also need to hold for an empty event log.

Proposition 21 (GenPro80 ). For any l ∈ L and m ∈ M such that τ (m) = T :
gen(l, m) = 1.

    This concludes this first inventory of conformance propositions. As mentioned, their
purpose is twofold. First of all, they help to understand the goals and challenges of pro-
cess discovery. Second, they provide a checklist for existing and future conformance
measures. As demonstrated in [20] for precision, most of the existing approaches vi-
olate seemingly obvious requirements. This justifies the formulation of the above 21
conformance propositions.


7 Example Evaluation

Although we do not aim to propose new conformance measures, we defined 6 L2M
measures in Section 5. We use these example measures to discuss the usefulness of the
21 conformance propositions. Table 7 shows for each of the six conformance measure
                                                                                            √
defined in Section 5 which conformance propositions always hold (marked with ).
The cells tagged with × are measure-proposition combinations that can be violated.
     Consider for example proposition RecPro1+ and measure rec L2M EB . Obviously,
|[t ∈ l | t ∈ τ (m1 )]| /|l| ≤ |[t ∈ l | t ∈ τ (m2 )]| /|l| if τ (m1 ) ⊆ τ (m2 ). Next, we con-
sider PrecPro1+ and prec L2M TB . |τ (l) ∩ τ (m1 )| /|τ (m1 )| ≤ |τ (l) ∩ τ (m2 )| /|τ (m2 )|
if τ (m1 ) ⊆ τ (m2 ) and τ (l) ∩ (τ (m2 ) \ τ (m1 )) = ∅ (because τ (l) ∩ τ (m2 ) =
                                                                      √
τ (l) ∩ τ (m1 )). Similarly, we can show the correctness of all entries in Table 7.
     More interesting are the × entries in the table. prec L2M EB does not satisfy proposi-
tion PrecPro1+ because π(m1 ) may be very different from π(m2 ) even when τ (m1 ) =
τ (m2 ). Both generalization measures do not satisfy GenPro2+ . When adding fitting

                                              68
Table 1. Overview of the conformance propositions that hold for the six example measures (under
                                                   √
the assumption that l 6= [ ] and π(m)(τ (m)) > 0):     means that the proposition holds for any
log and model and × means that the proposition does not always hold.

 Proposition     Name      rec L2M TB rec L2M EB prec L2M TB prec L2M EB gen L2M q gen L2M HB
                               √          √          √           √          √          √
      1         DetPro+
                               √          √          √           √          √          √
      2         BehPro+
                               √          √
      3        RecPro1+
                               √          √
      4        RecPro2+
                               √          √
      5        RecPro3+
                               √          √
      6        RecPro40
                               √          √
      7        RecPro5+
                                                     √
      8        PrecPro1+                                          ×
                                                     √           √
      9        PrecPro2+
                                                     √           √
      10       PrecPro30
                                                     √           √
      11       PrecPro40
                                                     √           √
      12       PrecPro5+
                                                     √           √
      13       PrecPro60
                                                                            √          √
      14       GenPro1+
      15       GenPro2+                                                     ×         ×
                                                                            √         √
      16       GenPro3+
                                                                            √         √
      17       GenPro4+
                                                                            √         √
      18       GenPro5+
                                                                            √         √
      19       GenPro60
      20       GenPro70                                                     ×         ×
      21       GenPro80                                                     ×         ×


traces, these traces may be low frequent, possibly lowering both generalization mea-
sures. They also do not satisfy GenPro70 . From the two definitions we can see that
replication can only lead to better generalization values even when the majority of
traces is not fitting. gen L2M q and gen L2M HB also do not satisfy GenPro80 because
the proposition does not depend on the event log as long as any trace can generated by
the model. The two measures do depend on the event log. Actually, gen L2M HB can only
approach value 1.
    The × entries in the table illustrate that the properties are not as straightforward as
they may seem at first glance.


8 Discussion
The goal of this paper is to trigger discussion and to address common misconceptions.
When it comes to relating process models and event logs different viewpoints are pos-
sible and often authors tend to impose their own assumptions on others without fully
understanding the different settings. Therefore, we put the definitions and propositions
presented in this paper in the context of related work. As mentioned in the introduction,
a broad range of conformance notions has been proposed in literature [1, 3, 4, 7–10, 14,
16, 18, 21, 22]. Moreover, different evaluation frameworks have been proposed [11, 19,

                                              69
                            10%                          10%                            10%

              10%                         10%                         10%




              (a) whole data set           (b) training set              (b) test set



Fig. 4. The original data set is partitioned into 10 equal size subsets for 10-fold cross-validation.
This allows for 10 experiments where 9 subsets are used for training and one subset is used for
testing.


21]. The four quality dimensions described in [1] (recall/fitness, precision, generaliza-
tion, simplicity) are generally accepted. In [6] it was shown that removing one of these
dimensions may lead to degenerate models that are obviously undesirable but scoring
well on the three remaining dimensions.

8.1   Evaluating Models Versus Evaluating Algorithms
In this paper, we considered the setting where two artifacts (e.g., a process model and an
event log) are given and need to be compared. We are not evaluating or comparing dif-
ferent discovery algorithms. Hence, one cannot use cross-validation. Cross-validation
is a technique to evaluate algorithms by partitioning the original data set into a training
set to discover a model, and a test set to evaluate it. In k-fold cross-validation, the orig-
inal data set is randomly partitioned into k equal size subsets (cf. Figure 4). Of the k
subsets, a single subset is retained as the validation data for testing the model, and the
remaining k − 1 sets are used as training data. The cross-validation process is then re-
peated k times (the so-called folds), with each of the k subsets used exactly once as the
validation data. The k results from the folds can then be combined to produce a single
estimation. The whole data set is used for both training and validation and each data
element is used for validation exactly once. Moreover, by comparing the performance
on the different folds one gets insights into the robustness of the approach.
     Cross-validation approaches can be used to evaluate process discovery algorithms.
One can distribute the cases in the original event log over 10 smaller event logs. The
discovery algorithm can be applied to the 10 folds and tested on the corresponding test
logs. As usual, there is the problem that event logs do not provide negative examples.
Therefore, recall is easier to measure than precision. Moreover, one can only use (k-
fold) cross-validation to evaluate an algorithm and not a process model. Therefore, it is
unrelated to most of the things discussed in this paper, e.g., L2M comparison and the
21 propositions.

8.2   Log Quality Versus Process Variability
In this paper, we assumed that all events are recorded correctly. In [1] a classification is
given of different quality issues. Different entities (cases, activity instances, or events)

                                                 70
            S                                             L


                                                                                 M
                   (a) system             (b) event log          (c) process model


      Fig. 5. Venn diagram showing the relationships between system S, log L, and model M .


can be missing in the log, missing in reality (recording of things that did not happen),
or concealed in log (recorded but not easy to locate and select). Event attributes may be
missing, incorrect, or imprecise. For example, the timestamp of an event may be miss-
ing, an event is related to another case, or the value of a timestamp is too coarse-grained
(just a date). A particular data quality problem (e.g., a missing attribute) may persist,
repeat in an unpredictable manner, or return periodically. In total 108 data quality prob-
lems were identified in [1].
     Process variability is orthogonal to log quality. A process that exhibits many differ-
ent behaviors, some of which are frequent and others infrequent, is difficult to analyze
even when the data quality is high. See [13] for a detailed discussion of these two di-
mensions referred to as individual trustworthiness (quality of event data from a data
perspective) and global conclusiveness (quality of event data from an analysis perspec-
tive). In [17] a generalized conformance checking framework is proposed that caters for
the common case, when one does neither fully trust the log nor the model. Although
data quality problems are important, the above papers are mostly unrelated to the propo-
sitions discussed in this paper.


8.3     System Model Versus Event Log

Scenario 1 shown in Figure 1 starts from a model that is used to generate an event
log (e.g. using simulation) which is subsequently used as input for a process discovery
technique that produces another model. One can compare the discovered model with
the original model using a M2M (Model-to-Model) assessment. The idea is that there
is a known “ground truth”, i.e., the original process model. In [6] a similar approach
is used by introducing the so-called system that represents the real process generating
the behavior. Assume that S, L, and M are the behaviors allowed by the system (S),
found in the event log (L), and allowed by the model (M ) respectively. To understand
the above let us first assume that S, L, and M are sets of traces. Hence, intersection and
union are well-defined (e.g., S ∩ L ∩ M and S ∪ L ∪ M are sets of traces). This allows
us to use a Venn diagram to show the relationships between system S, log L, and model
M (see Figure 5).
    Given the Venn diagram showing the relationships between system S, log L, and
model M , we can identify the following seven types of behavior (numbers refer to
Figure 6):

                                               71
          S            7
                                   L     (1) Modeled and observed system behavior
                                         (2) Not modeled but observed non-system behavior
                                         (3) Modeled and observed non-system behavior
                                         (4) Modeled but unobserved non-system behavior
              6                2         (5) Modeled but unobserved system behavior
                                         (6) Not modeled and unobserved system behavior
                                         (7) Not modeled but observed system behavior
                      1

                  5           3

                       4
                               M
Fig. 6. Given a system S, log L, and model M we can identify seven types of behavior (ignoring
non-system behavior that is not modeled and not observed) [6].



 – (1) Modeled and observed system behavior (S ∩ L ∩ M ).
 – (2) Not modeled but observed non-system behavior (L \ (M ∪ S)).
 – (3) Modeled and observed non-system behavior ((L ∩ M ) \ S).
 – (4) Modeled but unobserved non-system behavior (M \ (S ∪ L)).
 – (5) Modeled but unobserved system behavior ((M ∩ S) \ L).
 – (6) Not modeled and unobserved system behavior (S \ (M ∪ L)).
 – (7) Not modeled but observed system behavior ((S ∩ L) \ M ).

Clearly, one would like to like S and M to coincide. System-model recall can be defined
as |S ∩ M | / |S| and can be maximized by minimizing (6) and (7). Precision can be
defined as |S ∩ M | / |M | and can be maximized by minimizing (3) and (4). One can
also define log-model recall/precision and system-log recall/precision.
    Although Figure 6 is useful and consistent with the earlier definitions, it is also
too simplistic and partly misleading. First of all, in reality, we do not know the system
S. It is only possible to control and know S in an experimental setting (e.g. using a
simulation model). Second, the notion of S suggests that behavior can be classified into
system behavior and non-system behavior (i.e., a binary classification). This ignores the
likelihood of behavior and the purpose of the model (e.g., striving for the 80/20 model).
“Murphy’s Law of Process Mining” suggest that in principle anything can happen if
one is prepared to wait long enough, but it is uninteresting to return a model that allows
for any behavior. This paper clearly shows the need to include probabilities. Third,
system S may be changing over time. If we partition the timeline into segments i ∈
{1, 2, 3, . . .}, then we will be confronted with different versions of the system Si and
new events logs LS     i . How to relate model Mi to Si and Li while also using earlier
observations (e.g., j