<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>History-based Construction of Log-Process Alignments for Conformance Checking: Discovering What Really Went Wrong?</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahdi Alizadeh</string-name>
          <email>m.alizadeh@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Massimiliano de Leoni</string-name>
          <email>m.d.leoni@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicola Zannone</string-name>
          <email>n.zannone@tue.nl</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Mathematics and Computer Science, Eindhoven University of Technology P.</institution>
          <addr-line>O. Box 513, 5600 MB Eindhoven</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Alignments provide a robust approach for conformance checking which has been largely applied in various contexts such as auditing and performance analysis. Alignment-based conformance checking techniques pinpoint the deviations causing nonconformity based on a cost function. However, such a cost function is often manually de ned on the basis of human judgment and thus error-prone, leading to alignments that do not provide the most probable explanations of nonconformity. This paper proposes an approach to automatically de ne the cost function based on information extracted from the past process executions. The cost function only relies on objective factors and thus enables the construction of the most probable alignments, i.e. alignments that provide the most probable explanations of nonconformity. Our approach has been implemented in ProM and assessed using both synthetic and real-life data.</p>
      </abstract>
      <kwd-group>
        <kwd>Conformance checking</kwd>
        <kwd>alignments</kwd>
        <kwd>cost functions</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Modern organizations are centered around the processes needed to deliver
products and services in an e cient and e ective manner. Organizations that
operate at a higher process maturity level use formal/semiformal models (e.g., UML,
EPC, BPMN and YAWL models) to document their processes. In some case these
models are used to con gure process-aware information systems (e.g., WFM or
BPM systems). However, in most organizations process models are not used to
enforce a particular way of working. Instead, process models are used for
discussion, performance analysis (e.g., simulation), certi cation, process improvement,
etc. However, reality may deviate from such models. People tend to focus on
idealized process models that have little to do with reality. This illustrates the
importance of conformance checking [
        <xref ref-type="bibr" rid="ref1 ref2 ref9">1,2,9</xref>
        ].
? This work has been funded by the NWO CyberSecurity programme under the PriCE
project and the Dutch national program COMMIT under the THeCS project.
      </p>
      <p>
        Conformance checking aims to verify whether the observed behavior recorded
in an event log matches the intended behavior represented as a process model.
The notion of alignments [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] provides a robust approach to conformance
checking, which makes it possible to pinpoint the deviations causing nonconformity.
An alignment between a recorded process execution and a process model is a
pairwise matching between activities recorded in the log and activities allowed
by the model. Sometimes, activities as recorded in the event log (events) cannot
be matched to any of the activities allowed by the model (process activities).
For instance, an activity is executed when not allowed. In this case, we match
the event with a special null activity (hereafter, denoted as ), thus resulting
in so-called moves on log. Other times, an activity should have been executed
but is not observed in the event log. This results in a process activity that is
matched to a event, thus resulting in a so-called move on model.
      </p>
      <p>Alignments are powerful artifacts to detect nonconformity between the
observed behavior as recorded in the event log and the prescribed behavior as
represented by process models. In fact, when an alignment between a log trace
and process model contains at least one move on log or model, it means that
such a log trace does not conform the model. As a matter of fact, the moves
on log/model indicate where the execution is not conforming by pinpointing the
deviations that have caused this nonconformity.</p>
      <p>
        In general, a large number of possible alignments exist between a process
model and a log trace, since there may exist manifold explanations why a trace
is not conforming. It is clear that one is interested in nding the most probable
explanation. Adriansyah et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] have proposed an approach based on the
principle of the Occam's razor: the simplest and most parsimonious explanation is
preferable. Therefore, one should not aim to nd any alignment but, precisely,
one of the alignments with the least expensive deviations (one of the so-called
optimal alignments), according to some function assigning costs to deviations.
      </p>
      <p>
        Existing alignment-based conformance checking techniques (e.g. [
        <xref ref-type="bibr" rid="ref2 ref4">2,4</xref>
        ]) require
process analysts to manually de ne a cost function based on their background
knowledge and beliefs. The de nition of such a cost function is fully based on
human judgment and, thus, prone to imperfections. These imperfections ultimately
lead to alignments that are optimal, according to the provided cost function, but
that do not provide the most probable explanation of nonconformity.
      </p>
      <p>In this paper, we propose an alternative way to de ne a cost function, where
the human judgment is put aside and only objective factors are considered. The
cost function is automatically constructed by looking at the logging data and,
more speci cally, at the past process executions that are compliant with the
process model. The intuition behind is that one should look at the past history
of process executions and learn from it what is the most probable explanations of
nonconformity. We believe that the most probable explanation of nonconformity
of a certain process execution can be obtained by analyzing the behavior observed
for such a process execution in each and every state and the behavior observed
for other con rming traces when they were in the same state. Our approach
gives a potentially di erent cost for each move on model and log (depending on
the current state), leading to the de nition of a more sensitive cost function.</p>
      <p>The approach has been fully implemented as a software plug-in for the
opensource process-mining framework ProM. To assess the practical relevance of our
approach, we performed an evaluation using both synthetic and real event logs
and process models. In particular, we tested it on a real-life case study about the
management of road-tra c nes by an Italian town. The results show that our
approach signi cantly improves the accuracy in determining the most probable
explanation for nonconformity compared to existing techniques.</p>
      <p>The paper is organized as follows. Section 2 introduces preliminary concepts.
Section 3 presents our approach for constructing optimal alignments. Section 4
presents experiment results, which are discussed in Section 5. Finally, Section 6
discusses related work and concludes the paper with directions for future work.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Preliminaries</title>
      <p>This section introduces the notation and preliminaries for our work.
2.1</p>
      <p>Labeled Petri Nets, Event Logs, and Alignments
Process models describe how processes should be carried out. Many languages
exist to model processes. Here, we use a very simple formalism, which however
allow one to de ne all the aspects to take into account for this paper:
De nition 1 (Labeled Petri Net). A Labeled Petri net is a tuple (P; T; F; A; `; mi; mf )
where
{ P is a set of places;
{ T is a set of transitions;
{ F (P T ) [ (T P ) is the ow relation between places and transitions
(and between transitions and places);
{ A is the set of labels for transitions;
{ ` : T ! A is a function that associates a label with every transition in T ;
{ mi is the initial marking;
{ mf is the nal marking.</p>
      <p>The label of a transition identi es the activity represented by such a transition.
Multiple transitions can be associated with the same activity label; this means
that the same activity is represented by multiple transitions. This is typically
done to make the model simpler. Some transitions can be invisible. Invisible
transitions do not correspond to actual activities but are necessary for routing
purposes and, as such, their execution is never recorded in event logs. Given
a Labeled Petri net N , InvN A indicates the set of labels associated with
invisible transitions. As a matter of fact, invisible transitions are also associated
with labels, though these labels do not represent activities. We assume that a
label associated with a visible transition cannot be also associated with invisible
ones and vice versa.</p>
      <p>In the remainder, the simpler term Petri net is used to refer to Labeled Petri
nets. The state of a Petri net is represented by a marking, i.e. a multiset of tokens
on the places of the net. A Petri net has an initial marking mi and a nal marking
mf . When a transition is executed (i.e., red ), a token is taken from each of its
input places and a token is added to each of its output places. A sequence of
transitions M leading from the initial to the nal marking is a complete process
trace. Given a Petri net N = (P; T; F; A; `; mi; mf ), N indicates the set of all
complete process traces.</p>
      <p>
        Example 1. Fig. 1 shows a process model for handling road tra c nes in Italy
[
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. A process execution starts with recording a tra c ne in the system and
sending it to Italian residents. Tra c nes might be paid before or after they are
sent out by police or received by the o enders. If the ne is not paid in 180 days,
a penalty is added. In addition, o enders may appeal against nes to prefecture
and/or judge. If an appeal is accepted, the ne management is closed. Finally, if
the ne is not paid by the o ender, eventually the process terminates by handing
over the case for credit collection.
      </p>
      <p>Given a Petri net N = (P; T; F; A; `; mi; mf ), a log trace L 2 A is a
sequence of events where each event records the ring of a transition. In
particular, each event records the label of the transition that has red. An event log
L 2 B(A) is a multiset of log traces, where B(X) is used to represent the set of
all multisets over X. Here we assume that no events exist for activities not in
A; in practice, this can happen: in such cases, such events are ltered out before
the event log is taken into consideration.</p>
      <p>Not all log traces can be reproduced by a Petri net, i.e. not all log traces
perfectly t the process description. If a log trace perfectly ts the net, each
c s n t
1 = c s n t l</p>
      <p>o
r o i3</p>
      <p>c s n t o
2 = c s n t
l i6</p>
      <p>c s n t
3 = c s n
o
d
\move" in the log trace, i.e. an event observed in the trace, can be mimicked by
a \move" in the model, i.e. a transition red in the net. After all events in the log
trace are mimicked, the net reaches its nal marking. In cases where deviations
occur, some moves in the log trace cannot be mimicked by the net or vice versa.
We explicitly denote \no move" by .</p>
      <p>De nition 2 (Legal move). Let N = (P; T; F; A; `; mi; mf ) be a Petri net. Let
SL = (A n InvN ) [ f g and SM = A [ f g. A legal move is a pair (mL; mM ) 2
(SL SM ) n ( ; ) such that
{ (mL; mM ) is a synchronous move if mL 2 SL, mM 2 SM and mL = mM ,
{ (mL; mM ) is a move on log if mL 2 SL and mM = ,
{ (mL; mM ) is a move on model if mL = and mM 2 SM .</p>
      <p>N denotes the set of legal moves for a Petri net N .</p>
      <p>In the remainder, we indicate that a sequence 0 is a pre x of a sequence
00, denoted with 0 2 pre x( 00), if there exists a sequence 000 such that 00 =
0 000, where denotes the concatenation operator.</p>
      <p>De nition 3 (Alignment). Let N be the set of legal moves. An alignment of
a log trace L and a Petri net N = (P; T; F; A; `; mi; mf ) is a sequence 2 N
such that, ignoring all occurrences of , the projection on the rst element yields</p>
      <p>L and the projection on the second element yields a sequence ha1; : : : ; ani such
that there exists a sequence P0 = ht1; : : : ; tni 2 pre x( P ) for some P 2 N
where, for each 1 i n, `(ti) = ai. If P0 2 N , is called a complete
alignment of L and N .</p>
      <p>Fig. 2 shows three possible complete alignments of a log trace 1 = hc; s; n; t; oi
and the net in Fig. 1. The top row of an alignment shows the sequence of events
in the log trace, and the bottom row shows the sequence of activities in the net
(both ignoring ). Hereafter, we denote jL the projection of an alignment over
the log trace and jP the projection over the net.</p>
      <p>As shown in Fig. 2, there can be multiple possible alignments for a given log
trace and process model. The quality of an alignment is measured based on a
provided cost function K : N ! R0+, which assigns a cost to each alignment
2 N . Typically, the cost of an alignment is de ned as the sum of the costs of
the individual moves in the alignment. An optimal alignment of a log trace
and a process trace is one of the alignments with the lowest cost according to
the provided cost function.</p>
      <p>As an example, consider a cost function that assigns to any alignment a cost
equal to the number of moves on log and model for visible transitions. If moves
on model for invisible transitions ik are ignored, 1 has two moves on model, 2
has one move on model and one move on log, and 3 has one move on model
and two moves on log. Thus, according to the cost function, 1 and 2 are two
optimal alignments of 1 and the process model in Fig. 1.
2.2</p>
      <p>
        State Representation
At any point in time, a sequence of execution of activities leads to some state, and
this state depends on which activities have been performed and in which order.
Accordingly, any process execution can be mapped onto a state. As discussed
in [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], a state representation function takes care of this mapping:
De nition 4 (State Representation). Let (P; T; F; A; `; mi; mf ) be a Petri
net. Let R be the set of possible state representations of sequences in A A state
representation function abst : A ! R produces a state representation abst( )
for each process trace 2 .
      </p>
      <p>Several state-representation functions can be de ned. Each function leads to a
di erent abstraction, meaning that multiple di erent traces can be mapped onto
the same state, thus abstracting out certain trace's characteristics. The following
are examples of state-representation functions:
Sequence abstraction. It is a trivial mapping where the abstraction preserves
the order of activities. Each trace is mapped onto a state that is the trace
itself, i.e. for each 2 A , abst( ) = .</p>
      <p>Multi-set abstraction. The abstraction preserves the number of times each
activity is executed. This means that, for each 2 A , abst( ) = M 2 B(A)
such that, for each a 2 A, M contains all instances of a in .</p>
      <p>Set abstraction. The abstraction preserves whether each activity has been
executed or not. This means that, for each 2 A , abst( ) = M A such
that, for each a 2 A, M contains a if it ever occurs in .</p>
      <p>Example 2. Table 1 shows the state representation of some process traces of the
net in Fig. 1 using di erent abstractions. For instance, trace hc; p; p; s; ni can be
represented as the trace itself using the sequence abstraction, as fc(1); p(2); s(1); n(1)g
using the multi-set abstraction (in parenthesis the number of occurrences of
activities in the trace), and as fc; p; s; ng using the set abstraction. Traces hc; p; s; ni
and hc; p; p; s; n; pi are also mapped to state fc; p; s; ng using the set abstraction.
3</p>
    </sec>
    <sec id="sec-3">
      <title>History-based Construction of the Most Probable</title>
    </sec>
    <sec id="sec-4">
      <title>Alignments</title>
      <p>
        This section presents our approach to construct alignments that give the most
probable explanations of deviations based on objective facts, i.e. the historical
logging data, rather than on subjective cost functions manually de ned by
process analysts. To construct an optimal alignment between a process model and
an event log, we use the A-star algorithm, analogously to what proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>Section 3.1 discusses how the cost of an alignment is computed, whereas
Section 3.2 brie y reports on the use of A-star to compute the most probable
alignment.
The computation of the most probable alignment relies on a cost function that
accounts for the probability of an activity to be executed in a certain state.
The de nition of such a cost function requires an analysis of the past history as
recorded in the event log to compute the probability of an activity to immediately
occur or to never eventually occur when the process execution is in a certain
state.</p>
      <p>
        The cost of moves depends on probabilities. For this purpose, we need to
introduce a functions' class F [0; 1] ! R+ such that f 2 F if and only if
f (0) = 1 and f is monotonously decreasing between 0 and 1 (with f (1) &gt; 0).
Hereafter, these functions are called cost pro le. It is easy to observe that if f (p)
is a cost-pro le function, then f (p)i is also a cost-pro le function for every i &gt; 0.
Examples of these functions are:
f (p) = p1
f (p) = p1p
f (p) = 1 + log 1
p
(1)
Similarly to what proposed in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], the cost of an alignment move depends on the
move type and the activity involved in the move but, di erently from [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], it also
depends on the position in which the move is inserted:
De nition 5 (Cost of an alignment move). Let N = (P; T; F; A; `; mi; mf )
be an Petri net. Let 2 N be a sequence of legal moves for N and f 2 F
a cost pro le. The cost of appending a legal move (mL; mM ) 2 N to with
state-representation function abst is:
abst((mL; mM ); ) =
8 0
&gt;
&gt;&lt; 0
&gt; f Pabst(mM occurs after jP )
&gt;: f Pabst(mL never eventually occurs after
mL = mM
mL =
mL =
jP ) mM =
and mM 2 InvN
and mM 62 InvN
(2)
Readers can observe that the cost of a move on log (mL; ) is not simply the
probability of not executing activity mL immediately after jP ; rather, it is
the probability of never having activity mM at the any moment in the future
for that execution. This is motivated by the fact that a move on log (mL; )
indicates that mL is not expected to ever occur in the future. Conversely, if it
was expected, a number of moves in model would be introduced until the process
model, modeled as a Petri net, reaches a marking that allows mL to occur (and,
thus, a move in both can be appended). Di erent cost pro les account for the
probabilities computed from historical logging data di erently. In Section 4, we
evaluate the cost pro les in Eq. 1 with di erent combinations of event logs and
process models. The purpose is to verify whether a cost pro le universally works
better than the others. The following two de nitions describe how to compute
the probabilities required by Def. 5. For reliability, we only consider the subset
of traces L t of the original event log L that comply with the process model.
De nition 6 (Probability that an activity occurs). Let L be an event log
and L t L be the subset of traces that comply with a given process model
represented by a Petri net N = (P; T; F; A; `; mi; mf ). The probability that an
activity a 2 A occurs after executing with state-representation function abst is
the ratio between number of traces in L t in which a is executed after reaching
state abst( ) and the total number of traces in L t that reach state abst( ):
De nition 7 (Probability that an activity never eventually occurs).
Let L be an event log and L t L be the subset of traces that comply with
a given process model represented by a Petri net N = (P; T; F; A; `; mi; mf ).
The probability that an activity a 2 A will never eventually occur in a process
execution after executing 2 A with state-representation function abst is the
ratio between the number of traces in L t in which a is never eventually executed
after reaching state abst( ) and the total number of trace in L t that reach state
abst( ):
      </p>
      <p>Pabst(a never eventually occurs after ) =
(3)
(4)
jf 02L t : 9 002pre x( 0): abst( 00)=abst( )^8 000 00 000 ha0i2pre x( 0)^a06=agj
jf 02L t : 9 002pre x( 0): abst( 00)=abst( )gj</p>
      <p>The cost of an alignment is the sum of the cost of all moves in the alignment,
which are computed as described in De nition 5:
De nition 8 (Cost of an alignment). The cost of alignment 2 N with
state-representation function abst is computed as follows:</p>
      <p>Kabst(
(mL; mM )) =
abst((mL; mM ); hi)
abst((mL; mM ); ) + Kabst( )</p>
      <p>= hi
otherwise
(5)
Hereafter, the term most-probable alignment is used to denote any of the
optimal alignments (i.e., with the lowest cost) according to the cost function
given in De nition 8.</p>
      <p>
        The use of the A-star algorithm to construct alignments
The A-star algorithm [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] aims to nd a path in a graph V from a given source
node v0 to any node v 2 V in a target set. Every node v of graph V is associated
with a cost determined by an evaluation function f (v) = g(v) + h(v), where
{ g : V ! R0+ is a function that returns the smallest path cost from v0 to v;
{ h : V ! R0+ is an heuristic function that estimates the path cost from v to
its preferred target node.
      </p>
      <p>Function h is said to be admissible if it returns a value that underestimates the
distance of a path from a node v0 to its preferred target node v00, i.e. g(v0) +
h(v0) g(v00). If h is admissible, A-star nds a path that is guaranteed to have
the overall lowest cost.</p>
      <p>The A-star algorithm keeps a priority queue of nodes to be visited: higher
priority is given to nodes with lower costs. The algorithm works iteratively: at
each step, the node v with lowest cost is taken from the priority queue. If v
belongs to the target set, the algorithm ends returning node v. Otherwise, v is
expanded: every successors v0 is added to priority queue with a cost f (v0).</p>
      <p>We employ A-star to nd any of the optimal alignments between a log trace
L 2 L and a Petri net N . In order to be able to apply A-star, an opportune
search space needs to be de ned. Every node of the search space V is associated
to a di erent alignment that is a pre x of some complete alignment of L and N .
Since a di erent alignment is also associated to every search-space node and vice
versa, we use the alignment to refer to the associated state. The source node is
an empty alignment 0 = hi and the set of target nodes includes every complete
alignment of L and N .</p>
      <p>Let us denote the length of a sequence with k k. Given a node/alignment
2 V , the search-space successors of include all alignments 0 2 V obtained
from by concatenating exactly one move. Given an alignment 2 V , the cost
of path from the initial node to node 2 V is:</p>
      <p>g( ) = k jL k + K( ):
where K( ) is the cost of alignment according to De nition 8. It is easy
to check that, given two complete alignments C0 and C00 , K( C0 ) &lt; K( C00 ) i
g( C0 ) &lt; g( C00 ) and K( C0 ) = K( C00 ) i g( C0 ) = g( C00 ). Therefore, an optimal
solution returned by A-star coincides with an optimal alignment. To de ne a
more e cient and admissible heuristics, we consider term k Lk in h; this term
does not a ect optimality. Given an alignment 2 V , we employ the heuristics:
h( ) = k Lk
k jL k:
For alignment , the number of steps to add in order to reach a complete
alignment is lower bounded by the number of execution steps of trace L that have
not been included yet in the alignment, i.e. k Lk k jL k. Since the additional
cost to traverse a single node is at least 1, the cost to reach a target node is at
least h( ), corresponding to the case where the part of the log trace that still
needs to be included in the alignment perfectly ts.</p>
      <p>Example 3. Consider a log trace 2 = hc; s; n; l; oi and the net N in Fig. 1. An
analyst wants to determine the most probable explanations for nonconformity
by constructing the most probable alignment of 2 and N , based on historical
logging data. In particular, L t consists of the traces in Table 1 (the rst column
shows the traces, and the second the number of occurrences of a trace in the
history). Assume that the algorithm has constructed an optimal alignment of
trace hc; s; ni 2 pre x( 2) and N (left part of Fig. 3). The next event in the
log trace (i.e., l) cannot be replayed in the net. Therefore, the algorithm should
determine which move is the most likely to have occurred. Di erent moves are
possible; for instance, a move on log for l, a move on model for p, a move on
model for t, etc. The algorithm computes the cost for these moves using Eq. 5
(right part of Fig. 3). As move on model ( ; p) is the move with the least cost
(and no other alignments have lower cost), alignment 0 = ( ; p) is selected
for the next iteration. It is worth noting that activity d never occurs after hc; s; ni
in L t ; consequently, the cost of move ( ; d) is equal to 1.
4</p>
    </sec>
    <sec id="sec-5">
      <title>Implementation and Experiments</title>
      <p>We have implemented our approach for history-based construction of alignments
as a plugin of the open-source ProM framework (http://www.promtools.org).
The plug-in takes as inputs a process model and two event logs. It computes
the most probable alignments for each trace in the rst event log based on the
frequency of the traces in the second event log (historical logging data).</p>
      <p>To assess the practical feasibility and accuracy of the approach, we performed
a number of experiments using both synthetic and real-life logs. In the
experiments with synthetic logs, we assumed that the execution of an activity depends
on the activities that were performed in the past. In the experiments with
reallife logs, we tested if this assumption holds in real applications. Accordingly, the
real-life logs were used as historical logging data. To evaluate the approach, we
arti cially added noise to the traces used for the experiments. This was necessary
to assess the ability of the approach to reconstruct the original traces.
4.1</p>
      <p>
        Synthetic Data
For the experiments with synthetic data, we used the process for handling credit
requests in [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. Based on this model, we generated 10000 traces consisting of
69504 events using the CPN Tools (http://cpntools.org). To assess the
accuracy of the approach, we manipulated 20% of these traces by introducing di
erent percentages of noise. In particular, given a trace, we added and removed a
number of activities to/from the trace equal to the same percentage of the trace
length. The other traces were used as historical logging data. We computed the
most probable alignments of the manipulated traces and process model, and
evaluated the ability of the approach to reconstruct the original traces. To this
end, we measured the percentage of correct alignments (i.e., the cases where a
projection of an alignment over the process coincides with the original trace)
and compute the overall Levenshtein distance [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] between the original traces
and the projection of the computed alignments over the process. This string
metric measures the distance between two sequences, i.e. the minimal number
of changes required to transform one sequence into the other. In our setting, it
provides an indication of how much the projection of the computed alignments
over the process is close to the original traces.
      </p>
      <p>
        We tested our approach with di erent amounts of noise (i.e., 10%, 20%, 30%
and 40% of the trace length), with di erent cost pro les (i.e., 1=p, 1=pp, and 1 +
log(1=p)), and with di erent state-representation functions (i.e., sequence,
multiset, and set ). Moreover, we compared our approach with existing
alignmentbased conformance checking techniques. In particular, we used the standard
cost function introduced in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We repeated each experiment ve times. Table 2
shows the results where every entry reports the average over the ve runs.
      </p>
      <p>
        The results show that cost pro les 1=pp and 1 + log(1=p) in combination with
sequence and multi-set abstractions are able to better identify what really
happened, i.e. they align the manipulated traces with the corresponding original
traces in more cases (CA). In all cases, cost pro le 1 + log(1=p) with sequence
state-representation function provides more accurate diagnostics (LD): even if
log traces are not aligned to the original traces, the projection over the process
of alignments constructed using this cost pro le and abstraction are closer to the
original traces. Compared to the cost function used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], our approach
computed the correct alignment for 4.4% more traces when cost pro le 1 + log(1=p)
and sequence state-representation function are used. In particular, our approach
correctly reconstructed the original trace for 18.4% of the traces that were not
correctly reconstructed using the cost function used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, an analysis
of LD shows that, on average, the traces reconstructed using our approach have
0.37 deviations, while the traces reconstructed using the cost function used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
have 0.45 deviation. This corresponds to an improvement of LD of about 15.2%.
4.2
      </p>
      <p>
        Real-life Logs
To evaluate the applicability of our approach to real-life scenarios, we used an
event log obtained from a ne management system of the Italian police [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. The
process model in form of Petri net is presented in Fig. 1. We extracted a log
consisting of 142408 traces and 527549 events, where all traces are conforming
with the net. To these traces, we applied the same methodology used for the
experiments reported in Section 4.1. We repeated the experiments ve times.
Table 3 shows the results where every entry reports the average over ve runs.
      </p>
      <p>
        The results con rm that cost pro les 1=pp and 1+log(1=p) in combination with
sequence and multi-set state-representation functions provide the more accurate
diagnostics (both CA and LD). Moreover, the results show that our approach
(regardless of the used cost pro le and state-representation function) performs
better than the cost function in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] on real-life logs. In particular, using sequence
state-representation function and cost pro le 1 + log(1=p), our approaches
computed the correct alignment for 1.8% more traces than what the cost function in
[
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] did. In particular, our approach correctly reconstructed the original trace for
19.3% of the traces that were not correctly reconstructed using the cost function
used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, our approach improves LD by 21.1% compared to the
cost function used in [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Such an improvement shows that when the original
trace is not reconstructed correctly, our approach returns an explanation that is
signi cantly closer to the actual explanation.
5
      </p>
    </sec>
    <sec id="sec-6">
      <title>Discussion</title>
      <p>The A-star algorithm requires a cost function to penalize nonconformity. In our
experiments, we have considered a number of cost pro les to compute the cost
of moves on log/model based on the probability of a given activity to occur in
historical logging data. The selection of the cost pro le has a signi cant impact
on the results as they penalize deviations di erently. For instance, cost pro le
1=p penalizes less probable moves much more than 1 + log(1=p). To illustrate
99% = a1 / / a2 / / / / a50A</p>
      <p>{ AA
/ x / {U{UUU1U%UUUUU* b eeeeeeeeeeeeeeee2 / y /
(a) Process model
(b) Alignments
this, consider a trace = hx; yi and the process model in Fig. 4a. Two possible
alignments, namely 1 and 2, are conceivable (Fig. 4b). 1 contains a large
number of deviations compared to 2 (50 moves on log vs. 1 move on log). The
use of cost pro le 1=p yields 1 as the most probable alignment, while the use of
cost pro le 1 + log(1=p) yields 2 as the most probable alignment. Tables 2 and 3
show that cost pro le 1 + log(1=p) usually provides more accurate results. Cost
pro le 1=p penalizes less probable moves excessively, and thus tends to construct
alignments with more frequent traces in the historical logging data even if those
alignments contain a signi cantly larger number of deviations. Our experiments
suggest that the construction of the most probable alignments requires a
tradeo between the frequency of the traces in historical logging data and the number
of deviations in alignments, which is better captured by cost pro le 1 + log(1=p).</p>
      <p>Di erent state-representation functions can be used to characterize the state
of a process execution. In this work, we have considered three state-representation
functions: sequence, multi-set, and set. The experiments show that in general the
sequence abstraction produces more accurate results compared to the other
abstractions. The set abstraction provides the least accurate results, especially
when applied to the process for handling credit requests (Table 2). The main
reason is that this abstraction is not able to accurately characterize the state,
especially in presence of loops: after each loop iteration the process execution
yields the same state. Therefore, the cost function constructed using the set
abstraction is not able to account for the fact that the probability of executing
certain activities can increase after every loop iteration, thus leading to
alignments in which loops are not captured properly.</p>
      <p>To conclude, the experiments show that our technique tends to build
alignments that give better explanations of deviations. It is easy to see that, when
nonconformity is injected in tting traces and alignments are subsequently built,
the resulting alignments yield perfect explanations if the respective process
projections coincide with the respective tting traces before the injections of
nonconformity. Tables 2 and 3 have shown that, basing the construction of the cost
function on the analysis of historical logging data our technique tends to build
alignments whose process projection is closer to the original tting traces and,
hence, the explanations of deviations are closer to the correct ones.</p>
    </sec>
    <sec id="sec-7">
      <title>Related Work and Conclusions</title>
      <p>
        In process mining, a number of approaches have been proposed to check
conformance of process models and the actual behavior recorded in event logs. Some
approaches [
        <xref ref-type="bibr" rid="ref13 ref15 ref16 ref7 ref8">7,8,13,15,16</xref>
        ] check conformance by verifying whether traces satis es
rules encoding properties expected from the process. Petkovic et al. [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] verify
whether a log trace is a valid trace of the transition system generated by the
process model. Token-based approaches [
        <xref ref-type="bibr" rid="ref18 ref6">6,18</xref>
        ] use the number of missing and added
tokens obtained by replaying traces over the process model to measure the
conformance between the log and the process. However, these approaches only give
a boolean answers diagnosing whether traces conform to a process model or not.
When they are able to provide diagnostic information, such information is often
imprecise. For instance, token-based approaches may allow behavior that is not
allowed by the model due to the used heuristics and thus may provide incorrect
diagnostic information.
      </p>
      <p>
        Recently, the construction of alignments has been proposed as a robust
approach for checking the conformance of event logs with a given process model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
Alignments have proven to be very powerful artifacts to perform conformance
checking. By constructing alignments, analysts can be provided with richer and
more accurate diagnostic information. In fact, alignments are also used as the
main enablers for a number of techniques for process analytics, auditing, and
process improvement, such as for performance analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], privacy compliance
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and process-model repairing [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>To our knowledge, the main problem of existing techniques for constructing
optimal alignments is related to the fact that process analysts need to provide
a function which associates a cost to every possible deviation. These cost
functions are only based on human judgment and, hence, prone to imperfections.
If these techniques are fed with imprecise cost functions, they create imperfect
alignments, which ultimately leads to unlikely or, even, incorrect diagnostics.</p>
      <p>In this paper, we have proposed a di erent approach where the cost function
is automatically computed based on real facts: historical logging data recorded in
event logs. In particular, the cost function is computed based on the probability
of activities to be executed or not in a certain state (representing which activities
have been executed and their order). Experiments have shown that, indeed, our
approach can provide more probable explanations of nonconformity of process
executions, if compared with existing techniques.</p>
      <p>We acknowledge that the evaluation is far from being completed. We aim
to perform more extensive experiments to verify whether certain cost-pro le
functions provide more probable alignments than others or, at least, to give
some guidelines to determine in which settings a given cost-pro le function is
preferable. We also aim to develop a technique that, given a model and log, allow
for (semi-)automatic tuning of the cost-pro le function and state abstraction.</p>
      <p>In this paper, we only considered the control- ow, i.e. the name of the
activities and their ordering, to construct the cost function and, hence, to compute
the most probable alignment. However, the choice in a process execution is
often driven by other aspects. For instance, when instances are running late, the
execution of certain fast activities are more probable; or, if a certain process
attribute takes on a given value, certain activities are more likely to be executed.
We expect that our approach can be signi cantly improved if the other business
process perspectives (i.e., data, time and resources) are taken into account.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          : Process Mining - Discovery, Conformance and Enhancement of Business Processes. Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adriansyah</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Dongen</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          :
          <article-title>Replaying history on process models for conformance checking and performance analysis</article-title>
          .
          <source>Data Min. Knowl. Discov</source>
          .
          <volume>2</volume>
          (
          <issue>2</issue>
          ),
          <volume>182</volume>
          {
          <fpage>192</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schonenberg</surname>
            ,
            <given-names>M.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Song</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Time prediction based on process mining</article-title>
          .
          <source>Information Systems</source>
          <volume>36</volume>
          (
          <issue>2</issue>
          ),
          <volume>450</volume>
          {
          <fpage>475</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Adriansyah</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Dongen</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Memory-e cient alignment of observed and modeled behavior</article-title>
          .
          <source>BPM Center Report 03-03</source>
          , BPMcenter.org (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Adriansyah</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van Dongen</surname>
            ,
            <given-names>B.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zannone</surname>
          </string-name>
          , N.:
          <article-title>Privacy analysis of user behavior using alignments</article-title>
          .
          <source>it{Information Technology</source>
          <volume>55</volume>
          (
          <issue>6</issue>
          ),
          <volume>255</volume>
          {
          <fpage>260</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Banescu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Petkovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zannone</surname>
          </string-name>
          , N.:
          <article-title>Measuring privacy compliance using tness metrics</article-title>
          .
          <source>In: Business Process Management</source>
          . pp.
          <volume>114</volume>
          {
          <fpage>119</fpage>
          . LNCS 7481, Springer (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Borrego</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Barba</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          :
          <article-title>Conformance checking and diagnosis for declarative business process models in data-aware scenarios</article-title>
          .
          <source>Expert Syst. Appl</source>
          .
          <volume>41</volume>
          (
          <issue>11</issue>
          ),
          <volume>5340</volume>
          {
          <fpage>5352</fpage>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Caron</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanthienen</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Baesens</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          :
          <article-title>Comprehensive rule-based compliance checking and risk management with process mining</article-title>
          .
          <source>Decision Support Systems</source>
          <volume>54</volume>
          (
          <issue>3</issue>
          ),
          <volume>1357</volume>
          {
          <fpage>1369</fpage>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Cook</surname>
            ,
            <given-names>J.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wolf</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          :
          <article-title>Software process validation: quantitatively measuring the correspondence of a process to a model</article-title>
          .
          <source>TOSEM</source>
          <volume>8</volume>
          (
          <issue>2</issue>
          ),
          <volume>147</volume>
          {
          <fpage>176</fpage>
          (
          <year>1999</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Dechter</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pearl</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Generalized best- rst search strategies and the optimality of A*</article-title>
          .
          <source>Journal of the ACM</source>
          <volume>32</volume>
          ,
          <issue>505</issue>
          {
          <fpage>536</fpage>
          (
          <year>1985</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Fahland</surname>
            , D., van der Aalst,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Model repair - aligning process models to reality</article-title>
          .
          <source>Information Systems</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Levenshtein</surname>
          </string-name>
          , V.:
          <article-title>Binary codes capable of correcting deletions, insertions, and reversals</article-title>
          .
          <source>Soviet Physics Doklady</source>
          <volume>10</volume>
          (
          <issue>8</issue>
          ),
          <volume>707</volume>
          {
          <fpage>710</fpage>
          (
          <year>1966</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ly</surname>
            ,
            <given-names>L.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rinderle-Ma</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          , Goser,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Dadam</surname>
          </string-name>
          ,
          <string-name>
            <surname>P.</surname>
          </string-name>
          :
          <article-title>On enabling integrated process compliance with semantic constraints in process management systems - requirements, challenges, solutions</article-title>
          .
          <source>Information Systems Frontiers</source>
          <volume>14</volume>
          (
          <issue>2</issue>
          ),
          <volume>195</volume>
          {
          <fpage>219</fpage>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mannhardt</surname>
          </string-name>
          , F.,
          <string-name>
            <surname>de Leoni</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Balanced multi-perspective checking of process conformance</article-title>
          .
          <source>BPM Center Report 14-08</source>
          , BPMcenter.org (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>de Medeiros</surname>
            , A.K.A.,
            <given-names>van der Aalst. Wil M. P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pedrinaci</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>Semantic Process Mining Tools: Core Building Blocks</article-title>
          .
          <source>In: Proc. ECIS</source>
          . pp.
          <year>1953</year>
          {
          <year>1964</year>
          . AIS (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Montali</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Speci cation and Veri cation of Declarative Open Interaction Models - A Logic-Based Approach</article-title>
          .
          <source>LNBIP 56</source>
          , Springer (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Petkovic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prandi</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zannone</surname>
          </string-name>
          , N.:
          <article-title>Purpose control: Did you process the data for the intended purpose? In: Secure Data Management</article-title>
          . pp.
          <volume>145</volume>
          {
          <fpage>168</fpage>
          . LNCS 6933, Springer (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Rozinat</surname>
          </string-name>
          , A.,
          <string-name>
            <surname>van der Aalst</surname>
            ,
            <given-names>W.M.P.</given-names>
          </string-name>
          :
          <article-title>Conformance checking of processes based on monitoring real behavior</article-title>
          .
          <source>Information Systems</source>
          <volume>33</volume>
          (
          <issue>1</issue>
          ),
          <volume>64</volume>
          {
          <fpage>95</fpage>
          (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>