Enhancing Process Mining Results using
               Domain Knowledge

P.M. Dixit1,2 , J.C.A.M. Buijs2 , W.M.P. van der Aalst2 , B.F.A. Hompes1,2 , and
                                   J. Buurman1
                 1
                   Philips Research, Eindhoven, The Netherlands
               2
                 Department of Mathematics and Computer Science
         Eindhoven University of Technology, Eindhoven, The Netherlands
                 {prabhakar.dixit,hans.buurman}@philips.com
            {j.c.a.m.buijs,w.m.p.v.d.aalst,b.f.a.hompes}@tue.nl


      Abstract. Process discovery algorithms typically aim at discovering
      process models from event logs. Most discovery algorithms discover the
      model based on an event log, without allowing the domain expert to
      influence the discovery approach in any way. However, the user may
      have certain domain expertise which should be exploited to create a bet-
      ter process model. In this paper, we address this issue of incorporating
      domain knowledge to improve the discovered process model. We firstly
      present a modification algorithm to modify a discovered process model.
      Furthermore, we present a verification algorithm to verify the presence
      of user specified constraints in the model. The outcome of our approach
      is a Pareto front of process models based on the constraints specified by
      the domain expert and common quality dimensions of process mining.

      Keywords: user guided process discovery, declare templates, domain
      knowledge, algorithm post processing


1   Introduction

Process mining aims to bridge the gap between big data analytics and tradi-
tional business process management. This field can primarily be categorized into
(1)process discovery, (2)conformance checking and (3)enhancement [1]. Process
discovery techniques focus on using the event data in order to discover process
models. Conformance checking techniques focus on aligning the event data on
a process model to verify how well the model fits the data and vice versa [3].
Whereas enhancement techniques use event data and process models to repair
or enrich the process model.
    Most of the current process discovery approaches focus on discovering the
process model entirely based on the event log. Enhancement techniques use end-
to-end process models along with the event logs to repair and enrich the process
models with information eg., to highlight bottlenecks or to annotate decisions
with guards. In domains such as the healthcare sector, the underlying processes


                                      79
are complex and case specific, hence the domain expert might only be aware of
some conditions or constraints which should always hold in the process. However
the domain expert might not be aware of the complete end-to-end process which
is required as an input by the alignment based repair techniques. Nevertheless,
the domain expert might be aware of certain sub-processes or protocols constitut-
ing to the end-to-end process. However, traditional process discovery techniques
do not provide a way to incorporate the domain knowledge in order to discover
and repair a more accurate process model, based on both domain knowledge
and event logs. In this paper, we address the challenge of incorporating domain
knowledge in traditional process model discovery to overcome challenges such as
infrequent and/or incomplete data.
    Domain knowledge can be
introduced in the discovery
process at multiple stages
as shown in Figure 1. In
our approach, we post-process
an already discovered pro-
cess model to incorporate the
user specified domain knowl-
edge. Process models can be
represented by multiple mod-
eling notations, for example
BPMN, Petri nets, process
trees etc. State of the art dis-
covery algorithms such as the
Inductive Miner [8] and the
Evolutionary Tree Miner [4]
discover block-structured pro- Fig. 1: Places where domain knowledge can be
cess models represented by incorporated in the process discovery approach.
the notion of process trees. We Here we focus on 3.
use process trees in our ap-
proach as they are hierarchically structured and sound by construction. The
hierarchical nature of process trees allows for a structured way to incorporate
and validate the domain knowledge from the user. Our approach is generic and
scalable as it is independent of the inherent discovery algorithm, and can be
applied to any discovery approach which produces process trees.
   In order to provide a very handy and effective way to gather user input, we
make use of Declare templates [13]. Declare templates belong to the class of
declarative languages, which are used to construct constraint based declarative
process models. We do not use any aspect of declarative modeling in our ap-
proach. Only abstract Declare templates are used as a way for user to specify
domain knowledge effectively in terms of constraints.
   Figure 2 gives a high level overview of our approach. The ultimate goal is to
provide a generic way to post-process the process tree such that the user is pre-
sented with a balanced set of optimal variants of process trees. Post-processing


                                     80
Fig. 2: The general overview combining traditional process discovery with domain
knowledge specified using constraints. The original process tree is modified to
generate probable candidate process trees. A competitive set of process trees is
selected based on the verification of users constraints and quality dimensions.

the process tree with domain knowledge would enable us to overcome incom-
pleteness and exceptional behaviour which could be wrongly represented in the
originally discovered process tree. In order to achieve this goal, we first introduce
a modification approach to generate a list of candidate process trees. We then
introduce a novel verification algorithm to check whether a set of constraints
is satisfied by a process tree. The candidate process trees are used in combina-
tion with the verification algorithm, number of edit operations and four quality
dimensions to select a set of competitive process trees. The remainder of the
paper is structured as follows. In Section 2 and Section 3, we provide a litera-
ture review of related work and the preliminaries respectively. In Section 4 and
Section 5 we explain the modification and verification algorithms. In Section 6
we evaluate our approach based on synthetic and real life event logs. In Section 7
we conclude and discuss future research.


2   Related Work

Although the field of process discovery has matured in recent years, the aspect
of applying user knowledge for discovering better process models is still in its
nascent stages. Conformance techniques in process mining such as [2, 3, 5] re-
play event logs on the process model to check compliance, detect deviations and


                                      81
bottlenecks in the model. These techniques focus on verifying the conformance
of event logs with process model, but do not provide any way of incorporating
domain knowledge to repair/improve the process model. In [11], the authors pro-
vide a way to mine declarative rules and models based on event logs, but do not
allow users to introduce domain knowledge in rule discovery. The conformance
based repair technique suggested by [6] takes a process model and an event log
as input, and outputs a repaired process model based on the event log. However,
the input required for this approach is an end-to-end process model and a noise
free event log. Our approach requires only parts of process models or constraints
described using declarative templates. Genetic algorithms [12] in process mining
provide a possibility of using hand-made models as initial population. However
this information may fade over time and may eventually get lost.
    In [14], authors suggest an approach to discover a control flow model based
on event logs and prior knowledge specified in terms of augmented Information
Control Nets (ICN). Our approach mainly differs in the aspect of gathering
domain knowledge. Although declarative templates can also be used to construct
a network of related activities (similar to ICN), it can also be used to provide
a set of independent pairwise constraints or unary constraints. The authors of
[7] incorporate both positive and negative constraints during process discovery
to discover C-net models. Compared to this, our approach differs mainly in
two aspects. Firstly, we do not propose a new process discovery algorithm, but
provide a generic approach to post process an already discovered process tree.
Secondly, our approach provides the user with a balanced set of process models
which maximally satisfy user constraints and score high on quality dimensions.


3     Preliminaries

As mentioned in Section 1, we primarily use Declare templates as a means to
incorporate the domain knowledge and process trees to represent the process
models. This section provides a background and a brief description about process
trees and Declare templates.


3.1   Declare Templates

A declarative model is defined by using constraints specified by a set of templates
[13]. We use a subset of Declare templates as a way to input domain knowledge.
     Table 1 provides an overview and interpretation of the Declare constraints
that we consider [11, 13]. Binary templates provide ways to specify dependency
(positive and negative) between two activities. For example, response(A, B ) spec-
ifies that activity A has to be eventually followed by activity B somewhere in the
process. We use six binary constraints as shown in Table 1. We use one unary
constraint existence(n1 , n2 , A), as a way to specify the range of occurrence of
an activity.


                                      82
  Table 1: Declare constraints and their graphical and textual interpretations


                                  Graphical
Template Name                                          Interpretation
                                Representation


response(A, B )                                        Activity B should (always) eventu-
                                                       ally occur after activity A

precedence(A, B )                                      Activity B can occur only after the
                                                       occurrence of activity A

coexistence(A, B )                                     Activity A implies the presence of
                                                       activity B (and vice versa)

responded − existence(A, B )                           Activity B should (always) occur
                                                       before or after the occurrence of ac-
                                                       tivity A

not − coexistence(A, B )                               Activity A implies the absence of
                                                       activity B (and vice versa)

not − succession(A, B )                                Activity A should never be followed
                                                       by activity B


existence(n1 , n2 , A)                                 Activity A should occur:

                                                           • n1..n2 times


3.2   Process Trees
                                                                    →
Process trees provide a way to represent pro-
cess models in a hierarchically structured way
containing operators (parent nodes) and ac-                 A              z

tivities (leaf nodes). The operator nodes spec-
ify control flow constructs in the process tree.            ∧       D       ∨
Figure 3 shows an example process tree. A
process tree is traversed from left to right.          B        C       E       ×
    The order of child nodes is not important
for and (∧), exclusive-or (×) and inclusive-or                              F       G
(∨) operators, unlike sequence (→) and Xor-
loop () where the order is significant. In the    Fig. 3: Example Process tree
process tree from Figure 3, activities A and Z     showing sequence (→), and (∧),
are always the first and last activities respec-   exclusive-or (×), inclusive-or
tively. For the  operator the left most node      (∨) and Xor-loop () operators


                                      83
is the ‘do’ part of the loop and is executed at least once. In Figure 3, activity
D is the optional ‘re-do’ part of , execution of which activates the loop again.
Activities B and C occur in parallel and hence the order is not fixed. The right
node of the loop is the escape node and it is executed exactly once. For the ×
operator, only one of either F or G is chosen. For the ∨ operator both × and
activity C can occur, or only one of either two can occur.


4   Modification
Following our methodology depicted in Figure 2, we start with the modification
algorithm. As shown in Figure 2, the modification algorithm takes the discovered
process tree and generates a list of candidate trees. This is accomplished using
a “brute force” modification approach with the following steps:
 1. Starting with the original input process tree, variants are created based on
    three primary edit operations: Add node, Remove node and Modify node.
 2. Every node in the process tree is subject to each edit operation, resulting in
    a new variant of process tree.
 3. Each variant of process tree is further edited by iteratively calling all the
    edit operations exhaustively (in any order) using a “brute force” approach.
 4. Every variant of the process tree is added to a pool of candidate process
    trees.
 5. The process of creating process tree variants is repeated until a certain con-
    figurable threshold for number of edit operations w.r.t. each process tree is
    reached.

            →                          →


        A       ×                  A       ×            ×                        ×


            B       C                  B        C   B       C              A     B     C

(a) Process tree discovered (b) Remove node: Removed (c) Add node: Added activ-
by the discovery algorithm. activity A; also resulting in ity A as a child of × (num-
(number of edits: 0)        removal of → (number of ber of edits: 2)
                            edits: 1)
                ×                      →


            →       C              →       C            →                       →


        A       B              A       B        A       B       C         A      B     C

(d) Add node: Added oper-     (e) Modify node: Modified             (f) Remove node: Removed
ator → as a child of ×; and   operator × to →; resulting            operator →; resulting in an
parent of A and B (number     in a process tree with only           empty process tree (number
of edits: 3)                  one parent operator (num-             of edits: 5)
                              ber of edits: 4)

            Fig. 4: Example modification operations on process tree.


                                           84
     It is important to carefully set the threshold for maximum number of edit op-
erations, as a high threshold could result in many changes and a small threshold
would only explore a few changes in the resultant process tree as compared to
the original process tree. Hence there is no easy way to choose the threshold for
selecting an optimal value for number of edit operations. It should be chosen by
the user based on the original (starting) process tree, and the number of unver-
ified constraints in the original process tree. A very high threshold value would
result in more variants of process tree, however it would also be very compute
intensive and inefficient. In order to improve the efficiency of the modification
approach we can use techniques such as genetic algorithms or greedy algorithms
to reduce the search space, discussed in Section 7.
     Figure 4 shows different edit operations used in the modification algorithm.
The Modify node operation exhaustively modifies every node in the process tree
and can be classified into Modify activity and Modify operator depending on the
type of node. Similarly, Add node iteratively adds either an activity node or an
operator node (Figure 4c and Figure 4d). An operator can be added below the
parent node (Figure 4d) and above the parent node (not shown in Figure 4) by
exhaustively combining child nodes. Each edit operator results in a new process
tree, which can be further edited by other edit operations exhaustively until the
threshold for edit distance is reached. Every process tree arising after each edit
operation is added to the pool of candidate process trees. By executing all edit
operations in an iterative way, we can find an optimal sequence of operations to
deduce any process tree with minimal edit operations. As shown in Figure 4f,
we can reduce a process tree to an empty tree thereby ensuring the completeness
of our modification approach.
     Each process tree is evaluated against the four quality dimensions of process
mining (replay fitness, precision, generalization and simplicity) [1], the number of
user specified constraints verified, and the number of edit operations performed
on the tree. This results in six quality dimensions. In order to evaluate the
process trees based on these dimensions we use a Pareto front [4]. The general
idea of a Pareto front is that all models are mutually non-dominating: A model
is dominating with respect to other models, if for all measurement dimensions
it is at least equal or better and for one strictly better. Using the six dimensions
a Pareto front is presented to the user which contains the set of dominating
process trees. For details about quality dimensions and Pareto front evaluation
we refer to [4]. The verification of user constraints is covered in Section 5.


5   Verification

The verification algorithm takes a process tree and a set of constraints as input,
and returns the set of constraints satisfied by the process tree as output. Each
candidate process tree produced by the modification algorithm is verified. In this
section, we present a novel verification approach to assist the selection of the best
candidate process trees. The major advantage of our approach is that since we
utilize a posteriori approach, it is independent of underlying process discovery


                                      85
 Algorithm 1: Declare constraints verification in a process tree
   Input: process tree, set of constraints
   Output: constraints categorized as verified or unverified
 1 begin
 2    foreach constraint do
 3        if not existence constraint then
 4            compute collection of common sub-trees
 5        else
 6            consider full tree
 7        foreach sub-tree do
 8           if not existence constraint then
 9               verify common parent
10               verify position of activties
11               if common parent or position verification fails then
12                   set constraint verification unsuccessful

13            if relations constraint & occurs(STA ,B) is always then
14                set constraint verification successful
15            if negative relations constraint & occurs(STA ,B) is (never) then
16                set constraint verification successful
17            if existence constraint then
18               check range from occurs(PT,A) to occurs multiple times(PT,A)

19     return set of constraints - verified vs unverified


algorithm used to discover the initial process tree, and hence extremely generic
and scalable. In algorithm 1, we show the main sequence of steps used by the
verification approach. In the following sub-sections, we detail the algorithm.

5.1   Sub-tree Computation & Position Verification
Sub-trees are the sub-blocks containing the first common ancestor between the
two activities of the binary (relation or negative relation) constraints. The same
activity can be present at multiple locations in a process tree which could result
in multiple sub-trees for a single constraint, with the total number of sub-trees
equal to the number of occurrences of the activity in the process tree. Formally,
the computation of the collection of sub-trees for a binary constraint defined on
activities A,B can be given as:
    SB = {n0 | n0 ∈ N ∧ l(n) = B}
STcollectionA = {STn (SB , P T ) | n ∈ N ∧ l(n) = A ∧ STn (SB , P T ) 6= null}
where STcollectionA is the collection of sub-trees w.r.t each node A, N is the
collection of all nodes in process tree PT, SB is the set of nodes labelled B, and
STn (SB , P T ) is the sub-tree computed w.r.t. node n explained in detail below.
    Consider the constraint response(A, B ) that should be verified for the pro-
cess tree from Figure 5a. As described in Table 1, a response constraint states


                                        86
                                         →


                         B                              Z                                              


                         →               F               ∧                                     →        Ω       ∧


                 A               B               C           ×                             A       B        Ω       ×


                                                         B         E                                            B       Ω

(a) Activities with cyan background are                                           (b) Sub-tree for response(A, B ) with all
valid activities in process tree for con-                                         the irrelevant activities marked as Ω
straint response(A, B )

                 Fig. 5: Sub-tree computation for the constraint response(A, B )


that every occurrence of activity A should eventually be followed by activity B.
In order to verify that such constraint holds true in the process tree, we first
gather all the locations within the process tree where activity A occurs. For each
occurrence of A in the process tree, we find the first common ancestor containing
A and all the B’s which can be reached after executing activity A. As a process
tree is generally navigated from left to right, all the B’s eventually occurring
after A would naturally be on the right side of A. One caveat is that the order
of children for the operators ∨, ×,and ∧ is not fixed and the child nodes can
be executed in any order. Hence there is an additional check required to verify
the common parent, addressed in subsection 5.3. Figure 5b shows the sub-tree
for constraint response(A, B ). Since there is only one occurrence of activity A in
the process tree, there is only one sub-tree. The first occurrence of B from the

                     →


         1                                                                                                      
     B                              Z                                       null
                                                                 (b) Precedence(A, B ) -
                                                                                                        →       Ω       ∧
     →               F               ∧
                                                                 Null sub-tree with respect
                                                                 to activity B1
                                                                              →
                 2                                                                                  A       Ω       Ω           ×
 A           B               C               ×
                                                                                      2
                                                                                                                            3
                                         3                               A        B
                                                                                                                        B           Ω
                                     B               E
                                                                 (c) Precedence(A, B ) -           (d) Precedence(A, B ) -
(a) Precedence(A, B ) -                                          Sub-tree with respect to          Sub-tree with respect to
Active nodes for this                                            activity B2                       activity B3
constraint are highlighted
in cyan

Fig. 6: Sub-trees computation for the constraint precedence(A, B ). As Fig-
ure 6b results in a null sub-tree, the constraint verification for the constraint
precedence(A, B ) fails w.r.t. the entire process tree.


                                                                             87
                      ∧                                                                 ∧


                ×             →                                               ×                →


            1                                 2                           1
        A           D C       B           A                           A           Ω Ω          B     Ω

(a) Responded − existence(A, B ) and                          (b)       Sub-tree       for                   con-
coexistence(A, B ) - Active nodes for                         straints     coexistence(A, B )                 and
these constraints are highlighted in                          responded − existence(A, B )                   with
cyan                                                          respect to activity A1
                                                                                    ∧


                          →                                                   ×               →


                                      2                                   1                              2
                Ω         B       A                                   A           Ω Ω         B      A

(c)       Sub-tree       for                      con-        (d)     Sub-tree               for     constraint
straints     coexistence(A, B )                    and        coexistence(A, B )            with   respect to
responded − existence(A, B )                      with        activity B
respect to activity A2

Fig. 7: Sub-trees computation for constraints responded − existence(A, B ) and
coexistence(A, B )


original process tree is ignored as it is on the left side of A, and hence this B
cannot be guaranteed to be executed after executing activity A.

    For the precedence(A, B ) constraint; we are interested in finding all the com-
mon sub-trees with respect to B, containing all A’s on the left side of (executed
before) B. There are a total of 3 sub-trees corresponding to each B in the pro-
cess tree from Figure 6. The sub-trees for B2 and B3 are shown in Figure 6c and
Figure 6d respectively. However, for B1 there is no sub-tree containing activity
A prior to (i.e. on the left side of ) B. This results in a null sub-tree as shown
in Figure 6b, and therefore the verification fails.

    Relation constraints such as coexistence and responded-existence are inde-
pendent of the position of the other activity in the process tree. Figure 7 shows
the sub-trees for constraints responded − existence(A, B ) and coexistence(A, B ).
The sub-tree from Figure 7d is calculated with respect to activity B and is
only valid for the constraint coexistence(A, B ). The sub-trees for negative re-
lations constraint are calculated in a similar way to their respective relations
constraints counterpart. However, unlike relation constraints, for negative re-
lations constraints the absence of a sub-tree (null sub-tree) for each activity
from constraint implies satisfaction of the constraint in the process tree. Sub-
tree calculation is not necessary for unary constraints such as existence, wherein
we consider the entire process tree. The next step is to determine whether an
activity will occur, as discussed in subsection 5.2.


                                                         88
5.2   Activity Occurrence Verification

For binary constraints the next step after calculating the sub-trees is checking
the occurrence of the activity in the sub-tree. In order to achieve this, we use the
predicate occurs(STA , B), where A is the node with respect to which sub-tree
ST is computed and B is the second activity of the binary constraint. For every
ancestor of node A, we check the occurrence of activity B which can have the
following values: always, sometimes or never. Formally, this step can be defined
as follows:
    ∀STn ∈ST collectionA ∃ancestor(n) occurs(STA , B)
where acceptable values for occurs(STA ,B) are always and never for relation and
negative relation constraints resp.
    Figure 8b shows the occurrence of activity B, for the sub-tree from Figure 8a
which is computed with respect to activity A. For choice operators such as ×
and ∧, if activity B is present in all the child nodes, then activity B occurs
always w.r.t. the operator node. If only few or none of the children of the choice
operator have occurrence of activity B, then activity B occurs sometimes or
never resp. Similarly, if at least one child of → and ∧ is activity B, then activity
B occurs always w.r.t. this node. In case of  if activity B is present only in
the re-do part of the loop (which may or may not be executed), then activity
B occurs sometimes. If activity B is present in the loop or exit child of the 
operator, then activity B is guaranteed to occur always w.r.t. this node. We
check the occurrence of activity B, at every ancestor of activity A. For binary
relations constraint, if none of the ancestor(s) of activity A have the occurrence
of B as always, then the constraint is not satisfied. On the contrary for negative
relations constraints, if any of the ancestor(s) of activity A have the occurrence of
B as always or sometimes, then the constraint is not satisfied. For every parent
satisfying the constraint, we move on to validating the corresponding parent
verification discussed in subsection 5.3.
    In case of an unary constraint, the predicate occurs multiple times(PT,A) is
calculated with possible values yes or no, where PT is the entire process tree and
A is the activity from the unary constraint. If any of the ancestor(s) of activity
A are children of the loop part or the re-do of  operator, then the multiple
occurrence of activity A is set to yes. Otherwise, the multiple occurrence part of
                                                                         (A)


              →       Ω       ∧                              → (A)        Ω          ∧ (S)


          A       B       Ω       ×                      A           B           Ω        × (S)


                              B       ∨                                               B          ∨ (S)


                                  Ω       B                                                  Ω           B


(a) Sub-tree for constraint response(A, B )        (b) Blue, red and green colors indicate
                                                   the occurrence always(A), never(N ) and
                                                   sometimes(S) respectively.

      Fig. 8: Occurrence(STA ,B) verification for constraint response(A, B )


                                              89
         Table 2: Overview of possible ranges for existence constraint
      occurs(PT,A) at occurs multiple times(PT,A)
                                                        range of occurrence
      the root of PT  at the root of PT
      sometimes         no                              0..1
      sometimes         yes                             0..n
      always            yes                             1..n
      always            no                              exactly 1
      never             n.a.                            exactly 0


activity A is set to no. occurs multiple times(PT,A) gives us the upper bound
of the range, and we combine this with occurs(PT,A) to calculate the lower
bound of the range. We evaluate the unary constraints at the root of the tree
depending on the values of occurs(PT,A) and occurs multiple times(PT,A), as
shown in Table 2.

5.3   Parent Verification
If occurrence verification for each activity from the binary              ×
constraint is successful, then the next step is to verify
the common parent. There are a set of allowable common                 E     B
parent operators for each type of constraint. For example,
if we have to verify the coexistence(E , B ) constraint on the Fig. 9:      Sub-tree
process tree Figure 5a, then one of the sub-trees computed violating constraint
is Figure 9. As the common parent for this sub-tree is coexistence(E , B )
the choice operator ×, both E and B will never occur
together. Hence the common parent verification for this particular sub-tree fails
for constraint coexistence(E , B ). Table 3 summarizes the invalid common parents
for all the constraints from Table 1.
    For binary constraints, if either the sub-tree computation, position verifica-
tion, common parent verification or activity occurrence verification fails, then
that constraint is marked unsatisfied. If all these steps are successful for all the
corresponding sub-trees, then the constraint is marked satisfied. For unary con-
straints, if activity occurrence verification is successful (within the input range)
then the constraint is marked satisfied, otherwise, it is marked unsatisfied.


      Table 3: Invalid common parents for each of the declare constraints
         constraint                      invalid common parent operator
         response(A, B )                 × , ∨ , 1 , ∧
         precedence(A, B )               × , ∨ , 1 , ∧
         coexistence(A, B )              × , ∨ , 1
         responded − existence(A, B )    × , ∨ , 1
         not − succession(A, B )         →,∨,,∧
         not − coexistence(A, B )        →,∨,,∧
                                         1 is invalid only if node B (or A)
                                         is a child of the middle (redo) part


                                        90
6     Evaluation
Evaluation of the candidate process trees can be done in multiple ways. One
method could be to present the domain expert with a list of candidate process
trees (or process models) to choose from. However this approach is highly sub-
jective and would depend entirely on the preference of the domain expert, and
hence would be difficult to quantify. Another approach for evaluation is to dis-
cover an expected model based on user specified constraints. In this approach
there is a certain expected model, which isn’t discovered by the traditional pro-
cess discovery techniques due to reasons such as data inconsistencies, discovery
algorithm biases etc. We use the latter approach for evaluation as it provides a
quantifiable and controlled way to evaluate the results without depending on the
high subjectivity of domain expert. We evaluate our approach based on both a
synthetic log and a real life log.

6.1   Synthetic Event Log
We use a synthetic event log to demonstrate how our approach could improve
an incorrect model discovered due to algorithm bias and noisy event log. For
the event log L = [ hA,B,C,Di90 , hA,C,B,Di90 , hA,C,D,B i90 , hC,A,D,B i90 ,
hC,A,B,Di90 , hC,D,A,B i90 , hC,D,B,Ai6 , hC,B,A,Di6 , hD,A,C,B i6 ], the Induc-
tive Miner infrequent (IMi) [9] generates the process tree with all four activities
in parallel as shown in Figure 10a.
    From the high frequent traces of the log we can deduce simple rules such as
activity A is always eventually followed by activity B ; and activity B is always
preceded by activity A. Similar relationship holds for activities C and D. We
use this information and input the process tree discovered by IMi [9], event log
(L) and the following four constraints in our algorithm: response(A,B), prece-
dence(A,B), response(C,D), and precedence(C,D). Upon setting the maximum
edit distance to 3, the modification algorithm creates 554 unique process trees
resulting in a Pareto front of 7 process trees.
    Figure 10 shows the original process tree discovered by Inductive Miner (Fig-
ure 10a) and a modified process tree (Figure 10b) with highest replay fitness and
precision score from the Pareto front. Table 4 summarizes the dimension scores
of the process trees from Figure 10. The modified process tree from Figure 10b
satisfies all the four constraints. The number of edit operations required in order
to discover the modified process trees is 2. Figure 10b also has a higher precision
value of 1, and considerably high replay fitness score of almost 1. This process
tree is highly precise, thereby explaining the high frequent traces of the event log
                                                           ∧


                    ∧                              →               →


       A       B         C   D                 A       B       C       D

                   (a)                                 (b)

           Fig. 10: Original and modified process trees for event log L.


                                      91
Table 4: Quality dimensions of the Pareto front for process trees from Figure 10
               constraints   replay                                          number
tree                                   precision generalization simplicity
               satisfied     fitness                                         of edits
Figure 10a     0             1         0.833     0.957          1            0
Figure 10b     4             0.997     1         0.957          1            2

much better and ignoring the infrequent noisy traces. From this we can conclude
that by adding knowledge inferred from the event log to the discovered model it
becomes possible to improve it considerably. This way, it is possible to overcome
noise in the event log.

6.2    Real Life Event Log
Exceptional cases may dominate the normal cases, thereby leading to a process
model that is over-fitting the data or that is too general to be of any value. This
process model could however be improved by incorporating domain knowledge.
In order to evaluate such a scenario, we use the following steps on a real-life log
containing the road traffic fine management process with 11 events and 150,370
cases available at [10]:
 • Use the complete event log to mine a process tree using IMi resulting in a
   structured process tree. Figure 11a shows the Petri net representation of this
   process tree. Learn domain rules based on this tree.
 • Filter the event log to select 10% of the cases having exceptionally deviating
   behavior from the process model of Figure 11a.
 • Create a process tree based on the filtered log using IMi. We assume that this
   process tree is our starting point, and input it to the modification algorithm.
   The Petri net representation of this process tree is shown in Figure 11b.
 • Use the rules learnt from the original process model, in combination with
   the entire event log and modified trees to generate a Pareto front.
We deduce 2 coexistence, 2 responded-existence, 4 response and 1 not-succession
rules (9 in total) from the original process model. We use an edit distance of 3
in the modification algorithm and stop creating variants of process trees after
creating 500,000 unique process trees which results in a Pareto front of 54 process
trees. In Table 5 we compare the original process tree, filtered process tree and
the 5 modified process trees; out of which; trees 1,2, and 3 have the combined
highest values for replay fitness and precision in the Pareto front, and trees
4 and 5 have highest individual values in the Pareto front for replay fitness
and precision respectively. As we use the process model containing only 10%


(a) Process model mined with complete       (b) Process model with filtered log con-
event log.                                  taining infrequent traces only.

Fig. 11: Petri net models to show structural dissimilarities between models for
complete and filtered event logs.


                                       92
    Table 5: Dimensions statistics for process trees based on real life event log


                  constraints    replay                                               number
tree                                            precision generalization simplicity
                  satisfied      fitness                                              of edits
Complete log      9              0.970          0.872    0.983           1            0
Filtered log      2              0.957          0.740    0.845           1            0
Pareto Front 1    8              0.882          0.785    0.861           1            3
Pareto Front 2    8              0.817          0.812    0.839           1            3
Pareto Front 3    8              0.816          0.825    0.862           1            3
Pareto Front 4    8              1              0.576    0.009           1            3
Pareto Front 5    8              0.544          0.943    0.929           1            3

of the exceptional cases from the original log as our starting point, most of
the modified process trees score lower than the original process tree on replay
fitness and precision dimensions. Also, trees 4 and 5 from Table 5 demonstrate
that post modification higher replay fitness could result in lower precision and
vice versa. Although the filtered process tree has a higher score in terms of
fitness, it has the lowest precision score among all the other trees (except tree
4). The process tree discovered from the complete event log scores the highest
in all dimensions. However, the modified trees 1 to 3 from Pareto front (in
Table 5) have a nicely balanced score of all dimensions and in general, explain
the complete event log much better than the process tree from the incomplete
filtered log. From a users perspective, depending on the preference, the user can
select any process tree from the Pareto front. For example, if the user is looking
for a process tree satisfying maximum constraints as well as describing the log
very well, then then tree 4 from Table 5 seems to be the viable option. However,
the user can also see the tradeoff in the Pareto front and the fact that although
tree 4 satisfies users requirements, it scores very badly as compared to other trees
in some other dimensions (precision and generalization). Hence, while choosing
the models from Pareto front, the user can make an informed decision while
considering the requirements as well as evaluating different dimensions.


7      Conclusions and Future Work

In this paper we introduced two algorithms in order to incorporate and verify
domain knowledge in a discovered process model. The proposed verification al-
gorithm provides a comprehensive way of validating whether the constraints are
satisfied by the process tree. In the current approach we consider a subset of De-
clare templates. In the future this could be extended to include all the Declare
templates. The current modification algorithm uses a brute force approach and
exhaustively generates multiple process trees. However, currently the modifica-
tion algorithm does not consider the user constraints during the modification
process. In the future, we would like to improve upon the modification algo-
rithm by modifying the process tree in a smarter way (for eg. using genetic or
greedy algorithms), to optimise the modification approach and/or ensure cer-
tain guarantees in the modified process trees. Another future direction could be


                                           93
to incorporate domain knowledge at different stages, for example when logging
event data or during the discovery phase.

References
 [1] van der Aalst, W.M.P.: Process Mining: Discovery, Conformance and En-
     hancement of Business Processes. Springer, Berlin (2011)
 [2] Adriansyah, A., van Dongen, B.F., van der Aalst, W.M.P.: Conformance
     checking using cost-based fitness analysis. In: Enterprise Distributed Object
     Computing Conference (EDOC), 2011 15th IEEE International. pp. 55–64.
     IEEE (2011)
 [3] Adriansyah, A., van Dongen, B.F., van der Aalst, W.M.P.: Towards ro-
     bust conformance checking. In: Business Process Management Workshops,
     Lecture Notes in Business Information Processing, vol. 66, pp. 122–133.
     Springer Berlin Heidelberg (2011)
 [4] Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Quality dimen-
     sions in process discovery: The importance of fitness, precision, generaliza-
     tion and simplicity. Int. J. Cooperative Inf. Syst. 23(1) (2014)
 [5] De Leoni, M., Maggi, F.M., van der Aalst, W.M.P.: Aligning event logs and
     declarative process models for conformance checking. In: Business Process
     Management, pp. 82–97. Springer (2012)
 [6] Fahland, D., van der Aalst, W.M.P.: Repairing process models to reflect
     reality. In: Business process management, pp. 229–245. Springer (2012)
 [7] Greco, G., Guzzo, A., Lupa, F., Luigi, P.: Process discovery under prece-
     dence constraints
 [8] Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-
     structured process models from event logs containing infrequent behaviour.
     In: Business Process Management Workshops. pp. 66–78. Springer (2014)
 [9] Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-
     structured process models from event logs containing infrequent behaviour.
     In: Business Process Management Workshops. pp. 66–78. Springer (2014)
[10] de Leoni, M., Mannhardt, F.: Road traffic fine management process, http:
     //doi:10.4121/uuid:270fd440-1057-4fb9-89a9-b699b47990f5
[11] Maggi, F.M., Mooij, A.J., van der Aalst, W.M.P.: User-guided discovery of
     declarative process models. In: Computational Intelligence and Data Mining
     (CIDM), 2011 IEEE Symposium on. pp. 192–199. IEEE (2011)
[12] de Medeiros, A.K.A., Weijters, A.J.M.M., van der Aalst, W.M.P.: Genetic
     process mining: an experimental evaluation. Data Mining and Knowledge
     Discovery 14(2), 245–304 (2007)
[13] Pesic, M., Schonenberg, H., van der Aalst, W.M.P.: Declare: Full support for
     loosely-structured processes. In: Enterprise Distributed Object Computing
     Conference, 2007. EDOC 2007. 11th IEEE International. pp. 287–287. IEEE
     (2007)
[14] Rembert, A.J., Omokpo, A., Mazzoleni, P., Goodwin, R.T.: Process discov-
     ery using prior knowledge. In: Service-Oriented Computing, pp. 328–342.
     Springer (2013)


                                     94