=Paper= {{Paper |id=Vol-2478/paper5 |storemode=property |title=Efficient Comparison of Process Models using Tabu Search Algorithm |pdfUrl=https://ceur-ws.org/Vol-2478/paper5.pdf |volume=Vol-2478 |authors=Andrey Skobtsov,Anna Kalenkova }} ==Efficient Comparison of Process Models using Tabu Search Algorithm== https://ceur-ws.org/Vol-2478/paper5.pdf
     Efficient Comparison of Process Models using
                          Tabu Search Algorithm*

                          A.V. Skobtsov1 and A.A. Kalenkova1

      1
       National Research University Higher School of Economics, Moscow, Russia
                       {akalenkova@,avskobtsov@edu.}hse.ru



          Abstract. Companies from various domains record their operational
          behavior in a form of event logs. These event logs can be analyzed and
          relevant process models representing the real companies’ behavior can be
          discovered. One of the main advantages of the process discovery methods
          is that they commonly produce models in a form of graphs which can be
          easily visualized giving an intuitive view of the executed processes. More-
          over, the graph-based representation opens new challenging perspectives
          for the application of graph comparison methods to find and explicitly vi-
          sualize differences between discovered process models (representing real
          behavior) and reference process models (representing expected behav-
          ior). Another important area where graph comparison algorithms can be
          used is the recognition of process modeling patterns. Unfortunately, ex-
          act graph comparison algorithms are computationally expensive. In this
          paper, we adapt an inexact tabu search algorithm to find differences be-
          tween BPMN (Business Process Model and Notation) models. The tabu
          search and greedy algorithms were implemented within the BPMNDif-
          fViz tool and were tested on BPMN models discovered from synthetic
          and real-life event logs. It was experimentally shown that inexact tabu
          search algorithm allows to find a solution which is close to the optimal
          in most of the cases. At the same, its computational complexity is sig-
          nificantly lower than the complexity of the exact A∗ search algorithm
          investigated earlier.

          Keywords: graph edit distance, process mining, BPMN (Business Pro-
          cess Models and Notation), tabu search, conformance checking.


1     Introduction

    Information systems automating processes in various fields, including medical
care, education, e-government, banking, write history of their executions to event
logs. Process mining techniques allow us to discover process models generalizing
systems’ behavior presented in these event logs [3].
    To understand the meaning of the discovered process models, their structure
can be additionally analyzed. For example, process modeling patterns such as

* Copyright c 2019 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0)
sequence, choice, parallel execution, loop, and other, more complicated structures
can be identified automatically. Besides that, discovered process models can be
compared to reference models explicitly showing deviations between real and
expected process behavior.
    In papers [1, 2], it was suggest to use a so-called graph edit distance to
compare process models discovered from event logs. The graph edit distance
shows the degree of model difference/similarity. More precisely, it is defined
as a minimal number of elementary steps (node insertion/deletion, edge inser-
tion/deletion, label editing) needed to transform one graph another. Methods
for finding minimal graph edit distance can be applied to process models rep-
resented in a form of labeled oriented graphs with node and edge types. The
exact A* search algorithm [4] was implemented in a tool [1] for the comparison
of BPMN (Business Process Model and Notation) models [5]. This tool was used
for the comparison of BPMN models discovered from event logs of e-government
services [2]. To reduce the computational complexity of the model comparison
technique, process-specific heuristics were used [2]. Nevertheless, it was still not
possible to compare large process models extracted from event data. This can be
explained by the fact that the problem of finding minimal graph edit distance is
know to be NP-complete [6].
    In this paper, we study the possibility of finding minimal-graph edit distance
by applying an inexact tabu search algorithm [7]. The advantage of this algo-
rithm is that the total number of iterations is bounded by a constant predefined
number. A tabu search method for finding minimal-graph edit distance between
graphs was proposed in [8]. This method was focused on the application in the
computer vision field. In contrast to [8], we will compare process-specific models
discovered from event data.
    To estimate time and accuracy characteristics of the proposed algorithm,
we will compare process models discovered from event logs using Alpha [9] and
Inductive mining algorithms [10]. For the conversion of discovered process models
to BPMN standard we will use transformation rules proposed in [11].
    The paper is organized as follows. Section 2 presents main notions. Section
3 describes an exact graph matching technique based on A* search algorithm.
An algorithm for finding graph edit distances between BPMN models using tabu
search technique is presented in Section 4. Section 5 contains experiment results.
And finally, Section 6 concludes the paper.


2   Preliminaries

    In this section, we will define flat BPMN models, business process graphs
and distances between them. These notions will be used through the paper.
    Although there exists a wide range of models’ types proposed by the Object
Management Group (OMG) [5], we will consider flat BPMN models only. These
models formalize the simple control flow. They can be discovered in two steps: (1)
a low-level model (such as Petri net, causal net, or process tree) is synthesized
from the event log; (2) this model is converted into a high-level model using
algorithms presented in [11].
    Flat BPMN models constructed from Petri nets, process trees or causal nets
are represented by the following basic set of elements: tasks, exclusive and parallel
gateways, start and end events, control flows (Fig. 1).
                           Start (Fig. 1 a.) and end events (Fig. 1 b.) denote
                       the beginning and the termination of the process respec-
   a.    b.       c.   tively. Tasks (Fig. 1 c.) model atomic process steps. Par-
                       allel (Fig. 1 d.) and exclusive gateways (Fig. 1 e.) are used
      d.       e.      to model process branches which are executed in parallel
                       or are mutually exclusive.
            f.
                           An example of a BPMN model which presents a simple
Fig. 1. Elements of
                       booking procedure is presented in Fig. 2. Firstly, the user
flat BPMN models.
                       registers, then books a flight or a hotel (these activities
are performed in parallel and each of them can be skipped), and finally, pays.

                                                 book
                                                 flight


                         register                                     pay


                                                 book
                                                 hotel



          Fig. 2. A BPMN model describing a simple booking procedure.


    Flat BPMN models can be considered as oriented graphs with node and
edge types. We will call them business process graphs. Let us give their formal
definition.
    Business process graph is a tuple G = (N, E, t, l ), where N – is a set of nodes,
E ⊆ (N ×N ) – a set of edges, t : N → T , where T = {start event, end event, task,
exclusive gateway, parallel gateway} – is a function which defines node types, l :
N → L – a function which defines labels, and L – is a set of labels.
    Let us consider two business process graphs: Gk = (Nk , Ek , tk , lk ), k = 1, 2.
The distance between these graphs is quantified by constructing an edit relation
R ⊆ (N1 ∪ E1 ∪ {, δ}) × (N2 ∪ E2 ∪ {, δ}), where , δ ∈ / N1 ∪ N2 ∪ E1 ∪ E2 , such
that the following conditions hold:

 1. for each element i1 ∈ N1 ∪ E1 (i2 ∈ N2 ∪ E2 ) there exists one and only one
    element i2 ∈ N2 ∪ E2 ∪ {, δ} (i1 ∈ N1 ∪ E1 ∪ {, δ}), such that (i1 , i2 ) ∈ R;
 2. if i1 ∈ {, δ} (i2 ∈ {, δ}) and (i1 , i2 ) ∈ R, then i2 ∈ / {, δ} (i1 ∈      / {, δ});
 3. if i1 ∈ N1 and (i1 , i2 ) ∈ R, then i2 ∈ N2 ∪ {, δ}, and if additionally i2 ∈ N2 ,
    then the node types coincide, i.e., t1 (i1 ) = t2 (i2 );
 4. if i1 ∈ E1 and (i1 , i2 ) ∈ R, then i2 ∈ E2 ∪ {, δ};
 5. for any (i1 , i01 ) ∈ E1 , (i2 , i02 ) ∈ E2 the condition ((i1 , i01 ), (i2 , i02 )) ∈ R holds
    iff (i1 , i2 ) ∈ R and (i01 , i02 ) ∈ R.
    If for an element i holds that (i, ) ∈ R ((, i) ∈ R), then we say that i is
deleted (inserted ). If (i, δ) ∈ R ((δ, i) ∈ R), then we say that there is no corre-
sponding element for i (this will be used to represent intermediate comparison
results).
    Let us compare two business process graphs Gk = (Nk , Ek , tk , lk ), k = 1, 2
by constructing an edit relation R. For each r = (i1 , i2 ) ∈ R the cost will be
calculated as follows:
                           
                           
                           
                            lev(l1 (i1 ), l(i2 )) · clev , i1 ∈ N1 , i2 ∈ N2 ,
                           0,                              i1 ∈ E1 , i2 ∈ E2 ,
                           
                           
                           
               cost(r) = cdelete ,                          i2 = ,
                           
                           cinsert ,
                           
                                                           i1 = ,
                           
                           
                           0,                              i1 = δ ∨ i2 = δ.

    Function lev calculates Levenshtein distance [12] between two labels, clev –
is a special coefficient; cdelete and cinsert denote costs of deletion and insertion
operations respectively.
    The total cost of the edit relation R P  is defined as a sum of costs of all pairs
which belong to this relation: cost(R) = r∈R cost(r).
    The minimal edit relation between two business process graphs is an edit
relation with minimal cost, such that it does not contain pairs with δ (cor-
respondences are defined for all elements). The cost of this relation is called
minimal graph edit distance.


3    An Exact Algorithm for Finding Minimal Graph Edit
     Distance

    A description of an exact algorithm for finding minimal graph edit distance
is presented in this section (Algorithm 1). This algorithm takes a minimal
graph edit relation from the queue at each step. Then from this edit relation
novel edit relations are generated by applying expand function matching a node
which has no correspondence yet (it corresponds to δ) to all possible nodes of
the other model which have the same type and have no correspondence either
(they correspond to δ) or to the  element. When calculating new costs incident
edges of this node are also considered. The algorithm stops when a minimal
cost relation does not contain pairs with δ, i.e, there are correspondences for all
elements. The cost of this edit relation is a minimal graph edit distance between
these graphs.
    Fig. 3 presents a results of comparison of business process graphs modeling
a booking procedure. The business process graph presented in Fig. 3 a. models
the booking procedure which was presented before (Fig. 2). The model shown
in Fig. 3 b. assumes that there can be a cancellation. To transform the first
business process graph (Fig. 3 a.) to the second (Fig. 3 b.) two edges are to
be deleted and a subprocess corresponding to the cancellation should be added.
Although the structure of models is different, they have the same underlying
booking procedure.
   Data: G1 = (N1 , E1 , t1 , l1 ) and G2 = (N2 , E2 , t2 , l2 ) – business process
          graphs;
   Result: minimal graph edit relation between G1 and G2 ;
   \\Rinit – start edit relation;
   Rinit ← {};
   for i1 ∈ N1 ∪ E1 do
      Rinit ← Rinit ∪ {(i1 , δ)};
   end
   for i2 ∈ N2 ∪ E2 do
      Rinit ← Rinit ∪ {(δ, i2 )};
   end
   \\Q - a sorted queue of edit relations;
   Q ← hRinit i;
   while true do
      \\select edit relation with a minimal cost;
      Rmin ← takeM inCostRelation(Q);
      if (Rmin contains pairs with δ) then
          i ← takeN odeRelatedtoDelta(Rmin );
          Q.remove(Rmin );
          \\add all possible pairs for i to Rmin ;
          Q.add(expand(Rmin , i));
      else
          return cost(Rmin );
      end
   end
 Algorithm 1: Finding minimal graph edit distance between business process
 graphs.

   In order to make the algorithm more efficient, a special heuristic function
can be used. It is defined on a set of possible relations as follows:
                                (
                        X        (|N1t | − |N2t |) · cdelete , |N1t | ≥ |N2t |,
              H(R) =
                            t∈T  (|N2t | − |N1t |) · cinsert , |N2t | > |N1t |.

     N1t ⊆ N1 and N2t ⊆ N2 denote sets of model nodes of type t with no cor-
respondences
P                  currently defined. Then the total cost is calculated as: cost(R) =
   r∈R    cost(r)  + H(R). Indeed, using a heuristic function, one can predict the
number of nodes for which no corresponding nodes will be found in another
graph, so they are guaranteed to be removed or added. Let us consider an edit
relation R0 obtained from R by matching unmatched elements, such that for all
(i1 , i2 ) ∈ R0 it holds that i1 P
                                 6= δ and i2 6= δ, i.e., all elements in R0 are matched.
Then, cost(R ) = cost(R) + r∈R cost(r), where R = {(i1 , i2 ) ∈ R0 |(i1 , δ) ∈ R ∨
                 0

(δ, i2 ) ∈ R}. Let us consider nodes of type t ∈ T . Since we construct one-to-one
relation, the number of elements which are to be deleted is at least |N1t |−|N2t | (if
|N1t | ≥ |N2t |), or the number                                            t     t
                            P of nodes which are to be added is at least |N2 |−|N1 |
        t         t
(if |N2 | > |N1 |). Thus, r∈R cost(r) ≥ H(R), and the heuristic function does
not overestimate the final result.
    The approach based on the use of a heuristic function is also known as A∗
search algorithm [4].
    Another heuristics that can be used is that we can first consider those nodes
that are connected with nodes for which correspondences are already defined.
This heuristics allows us to efficiently find a solution when comparing similar
process models.
    The proposed heuristics were implemented in [1] and were tested on busi-
ness processes graphs synthesized from event logs of real-life information sys-
tems [2]. Although these heuristics make the algorithm of comparing business
process graphs work faster, the complexity of the problem associated with its
NP-completeness remains. This necessitates the development of imprecise but
fast methods that will allow us finding near minimum graph edit distances be-
tween large process models.



                                      book
                                      flight


                    register                         pay


                                      book
                                      hotel

                                       а.



                                      book
                                      flight
                                                            pay

                    register

                                                           cancel
                                      book
                                      hotel

                                       b.

Fig. 3. The result of comparison of business process graphs modeling a booking
procedure.


4   Application of Tabu Search Algorithm for the
    Comparison of Business Process Graphs
    This section contains a description of a so-called tabu search algorithm
adapted to solve the problem of finding minimal edit distance between graphs
(Algorithm 2). The tabu search algorithm moves through a graph of solutions
choosing the best neighboring solution at each time. To avoid looping, a spe-
cial tabuList is used. This list contains solutions which were already consid-
ered. At the same time, it may have a limited length (maxT abuListSize) in
order to provide a better performance. Another parameter of the algorithm is
maxExpansions, it limits the overall number of steps. Note that all of the afore-
mentioned solutions are both belong to R and there are no unmatched pairs in
any of them, i.e. there is no pair with δ in any solution.

     Firstly, the current solution Rcur is initialized by applying a greedy algorithm.
Then, it is added to tabuList, and neighboring solutions are generated. After
that, the neighboring solutions which belong to tabuList are filtered out and
the one with a minimal cost is selected. If its cost is lower than the global
minimal cost, then the global minimal cost is updated. If tabuList is full, the
first (the oldest) solution is removed. Finally, after the fixed number of steps
(maxExpansions) is completed or if there are no neighboring solutions due to
filtering out, the current global minimal cost is returned. Note that the algorithm
can also return the solution corresponding to this minimal cost.

   Data: G1 = (N1 , E1 , t1 , l1 ) and G2 = (N2 , E2 , t2 , l2 ) – business process
           graphs; maxT abuListSize – the maximal length of the tabu list;
           maxExpansions – the maximal number of state expansions;
   Result: graph edit distance between G1 and G2 ;
   \\initialize Rcur – edit relation;
   Rcur ← Rgreedy ;
   minCost ← cost(Rcur );
   tabuList.add(Rcur );
   foreach expansion ∈ {1, · · · , maxExpansions} do
      oneStepV ariants ← generateOneStepV ariants(Rcur );
      foreach variant ∈ oneStepV ariants do
           if variant ∈ tabuList then
               oneStepV ariants.remove(variant);
           end
      end
      if oneStepV ariants = {} then
           break;
      end
      Rcur ← takeM inCostRelation(oneStepV ariants);
      if minCost > cost(Rcur ) then
           minCost ← cost(Rcur );
      end
      if tabuList.size() = maxT abuListSize then
           tabuList.removeF irst();
      end
      tabuList.add(Rcur );
   end
   return minCost;
 Algorithm 2: Finding a minimal edit distance between business process
 graphs by applying tabu search algorithm.
    Now let us describe the generateOneStepV ariants procedure for construct-
ing all possible neighboring solutions of the current edit relation R. The neigh-
boring solutions are generated by applying one of the following substitution rules
to one of the pairs belonging to R:
 – (n1 , n2 ) → (n1 , ), (, n2 ),
 – (n1 , ), (, n2 ) → (n1 , n2 ),
 – (n1 , ), (n01 , n2 ) → (n1 , n2 ), (n01 , ),
 – (n1 , n2 ), (, n02 ) → (n1 , n02 ), (, n2 ),
 – (n1 , n2 ), (n01 , n02 ) → (n1 , n02 ), (, n2 ), (n01 , ),
 – (n1 , n2 ), (n01 , n02 ) → (n1 , ), (n01 , n2 ), (, n02 ).
    Note that the pairs specifying relations between edges will be defined in
accordance with the definition of edit relation. Also note that we match elements
with the same type only.
    To initialize the current edit relation we will use greedy algorithm, which
works as A∗ (Algorithm 1) with the queue Q of size 1. As well as A∗ , it starts
with an empty relation and then adds pairs to this relation at each step selecting
a relation with a minimal cost. It stops when there are no pairs with δ. In
contrast to A∗ , it has a linear computational complexity. In the next section, we
will evaluate accuracy and performance characteristics of A∗ , tabu and greedy
algorithms by comparing BPMN models discovered from event logs.

5     Experimental results
     This section contains results of experiments for finding minimal edit distances
between BPMN models discovered by different algorithms [9, 10] from the same
sets of artificial 1 and real-life event logs from e-government services, and online
ticketing domains.
     All the deletion and insertion costs were set to 1 during the experiments,
i.e., cdelete = cinsert = 1, the Levenshtein distance coefficient was defined as
clev = 10. For the tabu search the parameters were set as: maxExpansions =
100, maxT abuListSize = 100. All experiments were carried out on Asus Zen-
Book Pro, with the following characteristics: Intel i7-8750H 2.20 GHz processor,
16GB RAM.
     Fig. 4 shows execution times (Fig. 4 a.) as well as minimal costs (Fig. 4 b.)
obtained by A∗ , tabu and greedy algorithms for BPMN modes discovered from
artificial logs using different (Inductive and Alpha miner) algorithms.
     Similarly, BPMN models discovered from real-life event logs of e-government
services and online ticketing systems using different process mining algorithms
were compared using A∗ , tabu and greedy techniques (Fig. 5). The size of models
was varied by filtering out different portions of infrequent behavior using the
Filter Log using Simple Heuristics ProM 6.8 plugin. After filtering out each of
the event logs was processed via Alpha Miner and Mine Petri net with Inductive
Miner plugins with default parameters resulting into Petri nets. Finally, these
Petri nets were converted to BPMN models and compared.
1
    http://www.processmining.org/event_logs_and_models_used_in_book.
Fig. 4. Characteristics of the algorithms comparing BPMN models discovered
from artificial event logs.




Fig. 5. Comparison of BPMN models discovered by Alpha [9] and Inductive [10]
mining algorithms.



   And finally, BPMN models discovered by the Inductive miner algorithm cor-
responding to different parts of event logs were compared. In this case, random
1000 traces were selected from the initial log using the Extract sample of traces
(Random) plugin twice, then processed using the Inductive miner algorithm only.
The results of this comparison are presented in Fig.6.




Fig. 6. Comparison of BPMN models constructed by Inductive [10] algorithm
from different random parts of real-life event logs.
    As it follows from the result presented in Fig. 4, 5, 6, the execution time
of the exact A∗ exceeds 30 minutes for large models. In such cases, it is more
feasible to use tabu search, which is rather fast (as shown by the experiments)
and produces optimal solutions in most of the experimental cases presented in
Fig. 4, 5, 6.


6   Conclusion

    Algorithms for the analysis and discovery of process models from information
systems’ event logs (process mining) are widely used. There are many commer-
cial tools for the discovery of process models based on the event logs analysis.
With the development of the theory of process mining the task of model to
model comparison becomes extremely important, as it not only helps to find
differences between expected and real behavior, but also to visualize the result
(this characteristic of process analysis algorithms is especially appreciated by
end users). Due to the fact that in general the problem of graph models com-
parison is NP-complete, heuristic methods for the comparison of process models
are particularly valuable. In this paper, we adapted a greedy and tabu search
algorithms for the problem of matching process models extracted from event
logs. These algorithms were implemented and tested on BPMN models. Both
the accuracy and the time characteristics of the algorithms were evaluated. It
was demonstrated that the tabu search algorithm helps to significantly improve
the time, while producing optimal results in most of the cases.



Acknowledgment. This work was funded by the President Grant MK-4188.2018.9.

References

 1. Ivanov, S.Y., Kalenkova, A.A., van der Aalst, W.M.P.: BPMNDiffViz: A Tool for
    BPMN Models Comparison. In: Proceedings of the BPM Demo Session 2015 Co-
    located with the 13th International Conference on Business Process Management
    (BPM 2015), Innsbruck, Austria, September 2, 2015. (2015) 35–39
 2. Kalenkova, A.A., Ageev, A.A., Lomazova, I.A., van der Aalst, W.M.P.: E-
    Government Services: Comparing Real and Expected User Behavior. In Teniente,
    E., Weidlich, M., eds.: Business Process Management Workshops, Cham, Springer
    International Publishing (2018) 484–496
 3. van der Aalst, W.: Process Mining: Data Science in Action. 2 edn. Springer (2016)
 4. Hart, P.E., Nilsson, N.J., Raphael, B.: A formal basis for the heuristic deter-
    mination of minimum cost paths. IEEE transactions on Systems Science and
    Cybernetics 4(2) (1968) 100–107
 5. OMG: Business Process Model and Notation (BPMN). Object Management
    Group, formal/2013-12-09 (2013)
 6. Garey, M.R., Johnson, D.S.: Computers and Intractability; A Guide to the Theory
    of NP-Completeness. W. H. Freeman & Co., New York, NY, USA (1990)
 7. Glover, F.: Future paths for integer programming and links to artificial intelligence.
    Comput. Oper. Res. 13(5) (May 1986) 533–549
 8. Adamczewski, K., Lee, Y.S.K.M.: Discrete tabu search for graph matching. In:
    International Conference on Computer Vision. (2015)
 9. van der Aalst, W.M.P., Weijters, A.J.M.M., Maruster, L.: Workflow mining: Dis-
    covering process models from event logs. IEEE Transactions on Knowledge and
    Data Engineering 16(9) (2004) 1128–1142
10. Leemans, S., Fahland, D., van der Aalst, W.: Discovering Block-Structured Process
    Models from Incomplete Event Logs. In: 35th International Conference, PETRI
    NETS 2014. Volume 8489 of LNCS. Springer, Cham (2014) 91–110
11. Kalenkova, A., van der Aalst, W., Lomazova, I., Rubin, V.: Process mining using
    bpmn: relating event logs and process models. Software and Systems Modeling
    (2015) 1–30
12. Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and
    Reversals. Soviet Physics Doklady 10 (1966) 707