Mining Duplicate Tasks from Discovered
                     Processes

        Borja Vázquez-Barreiros, Manuel Mucientes, and Manuel Lama

         Centro de Investigación en Tecnoloxı́as da Información (CiTIUS)
      Universidade de Santiago de Compostela, Santiago de Compostela, Spain
            {borja.vazquez,manuel.mucientes,manuel.lama}@usc.es


      Abstract. Including duplicate tasks in the mining process is a challenge
      that hinders the process discovery as algorithms need an extra effort to
      find out which events of the log belong to which transitions. To face this
      problem, we propose an approach that uses the local information of the
      log to enhance an already mined model by performing a local search over
      the potential tasks to be duplicated. This proposal has been validated
      over 36 different solutions, improving the final model in 35 out of 36 of
      the cases.

      Keywords: Process mining, process discovery, duplicate tasks.


1   Introduction
The notion of duplicate tasks —or activities— refers to situations in which mul-
tiple tasks in the process have the same label. This kind of behavior is useful
when i) a particular task is used in different contexts in a process and ii) to en-
hance the comprehensibility of a model. Typically, duplicate tasks are recorded
with the same label in the log and, hence, they hinders the discovery of the
model that better fits the log, as algorithms need an extra effort to find out
which events of the log belong to which transitions. There are several techniques
allowing to mine duplicate tasks [2,3,4,5,6,7], however, or the heuristics rules
used to detect the duplicate tasks are not sufficiently general for all the logs [7],
or they have to deal with a large search space, increasing the time needed for
these algorithms [3,5,6].
    In this paper we present a novel proposal to tackle duplicate tasks. The
proposal starts from an already mined model without duplicate tasks, and uses
the local information of the log and the retrieved process to improve the model
through a local search over the potential duplicate tasks.


2   Local search algorithm
Algorithm 1 describes the proposed approach to tackle duplicate tasks. The first
step is the discovery of the potential duplicate activities. We used the heuristics
defined in [5] to reduce the search space by stating that two tasks with the same


                                       78
 Algorithm 1: Local search Algorithm.
     input: A log L
 1   ind0 ← initial solution(L)                    // Retrieved by a process discovery technique.
 2   potentialDuplicates ← ∅
 3   foreach activity t in the log L do
 4       if max(min(|t >L t0 |, |t0 >L t|), 1) > 1 then
 5           potentialDuplicates ← potentialDuplicates ∪ t

 6   ind0 ← localSearch (ind0 , L, potentialDuplicates, true)
 7   Function localSearch( ind0 , L, potentialDuplicates, f irstExecution)
 8      indbest ← ind0
 9      potentialDuplicatesL2L ← ∅
10      foreach activity t in potentialDuplicates do
11          combinations ← calculateCombinations (ind0 , L, t)
12          foreach combination c in combinations do
13              t0 ← activity t from ind0
14              t.inputs = (t.inputs \ c.inputs) ∪ c.sharedInputs
15              t0 .inputs = c.inputs
16              t.outputs = (t.outputs \ c.outputs) ∪ c.sharedOutputs
17              t0 .outputs = c.outputs
18              if (I(t0 ) 6= ∅ && O(t0 ) 6= ∅ && I(t) 6= ∅ && O(t) 6= ∅) then
19                    Add task t0 to individual ind0 and update t in ind0
20                    Repair ind0
21                    Post-prune unused arcs
22                    Evaluate ind0
23                    if ind0 < indbest then
24                          ind0 ← indbest
25                       else
26                              indbest ← ind0
27                              potentialDuplicatesL2L = potentialDuplicatesL2L ∪ t00 where
                                t00 ∈
                                    / potentialDuplicates and t >L t00

28                else
29                       ind0 ← indbest

30       if f irstExecution then
31             indbest ← localSearch (indbest , null, potentialDuplicatesL2L, f alse)
32       return indbest


label cannot share the same input and output dependencies. Within this context,
the duplicate tasks are locally identified based on the follows relation (>L ),
where the upper bound for an activity t is the minimum of the number of tasks
that directly precede t in the log and the number of tasks that directly follow t.
This definition can be formalized as [5]: max(min(|t >L t0 |, |t0 >L t|), 1). If for a
task t the upper bound is greater than 1, then t is considered as a potential task
for being duplicated and, hence, it is added to potentialDuplicates (Alg.1:3-5).
    After finding the potential duplicates, the algorithm splits the input and out-
put dependencies of the activities of the model into multiple tasks with the same
label through the function localSearch (Alg.1:7). In this step, the algorithm calcu-
lates the input and output combinations for each activity in potentialDuplicates
(Alg. 1:10-11) through the function CalculateCombinations (Alg.2). Within this
function, the algorithm first finds all the subsequences in the log L that match
the pattern t1 tt2 where t1 ∈ I(t) and t2 ∈ O(t) in the model (Alg.2:2) —being
I(t) and O(t) the inputs and outputs, respectively, of t. Then, based on these


                                                 79
 Algorithm 2: Algorithm to compute the combinations of a task.
 1   Function calculateCombinations(ind, L, t)
        /* If the input parameter L is null, retrieve the sequences from parsing ind        */
 2      Retrieve all the subsequences t1 tt2 where t1 ∈ I(t) and t2 ∈ O(t)
 3      combinations ← ∅
 4      forall the subsequences t1 tt2 do
 5          c←∅
 6          Create a set c.inputs with the combinations that share the same t1 and add in
            c.outputs their respective t2
 7          Add c to combinations
 8       foreach c in combinations do
 9           if c.outputs = c0 .outputs where c0 ∈ combinations then
10               c.inputs = c.inputs ∪ c0 .inputs and c.outputs = c.outputs ∪ c0 .outputs
11               combinations = combinations \ c0
12           if c.outputs shares an element e with another c0 .outputs then
13               c.sharedOutputs ← c.sharedOutputs ∪ e
14           if c.inputs shares an element e with another c0 .inputs then
15               c.sharedInputs ← c.sharedInputs ∪ e

16       return combinations


subsequences, the combinations are created following three rules (Alg.2:4-15).
First, given two subsequences t1 tt2 and t3 tt4 , if t1 = t3 , then we merge both
subsequences into a new combination (Alg.2:4-7). Later, given two different com-
binations c and c0 , if they share the same output, i.e., c.output = c0 .output, these
two combinations are merged (Alg.2:9-11). Finally, if the intersection between
two combinations is not the empty set, we have to record which elements are
shared by both combinations (Alg.2:12-15).
    After creating all the possible combinations, for each combination c (Alg.1:12),
the algorithm creates a new task t0 equal to the original activity t of the current
model (Alg.1:13). Then, it removes from I(t) all the tasks shared with c.inputs,
but keeping the tasks that are in c.sharedInputs (Alg.1:14). On the other hand,
for the new task t0 , it retains only the elements in I(t0 ) that are contained in
c.inputs (Alg.1:15). The same process is applied for the outputs of both t and
t0 but with c.outputs and c.sharedOutputs (Alg.1:16-1:17). If both the inputs
and outputs of these tasks are not empty (Alg.1:18), they are included in ind0
(Alg. 1:19). Otherwise the model goes back to its previous state and tries with
a new combination. If the new task is included, the model is repaired (Alg.1:20)
and the unused arcs are removed (Alg.1:21). In order to evaluate the models
(Alg.1:22), we based the quality of a solution on three criteria: fitness replay,
precision and simplicity. To measure these criteria we used the hierarchical met-
ric defined in [10]. If the new model is better, the best individual indbest is
replaced with ind0 (Alg.1:26). Otherwise the model goes back to its previous
state and repeats the process with a new combination.
    The main drawback of the heuristic followed to detect the possible dupli-
cate tasks of the log (Alg.1:3 [5]) is that it does not cover all the search space,
particularly with tasks involved in a length-two-loop situation, as it breaks the
rule of two tasks sharing the same input and output dependencies. To solve this,


                                             80
Table 1: Results for the 18 logs with the initial solutions of ProDiGen and HM.
                                                                        Logs
                                                       .                      D
                                                     pl                     N        R
                       a           d                25   im    ar  19   1A      1O     10   31   33   34   38   39   42   9       ro
                                                                                                                                     c
                      h         de        p       6p taS ghC g5p g5p g5p g6p g6p g6p g6p g6p g6p g6p g6p                        P
                   lp         l       o o     i g      e     i   i    i    i       i      i    i    i    i    i    i    i    el
                  A        Fo       L       F        b    fl   F    F    F       F      F    F    F    F    F    F    F    R

              C 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0
     ProDiGen P 0.8 0.75 0.79 0.75 0.86 0.81 0.9 0.76 0.73 0.78 0.61 0.65 0.73 0.93 0.93 0.7 0.79 0.89
              S 0.3 0.3 0.29 0.29 0.3 0.29 0.3 0.29 0.29 0.29 0.26 0.28 0.29 0.31 0.3 0.28 0.3 0.30

     ProDiGen C 1.0 1.0         1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0                                1.0 1.0
        +     P 0.85 1.0        0.99 0.97 0.94 1.0 1.0 1.0 1.0 0.96 1.0 1.0 0.93 1.0 0.94 1.0                          1.0 0.95
        LS    S 0.31 0.3        0.3 0.3 0.31 0.31 0.31 0.33 0.33 0.31 0.31 0.31 0.31 0.33 0.3 0.3                      0.32 0.32

               C 1.0 1.0 1.0 1.0 1.0 0.32 1.0 0.67 1.0 1.0 1.0 1.0 0.41 0.0 0.0 0.07 0.21 1.0
        HM     P 0.72 0.75 0.79 0.73 0.86 0.81 0.9 0.75 0.67 0.78 0.56 0.6 0.76 0.57 0.6 0.64 0.95 0.89
               S 0.3 0.3 0.29 0.28 0.31 0.29 0.31 0.33 0.31 0.29 0.26 0.27 0.29 0.28 0.29 0.28 0.32 0.30

        HM     C 1.0 1.0        1.0 1.0 1.0 0.32 1.0 0.67 1.0 1.0 1.0 1.0 0.72 1.0 0.53 0.36 0.21 1.0
         +     P 0.81 1.0       0.99 0.94 0.93 0.82 1.0 1.0 1.0 0.96 1.0 1.0 0.98 1.0 0.94 0.95 0.95 0.95
        LS     S 0.31 0.3       0.3 0.31 0.32 0.30 0.31 0.35 0.33 0.31 0.31 0.31 0.32 0.33 0.32 0.31 0.32 0.32


we have to make all the process iterative: when for a task t, max(min(|t >L
t0 |, |t0 >L t|), 1) is greater than 1, i.e, t is detected as a duplicate activity, the
upper bound for all the tasks t0 that directly follow t must be updated, because
these tasks will now have multiple tasks with the same label as input. Hence, if
a task t is correctly duplicated in the model (Alg.1:26), we add the tasks that
directly follow t —and that weren’t detected as possible duplicated tasks in the
first step— into potentialDuplicatesL2L (Alg.1:27). Therefore, the last step of
the algorithm (Alg.1:31) involves a new execution of the function localSearch
(Alg.1:7) but with potentialDuplicatesL2L instead of potentialDuplicates. In
this second and final execution, the subsequences are obtained from the process
model —note that in the first execution the subsequences were extracted from
the log. Therefore, the algorithm parses the solution, checking which one of the
activities with the same label t0 ∈ I(t) were executed just before t and which
activities t00 ∈ O(t) were executed after t. Finally, it creates the combinations
based on this information.


3    Experimentation

The validation of the presented approach has been done with several synthetic
logs from [5,7]. We used ProDiGen [10] and HM [11] over these set of logs to re-
trieve the initial solutions. On the other hand, the quality of the models was
measured taking into account three metrics: fitness replay (C) [8], precision
(P) [1] and simplicity (S) [9] . Table 1 shows the results retrieved before ap-
plying the presented approach —the raw solutions mined with ProDiGen and
HM— and after the local search. Moreover, they show information about which
algorithm retrieves better results for each metric —highlighted in grey— and
which solutions are equal to the original model —highlighted in italics.
    After applying our approach over the solutions, the proposed local search
was able to enhance the results in 35 out of 36 of the cases. More specifically,
the algorithm was able to i) significantly improve the precision, and ii) to reduce


                                                            81
the complexity of the different models by splitting the behavior of the overly
connected nodes. Furthermore, our approach was able to retrieve the original
model in 25 out of 36 cases.

4    Conclusions
We have presented an approach to tackle duplicate tasks in an already discovered
model. Our proposal takes as starting point a model without duplicate tasks and
its respective log, and based on the local information of the log and the causal
dependencies of the input mined model, it improves the comprehensibility of the
solution. The presented approach has been validated with 36 different models
with duplicate tasks. Results conclude that this local search is able to detect all
the potential duplicate tasks in the log, and enhance the comprehensibility of
the final model, by improving its fitness replay, precision and simplicity.

Acknowledgments
This work was supported by the Spanish Ministry of Economy and Compet-
itiveness under project TIN2014-56633-C3-1-R, and the Galician Ministry of
Education under the projects EM2014/012 and CN2012/151.

References
 1. Adriansyah, A., Munoz-Gama, J., Carmona, J., van Dongen, B.F., van der Aalst,
    W.M.P.: Alignment based precision checking. In: BPM. (2012) 137–149
 2. Broucke, S.K.V.: Advances in Process Mining. PhD thesis
 3. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Quality dimensions in
    process discovery: The importance of fitness, precision, generalization and simplic-
    ity. International Journal of Cooperative Information Systems 23(1) (2014)
 4. Carmona, J., Cortadella, J., Kishinevsky, M.: A region-based algorithm for dis-
    covering petri nets from event logs. In: BPM. Springer (2008) 358–373
 5. de Medeiros, A.: Genetic Process Mining. PhD thesis, TU/e (2006)
 6. Goedertier, S., Martens, D., Vanthienen, J., Baesens, B.: Robust process discovery
    with artificial negative events. The Journal of Machine Learning Research 10
    (2009) 1305–1340
 7. Li, J., Liu, D., Yang, B.: Process mining: Extending α-algorithm to mine dupli-
    cate tasks in process logs. In: Advances in Web and Network Technologies, and
    Information Management. (2007) 396–407
 8. Rozinat, A., van der Aalst, W.M.P.: Conformance checking of processes based on
    monitoring real behavior. Information Systems 33(1) (2008) 64–95
 9. Sánchez-González, L., Garca, F., Mendling, J., Ruiz, F., M.Piattini: Prediction
    of business process model quality based on structural metrics. In: Conceptual
    Modeling ER 2010. Volume 6412. (2010) 458–463
10. Vázquez-Barreiros, B., Mucientes, M., Lama, M.: ProDiGen: Mining complete,
    precise and minimal structure process models with a genetic algorithm. Information
    Sciences 294 (2015) 315–333
11. Weijters, A., van der Aalst, W.M.P., de Medeiros, A.: Process mining with the
    heuristics miner-algorithm. Technische Universiteit Eindhoven 166 (2006)


                                        82