Mining Duplicate Tasks from Discovered Processes Borja Vázquez-Barreiros, Manuel Mucientes, and Manuel Lama Centro de Investigación en Tecnoloxı́as da Información (CiTIUS) Universidade de Santiago de Compostela, Santiago de Compostela, Spain {borja.vazquez,manuel.mucientes,manuel.lama}@usc.es Abstract. Including duplicate tasks in the mining process is a challenge that hinders the process discovery as algorithms need an extra effort to find out which events of the log belong to which transitions. To face this problem, we propose an approach that uses the local information of the log to enhance an already mined model by performing a local search over the potential tasks to be duplicated. This proposal has been validated over 36 different solutions, improving the final model in 35 out of 36 of the cases. Keywords: Process mining, process discovery, duplicate tasks. 1 Introduction The notion of duplicate tasks —or activities— refers to situations in which mul- tiple tasks in the process have the same label. This kind of behavior is useful when i) a particular task is used in different contexts in a process and ii) to en- hance the comprehensibility of a model. Typically, duplicate tasks are recorded with the same label in the log and, hence, they hinders the discovery of the model that better fits the log, as algorithms need an extra effort to find out which events of the log belong to which transitions. There are several techniques allowing to mine duplicate tasks [2,3,4,5,6,7], however, or the heuristics rules used to detect the duplicate tasks are not sufficiently general for all the logs [7], or they have to deal with a large search space, increasing the time needed for these algorithms [3,5,6]. In this paper we present a novel proposal to tackle duplicate tasks. The proposal starts from an already mined model without duplicate tasks, and uses the local information of the log and the retrieved process to improve the model through a local search over the potential duplicate tasks. 2 Local search algorithm Algorithm 1 describes the proposed approach to tackle duplicate tasks. The first step is the discovery of the potential duplicate activities. We used the heuristics defined in [5] to reduce the search space by stating that two tasks with the same 78 Algorithm 1: Local search Algorithm. input: A log L 1 ind0 ← initial solution(L) // Retrieved by a process discovery technique. 2 potentialDuplicates ← ∅ 3 foreach activity t in the log L do 4 if max(min(|t >L t0 |, |t0 >L t|), 1) > 1 then 5 potentialDuplicates ← potentialDuplicates ∪ t 6 ind0 ← localSearch (ind0 , L, potentialDuplicates, true) 7 Function localSearch( ind0 , L, potentialDuplicates, f irstExecution) 8 indbest ← ind0 9 potentialDuplicatesL2L ← ∅ 10 foreach activity t in potentialDuplicates do 11 combinations ← calculateCombinations (ind0 , L, t) 12 foreach combination c in combinations do 13 t0 ← activity t from ind0 14 t.inputs = (t.inputs \ c.inputs) ∪ c.sharedInputs 15 t0 .inputs = c.inputs 16 t.outputs = (t.outputs \ c.outputs) ∪ c.sharedOutputs 17 t0 .outputs = c.outputs 18 if (I(t0 ) 6= ∅ && O(t0 ) 6= ∅ && I(t) 6= ∅ && O(t) 6= ∅) then 19 Add task t0 to individual ind0 and update t in ind0 20 Repair ind0 21 Post-prune unused arcs 22 Evaluate ind0 23 if ind0 < indbest then 24 ind0 ← indbest 25 else 26 indbest ← ind0 27 potentialDuplicatesL2L = potentialDuplicatesL2L ∪ t00 where t00 ∈ / potentialDuplicates and t >L t00 28 else 29 ind0 ← indbest 30 if f irstExecution then 31 indbest ← localSearch (indbest , null, potentialDuplicatesL2L, f alse) 32 return indbest label cannot share the same input and output dependencies. Within this context, the duplicate tasks are locally identified based on the follows relation (>L ), where the upper bound for an activity t is the minimum of the number of tasks that directly precede t in the log and the number of tasks that directly follow t. This definition can be formalized as [5]: max(min(|t >L t0 |, |t0 >L t|), 1). If for a task t the upper bound is greater than 1, then t is considered as a potential task for being duplicated and, hence, it is added to potentialDuplicates (Alg.1:3-5). After finding the potential duplicates, the algorithm splits the input and out- put dependencies of the activities of the model into multiple tasks with the same label through the function localSearch (Alg.1:7). In this step, the algorithm calcu- lates the input and output combinations for each activity in potentialDuplicates (Alg. 1:10-11) through the function CalculateCombinations (Alg.2). Within this function, the algorithm first finds all the subsequences in the log L that match the pattern t1 tt2 where t1 ∈ I(t) and t2 ∈ O(t) in the model (Alg.2:2) —being I(t) and O(t) the inputs and outputs, respectively, of t. Then, based on these 79 Algorithm 2: Algorithm to compute the combinations of a task. 1 Function calculateCombinations(ind, L, t) /* If the input parameter L is null, retrieve the sequences from parsing ind */ 2 Retrieve all the subsequences t1 tt2 where t1 ∈ I(t) and t2 ∈ O(t) 3 combinations ← ∅ 4 forall the subsequences t1 tt2 do 5 c←∅ 6 Create a set c.inputs with the combinations that share the same t1 and add in c.outputs their respective t2 7 Add c to combinations 8 foreach c in combinations do 9 if c.outputs = c0 .outputs where c0 ∈ combinations then 10 c.inputs = c.inputs ∪ c0 .inputs and c.outputs = c.outputs ∪ c0 .outputs 11 combinations = combinations \ c0 12 if c.outputs shares an element e with another c0 .outputs then 13 c.sharedOutputs ← c.sharedOutputs ∪ e 14 if c.inputs shares an element e with another c0 .inputs then 15 c.sharedInputs ← c.sharedInputs ∪ e 16 return combinations subsequences, the combinations are created following three rules (Alg.2:4-15). First, given two subsequences t1 tt2 and t3 tt4 , if t1 = t3 , then we merge both subsequences into a new combination (Alg.2:4-7). Later, given two different com- binations c and c0 , if they share the same output, i.e., c.output = c0 .output, these two combinations are merged (Alg.2:9-11). Finally, if the intersection between two combinations is not the empty set, we have to record which elements are shared by both combinations (Alg.2:12-15). After creating all the possible combinations, for each combination c (Alg.1:12), the algorithm creates a new task t0 equal to the original activity t of the current model (Alg.1:13). Then, it removes from I(t) all the tasks shared with c.inputs, but keeping the tasks that are in c.sharedInputs (Alg.1:14). On the other hand, for the new task t0 , it retains only the elements in I(t0 ) that are contained in c.inputs (Alg.1:15). The same process is applied for the outputs of both t and t0 but with c.outputs and c.sharedOutputs (Alg.1:16-1:17). If both the inputs and outputs of these tasks are not empty (Alg.1:18), they are included in ind0 (Alg. 1:19). Otherwise the model goes back to its previous state and tries with a new combination. If the new task is included, the model is repaired (Alg.1:20) and the unused arcs are removed (Alg.1:21). In order to evaluate the models (Alg.1:22), we based the quality of a solution on three criteria: fitness replay, precision and simplicity. To measure these criteria we used the hierarchical met- ric defined in [10]. If the new model is better, the best individual indbest is replaced with ind0 (Alg.1:26). Otherwise the model goes back to its previous state and repeats the process with a new combination. The main drawback of the heuristic followed to detect the possible dupli- cate tasks of the log (Alg.1:3 [5]) is that it does not cover all the search space, particularly with tasks involved in a length-two-loop situation, as it breaks the rule of two tasks sharing the same input and output dependencies. To solve this, 80 Table 1: Results for the 18 logs with the initial solutions of ProDiGen and HM. Logs . D pl N R a d 25 im ar 19 1A 1O 10 31 33 34 38 39 42 9 ro c h de p 6p taS ghC g5p g5p g5p g6p g6p g6p g6p g6p g6p g6p g6p P lp l o o i g e i i i i i i i i i i i i el A Fo L F b fl F F F F F F F F F F F R C 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ProDiGen P 0.8 0.75 0.79 0.75 0.86 0.81 0.9 0.76 0.73 0.78 0.61 0.65 0.73 0.93 0.93 0.7 0.79 0.89 S 0.3 0.3 0.29 0.29 0.3 0.29 0.3 0.29 0.29 0.29 0.26 0.28 0.29 0.31 0.3 0.28 0.3 0.30 ProDiGen C 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 + P 0.85 1.0 0.99 0.97 0.94 1.0 1.0 1.0 1.0 0.96 1.0 1.0 0.93 1.0 0.94 1.0 1.0 0.95 LS S 0.31 0.3 0.3 0.3 0.31 0.31 0.31 0.33 0.33 0.31 0.31 0.31 0.31 0.33 0.3 0.3 0.32 0.32 C 1.0 1.0 1.0 1.0 1.0 0.32 1.0 0.67 1.0 1.0 1.0 1.0 0.41 0.0 0.0 0.07 0.21 1.0 HM P 0.72 0.75 0.79 0.73 0.86 0.81 0.9 0.75 0.67 0.78 0.56 0.6 0.76 0.57 0.6 0.64 0.95 0.89 S 0.3 0.3 0.29 0.28 0.31 0.29 0.31 0.33 0.31 0.29 0.26 0.27 0.29 0.28 0.29 0.28 0.32 0.30 HM C 1.0 1.0 1.0 1.0 1.0 0.32 1.0 0.67 1.0 1.0 1.0 1.0 0.72 1.0 0.53 0.36 0.21 1.0 + P 0.81 1.0 0.99 0.94 0.93 0.82 1.0 1.0 1.0 0.96 1.0 1.0 0.98 1.0 0.94 0.95 0.95 0.95 LS S 0.31 0.3 0.3 0.31 0.32 0.30 0.31 0.35 0.33 0.31 0.31 0.31 0.32 0.33 0.32 0.31 0.32 0.32 we have to make all the process iterative: when for a task t, max(min(|t >L t0 |, |t0 >L t|), 1) is greater than 1, i.e, t is detected as a duplicate activity, the upper bound for all the tasks t0 that directly follow t must be updated, because these tasks will now have multiple tasks with the same label as input. Hence, if a task t is correctly duplicated in the model (Alg.1:26), we add the tasks that directly follow t —and that weren’t detected as possible duplicated tasks in the first step— into potentialDuplicatesL2L (Alg.1:27). Therefore, the last step of the algorithm (Alg.1:31) involves a new execution of the function localSearch (Alg.1:7) but with potentialDuplicatesL2L instead of potentialDuplicates. In this second and final execution, the subsequences are obtained from the process model —note that in the first execution the subsequences were extracted from the log. Therefore, the algorithm parses the solution, checking which one of the activities with the same label t0 ∈ I(t) were executed just before t and which activities t00 ∈ O(t) were executed after t. Finally, it creates the combinations based on this information. 3 Experimentation The validation of the presented approach has been done with several synthetic logs from [5,7]. We used ProDiGen [10] and HM [11] over these set of logs to re- trieve the initial solutions. On the other hand, the quality of the models was measured taking into account three metrics: fitness replay (C) [8], precision (P) [1] and simplicity (S) [9] . Table 1 shows the results retrieved before ap- plying the presented approach —the raw solutions mined with ProDiGen and HM— and after the local search. Moreover, they show information about which algorithm retrieves better results for each metric —highlighted in grey— and which solutions are equal to the original model —highlighted in italics. After applying our approach over the solutions, the proposed local search was able to enhance the results in 35 out of 36 of the cases. More specifically, the algorithm was able to i) significantly improve the precision, and ii) to reduce 81 the complexity of the different models by splitting the behavior of the overly connected nodes. Furthermore, our approach was able to retrieve the original model in 25 out of 36 cases. 4 Conclusions We have presented an approach to tackle duplicate tasks in an already discovered model. Our proposal takes as starting point a model without duplicate tasks and its respective log, and based on the local information of the log and the causal dependencies of the input mined model, it improves the comprehensibility of the solution. The presented approach has been validated with 36 different models with duplicate tasks. Results conclude that this local search is able to detect all the potential duplicate tasks in the log, and enhance the comprehensibility of the final model, by improving its fitness replay, precision and simplicity. Acknowledgments This work was supported by the Spanish Ministry of Economy and Compet- itiveness under project TIN2014-56633-C3-1-R, and the Galician Ministry of Education under the projects EM2014/012 and CN2012/151. References 1. Adriansyah, A., Munoz-Gama, J., Carmona, J., van Dongen, B.F., van der Aalst, W.M.P.: Alignment based precision checking. In: BPM. (2012) 137–149 2. Broucke, S.K.V.: Advances in Process Mining. PhD thesis 3. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Quality dimensions in process discovery: The importance of fitness, precision, generalization and simplic- ity. International Journal of Cooperative Information Systems 23(1) (2014) 4. Carmona, J., Cortadella, J., Kishinevsky, M.: A region-based algorithm for dis- covering petri nets from event logs. In: BPM. Springer (2008) 358–373 5. de Medeiros, A.: Genetic Process Mining. PhD thesis, TU/e (2006) 6. Goedertier, S., Martens, D., Vanthienen, J., Baesens, B.: Robust process discovery with artificial negative events. The Journal of Machine Learning Research 10 (2009) 1305–1340 7. Li, J., Liu, D., Yang, B.: Process mining: Extending α-algorithm to mine dupli- cate tasks in process logs. In: Advances in Web and Network Technologies, and Information Management. (2007) 396–407 8. Rozinat, A., van der Aalst, W.M.P.: Conformance checking of processes based on monitoring real behavior. Information Systems 33(1) (2008) 64–95 9. Sánchez-González, L., Garca, F., Mendling, J., Ruiz, F., M.Piattini: Prediction of business process model quality based on structural metrics. In: Conceptual Modeling ER 2010. Volume 6412. (2010) 458–463 10. Vázquez-Barreiros, B., Mucientes, M., Lama, M.: ProDiGen: Mining complete, precise and minimal structure process models with a genetic algorithm. Information Sciences 294 (2015) 315–333 11. Weijters, A., van der Aalst, W.M.P., de Medeiros, A.: Process mining with the heuristics miner-algorithm. Technische Universiteit Eindhoven 166 (2006) 82