Introduction

Mining Duplicate Tasks from Discovered Processes

Borja Vazquez-Barreiros

borja.vazquez@usc.es 0

Manuel Mucientes

Manuel Lama

0 0 Centro de Investigacion en Tecnolox as da Informacion (CiTIUS) Universidade de Santiago de Compostela , Santiago de Compostela , Spain

78 82

Including duplicate tasks in the mining process is a challenge that hinders the process discovery as algorithms need an extra e ort to nd out which events of the log belong to which transitions. To face this problem, we propose an approach that uses the local information of the log to enhance an already mined model by performing a local search over the potential tasks to be duplicated. This proposal has been validated over 36 di erent solutions, improving the nal model in 35 out of 36 of the cases.

Process mining process discovery duplicate tasks

Introduction

The notion of duplicate tasks |or activities| refers to situations in which multiple tasks in the process have the same label. This kind of behavior is useful when i) a particular task is used in di erent contexts in a process and ii) to enhance the comprehensibility of a model. Typically, duplicate tasks are recorded with the same label in the log and, hence, they hinders the discovery of the model that better ts the log, as algorithms need an extra e ort to nd out which events of the log belong to which transitions. There are several techniques allowing to mine duplicate tasks [ 2,3,4,5,6,7 ], however, or the heuristics rules used to detect the duplicate tasks are not su ciently general for all the logs [ 7 ], or they have to deal with a large search space, increasing the time needed for these algorithms [ 3,5,6 ].

In this paper we present a novel proposal to tackle duplicate tasks. The proposal starts from an already mined model without duplicate tasks, and uses the local information of the log and the retrieved process to improve the model through a local search over the potential duplicate tasks.

Local search algorithm

Algorithm 1 describes the proposed approach to tackle duplicate tasks. The rst step is the discovery of the potential duplicate activities. We used the heuristics de ned in [ 5 ] to reduce the search space by stating that two tasks with the same

// Retrieved by a process discovery technique.

Algorithm 1: Local search Algorithm.

input: A log L 1 ind0 initial solution(L) 2 potentialDuplicates ; 3 foreach activity t in the log L do 4 if max(min(jt >L t0j; jt0 >L tj); 1) > 1 then 5 potentialDuplicates potentialDuplicates [ t indbest ind0 potentialDuplicatesL2L = potentialDuplicatesL2L [ t00 where t00 2= potentialDuplicates and t >L t00 label cannot share the same input and output dependencies. Within this context, the duplicate tasks are locally identi ed based on the follows relation (>L), where the upper bound for an activity t is the minimum of the number of tasks that directly precede t in the log and the number of tasks that directly follow t. This de nition can be formalized as [ 5 ]: max(min(jt >L t0j; jt0 >L tj); 1). If for a task t the upper bound is greater than 1, then t is considered as a potential task for being duplicated and, hence, it is added to potentialDuplicates (Alg.1:3-5).

After nding the potential duplicates, the algorithm splits the input and output dependencies of the activities of the model into multiple tasks with the same label through the function localSearch (Alg.1:7). In this step, the algorithm calculates the input and output combinations for each activity in potentialDuplicates (Alg. 1:10-11) through the function CalculateCombinations (Alg.2). Within this function, the algorithm rst nds all the subsequences in the log L that match the pattern t1tt2 where t1 2 I (t) and t2 2 O(t) in the model (Alg.2:2) |being I (t) and O(t) the inputs and outputs, respectively, of t. Then, based on these Algorithm 2: Algorithm to compute the combinations of a task. 7 8 9 10 11 12 13 14 15 subsequences, the combinations are created following three rules (Alg.2:4-15). First, given two subsequences t1tt2 and t3tt4, if t1 = t3, then we merge both subsequences into a new combination (Alg.2:4-7). Later, given two di erent combinations c and c0, if they share the same output, i.e., c:output = c0:output, these two combinations are merged (Alg.2:9-11). Finally, if the intersection between two combinations is not the empty set, we have to record which elements are shared by both combinations (Alg.2:12-15).

After creating all the possible combinations, for each combination c (Alg.1:12), the algorithm creates a new task t0 equal to the original activity t of the current model (Alg.1:13). Then, it removes from I(t) all the tasks shared with c:inputs, but keeping the tasks that are in c:sharedInputs (Alg.1:14). On the other hand, for the new task t0, it retains only the elements in I(t0) that are contained in c:inputs (Alg.1:15). The same process is applied for the outputs of both t and t0 but with c:outputs and c:sharedOutputs (Alg.1:16-1:17). If both the inputs and outputs of these tasks are not empty (Alg.1:18), they are included in ind0 (Alg. 1:19). Otherwise the model goes back to its previous state and tries with a new combination. If the new task is included, the model is repaired (Alg.1:20) and the unused arcs are removed (Alg.1:21). In order to evaluate the models (Alg.1:22), we based the quality of a solution on three criteria: tness replay, precision and simplicity. To measure these criteria we used the hierarchical metric de ned in [ 10 ]. If the new model is better, the best individual indbest is replaced with ind0 (Alg.1:26). Otherwise the model goes back to its previous state and repeats the process with a new combination.

The main drawback of the heuristic followed to detect the possible duplicate tasks of the log (Alg.1:3 [ 5 ]) is that it does not cover all the search space, particularly with tasks involved in a length-two-loop situation, as it breaks the rule of two tasks sharing the same input and output dependencies. To solve this, we have to make all the process iterative: when for a task t, max(min(jt >L t0j; jt0 >L tj); 1) is greater than 1, i.e, t is detected as a duplicate activity, the upper bound for all the tasks t0 that directly follow t must be updated, because these tasks will now have multiple tasks with the same label as input. Hence, if a task t is correctly duplicated in the model (Alg.1:26), we add the tasks that directly follow t |and that weren't detected as possible duplicated tasks in the rst step| into potentialDuplicatesL2L (Alg.1:27). Therefore, the last step of the algorithm (Alg.1:31) involves a new execution of the function localSearch (Alg.1:7) but with potentialDuplicatesL2L instead of potentialDuplicates. In this second and nal execution, the subsequences are obtained from the process model |note that in the rst execution the subsequences were extracted from the log. Therefore, the algorithm parses the solution, checking which one of the activities with the same label t0 2 I(t) were executed just before t and which activities t00 2 O(t) were executed after t. Finally, it creates the combinations based on this information.

3 Experimentation

The validation of the presented approach has been done with several synthetic logs from [ 5,7 ]. We used ProDiGen [ 10 ] and HM [ 11 ] over these set of logs to retrieve the initial solutions. On the other hand, the quality of the models was measured taking into account three metrics: tness replay (C) [ 8 ], precision (P) [ 1 ] and simplicity (S) [ 9 ] . Table 1 shows the results retrieved before applying the presented approach |the raw solutions mined with ProDiGen and HM| and after the local search. Moreover, they show information about which algorithm retrieves better results for each metric |highlighted in grey| and which solutions are equal to the original model |highlighted in italics.

After applying our approach over the solutions, the proposed local search was able to enhance the results in 35 out of 36 of the cases. More speci cally, the algorithm was able to i) signi cantly improve the precision, and ii) to reduce the complexity of the di erent models by splitting the behavior of the overly connected nodes. Furthermore, our approach was able to retrieve the original model in 25 out of 36 cases. 4

Conclusions

We have presented an approach to tackle duplicate tasks in an already discovered model. Our proposal takes as starting point a model without duplicate tasks and its respective log, and based on the local information of the log and the causal dependencies of the input mined model, it improves the comprehensibility of the solution. The presented approach has been validated with 36 di erent models with duplicate tasks. Results conclude that this local search is able to detect all the potential duplicate tasks in the log, and enhance the comprehensibility of the nal model, by improving its tness replay, precision and simplicity.

Acknowledgments

This work was supported by the Spanish Ministry of Economy and Competitiveness under project TIN2014-56633-C3-1-R, and the Galician Ministry of Education under the projects EM2014/012 and CN2012/151.

1. Adriansyah , A. , Munoz-Gama , J. , Carmona , J., van Dongen, B.F. , van der Aalst , W.M.P. : Alignment based precision checking . In: BPM . ( 2012 ) 137 { 149

2. Broucke , S.K.V. : Advances in Process Mining . PhD thesis

3. Buijs , J.C.A.M. , van Dongen , B.F. , van der Aalst , W.M.P. : Quality dimensions in process discovery: The importance of tness, precision, generalization and simplicity . International Journal of Cooperative Information Systems 23 ( 1 ) ( 2014 )

4. Carmona , J. , Cortadella , J. , Kishinevsky , M.: A region-based algorithm for discovering petri nets from event logs . In: BPM . Springer ( 2008 ) 358 { 373

5. de Medeiros , A.: Genetic Process Mining . PhD thesis , TU/e ( 2006 )

6. Goedertier , S. , Martens , D. , Vanthienen , J. , Baesens , B. : Robust process discovery with arti cial negative events . The Journal of Machine Learning Research 10 ( 2009 ) 1305 { 1340

7. Li , J. , Liu , D. , Yang , B. : Process mining: Extending -algorithm to mine duplicate tasks in process logs . In: Advances in Web and Network Technologies, and Information Management . ( 2007 ) 396 { 407

8. Rozinat , A., van der Aalst , W.M.P. : Conformance checking of processes based on monitoring real behavior . Information Systems 33 ( 1 ) ( 2008 ) 64 { 95

9. Sanchez-Gonzalez , L. , Garca , F. , Mendling , J. , Ruiz , F. , M. Piattini: Prediction of business process model quality based on structural metrics . In: Conceptual Modeling ER 2010 . Volume 6412 . ( 2010 ) 458 { 463

10. Vazquez-Barreiros , B. , Mucientes , M. , Lama , M.: ProDiGen: Mining complete, precise and minimal structure process models with a genetic algorithm . Information Sciences 294 ( 2015 ) 315 { 333

11. Weijters , A., van der Aalst , W.M.P., de Medeiros , A.: Process mining with the heuristics miner-algorithm . Technische Universiteit Eindhoven 166 ( 2006 )