A Novel Fitness Improvement Method for Mined Business Process Models Yaguang Sun and Bernhard Bauer Software Methodologies for Distributed Systems, University of Augsburg, Germany {yaguang.sun,bernhard.bauer}@informatik.uni-augsburg.de Abstract. Business process model discovery (BPMD) is a significant research topic in the business process mining area. The present BPMD techniques encounter great challenges while dealing with real-life event logs that contain complex process behaviors. As a result, non-fitting pro- cess models might be obtained. In this paper, we propose a new mecha- nism for locating and handling the process behaviors recorded in event logs which cannot be expressed by the utilised model discovery algo- rithms. Keywords: Business Process Mining, Business Process Model Discov- ery, Process Model Fitness Improvement 1 Introduction As one of the most important research directions in business process mining area, the present business process model discovery (BPMD) techniques meet great challenges while trying to mine process models from real-life event logs that usually stem from the business processes implemented in highly flexible environments, e.g., healthcare, customer relationship management (CRM) and product development [1, 3]. Such real-life logs often contain complex process behaviors and utilising existing BPMD techniques to mine these logs might gen- erate non-fitting process models. The mining algorithm enhancement-based strategy has been put forward in the literature for solving the problem of low-fitness process models mined from real-life event logs. Such a strategy aims at improving the expressive ability of existing BPMD algorithms or developing new algorithms that are able to model more complex workflow patterns. In this paper, we develop a new method which inherits the basic idea of mining algorithm enhancement-based strategy for assisting the present BPMD techniques in generating high-fitness process models from real-life event logs. In our method, the fitness improvement problem for the inaccurately mined process models is surveyed from a new perspective and redefined as an issue of locating the inexpressible process behaviors recorded in event logs and then transforming them into expressible behaviors for the utilised BPMD algorithms. The structure of the main contents of this paper is organised as: Copyright c by the paper’s authors. Copying permitted only for private and academic purposes. In: S. España, M. Ivanović, M. Savić (eds.): Proceedings of the CAiSE’16 Forum at the 28th International Conference on Advanced Information Systems Engineering, Ljubljana, Slovenia, 13-17.6.2016, published at http://ceur-ws.org CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 26 Yaguang Sun and Bernhard Bauer - In Section 2, a novel method named IDC is put forward which is able to help the employed BPMD techniques mine more fitting process models from real-life logs. - In Section 3, we carry out the proposed method IDC on an example event log to test the correctness and effectiveness of it. - In Section 4, the related work from the academia is reviewed. - In Section 5, a conclusion together with the future problems that we are going to solve are elaborated. 2 Approach Design We firstly introduce a method for detecting and structuring the process be- haviors from real-life event logs in Section 2.1. Afterwards, based on the method presented in Section 2.1 a heuristic technique named IDC for helping the utilised BPMD algorithms mine more fitting process models is put forward in Section 2.2. 2.1 Construct Process Behavior Space (PBS) Business process behaviors from real-life event logs cannot be analysed before they are extracted and structured in an appropriate form. We solve the PBS construction problem by providing the following two concepts: Definition 1. (behavior-related activity) Let SA be the set of activities for event log E. Let E represents a direct relation between any two activities from SA. For instance, for two activities a, b ∈ SA, a E b = true if activity a is directly followed by b at least once in E. Symbol ⇒E represents a behavior-based relation between any two activities from SA. For two activities c, d ∈ SA, c ⇒E d = true if c E d = true or d E c = true and d is also called a behavior-related activity of c. Definition 2. (behavior-related sub-trace) Let SA be the set of activities and ST be the multiset of traces for event log E. Let t ∈ ST be a trace from E, st v t be a sub-trace of t and SAst be the set of activities for st. Given an activity a ∈ SA, st is a behavior-related sub-trace of a if ∀b ∈ SAst ∧ b 6= a such that a ⇒E b and ∃c ∈ SAst such that c = a. Take a simple event log E1 = {< l1 , m1 , n1 , x1 , y1 , z1 >15 , < l1 , n1 , m1 , y1 , z1 >15 , < l1 , m1 , x1 >3 , < l1 , n1 , m1 , z1 >5 } (utilised for the rest of Section 2) as an exam- ple. According to Definition 1, the activity n1 from E1 has three behavior-related activities which are activity l1 , m1 and x1 . Then, according to Definition 2, a set of behavior-related sub-traces {< l1 , m1 , n1 , x1 >15 , < l1 , n1 , m1 >20 } for activity n1 can be extracted from E1 . Let Ξ : (ST + , SA+ , SA) → SBT be a function for finding all of the behavior-related sub-traces for a specific activity from a given trace, where ST + represents the set of traces, SA+ represents the set of all possible sets of behavior-related activities, SA represents the set of activities and SBT stands for the set of all possible multisets of behavior-related sub-traces. The PBS construction procedure is described in Algorithm 1. A Novel Fitness Improvement Method for Mined Business Process Models 27 Algorithm 1 Construct the PBS for a specific event log E (CPBS) Input: an event log E, the set of activities SA for E Let BA be a set of behavior-related activities. Let BT be a set of behavior-related sub-traces. Let PBS be a set of sets of behavior-related sub-traces. 1: BA ← null, BT ← null, PBS ← null 2: for each activity a ∈ SA do 3: for each trace t1 ∈ E do 4: for each activity b ∈ t1 do 5: if b ∈ / BA && a ⇒E b = true then 6: BA ← BA ∪ b 7: end if 8: end for 9: end for 10: for each trace t2 ∈ E do 11: BT ← BT ∪ Ξ(t2 , BA, a) 12: end for 13: PBS ← PBS ∪ BT 14: BA ← null 15: BT ← null 16: end for Output: a set of sets of behavior-related sub-traces PBS 2.2 Detection and Conversion of Inexpressible Process Behaviors In this subsection, we firstly define a new concept called activity environment item. Afterwards, a new algorithm named IDC that is able to assist in detect- ing and converting the inexpressible process behaviors related to one particular activity is presented. Definition 3. (activity environment item) Let SA be the set of activities from event log E, activity a, b and c are three activities from SA, the tuple (b, c) is an environment item of activity a if ∃t ∈ E such that < b, < a . . . >, c >v t, where t stands for a trace from E and < a . . . > represents a sub-trace that only consists of activity a. Take the event log E1 mentioned in Section 2.1 as an example. According to Definition 3, the activity n1 ∈ E1 has two types of environment items which are (m1 , x1 ) and (l1 , m1 ). Activity l1 ∈ E1 also has two types of environment items that are (null, m1 ) (the value null indicates that l1 appears as a starting activity) and (null, n1 ). Let Θ : (SA, SBT ) → SEI + be an environment item searching function for a specific activity, where SA represents the set of activities, SBT represents the set of all multisets of behavior-related sub-traces and SEI + stands for the set of all sets of environment items. Let Φ : (SA, SEI, SBT, SNA) → SBT be an activity conversion function, where SEI stands for the set of environment items and SNA represents the set of activity names. Given an activity, function Φ is capable of converting this activity into a new activity (with a new name) 28 Yaguang Sun and Bernhard Bauer Algorithm 2 Inexpressible behaviors detection and conversion (IDC) Input: an activity a from event log E, the PBS for log E, a model fitness threshold ε, a model fitness improvement threshold η Let BSTa and BSTt be two sets of behavior-related sub-traces. Let SEIa and SEIr be two sets of environment items. Let f1 and f2 be two variables of type Double. Let i be a variable of type Integer. Let eit be an environment item. 1: f1 ← 0, f2 ← 0, i ← 0 2: eit ← null, BSTa ← null, BSTt ← null, SEIa ← null, SEIr ← null 3: BSTa ← PBS[a] # collect the set of behavior-related sub-traces for a from PBS 4: SEIa ← Θ(a, BSTa ) # find the set of environment items for activity a 5: f1 ← Σ(Ω(BSTa ), BSTa ) # calculate the fitness of the model mined from BSTa 6: while (|SEIa | > 0 && f1 < ε) do 7: for each environment item eia ∈ SEIa do 8: BSTt ← Φ(a, eia , BSTa , ai ) # transform a into ai under environment eia 9: if (Σ(Ω(BSTt ), BSTt ) > f2 ) then 10: f2 ← Σ(Ω(BSTt ), BSTt ) 11: eit ← eia 12: end if 13: end for 14: if (f2 − f1 > η) then 15: f1 ← f2 16: BSTa ← Φ(a, eit , BSTa , ai ) 17: SEIr ← SEIr ∪ eit 18: remove eit from SEIa 19: i←i+1 20: else 21: break 22: end if 23: end while Output: a set of environment items SEIr for activity a. under a certain environment in all of the behavior-related sub-traces of the given activity. For the activity n1 ∈ E1 , given an environment item ei = (m1 , x1 ) of n1 , the set of behavior-related sub-traces BSTn1 = {< l1 , m1 , n1 , x1 >15 , < l1 , n1 , m1 >20 } of n1 and a new activity name n1 0 , Φ(n1 , ei, BSTn1 , n1 0 ) = {< l1 , m1 , n1 0 , x1 >15 , < l1 , n1 , m1 >20 }. Let SST be the set of all possible multisets of traces, Ω : SST → SM be a workflow discovery algorithm, where SM is the set of process models. Σ : (SM, SST ) → SV represents a process model fitness evaluation schema with an input of a process model and its relevant set of traces and an output of an assessed value from SV (the set of all possible values output by Σ). The details of IDC is described in Algorithm 2. For a specific activity a from log E, the algorithm IDC firstly obtains the set of behavior-related sub-traces BSTa for a through step 3, the set of environ- ment items SEIa for a by step 4 and the fitness of the behavior-related model (by mining the generated set of behavior-related sub-traces BSTa ) of a by step A Novel Fitness Improvement Method for Mined Business Process Models 29 5. Afterwards, steps 6 − 13 try to find the environment item of a under which converting a into ai in the set of behavior-related sub-traces of a will help gen- erate a behavior-related model (by mining BSTt ) with maximal fitness and the discovered environment item is put in eit and the value of the maximal fitness is stored by f2 . Then, it is checked if the fitness improvement acquired is larger than or equal to a threshold η by step 14. If this is the case, the environment item is put in set SEIr (step 17) which will be finally output by IDC. The orig- inal set of behavior-related sub-traces is updated (step 16) by transforming the relevant activity under the found environment which is removed from SEIa later by step 18 and IDC continues to search for the next environment item if SEIa is not empty and the fitness of the new behavior-related model is less than a target value ε. Otherwise, the technique IDC stops. 3 Evaluation In our experiment, we utilise the Heuristics Miner (HM) [7] from ProM 61 for mining process models from event logs. At present, several application instances [9–14] of the mining algorithm enhancement-based strategy (introduced in Section 1) have been put forward in the literature. These proposed methods are able to help mine high-fitness process models expressed by Petri net [15]. However, few efforts have been made to help the HM mine high-fitness models. According to [16], the HM which expresses process model by Heuristics net is one of the most popular BPMD tools in the ProM framework and this is why we adapt our technique to HM for the evaluation. Adapting our technique to other BPMD algorithms will be a future job. Furthermore, we use the ICS fitness [8] for assessing the accuracy of the mined models because it has a computationally efficient calculative process. We tested the correctness and effectiveness of our technique by utilising an example event log Ee . As shown in Figure 1, log Ee has nine activities and 523 traces and the process model generated by implementing the HM on Ee has a low ICS fitness value which is 0.7748. In the process of our experiment, we firstly utilise the algorithm CPBS (in- troduced in Section 2.1) to build the PBS for log Ee . Afterwards, the algorithm IDC (introduced in Section 2.2) is used to analyse the built PBS for discovering the activities together with their relevant environment items that are the main factors to turn out the inexpressible process behaviors for HM in log Ee . The model fitness threshold ε is set to 1 and the model fitness improvement threshold η is set to 0.3 for IDC. According to Figure 2, the method IDC finds that the activity F under environment (D, E) and (H, I) in log Ee is the key factor for generating the process behaviors that cannot be expressed by HM. With this discovery from IDC, we replace the activity F under environment (H, I) by a new activity named F0 and the activity F under environment (D, E) by using F1 in log Ee so 1 http://www.promtools.org. 30 Yaguang Sun and Bernhard Bauer An Example Event Log : Ee Process Model Generated by Heuristics Miner 1 64 2 D 127 A B C E H I G 1 F 78 247 ICS Fitness : 0.7748 3 Fig. 1. The example event log Ee and the business process model mined from Ee by HM. that a new log Ee∗ (in Figure 2) can be generated. As shown in Figure 2, a more fitting process model (with a fitness value 0.9987) is output by HM executed on the newly generated log Ee∗ . This positive evaluation results indicate that our technique IDC successfully locate the inexpressible process behaviors from Ee for HM. Newly Generated Log : Ee* Process Model Generated by Heuristics Miner 1 64 E H I 2 F0 127 G D F1 1 F A B C 78 247 ICS Fitness : 0.9987 3 Fig. 2. The newly generated event log Ee∗ and the business process model mined from Ee∗ by HM. 4 Related Work In the literature, several effective techniques stemming from the mining algo- rithm enhancement-based strategy have been devised. Most of these presented techniques are developed to mine fitting process models expressed by Petri net. A Novel Fitness Improvement Method for Mined Business Process Models 31 The authors in [10] put forward the State-based Region Miner which is able to generate different categories of place-irredundant Petri nets from the transition systems and assure the fitness of the generated Petri nets. In [9], the authors point out that the BPMD approaches partly based on heuristic methods cannot guarantee an accurate mining results. Then, they develop a series of methods for adapting the technique of Petri net synthesis to the BPMD area so that the advantage from the Petri net synthesis can be utilised to mine high fitness process models from real-life event logs. The authors in [11] propose the ILP Miner based on the idea that places can restrict the possible firing sequences of a Petri net. The ILP Miner assures to mine highly fitting Petri nets whose sizes are independent of the number of events in the event logs. The authors in [12, 13] put forward the Inductive Miner which guarantees to return a fitting model that is also sound. The process model repair technique is presented by the authors in [14] for generating a high-fitness process model. Such a technique tries to divide the original log into several sublogs with non-fitting subtraces. For every sublog, a sub-model is obtained and added to the original model at the appropriate place. 5 Conclusion In this paper, we mainly proposed a technique IDC for helping detect and con- vert the inexpressible process behaviors for the utilised BPMD techniques from particular real-life event logs. Through the evaluation results from Section 3, we demonstrated the effectiveness of our technique. A fitting model might still be complex. Our future job will be focused on designing a new trace clustering technique [2–6] that contains two steps where the first step focuses on building sub-models with lower complexity and the second step focuses on improving the accuracy of the mined sub-models by using the technique proposed in this paper. In the meantime, we will also validate our method on some other real-life cases. References 1. W.M.P. van der Aalst. Process Mining: Discovery, Conformance and Enhancement of Business Processes. Springer, Berlin, 2011. 2. M. Song, C.W. Gnther and W.M.P. van der Aalst. Trace Clustering in Process Mining. In Business Process Management Workshops 2008, volume 17 of Lecture Notes in Business Information Processing, pages 109-120. Springer, Berlin. 3. J.D. Weerdt, S. vanden Broucke, J. Vanthienen and B. Baesens. Active Trace Clustering for Improved Process Discovery. IEEE Transactions on Knowledge and Data Engineering, 25(12):2708-2720, 2013. 4. Y.G. Sun and B. Bauer. A Novel Top-Down Approach for Clustering Traces. In Advanced Information Systems Engineering (CAiSE 2015), volume 9097 of Lecture Notes in Computer Science, pages 331-345. 32 Yaguang Sun and Bernhard Bauer 5. C.C. Ekanayake, M. Dumas, L. Garcia-Banuelos and M. La Rosa. Slice, Mine and Dice: Complexity-Aware Automated Discovery of Business Process Models. In Business Process Management (BPM 2013), volume 8094 of Lecture Notes in Computer Science, pages 49-64. 6. L. Garcia, M. Dumas, M.L. Rosa, J.D. Weerdt and C.C. Ekanayake. Controlled Automated Discovery of Collections of Business Process Models. Inf. Syst., 46:85- 101, 2014. 7. A.J.M.M. Weijters, W.M.P. van der Aalst and A.K. Alves de Medeiros. Process Mining with the Heuristics Algorithm. TU Eindhoven, BETA Working Paper Series 166, 2006. 8. A.A. de Medeiros. Genetic Process Mining. PhD. thesis, Eindhoven University of Technology (2006). 9. R. Bergenthum, J. Desel, R. Lorenz and S. Mauser. Process Mining Based On Regions of Languages. In Business Process Management (BPM 2007), volume 4714 of Lecture Notes in Computer Science, pages 375-383. 10. J. Cortadella, M. Kishinevsky, L. Lavagno and A. Yakovlev. Deriving Petri Nets from Finite Transition Systems. IEEE Transactions on Computers, 47(8):859-882, 1998. 11. J. van der Werf, B. van Dongen, C. Hurkens and A. Serebrenik. Process Discovery Using Integer Linear Programming. Applications and Theory of Petri Nets, pages 368-387, 2008. 12. S.J.J. Leemans, D. Fahland and W.M.P. van der Aalst. Discovering Block- Structured Process Models from Event Logs-A Constructive Approach. In: Petri Nets, volume 7927 of Lecture Notes in Computer Science, pages 311-329, 2013. 13. S.J.J. Leemans, D. Fahland and W.M.P. van der Aalst. Discovering Block- Structured Process Models from Event Logs Containing Infrequent Behaviour. In Business Process Management Workshops 2013, volume 171 of Lecture Notes in Business Information Processing, pages 66-78. 14. D. Fahland and W.M.P. van der Aalst. Repairing Process Models to Reflect Reality. In Business Process Management (BPM 2012), volume 7481 of Lecture Notes in Computer Science, pages 229-245. 15. T. Murata. Petri Nets: Properties, Analysis and Applications. Proc. of the IEEE, 77(4):541-580, 1989. 16. J. Claes and G. Poels. Process Mining and the ProM Framework: An Exploratory Survey. In Business Process Management Workshops 2012, volume 132 of Lecture Notes in Business Information Processing, pages 187-198.