=Paper=
{{Paper
|id=Vol-2016/paper2
|storemode=property
|title=On the Use of Hierarchical Subtrace Mining for Efficient Local Process Model Mining
|pdfUrl=https://ceur-ws.org/Vol-2016/paper2.pdf
|volume=Vol-2016
|authors=Niek Tax,Laura Genga,Nicola Zannone
|dblpUrl=https://dblp.org/rec/conf/simpda/TaxGZ17
}}
==On the Use of Hierarchical Subtrace Mining for Efficient Local Process Model Mining==
On the Use of Hierarchical Subtrace Mining for Efficient Local Process Model Mining? Niek Tax, Laura Genga, and Nicola Zannone Eindhoven University of Technology {n.tax, l.genga, n.zannone}@tue.nl Abstract. Mining local patterns of process behavior is a vital tool for the analysis of event data that originates from flexible processes, for which it is generally not possible to describe the behavior of the process in a single process model without overgener- alizing the behavior allowed by the process. Several techniques for mining such local patterns have been developed throughout the years, including Local Process Model (LPM) mining and the hierarchical mining of frequent subtraces (i.e., subprocesses). These two techniques can be considered to be orthogonal, i.e., they provide different types of insights on the behavior observed in an event log. As a consequence, it is often useful to apply both techniques to the data. However, both techniques can be computa- tionally intensive, hindering data analysis. In this work, we explore how the output of a subtrace mining approach can be used to mine LPMs more efficiently. We show on a collection of real-life event logs that exploiting the ordering constraints extracted from subtraces lowers the computation time needed for LPM mining compared to state-of- the-art techniques, while at the same time mining higher quality LPMs. Additionally, by mining LPMs from subtraces, we can obtain a more structured and meaningful rep- resentation of subprocesses allowing for classic process-flow constructs such as paral- lel ordering, choices, and loops, besides the precedence relations shown by subtraces. 1 Introduction Process Mining [1] has emerged as a new research area that aims at business process improvement through the analysis of event logs recorded by information systems. Such event logs capture the different steps (events) that are recorded for each instance of the process (case), and record for each of those steps what was done, by whom, for whom, where, when, etc. Process discovery, one of the main area within the process mining field, is concerned with the discovery of an interpretable model from event logs able to accurately describe the process. The process models obtained give insight into what is happening in the process and can be used as a starting point for follow-up analysis, e.g., bottleneck analysis [23], and checking compliance with rules and regulations [24]. Many process discovery algorithms have been developed throughout the years, e.g., [3, 13, 17, 18, 20]. In recent years, the scope of process discovery has broadened to new application domains, including software analysis and human behavior analysis. In some of those new application domains, including human behavior analysis, logs recording observed behavior have a high degree of variability. Often, this log variability has a significant impact on the ? This work has been partially funded by the ITEA2 project M2MGrids (No. 13011). 8 generation of insightful models. In particular, traditional process discovery techniques tend to generate process models in which any sequence of events is allowed, often referred to as the flower model. Several techniques aim to address this challenge of analyzing highly variable event logs: Declarative process discovery (e.g., [20, 25]) mines binary relations between activities of the process. Local Process Model (LPM) mining (e.g., [27, 28]) mines a collection of process mod- els instead of a single model, where each process model aims to model only a subset of the behavior. Subtrace mining (e.g., [4, 10, 16]) mines subtraces that represent relevant sequential portions of process executions (i.e., subprocesses). Some techniques (e.g., [10]) have investigated the mining of hierarchical subtraces. Those techniques first infer the most relevant subtraces from the event log, and then they arrange subtraces in a hierarchical structure, according to inclusion relationships. Our primary focus is on subtrace mining and LPM mining. These techniques share similar goals, i.e., the mining of relevant process execution patterns. However, they provide different insights on the process and have their advantages and disadvantages. Subtrace mining techniques derive the set of activities that are frequently performed together, also describing in which order they are typically performed. Hierarchical subtrace mining enables the analysis of process behaviors at different levels of abstraction. Here, subtraces at the top of the hierarchy represent the most general process behaviors, i.e., the core subtraces starting from which all the other, more detailed, subtraces can be obtained by adding one or more activities. Most subtrace mining approaches, however, are only able to derive sequential subprocesses. A more structured output, involving classic process flow constructs, usually provides more meaningful insights of the process behaviors. An exception is represented by the approach in [10], which however requires either to have a priori knowledge on concurrency or to be able to infer concurrency from event logs, for instance by applying process discovery techniques. As discussed above, mining accurate process models, including concurrent relations, is particularly challenging for low-structured and highly variable processes. Moreover, even when concurrency can be derived, other process complex behaviors, like choice constructs or loops, are not considered by subtrace mining techniques. Local models obtained with LPM mining can model portions of process behaviors, including sequential behaviors, concurrency, loops and choice construct. However, mining LPM patterns is computationally expensive, or even infeasible, for event logs with many activities. In practice, computational problems can already arise at seventeen activities [27]. Therefore, a set of heuristics have been proposed in [27] to speed up the mining process. These heuristics aim to discover subsets of process activities (called projections) that are strongly related by applying the LPM miner on each projection individually and aggregating the results by taking the union of the obtained local models. The downside of these heuristics is the loss of the formal guarantee that all process models that meet a certain support threshold will be found. The quality of the results is determined by the quality of the projections. In this work, we explore the combined use of LPM mining and hierarchical subtrace mining to mine local processes. More precisely, we investigate the application of the subtraces inferred by the technique proposed in [10] as input for LPM mining, i.e., as 9 projections for the LPM mining. In addition, we can exploit the ordering defined by subtraces on the execution of activities to define ordering constraints on the local process models. We conjecture that using activities from these subtraces as projections and ordering constraints can speed up the LPM mining procedure. Moreover, applying LPM mining on the output of subtrace mining makes it possible to infer rich control-flow relations between activities in the mined subtraces. Hierarchical subtrace mining can mine hierarchical relations between patterns, which is computationally infeasible with LPMs [22]. Therefore, by combining subtrace mining and LPM mining, an analyst can explore process behaviors that are modeled in terms of classic process constructs at different levels of abstraction, which would be otherwise difficult to obtain by LPM and subtraces mining considered separately. To assess the effectiveness of our approach, we performed an evaluation using three real-life event logs. The results show how the use of subtraces to infer projections and ordering constraints for LPM mining actually leads to significant improvements in terms of both speed of the mining and quality of the obtained local processes. This paper is organized as follows. Section 2 introduces notation and basic concepts that are used throughout the paper. Section 3 presents LPM and hierarchical subtrace mining. Section 4 introduces a projection method and a constraint generation technique for local process model mining that is based on subtrace mining results. In Section 5 we evaluate the speedup obtained with the projections and constraints from subtraces and the quality of the obtained results on a collection of real-life event logs. Finally, Section 6 discusses related work, and Section 7 concludes the paper. 2 Event Data and Process Models Process models describe how processes should be carried out. A process model notation that is commonly used in process mining is process tree [6]. A process tree is a tree structure where leaf nodes represent process activities. Non-leaf nodes represent operators that specify the allowed behavior over the activity nodes. Allowed operator nodes are the sequence operator (→), which indicates that the first child is executed before the second, the exclusive choice operator (×), which indicates that exactly one of the children can be executed, the concurrency operator (∧), which indicates that every child will be executed but allows for any ordering, and the loop operator (), which has one child node and allows for repeated execution of this node. We formally define process trees recursively. Let Σ be the set of all process activities, OP = {→, ×, ∧, } a set of operators and symbol τ ∈ / Σ denote silent activities. We define a process tree pt as follows: – a ∈ Σ ∪ {τ } is a process tree M ; – let {M1 , M2 , . . . , Mn } be a set of process trees. Then ⊕(M1 , M2 , . . . , Mn ) with ⊕ ∈ OP is a process tree. Hereafter, L(M ) denotes the language of a process tree M , i.e., the set of activity ex- ecution paths allowed by the model. Fig. 1d shows an example process tree M4 , with L(M4 )={ha, b, ci, ha, c, bi, hd, b, ci, hd, c, bi}. Informally, it indicates that either activity a or d is executed first, followed by the execution of activities b and c in any order. Process discovery aims to mine a process model from past process executions. An event e is the actual recording of the occurrence of an activity in Σ. A trace σ is a 10 → → → a ∧ × ∧ a a b b c a d b c (a) M1 (b) M2 (c) M3 (d) M4 Fig. 1: An initial LPM (M1 ) and three LPMs built from successive expansions. ∗ sequence of events, i.e., σ = he1 , e2 , . . . , en i∈Σ ∗ . An event log L∈NΣ is a finite multiset of traces. For example, event log L = [ha, b, ci2 , hb, a, ci3 ] consists of two occurrences of trace ha, b, ci and three occurrences of trace hb, a, ci. LX represents the projection of log L on a subset of the activities X⊆Σ, e.g., L{b,c} = [hb, ci5 ]. For an event log L, L̃={σ∈Σ ∗ |L(σ)>0} is the trace set of L. For example, for event log L=[ha, b, ci2 , hb, a, ci3 ], L̃={ha, b, cihb, a, ci}. #(σ, L) denotes the frequency of sequence σ ∈ Σ ∗ as a subtrace within log L, e.g., #(ha, bi, [ha, b, ci2 , ha, b, di3 ]) = 5. 3 Subprocess Mining In this section, we provide an overview of LPM mining and hierarchial subtrace mining. 3.1 Local Process Models Local Process Models (LPM) [28] are process models that describe frequent but partial behaviors seen in the event log, i.e., they model subsets of activities of the process. In this work, we use the process tree notation to represent LPMs. A technique to generate a ranked collection of LPMs through iterative expansion of candidate process trees is proposed in [28]. This technique involves four steps: 1) the generation of an initial (candidate) set of process trees, where a process tree is created for each single activity of the projection; 2) the evaluation phase, where process tree quality is assessed by a set of tailored metrics; 3) the selection phase, where process trees that scored poorly at the previous step are removed from the candidate set; 4) the expansion phase, where candidates selected at the previous step are expanded by replacing an activity node by an operator node, whose children are the replaced activity and another activity of the process. These steps are repeated until no candidate in the candidate set has the required support. The expansion procedure consists of the replacement of a leaf activity node a in the process tree by an operator node (i.e., →, ×, ∧ or ), where one of the child nodes is the replaced activity node a and the other is a new activity node b ∈ Σ. We denote M the LPM universe, i.e., the set of all possible LPMs. An LPM M ∈M can be expanded in many ways, as it can be extended by replacing any one of its activity nodes, expanding it with any of the operator nodes, and with a new activity node that represents any of the activities in the event log. We define Exp(M ) as the set of expansions of M , and exp max the maximum number of expansions allowed from an initial LPM, i.e., an LPM containing only one activity. Fig. 1 provides an example of the expansion procedure, starting from the initial LPM M1 of Fig. 1a. The LPM of Fig. 1a is first expanded into a larger LPM by replacing a by 11 event id activity time 1 a 15-4-2016 12:23 2 d 16-4-2016 14:38 σ =〈a,d,b,c,d,c,a,c,b,d,a,b,a 3 b 16-4-2016 14:46 4 c 16-4-2016 15:46 5 6 d c 16-4-2016 16:53 16-4-2016 16:58 σ {a,b,c} =〈a,b,c,c,a,c,b,a,b,a 7 a 16-4-2016 17:11 8 c 16-4-2016 17:45 λ1 γ1 λ2 γ2 λ3 Гσ,LPM =〈a,b,c,a,c,b 9 b 16-4-2016 18:03 10 d 17-4-2016 12:09 11 a 17-4-2016 18:24 12 b 17-4-2016 18:36 13 a 17-4-2016 18:37 (a) A trace σ of an event log L (b) Segmentation of σ on M3 Fig. 2: Example of segmentation in LPM mining. operator node →, with activity a as its left child node and b its right child node, resulting in the LPM of Fig. 1b. Note that M1 can also be expanded using any other operator or any other activity from Σ, and LPM discovery recursively explores all possible process trees that meet a support threshold by iterative expansion. In a second expansion step, activity node b of the LPM of Fig. 1b is replaced by operator node ∧, with activity b as its left child and c its right child, resulting in the LPM of Fig. 1c. Finally, activity node a of the LPM of Fig. 1c is replaced by operator node × with as left child activity a and as right child activity d, forming the LPM of Fig. 1d. In traditional LPM discovery the expansion procedure of an LPM stops when the behavior described by the LPM is not observed frequently enough in an event log L (i.e., with regard to some support threshold). To evaluate a given LPM on a given event log L, its traces σ∈L are first projected on the set of activities X in the LPM, i.e. σ 0 = σX . The projected trace σ 0 is then segmented into γ-segments, i.e., segments that fit the behavior of the LPM, and λ-segments, i.e. seg- ments that do not fit the behavior of the LPM. Specifically, σ 0 =λ1 γ1 λ2 γ2 · · · λn γn λn+1 such that γi ∈L(LPM ) and λi 6∈L(LPM ). We define Γσ,LPM to be a function that projects trace σ on the LPM activities and obtains its subsequences that fit the LPM, i.e. Γσ,LPM = γ1 γ2 . . . γn . Let our LPM M3 under evaluation be the process tree of Fig. 1c and σ the example trace shown in Fig. 2a. Function Act(LPM ) gives the set of process activities in the LPM, e.g. Act(M3 ) = {a, b, c}. The projection on the activities of the LPM gives σAct(M3 ) = ha, b, c, c, a, c, b, a, b, ai. Fig. 2b shows the segmentation of the projected trace on the LPM, leading to Γσ,LP M = ha, b, c, a, c, bi. The segmentation starts with an empty non-fitting segment λ1 , followed by a fitting segment γ1 =ha, b, ci, which completes one run through the process tree. The second event c in σ cannot be replayed on LP M , since it only allows for one c and γ1 already contains a c. This results in a non-fitting segment λ2 =hci. Segment γ2 =ha, c, bi again represents a run through process tree; the segmentation ends with non-fitting segment λ3 =ha, b, ai. We lift segmentation function Γ to event logs, ΓL,LP M ={Γσ,LPM |σ∈L}. An alignment-based [2] implementation of Γ , as well as a method to rank and select LPMs based on their support, i.e., the number of events in ΓL,LPM , is described in [28]. Local Process Models (LPMs) only contain a subset of the activities of the log and each LPM can in principle be discovered on any projection of the log containing the activities used in this LPM. Since the computational complexity of LPM mining depends 12 combinatorially on the number of activities in the log, mining LPMs on projections of the log instead of on the full log can significantly lower the running time of LPM mining. However, this results in a partial exploration of the LPM search space, potentially leading to good LPMs not being found. In principle, when activities that often follow each other are in the same projection, the search space can be constrained almost without loss in quality of the mined LPMs. Such projection sets could potentially be overlapping, which is actually desired, since interesting patterns might exist within a set of activities {a, b, c, d}, as well as within a set of activities {a, b, c, e}, and discovering on both L {a,b,c,d} and L {a,b,c,e} and merging the results is faster than discovering on the full log with Σ = {a, b, c, d, e}. Markov clustering, a well-known graph clustering algorithm, is proposed as a method to generate projection sets in [27]. Markov clustering is applied to a graph where vertices represent activities and edges represent the connectedness of two activities, using directly- follows and directly-precedesr ratios. It defines connectedness between activities a and b in #(ha,bi,L) 2 2 log L as conn(a, b, L) = #(hai,L) + #(ha,bi,L) #(hbi,L) . 3.2 Hierarchical Subtrace Mining The approach presented in [10] proposes to apply frequent subgraph mining (FSM) algorithms to derive most relevant subprocesses. To this end, each trace σ ∈ L is transformed into a directed graph g = (V, E, φ), where V is the set of nodes, each corresponding to an event in σ, E is the set of the edges, showing ordering relations between the events, and φ is a labeling function associating each node with the activity of the corresponding event. A trivial transformation consists in creating a node for each event in the trace and linking each pair of subsequent events through an edge. By doing so, every log trace is transformed into a sequence of events ordered according to their order in the log trace. Once the set of graphs is obtained, an FSM algorithm is applied to derive frequent subgraphs (i.e., subtraces). FSM algorithms usually rely on a metric to evaluate the rel- evance of the subgraphs. The work in [10] uses the SUBDUE algorithm [15] that employs Description Length (DL) to assess the relevance of subgraphs. Given a graph set G and a subgraph s, SUBDUE uses an index based on DL, hereafter denoted by ν(s, G), which DL(G) is computed as ν(s, G) = DL(s)+DL(G|s) where DL(G) is the DL of G, DL(s) is the DL of s and DL(G|s) is the DL of G compressed using s (i.e., the graph set in which each occurrence of s is replaced with a single node). By doing so, SUBDUE relates the relevance of a subgraph with its compression capability. The higher the value of ν is, the lower the DL value of the graph dataset compressed by s is. SUBDUE works iteratively. At each step, it extracts the subgraph with the highest compression capability, i.e., the subgraph corresponding to the maximum value of the ν index. This subgraph is then used to compress the graph set. The compressed graphs are presented to SUBDUE again. These steps are repeated until no more compression is possible or until a user-defined number of iterations is reached. The outcome of SUBDUE is a hierarchical structure, where mined subgraphs are ordered according to their relevance, showing the existing inclusion relationships among them. Top-level subgraphs are defined only through elements that belong to input graphs (i.e., nodes and arcs), while lower-level subgraphs additionally contain upper-level subgraphs as nodes. Descending the hierarchy, 13 γ1 a b c d e f γ2 γ3 f γ1 g γ1 k m γ2 γ4 Fig. 3: Example of the SUBDUE hierarchy → → → × × × a b c b a c b c a a b b c a b (a) Subtrace (b) Ordering constraints Fig. 4: An example of subtrace and the corresponding ordering constraints. we pass from subgraphs that are very common in the input graph set to subgraphs occurring in a few input graphs. Therefore, we can reasonably expect to be able to capture most of the process behaviors by only considering top-level subgraphs. Fig. 3 shows a simple example of the hierarchical structure mined by SUBDUE. We have two subgraphs at the top level, i.e., γ1 and γ2 , together with two subgraphs at the lower level, i.e., γ3 and γ4 . We can see that γ3 is a child of γ1 , since it involves γ1 in its definition. Similarly, γ4 is a child of both γ1 and γ2 . 4 Approach In this section, we present an approach to mine local process models by exploiting subtrace mining. In particular, our approach extends an existing LPM mining algorithm [28] by converting the set of subgraphs mined using SUBDUE into a set of projections (i.e., set of activities) and a set of ordering constraints. Projections and constraints are both used in the expansion phase of the LPM mining algorithm. Projections are used to reduce the LPM min- ing search space by reducing the number of activities that can be used for expansion, while ordering constraints prohibit expansion into LPMs that do not meet certain ordering relations. Ordering constraints represent impossible (or, at least, very unlikely) execution orders between activities of the subgraphs. Two types of constraints can be derived from subtraces: constraints on exclusive choices and constraints on sequential executions. Given a subgraph s = (V, E, φ), an ordering constraint on s is defined as a process tree M = ⊕(a, b) such that a, b ∈ Vi and ⊕ ∈ {→, ×}. Fig. 4 provides an example of a set of process trees representing ordering constraints mined from a subtrace. These constraints allow us to remove from the space state those LPMs that we can reasonably expect to turn out to be poor quality models and, hence, would be discarded in any case. It is trivial to see that we can safely remove from the state space all process trees involving choices constructs among activities belonging to the same subgraph. Indeed, the fact that these activities belong to the same subgraph implies that they tend to occur together, which, in turn, excludes that they can be properly represented by a choice construct. Similarly, since subgraphs represent the most frequent ordering executions obtained for a 14 Algorithm 1: Subgraphs-based LPM Input :event log L, set of subgraphs S Output :set of local process models LPM 1 LP M = ∅; 2 CP = ∅; 3 foreach si = (Vi , Ei , φi ) ∈ S do 4 OCi = f indOrderingConstraints(si ) 5 CP = CP ∪ {(Vi , OCi )}; 6 Checked p = ∅ 0 7 CP = ∅ 8 foreach pi = (Vi , OCi ) ∈ CP do 9 p0i = pi 10 if pi ∈ / Checked p then 11 foreach pj = (Vj , OCj ) ∈ CP do 12 if pj ∈ / Checked p ∧ j 6= i then 13 if (Vi ⊆ Vj ∨ Vj ⊆ Vi ) then 14 p0i = (Vi ∪ Vj , OCi ∪ OCj ) 15 Checked p = Checked p ∪ {pj } 16 Checked p = Checked p ∪ {pi } 17 CP 0 = CP 0 ∪ {p0i } 0 18 foreach pi ∈ CP do 19 LPM = LPM ∪ MiningLPMwithConstraints(L, pi ) 20 return LPM given set of activities, it is reasonable to not investigate those paths that explicitly contradict them when mining LPMs. For instance, the subprocess in Fig. 4a shows that in most of the cases activity a occurred before b, which, in turn, mostly occurred before c (which means that a occurred before c as well). Therefore, from this subgraph, we obtain the three sequential ordering constraints shown in Fig. 4b. Algorithm 1 formalizes the approach. Given an event log L and a set of subgraphs S, ordering constraints are derived from each subgraph by invoking findOrderingConstraints (line 4), defined in Algorithm 2. Then, the algorithm generates the set of projections and constraints CP . Each element of CP represents a projection Vi together with the corresponding ordering constraints OCi mined from subgraph si . After an element of CP has been created for each subgraph, the algorithm merges those elements whose sets of activities are identical or included among each other. Namely, it generates a new element p0 whose projection corresponds to the union of their projections, and the set of ordering constraints corresponds to the union of their sets of ordering constraints. (lines 8-17). By merging them in a single projection, we ensure that the same set of activities are considered only once by LPM mining. Combining ordering constraints can help reduce the number of process trees to explore. For example, let us assume that SUBDUE finds two subgraphs s1 , s2 involving activities a, b but reported in opposite execution order such that CPs1 = ({a, b}, {→ (b, a), ×(a, b)}) and CPs2 = ({a, b}, {→ (a, b), ×(a, b)}). By merging these elements, we obtain CP 0 = ({a, b}, {→ (b, a), → (a, b) × (a, b)}). This means that in the expansion phase of LPM 15 Algorithm 2: FindOrderingConstraints Input :subgraph s = (V, E, φ) Output :set of ordering constraints OC 1 OC = ∅; 2 foreach vi ∈ V do 3 foreach vj ∈ V do 4 Mxor = ×(vi , vj ); 5 OC = OC ∪ {Mxor }; 6 ρ(vi , vj ) = extractP ath(vi , vj ) 7 if ρ(vi , vj ) 6= ∅ then 8 Mseq =→ (vj , vi ) 9 OC = OC ∪ {Mseq }; 10 return OC mining, whenever a node involving a (b) has to be expanded adding another node b (a), only the parallel operator would be considered for the expansion. In fact, the expansions involving both the sequential operator and the choice operator are forbidden by the ordering constraints. Finally, algorithm MiningLPMwithConstraints is invoked on the event log and the mined set of constraints (line 19). This algorithm is a refined version of the procedure described in Section 3.1, able to account for constraints during the expansion phase. More precisely, every time that a leaf activity node in the process tree M to expand is replaced by an operator node, the obtained process tree Mi ∈ Exp(M ) is checked against the set of ordering constraints OCi . If there exists a constraint oci ∈ OCi such that oci is a subtree of Mi , then Mi is removed from the set of process trees that are considered for the following expansions. Algorithm 2 describes the procedure to convert a subgraph into a set of ordering constraints. For each pair of activities in the subgraph, the algorithm generates a constraint representing an exclusive choice and updates the set of ordering constraints (lines 4 and 5). Then, the algorithm checks whether those activities are linked by a path (line 6). If a path exists, the algorithm generates a sequential constraint representing a sequential execution order that violates the detected path is added to the set of constraints (lines 7-9). The algorithm returns the set of inferred ordering constraints. The hierarchical structure returned by SUBDUE can be exploited to determine which subgraphs have to be considered during the LPM mining. As explained in Section 3.2, low-levels subgraphs in the SUBDUE hierarchy correspond to detailed and often infrequent process variants. In this work, we let process analysts determine the level of detail to be used for LPM mining, i.e., the level of the hierarchy from which subgraphs can be selected. 5 Experiments In this section, we explore the effect of exploiting subtrace mining to mine Local Process Models (LPM). To this end, we performed two sets of experiments. In the first, we investigated the effects of the only use of SUBDUE projections. Namely, we used the LPM mining procedure of [27] adopting the activities involved in SUBDUE subgraphs as 16 projections, thus neglecting their ordering. In the second set of experiments, we exploited both projections and ordering constraints, using the LPM mining procedure in Section 4. We evaluated the LPM mining on two dimensions. First, we considered the reduction in the search space size, indicating the potential gain in computation time of the mining procedure. Then, we considered the quality of the output by comparing the ranking of LPMs mined using SUBDUE projections to the ranking of LPMs mined by exploring the full search space. When projections are exploited, some LPMs that are mined when exploring the full LPM search space might not be found [27]. Hence, by comparing the ranking of the models mined with/without projections, we can evaluate to what extent the use of projections affected the obtained outcome. We used the Normalized Discounted Cumulative Gain (NDCG) [7, 14], which is a widely used metrics to compare an obtained ranking with a ground truth ranking in the field of information retrieval [26]. Generally, NDCG@k is used, which only considers the top k elements of the ranking. NDCG consists of two components, Discounted Cumulative Gain (DCG) and Ideal Discounted Cumulative Gain (IDCG). DCG aggregates the relevance scores (i.e., the score obtained with respect to the quality metrics) of individual LPMs in the ranking in such a way that the graded relevance is discounted with logarithmic proportion to their position in the ranking. This results in more weight being put on the top of the ranking compared to lower parts of the ranking. Formally, DCG is 2reli −1 Pk defined as: DCG@k = i=1 log 2 (i+1) , where reli is the relevance of the LPM at position i. Normalized Discounted Cumulative Gain (NDCG) is obtained by dividing the DCG value by the DCG on the ground truth ranking (called Ideal Discounted Cumulative Gain). DCG@k Normalized Discounted Cumulative Gain (NDCG) is defined as: NDCG@k = IDCG@k . As a baseline technique, we applied the Markov-based projection discovery technique presented in [27] iteratively until all projections contain at most seven activities. We compared the search space reduction and the NDCG obtained when mining with iterative Markov-based projections with the search space reduction and NDCG obtained when mining with projections and constraints extracted from SUBDUE subgraphs. Dataset We evaluated our technique using three real-life event logs. The first event log contains execution traces from a financial loan application process at a large Dutch financial institution, commonly referred to as the BPI 2012 log [11]. This log consists of 13087 traces (loan applications) for which a total of 164506 events have been executed, divided over 23 activities. The second event log contains traces from the receipt phase of an environmental permit application process at a Dutch municipality, to which we will refer as the receipt phase WABO log [5]. The receipt phase WABO log contains 1434 traces, 8577 events, and 27 activities. The third event log contains medical care pathways of sepsis patients from a medium size hospital, to which we will refer as the SEPSIS log [21]. The SEPSIS log contains 1050 traces, 15214 events, and 16 activities. Settings We use the implementation of the iterative Markov LPM mining algorithm available in the LocalProcessModelDiscovery package1 of the ProM framework [29]. For the LPM mining approach that uses SUBDUE projections and constraints, we made an implementation available in ProM package LocalProcessModelDiscoveryWithSubdueCon- 1 https://svn.win.tue.nl/repos/prom/Packages/LocalProcessModelDiscovery/ 17 straints2 . Both for Markov-based LPM mining and SUBDUE-based LPM mining we use the standard configuration in ProM. For SUBDUE we used the implementation provided in http://ailab.wsu.edu/subdue/, varying the number of iterations. Note that, roughly speaking, a high number of SUBDUE iterations is expected to have benefits in terms of quality of LPM mining. Higher numbers of iterations lead to a higher number of process behaviors captured by SUBDUE subgraphs and, thus, taken into account during LPM mining. However, this has a negative impact on the speedup of the LPM mining procedure. Moreover, it is worth noting that, by construction, SUBDUE extracts the largest among the more frequent subgraphs in the earlier iterations. Hence, in practice, we expect the subtraces obtained using SUBDUE even for a low number of iterations to be able to represent most of the process behaviors. Ac- cording to this observation, we used 1, 5 and 10 iterations for our experiments. Nonetheless, to verify our assumption, we also performed for each dataset an experiment with a high num- ber of iterations (i.e., 10000). Note that for these experiments we derived the LPMs only for the subgraphs of the top-level of SUBDUE hierarchy, since they represent the most relevant ones, as explained in Section 3. All the experiments were performed on an Intel i7 CPU. Results Table 1 shows the results obtained on the three event logs. The results obtained when exploring the full search space, i.e., without applying any projections or constraints, are considered to be the ground truth LPM ranking and therefore have NDCG@k values of 1.0 by definition. For the BPI 2012 log and the receipt phase log the full search space consists of approximate 1.5 million LPMs, while for the smaller SEPSIS log the search space consists of 315k LPMs. The results obtained by the best heuristic configuration(s) are reported in bold. When using projections mined with iterative Markov [27] projection mining, the search space of LPM mining is reduced by a factor between 43.50x (BPI 2012 log) and 120.21x (Receipt phase log), while the majority of the top 20 LPMs of the ground truth can still be found, resulting in high NDCG values. The table shows that using the constraints extracted from SUBDUE subgraphs after running the SUBDUE algorithm for 10k iterations together with the iterative Markov projections further increases the speedup of LPM mining on all three logs while resulting in identical LPM rankings. The number of iterations used in the SUBDUE algorithms when using projections extracted from SUBDUE results is indicated between parentheses. The search space size of LPM mining when using SUBDUE projections depends on the number of iterations used in the SUBDUE mining phase: when a larger number of iterations is used, then a larger number of unique sets of activities will be found to be used a projections, leading to a larger LPM search space. At the same time, increasing the number of iterations used in the SUBDUE mining phase increases the quality of the mined LPMs in terms of NDCG. Notice that the quality of the LPM mining results is not very stable when we run the SUBDUE algorithm for only one iteration: on the receipt phase log and the BPI 2012 log it leads to results comparable with or better than the results obtained with iterative Markov projections, but on the SEPSIS log it results in a very small search space and NDCG values. It is worth noting that running the SUBDUE algorithm for 10 iterations leads to a higher speedup than iterative Markov projections on all logs, even when no SUBDUE constraints are used, while at the same time the quality of the obtained rankings in terms of NDCG is higher. This 2 https://svn.win.tue.nl/repos/prom/Packages/LocalProcessModelDiscoveryWithSubdueConstraints 18 Table 1: Experimental results of LPM mining with and without SUBDUE projections and constraints in terms of search space size and quality of the ranking. Event Log Projections Constraints Search Space Speedup NDCG@5 NDCG@10 NDCG@20 (iterations) (iterations) Size BPI 2012 None None 1567250 - 1.0000 1.0000 1.0000 BPI 2012 Iterative Markov None 36032 43.50x 0.9993 0.9987 0.9865 BPI 2012 Iterative Markov SUBDUE (10k) 21084 74.33x 0.9993 0.9987 0.9865 BPI 2012 SUBDUE (1) None 10608 147.74x 0.9993 0.9987 0.9830 BPI 2012 SUBDUE (5) None 10740 145.93x 1.0000 0.9994 0.9870 BPI 2012 SUBDUE (10) None 10904 143.73x 1.0000 0.9994 0.9903 BPI 2012 SUBDUE (10k) None 12666 123.74x 1.0000 0.9994 0.9903 BPI 2012 SUBDUE (1) SUBDUE (1) 2718 576.62x 0.9993 0.9987 0.9830 BPI 2012 SUBDUE (5) SUBDUE (5) 2620 598.19x 1.0000 0.9994 0.9870 BPI 2012 SUBDUE (10) SUBDUE (10) 2874 545.32x 1.0000 0.9994 0.9903 BPI 2012 SUBDUE (10k) SUBDUE (10k) 4012 390.64x 1.0000 0.9994 0.9903 Receipt phase None None 1451450 - 1.0000 1.0000 1.0000 Receipt phase Iterative Markov None 12074 120.21x 0.9418 0.8986 0.8238 Receipt phase Iterative Markov SUBDUE (10k) 10610 136.80x 0.9418 0.8986 0.8238 Receipt phase SUBDUE (1) None 8176 177.53x 1.0000 0.9994 0.9903 Receipt phase SUBDUE (5) None 8256 175.81x 1.0000 0.9994 0.9903 Receipt phase SUBDUE (10) None 8264 175.64x 1.0000 0.9994 0.9958 Receipt phase SUBDUE (10k) None 8504 170.68x 1.0000 0.9994 0.9958 Receipt phase SUBDUE (1) SUBDUE (1) 4012 390.64x 1.0000 0.9994 0.9903 Receipt phase SUBDUE (5) SUBDUE (5) 1862 779.51x 1.0000 0.9994 0.9903 Receipt phase SUBDUE (10) SUBDUE (10) 2170 668.71x 1.0000 0.9994 0.9958 Receipt phase SUBDUE (10k) SUBDUE (10k) 2178 666.41x 1.0000 0.9994 0.9958 SEPSIS None None 315451 - 1.0000 1.0000 1.0000 SEPSIS Iterative Markov None 6304 50.04x 0.9332 0.9148 0.8613 SEPSIS Iterative Markov SUBDUE (10k) 3768 83.72x 0.9332 0.9148 0.8613 SEPSIS SUBDUE (1) None 12 26287.58x 0.5763 0.3771 0.2489 SEPSIS SUBDUE (5) None 334 994.46x 0.9916 0.9671 0.9472 SEPSIS SUBDUE (10) None 394 800.64x 0.9923 0.9692 0.9534 SEPSIS SUBDUE (10k) None 1034 305.08x 0.9923 0.9692 0.9534 SEPSIS SUBDUE (1) SUBDUE (1) 10 31545.10x 0.5763 0.3771 0.2489 SEPSIS SUBDUE (5) SUBDUE (5) 174 1812.94x 0.9916 0.9671 0.9472 SEPSIS SUBDUE (10) SUBDUE (10) 144 2190.63x 0.9923 0.9692 0.9534 SEPSIS SUBDUE (10k) SUBDUE (10k) 470 671.17x 0.9923 0.9692 0.9534 seems to indicate that using the activity sets of SUBDUE subgraphs is a more effective approach to mine clusters of activities for projection based LPM mining than the iterative Markov based approach. No increase in the quality of the mined LPMs has been found by increasing the number of SUBDUE iterations over 10 iterations; on all three event logs the quality when using 10k iterations is identical to 10 iterations. Using constraints extracted from SUBDUE subtrace mining in combination with SUBDUE-based projections has resulted in considerably higher speedup on all three logs without any loss in quality of the mined LPMs. On all event logs, the highest quality LPMs are mined when using projections and the constraints extracted from the subsequence mined with at least 10 iterations of the SUBDUE algorithm. Additionally, as expected, the speedup obtained using these projections and constraints is considerably higher than the speedup obtained when mining LPMs without constraints and with iterative Markov projections, which was the current state-of-the-art approach in LPM mining. Compared to LPM mining 19 → IV IV Admission → Liquid Antibiotics NC IV Liquid IV Antibiotics AdmissionNC (551/753) (551/823) (811/1182) (a) Subgraph mined with SUBDUE (b) Local Process Model Fig. 5: Subgraph mined with SUBDUE from the SEPSIS log and the Local Process Model showing the true relation between activities IV Liquid, IV Antibiotics, and Admission NC. with iterative Markov projections, mining with SUBDUE (10) projections and constraints results in an increase in speedup from 43.50x to 545.32x on the BPI 2012 log, from 120.21x to 668.71x on the receipt phase log, and from 50.04x to 6190.63x on the SEPSIS log. To put these results into perspective: this brings down the mining time on the BPI 2012 log from 24 minutes to less than two minutes. Before concluding this section, we provide an example of LPM mined from a SUBDUE subgraph, to provide some insights on the benefits of introducing a higher level of struc- ture on subtrace mining output. Fig. 5a shows the subtrace mined with SUBDUE from the SEPSIS log from which the projection on activity set {IV Liquid , IV Antibiotics, AdmissionNC } was extracted and Fig. 5b shows one of the LPM mined from that set of activities. While SUBDUE can only mine sequential relations, as in Fig. 5a, the LPM shows the true relation between the three activities: 551 out of 753 occurrences of IV Liquid are followed by IV Antibiotics and then one or more occurrences of Admission NC. However, 811 out of 1182 instances of Admission NC are preceded by the sequence hIV Liquid , IV Antibioticsi, showing that there are on average almost 1.5 admissions for each liquid and antibiotics. 6 Related Work Several approaches have been proposed to extract the most relevant subprocesses (intended as subgraphs) from a set of process execution traces. Some approaches propose to extract subprocesses from sequential traces. For instance, Bose et al. [4] mine subprocesses by identifying sequences of events that fit a priori defined templates. Leemans et al. [16] introduce an approach to derive episodes, i.e., directed graphs where nodes correspond to activities and edges to eventually-follow precedence relations that, given a pair of activities, indicate which one occurs later in the process. Compared to these approaches, our approach does not require any predefined template and extracts subprocesses that are the most relevant according to their description length, thus taking into account both frequency and size in determining the relevance of each subprocess. Several other techniques, like LPMs, focus on the mining of more complex patterns that allow for control-flow constructs other than sequential ordering. Chapela-Campa et al. [8] developed a technique called WoMine-i to mine subprocess patterns with multiple control- flow constructs that are infrequent, in contrast to most techniques that focus on the mining of frequent subprocess patterns. Lu et al. [19] recently proposed an interactive subprocess 20 exploration tool, which allows for the discovery of subprocess patterns and then allows the process analyst to modify these discovered subprocesses based on domain knowledge. For such a modified subprocess the tool provides the analyst with information regarding how frequent the modified pattern is, and where in the log it occurred. Dalmas et al. [9] developed a version of the Local Process Model miner, called the High-Utility LPM miner, that aims at mining subprocess patterns that optimize some optimization function that is provided by the process analyst, instead of mining simply frequent LPMs. Greco et al. [12] propose a Frequent Subgraph Mining (FSM) algorithm that exploits knowledge about relationships among activities (e.g., an AND/OR split) to drive subgraphs mining. Graphs are generated by replaying traces over the process model; however, this algorithm requires a model properly representing the event log, which may not be available for many real-world processes. 7 Conclusions and Future Work In this work, we investigated the use of subtrace mining in the generation of projections and constraints to aid the discovery of Local Process Models (LPMs). Specifically, we extended the LPM algorithm in [27] to account for ordering constraints mined using SUBDUE. We evaluated our approach on three real-world event logs. The results show that mining LPMs with SUBDUE projections and constraints outperforms the current state-of-the-art techniques for LPM mining both in quality as well as in computation time. In particular, on two out of the three event logs, the speedup in computation time compared to the current state-of-the-art technique is more than a factor 10. As future work, we plan to explore the use of other subtraces mining techniques to derive local process models. Moreover, we intend to investigate the mining of ordering relations between projections that are not involved in hierarchical inclusion relations, and their use in the combined set of LPMs, to obtain models describing possible correlations between behaviors occurring in different portions of the process. References 1. van der Aalst, W.M.P.: Process mining: data science in action. Springer (2016) 2. van der Aalst, W.M.P., Adriansyah, A., van Dongen, B.F.: Replaying history on process models for conformance checking and performance analysis. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 2(2), 182–192 (2012) 3. Bergenthum, R., Desel, J., Lorenz, R., Mauser, S.: Process mining based on regions of languages. In: Business Process Management. pp. 375–383. Springer (2007) 4. Bose, R.P.J.C., van der Aalst, W.M.P.: Abstractions in process mining: A taxonomy of patterns. In: Business Process Management, LNCS, vol. 5701, pp. 159–175. Springer (2009) 5. Buijs, J.C.A.M.: Receipt phase of an environmental permit application process (WABO), CoSeLoG project (2014), doi:10.4121/uuid:a07386a5-7be3-4367-9535-70bc9e77dbe6 6. Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: A genetic algorithm for discovering process trees. In: Proc. of IEEE Congress on Evolutionary Computation. pp. 1–8. IEEE (2012) 7. Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of International Conference on Machine Learning. pp. 89–96 (2005) 21 8. Chapela-Campa, D., Mucientes, M., Lama, M.: Discovering infrequent behavioral patterns in process models. In: Business Process Management. pp. 324–340. Springer (2017) 9. Dalmas, B., Tax, N., Norre, S.: Heuristics for high-utility local process model mining. In: Proceedings of the International Workshop on Algorithms & Theories for the Analysis of Event Data. pp. 106–121. CEUR (2017) 10. Diamantini, C., Genga, L., Potena, D.: Behavioral process mining for unstructured processes. Journal of Intelligent Information Systems 47(1), 5–32 (2016) 11. van Dongen, B.F.: BPI challenge 2012 (2012), doi:10.4121/uuid: 3926db30-f712-4394-aebc-75976070e91f 12. Greco, G., Guzzo, A., Manco, G., Saccà, D.: Mining and reasoning on workflows. IEEE Transactions on Knowledge and Data Engineering 17(4), 519–534 (2005) 13. Günther, C.W., van der Aalst, W.M.P.: Fuzzy mining–adaptive process simplification based on multi-perspective metrics. In: Business Process Management. pp. 328–343. Springer (2007) 14. Järvelin, K., Kekäläinen, J.: Cumulated gain-based evaluation of ir techniques. ACM Transactions on Information Systems 20(4), 422–446 (2002) 15. Jonyer, I., Cook, D., Holder, L.: Graph-based Hierarchical Conceptual Clustering. Journal of Machine Learning Research 2, 19–43 (2002) 16. Leemans, M., van der Aalst, W.: Discovery of frequent episodes in event logs. In: Proc. of Int. Symposium on Data-driven Process Discovery and Analysis. pp. 1–31. CEUR-WS.org (2014) 17. Leemans, S.J.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structured process models from event logs containing infrequent behaviour. In: BPM. pp. 66–78. Springer (2013) 18. Liesaputra, V., Yongchareon, S., Chaisiri, S.: Efficient process model discovery using maximal pattern mining. In: Business Process Management. pp. 441–456. Springer (2015) 19. Lu, X., Fahland, D., Andrews, R., Suriadi, S., Wynn, M.T., ter Hofstede, A.H.M., van der Aalst, W.M.P.: Semi-supervised log pattern detection and exploration using event concurrence and contextual information. In: Proceedings of International Conference on Cooperative Information Systems. Springer (2018) 20. Maggi, F.M., Mooij, A.J., van der Aalst, W.M.P.: User-guided discovery of declarative process models. In: Computational Intelligence and Data Mining. pp. 192–199. IEEE (2011) 21. Mannhardt, F.: SEPSIS Cases – Event Log (2016), doi:10.4121/uuid: 915d2bfb-7e84-49ad-a286-dc35f063a460 22. Mannhardt, F., Tax, N.: Unsupervised event abstraction using pattern abstraction and local process models. In: Proceedings of the International Working Conference on Business Process Modeling, Development and Support. pp. 89–96. CEUR-WS.org (2017) 23. Măruşter, L., van Beest, N.R.T.P.: Redesigning business processes: a methodology based on simulation and process mining techniques. Knowl. Inf. Syst. 21(3), 267 (2009) 24. Ramezani, E., Fahland, D., van der Aalst, W.M.P.: Where did I misbehave? diagnostic information in compliance checking. In: BPM. pp. 262–278. Springer (2012) 25. Schönig, S., Cabanillas, C., Jablonski, S., Mendling, J.: Mining the organisational perspective in agile business processes. In: International Conference on Enterprise, Business-Process and Information Systems Modeling. pp. 37–52. Springer (2015) 26. Tax, N., Bockting, S., Hiemstra, D.: A cross-benchmark comparison of 87 learning to rank methods. Information processing & management 51(6), 757–772 (2015) 27. Tax, N., Sidorova, N., van der Aalst, W.M.P., Haakma, R.: Heuristic approaches for generating local process models through log projections. In: Proceedings of IEEE Symposium Series on Computational Intelligence. pp. 1–8. IEEE (2016) 28. Tax, N., Sidorova, N., Haakma, R., van der Aalst, W.M.P.: Mining local process models. Journal of Innovation in Digital Ecosystems 3(2), 183–196 (2016) 29. Verbeek, H.M.W., Buijs, J.C.A., Van Dongen, B.F., van der Aalst, W.M.P.: ProM 6: The process mining toolkit. In: Proceedings of BPM Demo Track. vol. 615, pp. 34–39. CEUR-WS.org (2010) 22