Detecting Change in Processes Using Comparative Trace Clustering B.F.A. Hompes1,2 , J.C.A.M. Buijs1 , W.M.P. van der Aalst1 , P.M. Dixit1,2 , and J. Buurman2 1 Department of Mathematics and Computer Science Eindhoven University of Technology, Eindhoven, The Netherlands 2 Philips Research, Eindhoven, The Netherlands {b.f.a.hompes,h.m.w.verbeek,w.m.p.v.d.aalst}@tue.nl {prabhakar.dixit,hans.buurman}@philips.com Abstract. Real-life business processes are complex and show a high degree of variability. Additionally, due to changing conditions and circum- stances, these processes continuously evolve over time. For example, in the healthcare domain, advances in medicine trigger changes in diagnoses and treatment processes. Besides changes over time, case data (e.g. treating physician, patient age) also influence how processes are executed. Existing process mining techniques assume processes to be static and therefore are less suited for the analysis of contemporary flexible business processes. This paper presents a novel comparative trace clustering approach that is able to expose changes in behavior. Valuable insights can be gained and process improvements can be made by finding those points in time where behavior changed and the reasons why. Evaluation on real-life event data shows our technique can provide these insights. Keywords: process mining, trace clustering, concept drift, process com- parison 1 Introduction Business processes in flexible environments typically depend on many factors. Due to changing conditions and circumstances, these processes continuously evolve over time. For example, advances in medicine can change how patients are treated or how diagnoses are made in hospitals. Other changing circumstances could be new legislation, seasonal effects or even preferences of involved resources. As a result, in these flexible types of processes many cases follow a unique path through the process. This variability causes problems for existing process mining techniques since processes are assumed to be structured and in a steady state. Contemporary process mining techniques return spaghettilike processes and potentially misleading results when this is not the case [3, 7, 16]. The discovered process models capture behavior possible at any given point in time. Often, however, the process context changes and what was possible before is no longer possible or vice-versa. Figure 1 shows the relevance of the problem with an example process that has changed over time. 95 Fig. 1: Demonstration of concept drift in a simple process. Techniques that consider the entire event log can return misleading results. This so-called concept drift is one of the key challenges in process mining, as discussed in the Process Mining Manifesto [2]. Few techniques have been proposed to deal with concept drift in a business process setting [5, 6, 9, 11, 17]. Where existing techniques focus mainly on change in control-flow, our focus is on detecting changing behavior including data aspects as well. Finding changes in behavior can be of great value to process owners. Detecting unwanted changes, for example, can identify potential risks while positive change can lead to perdurable process improvement. Techniques that consider the entire event log at once cannot show the individual changes that occur throughout the lifetime of a process [5]. For example, specific types of behavior might occur for a limited time only, or can have a seasonal nature. Behavior can also merge with, or emerge from other behavior. Analysis of differences in behavior is supported by comparing clusterings created for two selected partitions of the event log. This way, we can not only compare differences in behavior over time, but behavior for different case groups (e.g. different age groups, customer types, cases handled by different resources) can be compared as well. As such, the analysis of differences in process behavior is generalized. Recently we proposed a novel technique for the detection of common and deviating behavior using trace clustering [8]. The process context is considered by taking both control-flow as well as case and event data into account. Hence, clustering of cases is not limited to finding similar execution paths. The trace clustering technique is based on the Markov cluster (MCL) algorithm, which is a fast and scalable cluster algorithm for graphs [15]. It is able to autonomously discover a (non-specified) number of clusters of different sizes and densities. Hence, the resulting clustering shows both mainstream and deviating behavior. The technique can be used to exploratively analyse the event data at hand in order to gain novel insights into the underlying process(es). 96 Similarity matrix Clustering Sublog Event log Sublog Similarity matrix Clustering Fig. 2: A graphical overview of the approach. Differences in behavior can be compared using comparative trace clustering. In this paper, we extend on the work in [8] by providing a new technique for change point detection and a means to compare clusterings. Figure 2 shows a high-level overview of our approach. By incorporating the time dimension we can discover temporal evolutions in process behavior. Changes in behavior causing or caused by differences in case data can therefor be detected as well. In order to detect change points, we look at similarities between cases. We consider the effect of new events on a clustering of active cases. The remainder of this paper is organized as follows. Section 2 briefly introduces necessary preliminary definitions. The detection of change points is described in Section 3. Section 4 describes the comparative trace clustering approach. An experimental evaluation on two real-life event logs is performed in Section 5. Section 6 discusses related work. The paper concludes with a summary and planned future work in Section 7. 2 Preliminaries Typically, the executed events of multiple cases of a process are recorded in an event log. An event is an execution of an activity potentially having additional data attributes such as a timestamp or the responsible resource. A trace is a finite sequence of events, and describes one specific instance (i.e. a case) of the process at hand in terms of the executed events. A case can also have additional (case-wide) attributes such as a patient birthdate or diagnosis result. Definitions in this section are based on those as defined in [1]. Definition 1 (Event, attribute). Let E be the event universe, i.e. the set of all possible event identifiers. Events may be characterized by various attributes, e.g. an event may have a timestamp, correspond to an activity, be executued by a particular person, etc. Let N be a set of attribute names. For any event e ∈ E and attribute name n ∈ N : n(e) is the value of attribute n for event e. If event e does not have an attribute named n, then n(e) =⊥ (null value). 97 Typically, the following attributes are present in all events: activity(e) is the activity associated to event e, time(e) is the timestamp of e, and resource(e) is the resource associated to e. Additional event attributes can be the cost associated with the event, the outcome of an activity (e.g. surgery result), etc. Definition 2 (Case, trace, event log). Let C be the case universe, i.e. the set of all possible case identifiers. Cases, like events, have attributes. For any case c ∈ C and attribute name n ∈ N : n(c) is the value of attribute n for case c (n(c) =⊥ if c has no attribute named n). Each case has a mandatory attribute ‘trace’: trace(c) ∈ E ∗ . ĉ = trace(c) is a shorthand notation for referring to the trace of a case. A trace is a finite sequence of events σ ∈ E ∗ , such that each event appears only once, i.e. for 1 ≤ i ≤ j ≤ |σ|: σ(i) 6= σ(j). For any sequence σ = hσ1 , σ2 , . . . , σn i, set(σ) = {σ1 , σ2 , . . . , σn } converts a sequence into a set, e.g. set(ha, b, c, b, c, di) = {a, b, c, d}. An event log is a set of cases L ⊆ C such that each event appears at most once in the entire log, i.e. for any c, c0 ∈ L such that c 6= c0 : set(ĉ) ∩ set(ĉ0 ) = ∅. In the example traces in Figures 1 and 3 a simplified form is used where events are represented solely by the activities they execute. In this form, a trace is a sequence of activities and an event log a multiset of traces (since in this simplified form cases can share their trace). Definition 3 (Trace Similarity Matrix). Let L ⊆ C be an event log. M(L) = (L × L) → [0.0, 1.0] denotes the set of all possible trace similarity matrices over L. For cases c, c0 ∈ L and a trace P Pmatrix M ∈ M(L), similarity M (c, c0 ) denotes the similarity between c and c0 . M = c,c0 M (c, c0 ) denotes the sum of all elements in M . Furthermore, standard matrix operations such as addition, subtraction, multiplication, etc. are supported as well. Definition 4 (Trace Clustering). Let L ⊆ C be an event log. A trace cluster over L is a subset of L. A trace clustering T C ⊆ P(L)1 is a set of trace clus- ters S over L. We assume every case to be part of at least one trace cluster, i.e. T C = L. Cases can be in multiple clusters, i.e. cluster overlap is allowed. 3 Detecting Change in Behavior Discovering a process model on an entire real-life event log will often lead to a spaghetti model since it has to represent all past traces [3, 7, 16]. Similarly, clustering the entire event log will show groups of behavior that were possible at any given point in time. The evolution of this behavior is not shown. Our technique is based on the technique proposed in [8] where trace clustering and outlier detection are combined in order to find mainstream and deviating behavior. It relies on the Markov cluster (MCL) algorithm [15] to find groups of cases (trace clusters) that share behavior on a set of selected perspectives. By incorporating both control-flow and case data, the process context is taken 1 P(L) denotes the powerset over event log L, i.e. all possible sublogs of L. 98 into account. Because MCL is neither biased towards globular or local clusters and is able to find clusters of different density, we can distinguish common and exceptional behavior based on cluster sizes. As the number of clusters is discovered rather than set beforehand, changes in behavior will be reflected in change in the clustering. MCL uses a stochastic similarity matrix between cases as its input. It al- ternates an expansion step that raises the matrix to a given power with the inflation step. Inflation raises each element to a given power and normalizes the matrix such that it is stochastic again. This alternation eventually results in the separation of the matrix into different components which is interpreted as a clustering. Similarities between cases are calculated by mapping cases to profile vectors using a set of selected perspectives and computing pair-wise vector similarity values. Possible perspectives are the occurrence or frequency of certain activities, the values for case or event data, etc. For more information on the calculation of this similarity matrix and the use of the MCL algorithm in trace clustering we refer the reader to [8]. We propose to utilize the similarity matrix between cases to detect changes in behavior. An overview of this approach can be seen in Figure 3, where five example (simple) traces are drawn over time. Over time, the occurrence of new events in a certain trace will change the similarity of that trace to other traces, which is reflected in the similarity matrix. As this matrix is the input for the clustering algorithm, the impact on the similarity matrix is a good indicator for how much the clustering will change. Thus, in order to detect potentially interesting change points, we compute the change in the values of the similarity matrix over time. Similar to the approaches that use statistical tests [5, 10, 11], a large difference indicates a significant change in behavior. For each event window, cases that have events in or before that window are considered in the calculation of the next similarity matrix. Events occurring after the current event window are not considered. As such, for each case, the prefix of its trace is taken. Different window sizes can be considered. Case attributes (e.g. patient age or gender) are considered to be known from the very first event in the trace of a case. As a result, when looking at case-wide data, change points will reflect on the first event of a deviating case. This projection of cases leads to a sublog containing all events up to a certain time. A similarity matrix is calculated for that sublog and compared with the previous similarity matrix. If change points have been identified, clusterings of cases occurring before and after these points can be created to compare behavior before and after the change point. For example, take the traces depicted in Figure 3, and consider cases to be similar when they share activities. At first, up to event window W3 , trace t4 seems to be quite similar to the other three traces seen so far. However, in window W4 this changes due to activity f which has replaced activity e. This change is reflected P in the similarity matrices M1 to M5 . Consequently the value for ∆3 = | (M4 − M3 )| will be significantPand indicates a possible change point in W4 . Since the range for the values for | (Mx − Mx−1 )| for future windows is unknown, we normalize all delta values after each measurement. 99 Window size t1 t2 Traces t 3 t4 t5 Time W1 W3 W5 W2 W4 M1 M2 M3 M4 M5 Δ1 Δ3 Δ2 Δ4 Δmin Δ Time W4 Fig. 3: A graphical overview of the change detection approach. In order to detect changes in behavior, the difference in similarity matrices over time is calculated. In this example, significant change occurred in window W4 . 100 4 Comparative Clustering In order to analyse the effect of changing behavior in a process, different clusterings can be created and compared both programmatically and visually. By comparing clusterings created for different sublogs (e.g. before and after an identified change point), we can see where behavior changed and why. As discussed, it is also possible to manually create selections of cases based on time or data attributes in order to compare behavior. This can be used to compare behavior for different age groups, for patients from different geographical locations, etc. Because of the properties of the MCL algorithm discussed in Section 3, change is reflected in the cluster structure. For example, cluster sizes indicate the frequency of the captured behavior. Behavior that used to be common but is becoming less and less frequent will still be clustered separately rather than being merged with (different) common behavior, as long as the behavior remains dissimilar enough. Clusters of cases can be annotated with descriptions about the cases that are present. Shared activities between traces and similar data values are a few of the possibilities. For example, a cluster might group cases of patients that all share a certain diagnosis. This diagnosis can be used to describe the cluster. Two example clusterings are shown in Figure 4. In the left clustering, one cluster is selected. By highlighting those clusters in the right clustering that share behavior with the selected cluster(s) on the left we can show how behavior has changed. It is possible to see how cases are clustered or what annotations are shared. For example, we might compare two years of patient data, clustered on diagnosis. Patients that are present in both years are highlighted as shared cases (in dark gray), while clusters that share some or all diagnoses (shared annotations) are highlighted (in light gray). Within a trace clustering, clusters that share behavior are connected. As such, when comparing two clusterings, clusters that are split into or have emerged from multiple clusters can be found. This can indicate that behavior has become more specific or more general. Selected cluster Similar annotation Shared traces Fig. 4: Two clusterings compared. Highlighted clusters on the right show how behavior on the left has changed. 101 5 Evaluation In order to evaluate the approach we use two real-life event logs. The first log comes from a Dutch academic hospital that contains cases pertaining to cancer treatment procedures. It was originally used in the first Business Process Intelligence Contest (BPIC 2011) [13]. The second event log contains cases of building permit applications provided by a Dutch municipality. This log is part of the 2015 edition of the BPI Challenge (BPIC 2015) [14]. These event logs were used so that the results can be reproduced. Our technique has been implemented in the process mining tool ProM1 , and is publicly available in the TraceClustering package. In the results shown here, we used parameters expansion=2, inflation=15 for the MCL algorithm. 5.1 Hospital event log The first event log contains cases of different stages of malignancy and of different parts of the body. Also, information is present about the diagnosis, treatment, specialism required, patient age, organisational group (hospital department), etc. This log contains 1,143 cases, 150,291 events and 624 distinct activities. There are 981 different executions paths (activity sequences). There are many attributes present on both the event and case level. All of these attributes can obtain several different values, leading to a large heterogeneity in the log. As cases are recorded between January 2005 and March 2008, the event log is likely to exhibit drifts. For each case, there are 16 attributes for ‘diagnosis code’, referring to the diagnoses the patient received for different parts of their body. By clustering on these attributes and comparing the years 2005 and 2006 we can see trends in common and exceptional diagnoses. Figure 5 shows how this diagnosis has evolved. The clustering on the left represents behavior in 2005 whereas the clustering on the right represents 2006. Patients that were in the selected cluster and have had activities in both years are highlighted in dark gray. Groups of patients that have had (partially) shared diagnoses are marked light gray. For example, out of the 539 cases in 2015, 78 (14.5%) were diagnosed with codes M13, 822, and 106, on different parts of the body. Out of the 765 cases in 2016, only 84 (11.0%) had a diagnosis using the same codes. What’s more, is that these diagnoses are present for more body parts as well. This could indicate a trend in diseases or be due to an improvement in diagnosis detail. As there are many smaller clusters in 2006 that have additional diagnoses (light gray), we can deduce that for the selected diagnosis, the related diagnoses have become more specific and diagnoses are also made on other parts of the body. In Figure 5 process maps and differences in activities are shown for two highlighted clusters. Besides diagnosis codes, every case has 16 possible attributes for ‘treatment code’, referring to the treatments the patient received on different parts of their body. This leads to many possible treatment combinations for different diagnoses. We compare the clustering created for the combined attributes diagnosis code 1 See http://promtools.org 102 and treatment code for the years 2005 and 2006. Using our technique, different clusters are discovered, each cluster contains specific combinations of diagnosis and treatment. By comparing the results for the two years, changes in treatments for specific diagnoses become visible. As we can see from Figure 6, treatment for some diagnoses have changed. Again cases in 2005 are shown on the left and cases in 2006 are shown on the right. We can see that there are two clusters that contain patients that were active in 2005, one of which is much smaller than the other. For the larger cluster, additional diagnoses were made and additional treatments were performed. As a result, more cases share this behavior. The call-outs in Figure 6 again show differences in activities between the two years. These differences indicate that, over time, treatments for certain diagnoses have changed. Considering the type of process, this could be due to specific patient needs, changes in protocols or advances in medicine. A likely reason is that diagnoses were made (or recorded) with greater detail in 2006 versus in 2005. Insights such as these can be gained easily and can be used to verify or specify protocols, check whether certain behavior is changing or for auditing purposes. vervolgconsult poliklinisch 198 117 88 23 aanname laboratoriumonderzoek 1151 41 44 60 bacteriologisch onderzoek met kweek -nie natrium - vlamfotometrisch - spoed vagina - toucher onder anesthesie 75 117 88 23 30 administratief tarief - eerste pol 100 bilirubine -geconjugeerd 52 77 1 6 10 calcium - spoed 76 60 57 62 bilirubine - totaal verlosk.-gynaec. korte kaart kosten-out coupe ter inzage telefonisch consult 92 45 16 47 bicarbonaat 61 teletherapie - megavolt fotonen bestrali 52 4 6 30 11 glucose 120 5 19 ligdagen - alle spec.beh.kinderg.-reval. 635 kalium vlamfotometrisch - spoed 11 69 91 192 ureum aanname laboratoriumonderzoek 133 437 802 o2-saturatie 99 14 43 52 57 hemoglobine foto-elektrisch creatinine - spoed bilirubine -geconjugeerd 283 76 53 urine onderzoek - kwalitatief 218 49 bacteriologisch onderzoek met kweek -nie 44 52 79 creatinine 317 natrium - vlamfotometrisch - spoed bilirubine - totaal Added activities: 8 103 57 histologisch onderzoek - biopten nno 94 25 urine onderzoek - kwalitatief 77 44 31 alkalische fosfatase -kinetisch- 109 kalium vlamfotometrisch - spoed glucose 98 106 73 36 50 14 natrium vlamfotometrisch 317 differentiele telling automatisch ureum 296 17 111 86 - ‘melkzuur enzymatisch’ kalium potentiometrisch 79 64 318 105 haemoglobine foto-electrisch - spoed hemoglobine foto-elektrisch 148 153 sgot - asat kinetisch 118 80 122 84 112 trombocyten tellen - spoed creatinine 95 182 sgpt - alat kinetisch 117 65 56 71 Selected clusters - ‘haptoglobine’ alkalische fosfatase -kinetisch- differentiele telling automatisch melkzuurdehydrogenase -ldh- kinetisch 70 185 81 91 40 leucocyten tellen -electronisch- spoed 62 100 haemoglobine foto-electrisch - spoed bloedgroep abo en rhesusfactor 155 134 natrium vlamfotometrisch 167 87 6 130 157 trombocyten tellen - spoed actuele ph - pco2 - stand.bicarbonaat 83 rhesusfactor d - centrifugeermethode - e 103 52 134 - ‘ct hals’ kalium potentiometrisch 165 16 69 12 18 6 10 20 73 31 kruisproef volledig -drie methoden- hematocriet mbv centrifuge 44 250 11 142 35 sgot - asat kinetisch 73 leucocyten tellen -electronisch- spoed 20 cde fenotypering 80 106 18 75 leukocyten tellen elektronisch 18 205 sgpt - alat kinetisch 185 bloedgroepantigenen andere dan abo-rhesu 9 80 19 47 55 Diagnoses: trombocyten tellen - elektronisch 198 melkzuurdehydrogenase -ldh- kinetisch 76 52 28 29 gammaglutamyltranspeptidase 111 22 bloedgroep abo en rhesusfactor 307 60 20 93 52 24 31 cea - tumormarker mbv meia 91 30 7 24 More bodyparts 15 8 rhesusfactor d - centrifugeermethode - e M13, 822, 106 93 60 squamous cell carcinoma mbv eia 69 11 27 38 leukocyten tellen elektronisch 112 calcium 127 55 kruisproef volledig -drie methoden- 96 74 101 albumine trombocyten tellen - elektronisch diagnosed with 108 103 45 48 screening antistoffen erytrocyten 129 gammaglutamyltranspeptidase 71 27 17 ca-125 mbv meia 91 30 cea - tumormarker mbv meia 20 26 M13, 822, 106 16 15 ordertarief 665 squamous cell carcinoma mbv eia 59 6 42 47 thorax 126 25 51 calcium 190021 klinische opname a002 behandeltijd - eenheid t3 - megavolt 29 82 186 35 59 21 1e consult poliklinisch vervolgconsult poliklinisch 124 323 albumine teletherapie - megavolt fotonen bestrali 94 40 80 51 31 administratief tarief - eerste pol 31 137 screening antistoffen erytrocyten 9 9 14 16 91 mri abdomen telefonisch consult 20 36 81 ca-125 mbv meia 139 coupe ter inzage 12 10 1 62 21 24 22 10 ligdagen - alle spec.beh.kinderg.-reval. verlosk.-gynaec. korte kaart kosten-out 859 21 ordertarief 23 464 anesthesie - bij behandeling met radium 26 42 29 23 10 190021 klinische opname a002 thorax 117 34 brachytherapie - interstitieel - intensi 387 30 65 17 16 21 41 190205 klasse 3b a205 1e consult poliklinisch 474 68 190205 klasse 3b a205 715 381 528 190101 bovenreg.toesl. a101 190101 bovenreg.toesl. a101 460 586 21 6 32 gefiltreerd erytrocytenconcentraat gefiltreerd erytrocytenconcentraat verlosk.-gynaec. aanv.kaart kosten-out 95 73 17 Fig. 5: Hospital log clustered on diagnosis code for cases active in 2005 (left) and 2006 (right). Changes in diagnoses are discovered. More bodyparts are diagnosed with codes M13, 822 and 106. 103 aanname laboratoriumonderzoek 5 Added activity: 102 6 bilirubine - totaal 10 5 glucose 8 7 11 3 bacteriologisch onderzoek met kweek -nie 10 - ‘bovenreg.toesl. a101’ ureum 10 10 30 14 hemoglobine foto-elektrisch 26 vervolgconsult poliklinisch glucose 284 14 21 165 93 14 11 creatinine 21 administratief tarief - eerste pol aanname laboratoriumonderzoek ureum 12 172 96 13 Selected cluster 23 3 18 13 alkalische fosfatase -kinetisch- Missing activity: 13 21 ct abdomen 15 hemoglobine foto-elektrisch 13 11 37 natrium vlamfotometrisch 5 30 18 18 telefonisch consult creatinine 60 30 - ‘gefiltreerd kalium potentiometrisch 1 17 18 8 3 verlosk.-gynaec. jaarkaart kosten-out alkalische fosfatase -kinetisch- 20 17 sgot - asat kinetisch 12 17 Diagnoses: 12 vitamine b12 mbv chemieluminescentie 17 7 natrium vlamfotometrisch erytrocytenconcentraat’ 23 sgpt - alat kinetisch 9 23 13 6 kalium potentiometrisch 23 M13, M14, M16, 106 melkzuurdehydrogenase -ldh- kinetisch 6 7 4 12 5 sgot - asat kinetisch 35 16 leukocyten tellen elektronisch 17 30 vitamine b12 mbv chemieluminescentie 16 16 11 trombocyten tellen - elektronisch 2 10 sgpt - alat kinetisch Additional diagnoses: 17 16 12 8 gammaglutamyltranspeptidase melkzuurdehydrogenase -ldh- kinetisch 13 12 Treatments: 6 8 M11, M12, M15 squamous cell carcinoma mbv eia 11 leukocyten tellen elektronisch 24 2 2 5 24 ca-125 mbv meia 5 9 trombocyten tellen - elektronisch 13, 61, 101, 103, 502, 503 24 calcium 8 4 18 ordertarief 64 gammaglutamyltranspeptidase 19 2 6 25 thorax 5 squamous cell carcinoma mbv eia Additional treatments: 12 3 5 vervolgconsult poliklinisch 200 ordertarief 71 118 57 4 administratief tarief - eerste pol 23, 62, 603, 803, 119 17 2 14 ct abdomen 12 5 2 20 3209, 9101 telefonisch consult 40 1 verlosk.-gynaec. jaarkaart kosten-out 17 Fig. 6: Hospital log clustered on diagnosis code and treatment code in 2005 (left) and 2006 (right). Treatments for specific diagnoses are changing over time. Additional diagnoses and treatments are found. 5.2 Municipality event log The second event log contains cases of building permit applications in a Dutch municipality. Information is present about the type of permit, the costs associated with the permit, the involved resources, etc. Again, each attribute can have several different values. This log contains 1,199 cases recorded between late 2010 and early 2015 with in total 52,217 events and 398 distinct activities. As there are 1170 different execution paths, almost all cases are unique from the control-flow perspective. Each case has an attribute ‘parts’ that refers to the different permit types that are involved in the case it describes. As a pre-processing step, values containing multiple types were split up over different case attributes, one for each type. Besides permit types, each case is also labeled with the attribute ‘term name’, describing which status has been assigned to the permit application. Possible values are ‘permit granted’, ‘additional information required’, ‘term objection and appeal’, etc. We cluster the cases in the log on both permit type and term name and compare cases in 2011 with cases in 2012. The results are shown in Figure 7. As we can see, few clusters are discovered, indicating only slight differences in behavior on these perspectives. A group of cases pertaining to mainly construction and environmental permits that are in the ‘objection and appeal’ term is selected in the 2011 clustering. Over a one-year period, most of this behavior has merged 104 Additional permit types: Fire safety, Commercial, Driveway, Selected cluster 28 01_HOOFD_010\\complete 29 Violation environmental planning 14 2 01_HOOFD_015\\complete 29 11 17 01_HOOFD_020\\complete 8 29 5 18 2 01_HOOFD_030_2\\complete 01_HOOFD_030_1\\complete 29 29 2 2 01_HOOFD_065_1\\complete 8 29 6 7 01_HOOFD_540\\complete 12 7 6 06_VD_010\\complete Permit types: 27 9 2 05_EIND_010\\complete 2 2 27 5 1 01_HOOFD_050\\complete 13 8 2 2 2 29 Many more terms 2 4 12 3 6 3 2 7 01_HOOFD_055\\complete 8 6 4 19 2 4 287 01_HOOFD_010\\complete 8 07_OPS_010\\complete 8 01_HOOFD_110\\complete 2 293 15 33 Construction, 252 28 2 01_HOOFD_011\\complete 31 253 184 10 01_HOOFD_120\\complete 01_HOOFD_015\\complete 33 293 7 262 5 2 29 01_HOOFD_020\\complete 293 2 01_HOOFD_130\\complete 01_HOOFD_180\\complete 106 4 17 33 169 01_HOOFD_030_2\\complete 1 283 2 16 3 127 108 111 01_HOOFD_030_1\\complete 13 01_HOOFD_100\\complete 4 01_HOOFD_190\\complete 287 24 16 133 18 4 13 12 02_DRZ_010\\complete 01_HOOFD_040\\complete 250 39 241 01_HOOFD_040\\complete 01_HOOFD_195\\complete 2 Monument, 04_BPT_005\\complete 27 18 3 248 3 125 26 16 4 14 01_HOOFD_061\\complete 51 129 3 01_HOOFD_060\\complete 3 01_HOOFD_200\\complete 68 56 40 27 31 68 51 01_HOOFD_065_2\\complete 256 3 11 21 7 33 102 115 01_HOOFD_065_1\\complete 9 01_HOOFD_065_2\\complete 2 01_HOOFD_205\\complete 11 2 255 29 22 69 7 3 6 12 01_HOOFD_065_0\\complete 104 01_HOOFD_099\\complete 34 36 79 23 34 6 01_HOOFD_270\\complete 01_HOOFD_101\\complete 14 17 4 263 3 7 7 5 17 3 13 2 10 14_VRIJ_010\\complete 01_HOOFD_110_0\\complete 3 3 14 30 35 01_HOOFD_250\\complete 2 Environment, 69 22 28 29 4 03_GBH_005\\complete 3 01_HOOFD_180\\complete 42 289 2 3 25 9 109 3 8 01_HOOFD_050\\complete 7 09_AWB45_005\\complete 2 01_HOOFD_260\\complete 4 151 123 29 3 3 20 2 2 24 05_EIND_010\\complete 08_AWB45_010\\complete 01_HOOFD_060\\complete 126 64 39 38 2 01_HOOFD_330\\complete 08_AWB45_020_1\\complete 29 15 64 34 28 3 08_AWB45_020_2\\complete 11 62 3 09_AH_I_010\\complete 12 11 6 132 29 18 08_AWB45_025\\complete 3 21 26 2 11 18 32 95 08_AWB45_030\\complete 8 3 Demolition 5 15_NGV_010\\complete 13 01_HOOFD_375\\complete 61 5 29 48 14 15 08_AWB45_040\\complete 54 8 2 15 01_HOOFD_370\\complete 01_HOOFD_193\\complete 29 8 42 28 13 08_AWB45_005\\complete 4 176 01_HOOFD_380\\complete 136 29 9 36 01_HOOFD_195\\complete 7 255 24 121 4 4 17 4 8 01_HOOFD_200\\complete 7 37 5 01_HOOFD_430\\complete 249 27 24 104 24 8 3 121 79 2 4 23 5 4 13 112 01_HOOFD_250_1\\complete 11 12 01_HOOFD_250\\complete 144 113 3 69 111 128 82 3 119 11_AH_II_010\\complete 01_HOOFD_470\\complete 01_HOOFD_250_2\\complete 01_HOOFD_260\\complete 5 24 4 5 4 144 7 113 30 16 106 80 4 23 4 01_HOOFD_196\\complete 17 01_HOOFD_330\\complete 142 230 01_HOOFD_480\\complete 2 24 207 27 5 09_AH_I_010\\complete 254 27 119 100 01_HOOFD_370\\complete 01_HOOFD_490_1\\complete 254 29 23 136 111 20 100 01_HOOFD_375\\complete 5 4 254 3 7 3 119 01_HOOFD_490_2\\complete 06_VD_010\\complete 01_HOOFD_380\\complete 29 7 16 70 253 47 214 4 Term: 06_VD_020\\complete 01_HOOFD_430\\complete 23 47 233 8 01_HOOFD_491\\complete 45 223 5 06_VD_030_1\\complete 11_AH_II_010\\complete 47 246 12 3 26 4 26 9 06_VD_030_2\\complete 184 13_CRD_010\\complete 01_HOOFD_495\\complete 8 49 26 17 3 25 22 1 9 01_HOOFD_110\\complete 06_VD_030_3\\complete 13 01_HOOFD_480\\complete 160 39 233 18 10 14 01_HOOFD_510_1\\complete 3 01_HOOFD_110_2\\complete 06_VD_060_1\\complete 29 51 4 119 41 3 121 22 15 3 01_HOOFD_505\\complete 22 5 1 7 88 01_HOOFD_055\\complete 103 06_VD_060_2\\complete 43 01_HOOFD_490_2\\complete 216 9 3 13 34 1 01_HOOFD_520\\complete 6 29 01_HOOFD_110_1\\complete 01_HOOFD_120\\complete 01_HOOFD_491\\complete 100 140 197 23 63 4 4 Objection and appeal 43 01_HOOFD_494a\\complete 31 01_HOOFD_101\\complete 01_HOOFD_490_3\\complete 15 7 16 24 01_HOOFD_495\\complete 250 3 13 9 20 17 01_HOOFD_510_2\\complete 3 14_VRIJ_010\\complete 4 01_HOOFD_510_2\\complete 3 252 15 29 7 3 11 8 01_HOOFD_515\\complete 17 01_HOOFD_490_5a\\complete 115 31 38 56 10 03_GBH_005\\complete 01_HOOFD_530\\complete 15 29 7 01_HOOFD_519\\complete 51 3 113 5 20 82 26 01_HOOFD_520\\complete 61 3 99 20 143 7 33 6 44 15 227 3 34 19 01_HOOFD_490_3\\complete 56 105 3 11 9 28 01_HOOFD_516\\complete 4 7 159 86 25 99 2 3 8 29 11 6 01_HOOFD_510_4\\complete 7 115 12 23 16 3 19 3 38 6 01_HOOFD_490_5\\complete 47 106 86 3 28 19 01_HOOFD_510_2a\\complete 31 29 5 15 4 16 27 17 01_HOOFD_490_4\\complete 155 3 01_HOOFD_500\\complete 75 117 32 4 49 01_HOOFD_510_0\\complete 32 23 20 01_HOOFD_510_1\\complete 253 17 33 01_HOOFD_505\\complete 01_HOOFD_510_3\\complete 66 115 10 3 01_HOOFD_530\\complete 01_HOOFD_490_1\\complete 104 255 96 30 01_HOOFD_490_1a\\complete 32 Fig. 7: Municipality log clustered on permit type and term description. In this case, the discovered change in behavior leads to limited insights. with the biggest group of cases, which now represents almost all behavior in the log in 2012. Though there clearly are differences in behavior between the two years, it is difficult to interpret these findings directly. More analysis is necessary in order to discover results that are meaningful in a business process management context. For example, the activities that are executed could be considered when clustering the cases in the log. As such, clusters would show differences and similarities in control-flow as well. This indicates that the choice for the clustering perspectives is essential in finding valuable insights about differences in behavior. 6 Related Work Although concept drift is a well-studied topic in the data mining and machine learning communities, little work has been done on detecting concept drift in business processes. Bose et al. were the first to consider concept drift and change detection in a process mining setting [5]. In their work, a classification of possible changes in business processes is given, and statistical hypothesis tests are used to detect regions of change. Even tough the authors consider the possibility of change in data attributes, the scope of their work is limited to the detection of control-flow changes in a process manifested as sudden drifts over a period of time. More recently, Martjushev et al. built on this work by looking at gradual and multi-order dynamics to detect concept drift in control-flow [11]. They extend the 105 work in [5] by providing solutions to detect gradual change as well. By considering multi-order dynamics through the use of an adaptive window technique, process change occurring at multiple levels of mixed time granularity can be detected. Maaradji et al. employ statistical tests over the distributions of runs observed in two consecutive time windows in order to detect concept drift [10]. As noted by the authors, in order to find differences in process behavior a notion of equivalence is necessary. In their paper, a notion of run-equivalence is used. It is shown that drift can be identified fast and accurately by using an adaptive sliding window technique. As a result, it can be used in an online (streaming) setting as an oracle as to when a discovered model should be updated. Weber et al. employ probabilistic deterministic finite automata (PDFA) to represent the probability distributions generated by process models [17]. Similar to [5] and [10], statistical hypothesis tests are used to detect whether or not a distribution has changed significantly from a ground truth. The aim of their technique is to identify process change as soon as possible, but with confidence that change is significant, in order to discover a model representing reality as good as possible. As such, only drift in control-flow is considered. In [6] a different technique is proposed to automatically detect and manage concept drift in an online setting. Here, concept drift is detected real-time using an estimation technique based on abstract interpretation of the process and sequential sampling of the log. The fitness of prefixes of new samples taken from the log is checked against that of prefixes of initial samples. A change point is identified when there is a significant difference between these two points. In the above-mentioned techniques however, data attributes are not considered. As such, only changes in control-flow behavior can be discovered. Trace clustering techniques are often used to find different process variants. Several trace clustering techniques have been proposed in the field of process mining, and an extensive comparative analysis of trace clustering techniques has recently been performed in [12]. Often, however, the temporal dimension is not considered. In [9], the starting time of each process instance is used as an additional feature in trace clustering. By combining control-flow and time features, the clusters formed share both a structural similarity and a temporal proximity. The technique is based on the technique proposed in [4] and considers different types of changes, including sudden, recurring, gradual, and incremental changes. In more complex evolving business processes however, including the temporal proximity of cases might lead to misleading results. For example when seasonal drifts are intertwined with gradual changes in the process. The technique proposed in this paper uses similar ideas and concepts as used in the papers mentioned above. However, trace clustering is used to find common and deviating process behavior. By looking at both control-flow as well as data attributes, the technique is made context-aware. We extend the technique in [8] by including change detection in behavioral similarities between cases. In this way, we can identify changes in common and deviating behavior, on both the control-flow and data perspectives. 106 7 Conclusions and Future Work Real-life business processes are often complex while exhibiting a high degree of variability. Due to changing conditions and circumstances, these processes continuously evolve over time. Existing process mining techniques assume the process to be static and therefore are less suited for the analysis of contemporary business processes. In this paper we presented a novel comparative trace clustering approach that is able to expose differences in behavior in a process. By using both control-flow and case data we take the process context into account. Insights can be gained into how and why behavior has changed by comparing changes in clusterings over different partitions of the log. This information can then be used for further analysis, e.g. to design protocols, for early detection of unwanted behavior or for auditing purposes. Besides the time dimension, different data and control-flow attributes can be utilized in order to distinguish groups of behavior. While our initial results show that indeed promising insights can be achieved, there is still quite some manual work involved. More work is needed to further automate the analysis process. In the future we would also like to look into how changes in process behavior can be analysed in an online setting. Different ways to visualize change in behavior should be explored as well. Bibliography [1] W.M.P. van der Aalst. Process Mining: Discovery, Conformance and En- hancement of Business Processes. Springer, Berlin, 2011. [2] W.M.P. van der Aalst, A. Adriansyah, A.K.A. de Medeiros, F. Arcieri, T. Baier, T. Blickle, R.P.J.C. Bose, P. van den Brand, R. Brandtjen, J.C.A.M. Buijs, et al. Process Mining Manifesto. In Business Process Management Workshops, pages 169–194. Springer, 2012. [3] R.P.J.C. Bose and W.M.P. van der Aalst. Context Aware Trace Clustering: Towards Improving Process Mining Results. In Proceedings of the SIAM In- ternational Conference on Data Mining, pages 401–412. Society for Industrial and Applied Mathematics, 2009. [4] R.P.J.C. Bose and W.M.P. van der Aalst. Trace Clustering Based on Conserved Patterns: Towards Achieving Better Process Models. In Business Process Management Workshops, pages 170–181. Springer, 2010. [5] R.P.J.C. Bose, W.M.P. van der Aalst, I. Žliobaitė, and M. Pechenizkiy. Handling Concept Drift in Process Mining. In Advanced Information Systems Engineering, pages 391–405. Springer, 2011. [6] J. Carmona and R. Gavalda. Online Techniques for Dealing with Concept Drift in Process Mining. In Advances in Intelligent Data Analysis XI, pages 90–102. Springer, 2012. [7] S. Goedertier, J. De Weerdt, D. Martens, J. Vanthienen, and B. Baesens. Process Discovery in Event Logs: An Application in the Telecom Industry. Applied Soft Computing, 11(2):1697–1710, 2011. 107 [8] B.F.A. Hompes, J.C.A.M. Buijs, W.M.P. van der Aalst, P.M. Dixit, and J. Buurman. Discovering Deviating Cases and Process Variants Using Trace Clustering. In Proceedings of the 27th Benelux Conference on Artificial Intelligence (BNAIC), November 5-6, Hasselt, Belgium, 11 2015. [9] D. Luengo and M. Sepúlveda. Applying Clustering in Process Mining to Find Different Versions of a Business Process that Changes over Time. In Business Process Management Workshops, pages 153–158. Springer, 2012. [10] A. Maaradji, M. Dumas, M. La Rosa, and A. Ostovar. Fast and Accurate Business Process Drift Detection. In Business Process Management, pages 406–422. Springer, 2015. [11] J. Martjushev, R.P.J.C. Bose, and W.M.P. van der Aalst. Change Point Detection and Dealing with Gradual and Multi-order Dynamics in Process Mining. In Perspectives in Business Informatics Research, pages 161–178. Springer, 2015. [12] T. Thaler, S.F. Ternis, P. Fettke, and P. Loos. A Comparative Analysis of Process Instance Cluster Techniques. In Proceedings of the 12th International Conference on Wirtschaftsinformatik. Internationale Tagung Wirtschaftsin- formatik (WI-15), March 3-5, Osnabrck, Germany. Universitt Osnabrck, 3 2015. [13] B.F. van Dongen. Real-life Event Logs - Hospital Log, 2011. URL: http://dx. doi.org/10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54, doi: 10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54. [14] B.F. van Dongen. BPI Challenge 2015, 2015. URL: http://dx.doi.org/ 10.4121/uuid:31a308ef-c844-48da-948c-305d167a0ec1, doi:10.4121/ uuid:31a308ef-c844-48da-948c-305d167a0ec1. [15] S. Van Dongen. A Cluster Algorithm for Graphs. Technical report, National Research Institute for Mathematics and Computer Science in the Netherlands, 2000. [16] G.M. Veiga and D.R. Ferreira. Understanding Spaghetti Models with Se- quence Clustering for ProM. In Business Process Management Workshops, pages 92–103. Springer, 2010. [17] P. Weber, B. Bordbar, and P. Tino. Real-Time Detection of Process Change using Process Mining. In Imperial College Computing Student Workshop, pages 108–114, 2011. 108