Detecting Change in Processes
Using Comparative Trace Clustering
B.F.A. Hompes1,2 , J.C.A.M. Buijs1 , W.M.P. van der Aalst1 ,
P.M. Dixit1,2 , and J. Buurman2
1
Department of Mathematics and Computer Science
Eindhoven University of Technology, Eindhoven, The Netherlands
2
Philips Research, Eindhoven, The Netherlands
{b.f.a.hompes,h.m.w.verbeek,w.m.p.v.d.aalst}@tue.nl
{prabhakar.dixit,hans.buurman}@philips.com
Abstract. Real-life business processes are complex and show a high
degree of variability. Additionally, due to changing conditions and circum-
stances, these processes continuously evolve over time. For example, in the
healthcare domain, advances in medicine trigger changes in diagnoses and
treatment processes. Besides changes over time, case data (e.g. treating
physician, patient age) also influence how processes are executed. Existing
process mining techniques assume processes to be static and therefore are
less suited for the analysis of contemporary flexible business processes.
This paper presents a novel comparative trace clustering approach that is
able to expose changes in behavior. Valuable insights can be gained and
process improvements can be made by finding those points in time where
behavior changed and the reasons why. Evaluation on real-life event data
shows our technique can provide these insights.
Keywords: process mining, trace clustering, concept drift, process com-
parison
1 Introduction
Business processes in flexible environments typically depend on many factors.
Due to changing conditions and circumstances, these processes continuously
evolve over time. For example, advances in medicine can change how patients are
treated or how diagnoses are made in hospitals. Other changing circumstances
could be new legislation, seasonal effects or even preferences of involved resources.
As a result, in these flexible types of processes many cases follow a unique path
through the process. This variability causes problems for existing process mining
techniques since processes are assumed to be structured and in a steady state.
Contemporary process mining techniques return spaghettilike processes and
potentially misleading results when this is not the case [3, 7, 16]. The discovered
process models capture behavior possible at any given point in time. Often,
however, the process context changes and what was possible before is no longer
possible or vice-versa. Figure 1 shows the relevance of the problem with an
example process that has changed over time.
95
Fig. 1: Demonstration of concept drift in a simple process. Techniques that
consider the entire event log can return misleading results.
This so-called concept drift is one of the key challenges in process mining,
as discussed in the Process Mining Manifesto [2]. Few techniques have been
proposed to deal with concept drift in a business process setting [5, 6, 9, 11, 17].
Where existing techniques focus mainly on change in control-flow, our focus is
on detecting changing behavior including data aspects as well.
Finding changes in behavior can be of great value to process owners. Detecting
unwanted changes, for example, can identify potential risks while positive change
can lead to perdurable process improvement. Techniques that consider the entire
event log at once cannot show the individual changes that occur throughout the
lifetime of a process [5]. For example, specific types of behavior might occur for
a limited time only, or can have a seasonal nature. Behavior can also merge with,
or emerge from other behavior.
Analysis of differences in behavior is supported by comparing clusterings
created for two selected partitions of the event log. This way, we can not only
compare differences in behavior over time, but behavior for different case groups
(e.g. different age groups, customer types, cases handled by different resources)
can be compared as well. As such, the analysis of differences in process behavior
is generalized.
Recently we proposed a novel technique for the detection of common and
deviating behavior using trace clustering [8]. The process context is considered
by taking both control-flow as well as case and event data into account. Hence,
clustering of cases is not limited to finding similar execution paths. The trace
clustering technique is based on the Markov cluster (MCL) algorithm, which is
a fast and scalable cluster algorithm for graphs [15]. It is able to autonomously
discover a (non-specified) number of clusters of different sizes and densities.
Hence, the resulting clustering shows both mainstream and deviating behavior.
The technique can be used to exploratively analyse the event data at hand in
order to gain novel insights into the underlying process(es).
96
Similarity matrix Clustering
Sublog
Event log
Sublog Similarity matrix Clustering
Fig. 2: A graphical overview of the approach. Differences in behavior can be
compared using comparative trace clustering.
In this paper, we extend on the work in [8] by providing a new technique for
change point detection and a means to compare clusterings. Figure 2 shows a
high-level overview of our approach. By incorporating the time dimension we can
discover temporal evolutions in process behavior. Changes in behavior causing
or caused by differences in case data can therefor be detected as well. In order
to detect change points, we look at similarities between cases. We consider the
effect of new events on a clustering of active cases.
The remainder of this paper is organized as follows. Section 2 briefly introduces
necessary preliminary definitions. The detection of change points is described
in Section 3. Section 4 describes the comparative trace clustering approach. An
experimental evaluation on two real-life event logs is performed in Section 5.
Section 6 discusses related work. The paper concludes with a summary and
planned future work in Section 7.
2 Preliminaries
Typically, the executed events of multiple cases of a process are recorded in an
event log. An event is an execution of an activity potentially having additional
data attributes such as a timestamp or the responsible resource. A trace is a
finite sequence of events, and describes one specific instance (i.e. a case) of the
process at hand in terms of the executed events. A case can also have additional
(case-wide) attributes such as a patient birthdate or diagnosis result. Definitions
in this section are based on those as defined in [1].
Definition 1 (Event, attribute). Let E be the event universe, i.e. the set of
all possible event identifiers. Events may be characterized by various attributes,
e.g. an event may have a timestamp, correspond to an activity, be executued by a
particular person, etc. Let N be a set of attribute names. For any event e ∈ E
and attribute name n ∈ N : n(e) is the value of attribute n for event e. If event e
does not have an attribute named n, then n(e) =⊥ (null value).
97
Typically, the following attributes are present in all events: activity(e) is the
activity associated to event e, time(e) is the timestamp of e, and resource(e) is
the resource associated to e. Additional event attributes can be the cost associated
with the event, the outcome of an activity (e.g. surgery result), etc.
Definition 2 (Case, trace, event log). Let C be the case universe, i.e. the
set of all possible case identifiers. Cases, like events, have attributes. For any
case c ∈ C and attribute name n ∈ N : n(c) is the value of attribute n for case c
(n(c) =⊥ if c has no attribute named n). Each case has a mandatory attribute
‘trace’: trace(c) ∈ E ∗ . ĉ = trace(c) is a shorthand notation for referring to the
trace of a case. A trace is a finite sequence of events σ ∈ E ∗ , such that each
event appears only once, i.e. for 1 ≤ i ≤ j ≤ |σ|: σ(i) 6= σ(j). For any sequence
σ = hσ1 , σ2 , . . . , σn i, set(σ) = {σ1 , σ2 , . . . , σn } converts a sequence into a set,
e.g. set(ha, b, c, b, c, di) = {a, b, c, d}. An event log is a set of cases L ⊆ C such
that each event appears at most once in the entire log, i.e. for any c, c0 ∈ L such
that c 6= c0 : set(ĉ) ∩ set(ĉ0 ) = ∅.
In the example traces in Figures 1 and 3 a simplified form is used where events
are represented solely by the activities they execute. In this form, a trace is a
sequence of activities and an event log a multiset of traces (since in this simplified
form cases can share their trace).
Definition 3 (Trace Similarity Matrix). Let L ⊆ C be an event log.
M(L) = (L × L) → [0.0, 1.0] denotes the set of all possible trace similarity
matrices over L. For cases c, c0 ∈ L and a trace P Pmatrix M ∈ M(L),
similarity
M (c, c0 ) denotes the similarity between c and c0 . M = c,c0 M (c, c0 ) denotes
the sum of all elements in M . Furthermore, standard matrix operations such as
addition, subtraction, multiplication, etc. are supported as well.
Definition 4 (Trace Clustering). Let L ⊆ C be an event log. A trace cluster
over L is a subset of L. A trace clustering T C ⊆ P(L)1 is a set of trace clus-
ters
S over L. We assume every case to be part of at least one trace cluster, i.e.
T C = L. Cases can be in multiple clusters, i.e. cluster overlap is allowed.
3 Detecting Change in Behavior
Discovering a process model on an entire real-life event log will often lead to
a spaghetti model since it has to represent all past traces [3, 7, 16]. Similarly,
clustering the entire event log will show groups of behavior that were possible at
any given point in time. The evolution of this behavior is not shown.
Our technique is based on the technique proposed in [8] where trace clustering
and outlier detection are combined in order to find mainstream and deviating
behavior. It relies on the Markov cluster (MCL) algorithm [15] to find groups
of cases (trace clusters) that share behavior on a set of selected perspectives.
By incorporating both control-flow and case data, the process context is taken
1
P(L) denotes the powerset over event log L, i.e. all possible sublogs of L.
98
into account. Because MCL is neither biased towards globular or local clusters
and is able to find clusters of different density, we can distinguish common
and exceptional behavior based on cluster sizes. As the number of clusters is
discovered rather than set beforehand, changes in behavior will be reflected in
change in the clustering.
MCL uses a stochastic similarity matrix between cases as its input. It al-
ternates an expansion step that raises the matrix to a given power with the
inflation step. Inflation raises each element to a given power and normalizes
the matrix such that it is stochastic again. This alternation eventually results
in the separation of the matrix into different components which is interpreted
as a clustering. Similarities between cases are calculated by mapping cases to
profile vectors using a set of selected perspectives and computing pair-wise vector
similarity values. Possible perspectives are the occurrence or frequency of certain
activities, the values for case or event data, etc. For more information on the
calculation of this similarity matrix and the use of the MCL algorithm in trace
clustering we refer the reader to [8].
We propose to utilize the similarity matrix between cases to detect changes
in behavior. An overview of this approach can be seen in Figure 3, where five
example (simple) traces are drawn over time. Over time, the occurrence of new
events in a certain trace will change the similarity of that trace to other traces,
which is reflected in the similarity matrix. As this matrix is the input for the
clustering algorithm, the impact on the similarity matrix is a good indicator
for how much the clustering will change. Thus, in order to detect potentially
interesting change points, we compute the change in the values of the similarity
matrix over time. Similar to the approaches that use statistical tests [5, 10, 11], a
large difference indicates a significant change in behavior. For each event window,
cases that have events in or before that window are considered in the calculation
of the next similarity matrix. Events occurring after the current event window are
not considered. As such, for each case, the prefix of its trace is taken. Different
window sizes can be considered. Case attributes (e.g. patient age or gender) are
considered to be known from the very first event in the trace of a case. As a
result, when looking at case-wide data, change points will reflect on the first
event of a deviating case. This projection of cases leads to a sublog containing
all events up to a certain time. A similarity matrix is calculated for that sublog
and compared with the previous similarity matrix. If change points have been
identified, clusterings of cases occurring before and after these points can be
created to compare behavior before and after the change point.
For example, take the traces depicted in Figure 3, and consider cases to
be similar when they share activities. At first, up to event window W3 , trace
t4 seems to be quite similar to the other three traces seen so far. However, in
window W4 this changes due to activity f which has replaced activity e. This
change is reflected
P in the similarity matrices M1 to M5 . Consequently the value
for ∆3 = | (M4 − M3 )| will be significantPand indicates a possible change point
in W4 . Since the range for the values for | (Mx − Mx−1 )| for future windows is
unknown, we normalize all delta values after each measurement.
99
Window size
t1
t2
Traces t
3
t4
t5
Time
W1 W3 W5
W2 W4
M1 M2 M3 M4 M5
Δ1 Δ3
Δ2 Δ4
Δmin
Δ
Time
W4
Fig. 3: A graphical overview of the change detection approach. In order to detect
changes in behavior, the difference in similarity matrices over time is calculated.
In this example, significant change occurred in window W4 .
100
4 Comparative Clustering
In order to analyse the effect of changing behavior in a process, different clusterings
can be created and compared both programmatically and visually. By comparing
clusterings created for different sublogs (e.g. before and after an identified change
point), we can see where behavior changed and why. As discussed, it is also
possible to manually create selections of cases based on time or data attributes
in order to compare behavior. This can be used to compare behavior for different
age groups, for patients from different geographical locations, etc.
Because of the properties of the MCL algorithm discussed in Section 3,
change is reflected in the cluster structure. For example, cluster sizes indicate
the frequency of the captured behavior. Behavior that used to be common but
is becoming less and less frequent will still be clustered separately rather than
being merged with (different) common behavior, as long as the behavior remains
dissimilar enough.
Clusters of cases can be annotated with descriptions about the cases that are
present. Shared activities between traces and similar data values are a few of the
possibilities. For example, a cluster might group cases of patients that all share a
certain diagnosis. This diagnosis can be used to describe the cluster.
Two example clusterings are shown in Figure 4. In the left clustering, one
cluster is selected. By highlighting those clusters in the right clustering that
share behavior with the selected cluster(s) on the left we can show how behavior
has changed. It is possible to see how cases are clustered or what annotations
are shared. For example, we might compare two years of patient data, clustered
on diagnosis. Patients that are present in both years are highlighted as shared
cases (in dark gray), while clusters that share some or all diagnoses (shared
annotations) are highlighted (in light gray). Within a trace clustering, clusters
that share behavior are connected. As such, when comparing two clusterings,
clusters that are split into or have emerged from multiple clusters can be found.
This can indicate that behavior has become more specific or more general.
Selected cluster
Similar annotation
Shared traces
Fig. 4: Two clusterings compared. Highlighted clusters on the right show how
behavior on the left has changed.
101
5 Evaluation
In order to evaluate the approach we use two real-life event logs. The first
log comes from a Dutch academic hospital that contains cases pertaining to
cancer treatment procedures. It was originally used in the first Business Process
Intelligence Contest (BPIC 2011) [13]. The second event log contains cases of
building permit applications provided by a Dutch municipality. This log is part
of the 2015 edition of the BPI Challenge (BPIC 2015) [14]. These event logs were
used so that the results can be reproduced. Our technique has been implemented
in the process mining tool ProM1 , and is publicly available in the TraceClustering
package. In the results shown here, we used parameters expansion=2, inflation=15
for the MCL algorithm.
5.1 Hospital event log
The first event log contains cases of different stages of malignancy and of different
parts of the body. Also, information is present about the diagnosis, treatment,
specialism required, patient age, organisational group (hospital department), etc.
This log contains 1,143 cases, 150,291 events and 624 distinct activities. There
are 981 different executions paths (activity sequences). There are many attributes
present on both the event and case level. All of these attributes can obtain several
different values, leading to a large heterogeneity in the log. As cases are recorded
between January 2005 and March 2008, the event log is likely to exhibit drifts.
For each case, there are 16 attributes for ‘diagnosis code’, referring to the
diagnoses the patient received for different parts of their body. By clustering on
these attributes and comparing the years 2005 and 2006 we can see trends in
common and exceptional diagnoses. Figure 5 shows how this diagnosis has evolved.
The clustering on the left represents behavior in 2005 whereas the clustering on
the right represents 2006. Patients that were in the selected cluster and have had
activities in both years are highlighted in dark gray. Groups of patients that have
had (partially) shared diagnoses are marked light gray. For example, out of the
539 cases in 2015, 78 (14.5%) were diagnosed with codes M13, 822, and 106, on
different parts of the body. Out of the 765 cases in 2016, only 84 (11.0%) had a
diagnosis using the same codes. What’s more, is that these diagnoses are present
for more body parts as well. This could indicate a trend in diseases or be due to
an improvement in diagnosis detail. As there are many smaller clusters in 2006
that have additional diagnoses (light gray), we can deduce that for the selected
diagnosis, the related diagnoses have become more specific and diagnoses are also
made on other parts of the body. In Figure 5 process maps and differences in
activities are shown for two highlighted clusters.
Besides diagnosis codes, every case has 16 possible attributes for ‘treatment
code’, referring to the treatments the patient received on different parts of their
body. This leads to many possible treatment combinations for different diagnoses.
We compare the clustering created for the combined attributes diagnosis code
1
See http://promtools.org
102
and treatment code for the years 2005 and 2006. Using our technique, different
clusters are discovered, each cluster contains specific combinations of diagnosis
and treatment. By comparing the results for the two years, changes in treatments
for specific diagnoses become visible. As we can see from Figure 6, treatment
for some diagnoses have changed. Again cases in 2005 are shown on the left and
cases in 2006 are shown on the right. We can see that there are two clusters that
contain patients that were active in 2005, one of which is much smaller than
the other. For the larger cluster, additional diagnoses were made and additional
treatments were performed. As a result, more cases share this behavior. The
call-outs in Figure 6 again show differences in activities between the two years.
These differences indicate that, over time, treatments for certain diagnoses
have changed. Considering the type of process, this could be due to specific
patient needs, changes in protocols or advances in medicine. A likely reason is
that diagnoses were made (or recorded) with greater detail in 2006 versus in 2005.
Insights such as these can be gained easily and can be used to verify or specify
protocols, check whether certain behavior is changing or for auditing purposes.
vervolgconsult poliklinisch
198 117 88 23 aanname laboratoriumonderzoek
1151
41 44 60
bacteriologisch onderzoek met kweek -nie natrium - vlamfotometrisch - spoed vagina - toucher onder anesthesie 75
117 88 23
30 administratief tarief - eerste pol
100 bilirubine -geconjugeerd
52
77
1 6 10 calcium - spoed
76
60
57 62 bilirubine - totaal
verlosk.-gynaec. korte kaart kosten-out coupe ter inzage telefonisch consult 92
45 16 47
bicarbonaat 61
teletherapie - megavolt fotonen bestrali 52
4 6
30
11 glucose
120
5 19 ligdagen - alle spec.beh.kinderg.-reval.
635
kalium vlamfotometrisch - spoed 11 69
91
192
ureum
aanname laboratoriumonderzoek 133
437
802
o2-saturatie 99
14 43 52 57
hemoglobine foto-elektrisch
creatinine - spoed bilirubine -geconjugeerd 283
76 53
urine onderzoek - kwalitatief 218
49
bacteriologisch onderzoek met kweek -nie 44 52
79
creatinine
317
natrium - vlamfotometrisch - spoed bilirubine - totaal
Added activities:
8
103 57
histologisch onderzoek - biopten nno 94
25
urine onderzoek - kwalitatief 77 44
31 alkalische fosfatase -kinetisch-
109
kalium vlamfotometrisch - spoed glucose
98
106 73
36 50 14 natrium vlamfotometrisch
317
differentiele telling automatisch ureum 296
17
111 86
- ‘melkzuur enzymatisch’
kalium potentiometrisch
79 64 318
105
haemoglobine foto-electrisch - spoed hemoglobine foto-elektrisch
148 153
sgot - asat kinetisch
118
80 122
84 112
trombocyten tellen - spoed creatinine
95 182 sgpt - alat kinetisch
117
65 56
71
Selected clusters - ‘haptoglobine’
alkalische fosfatase -kinetisch- differentiele telling automatisch melkzuurdehydrogenase -ldh- kinetisch
70 185 81
91 40
leucocyten tellen -electronisch- spoed 62
100
haemoglobine foto-electrisch - spoed bloedgroep abo en rhesusfactor
155 134
natrium vlamfotometrisch
167
87 6 130
157
trombocyten tellen - spoed actuele ph - pco2 - stand.bicarbonaat 83 rhesusfactor d - centrifugeermethode - e
103 52 134
- ‘ct hals’
kalium potentiometrisch
165 16 69 12 18 6 10 20
73 31 kruisproef volledig -drie methoden- hematocriet mbv centrifuge 44 250 11
142 35
sgot - asat kinetisch 73 leucocyten tellen -electronisch- spoed 20 cde fenotypering
80 106 18
75 leukocyten tellen elektronisch 18
205
sgpt - alat kinetisch
185 bloedgroepantigenen andere dan abo-rhesu 9
80
19
47 55
Diagnoses:
trombocyten tellen - elektronisch
198
melkzuurdehydrogenase -ldh- kinetisch
76
52
28 29 gammaglutamyltranspeptidase
111
22 bloedgroep abo en rhesusfactor 307 60 20
93
52 24 31 cea - tumormarker mbv meia
91 30 7 24
More bodyparts
15 8
rhesusfactor d - centrifugeermethode - e
M13, 822, 106
93
60 squamous cell carcinoma mbv eia
69
11 27
38
leukocyten tellen elektronisch
112 calcium
127
55 kruisproef volledig -drie methoden- 96 74
101
albumine
trombocyten tellen - elektronisch
diagnosed with
108
103
45
48
screening antistoffen erytrocyten
129
gammaglutamyltranspeptidase
71
27
17
ca-125 mbv meia 91
30
cea - tumormarker mbv meia
20 26
M13, 822, 106
16 15 ordertarief
665
squamous cell carcinoma mbv eia 59 6 42
47
thorax 126
25
51
calcium 190021 klinische opname a002 behandeltijd - eenheid t3 - megavolt 29
82 186 35
59 21 1e consult poliklinisch vervolgconsult poliklinisch
124 323
albumine teletherapie - megavolt fotonen bestrali 94 40
80 51
31 administratief tarief - eerste pol 31
137
screening antistoffen erytrocyten 9 9 14 16
91
mri abdomen telefonisch consult
20 36 81
ca-125 mbv meia 139 coupe ter inzage 12 10 1
62
21
24
22 10 ligdagen - alle spec.beh.kinderg.-reval. verlosk.-gynaec. korte kaart kosten-out
859 21
ordertarief 23
464
anesthesie - bij behandeling met radium 26
42 29 23
10
190021 klinische opname a002 thorax
117 34
brachytherapie - interstitieel - intensi 387
30
65 17
16 21 41
190205 klasse 3b a205 1e consult poliklinisch
474 68 190205 klasse 3b a205
715
381
528
190101 bovenreg.toesl. a101 190101 bovenreg.toesl. a101
460 586
21 6 32
gefiltreerd erytrocytenconcentraat
gefiltreerd erytrocytenconcentraat verlosk.-gynaec. aanv.kaart kosten-out
95
73 17
Fig. 5: Hospital log clustered on diagnosis code for cases active in 2005 (left) and
2006 (right). Changes in diagnoses are discovered. More bodyparts are diagnosed
with codes M13, 822 and 106.
103
aanname laboratoriumonderzoek 5
Added activity:
102
6 bilirubine - totaal 10
5
glucose
8
7 11 3 bacteriologisch onderzoek met kweek -nie
10
- ‘bovenreg.toesl. a101’
ureum
10
10
30 14
hemoglobine foto-elektrisch
26
vervolgconsult poliklinisch glucose
284 14
21
165 93 14 11
creatinine
21
administratief tarief - eerste pol aanname laboratoriumonderzoek ureum
12 172 96 13
Selected cluster
23 3 18 13
alkalische fosfatase -kinetisch-
Missing activity:
13
21 ct abdomen 15 hemoglobine foto-elektrisch
13
11 37
natrium vlamfotometrisch 5 30
18
18 telefonisch consult creatinine
60 30
- ‘gefiltreerd
kalium potentiometrisch
1 17
18
8 3
verlosk.-gynaec. jaarkaart kosten-out alkalische fosfatase -kinetisch-
20 17
sgot - asat kinetisch
12 17
Diagnoses:
12 vitamine b12 mbv chemieluminescentie 17
7
natrium vlamfotometrisch
erytrocytenconcentraat’
23
sgpt - alat kinetisch 9
23
13
6
kalium potentiometrisch
23
M13, M14, M16, 106
melkzuurdehydrogenase -ldh- kinetisch 6
7 4 12
5
sgot - asat kinetisch 35
16
leukocyten tellen elektronisch
17
30 vitamine b12 mbv chemieluminescentie 16
16 11
trombocyten tellen - elektronisch 2 10 sgpt - alat kinetisch
Additional diagnoses:
17 16
12 8
gammaglutamyltranspeptidase
melkzuurdehydrogenase -ldh- kinetisch
13
12
Treatments:
6
8
M11, M12, M15
squamous cell carcinoma mbv eia
11 leukocyten tellen elektronisch
24
2 2
5 24
ca-125 mbv meia 5
9
trombocyten tellen - elektronisch
13, 61, 101, 103, 502, 503
24
calcium 8
4
18
ordertarief
64 gammaglutamyltranspeptidase
19
2
6
25 thorax
5
squamous cell carcinoma mbv eia
Additional treatments:
12
3
5
vervolgconsult poliklinisch
200
ordertarief
71 118 57
4 administratief tarief - eerste pol
23, 62, 603, 803,
119
17 2
14 ct abdomen 12
5
2 20
3209, 9101
telefonisch consult
40
1
verlosk.-gynaec. jaarkaart kosten-out
17
Fig. 6: Hospital log clustered on diagnosis code and treatment code in 2005
(left) and 2006 (right). Treatments for specific diagnoses are changing over time.
Additional diagnoses and treatments are found.
5.2 Municipality event log
The second event log contains cases of building permit applications in a Dutch
municipality. Information is present about the type of permit, the costs associated
with the permit, the involved resources, etc. Again, each attribute can have several
different values. This log contains 1,199 cases recorded between late 2010 and
early 2015 with in total 52,217 events and 398 distinct activities. As there are
1170 different execution paths, almost all cases are unique from the control-flow
perspective.
Each case has an attribute ‘parts’ that refers to the different permit types that
are involved in the case it describes. As a pre-processing step, values containing
multiple types were split up over different case attributes, one for each type.
Besides permit types, each case is also labeled with the attribute ‘term name’,
describing which status has been assigned to the permit application. Possible
values are ‘permit granted’, ‘additional information required’, ‘term objection
and appeal’, etc.
We cluster the cases in the log on both permit type and term name and
compare cases in 2011 with cases in 2012. The results are shown in Figure 7. As we
can see, few clusters are discovered, indicating only slight differences in behavior
on these perspectives. A group of cases pertaining to mainly construction and
environmental permits that are in the ‘objection and appeal’ term is selected in
the 2011 clustering. Over a one-year period, most of this behavior has merged
104
Additional permit types:
Fire safety, Commercial, Driveway,
Selected cluster
28
01_HOOFD_010\\complete
29
Violation environmental planning
14
2 01_HOOFD_015\\complete
29
11 17
01_HOOFD_020\\complete 8
29
5 18 2
01_HOOFD_030_2\\complete 01_HOOFD_030_1\\complete
29 29
2 2
01_HOOFD_065_1\\complete 8
29
6 7
01_HOOFD_540\\complete
12
7
6 06_VD_010\\complete
Permit types:
27
9
2 05_EIND_010\\complete 2 2
27
5
1 01_HOOFD_050\\complete 13 8 2 2 2
29
Many more terms
2 4 12 3 6 3
2 7 01_HOOFD_055\\complete
8
6 4 19 2 4
287
01_HOOFD_010\\complete
8 07_OPS_010\\complete 8 01_HOOFD_110\\complete 2 293
15 33
Construction,
252
28 2 01_HOOFD_011\\complete 31
253
184
10 01_HOOFD_120\\complete
01_HOOFD_015\\complete
33 293
7
262
5 2 29
01_HOOFD_020\\complete
293
2 01_HOOFD_130\\complete 01_HOOFD_180\\complete 106 4
17 33
169 01_HOOFD_030_2\\complete 1
283
2 16
3 127 108
111 01_HOOFD_030_1\\complete 13
01_HOOFD_100\\complete 4 01_HOOFD_190\\complete 287
24 16
133 18
4 13 12 02_DRZ_010\\complete 01_HOOFD_040\\complete
250 39
241
01_HOOFD_040\\complete 01_HOOFD_195\\complete 2
Monument,
04_BPT_005\\complete
27 18 3
248
3
125
26 16
4 14 01_HOOFD_061\\complete 51
129
3 01_HOOFD_060\\complete 3 01_HOOFD_200\\complete 68 56 40
27 31
68 51 01_HOOFD_065_2\\complete
256
3 11
21 7 33 102 115
01_HOOFD_065_1\\complete 9
01_HOOFD_065_2\\complete 2 01_HOOFD_205\\complete 11 2 255
29 22
69 7 3
6 12 01_HOOFD_065_0\\complete 104 01_HOOFD_099\\complete 34
36 79
23 34 6
01_HOOFD_270\\complete
01_HOOFD_101\\complete
14 17 4
263
3 7
7 5 17 3 13
2 10
14_VRIJ_010\\complete 01_HOOFD_110_0\\complete 3 3 14
30 35
01_HOOFD_250\\complete 2
Environment,
69 22 28
29
4 03_GBH_005\\complete 3 01_HOOFD_180\\complete
42 289
2 3 25
9 109 3
8 01_HOOFD_050\\complete 7 09_AWB45_005\\complete
2 01_HOOFD_260\\complete 4 151 123
29
3 3 20
2 2 24 05_EIND_010\\complete 08_AWB45_010\\complete 01_HOOFD_060\\complete
126 64 39
38
2 01_HOOFD_330\\complete
08_AWB45_020_1\\complete
29 15
64
34
28
3 08_AWB45_020_2\\complete 11
62
3 09_AH_I_010\\complete 12 11 6 132
29
18 08_AWB45_025\\complete 3 21
26
2 11
18 32 95
08_AWB45_030\\complete 8 3
Demolition
5 15_NGV_010\\complete 13 01_HOOFD_375\\complete 61
5 29
48
14 15 08_AWB45_040\\complete
54
8
2 15 01_HOOFD_370\\complete
01_HOOFD_193\\complete
29 8
42
28
13
08_AWB45_005\\complete 4
176
01_HOOFD_380\\complete 136
29
9 36 01_HOOFD_195\\complete 7
255
24
121 4 4
17 4 8 01_HOOFD_200\\complete 7 37
5 01_HOOFD_430\\complete 249
27
24 104 24 8 3 121 79
2 4 23 5 4 13 112 01_HOOFD_250_1\\complete 11 12 01_HOOFD_250\\complete
144 113
3 69 111 128 82 3 119
11_AH_II_010\\complete 01_HOOFD_470\\complete
01_HOOFD_250_2\\complete 01_HOOFD_260\\complete
5 24 4 5 4
144
7
113
30
16 106 80
4 23
4 01_HOOFD_196\\complete 17 01_HOOFD_330\\complete
142 230
01_HOOFD_480\\complete 2 24 207
27
5 09_AH_I_010\\complete
254
27
119
100 01_HOOFD_370\\complete
01_HOOFD_490_1\\complete 254
29
23 136 111
20 100 01_HOOFD_375\\complete 5 4
254
3 7 3 119
01_HOOFD_490_2\\complete
06_VD_010\\complete 01_HOOFD_380\\complete
29 7 16
70 253
47 214
4
Term:
06_VD_020\\complete 01_HOOFD_430\\complete 23
47 233
8 01_HOOFD_491\\complete 45 223
5
06_VD_030_1\\complete 11_AH_II_010\\complete
47 246
12 3
26 4 26
9 06_VD_030_2\\complete 184 13_CRD_010\\complete
01_HOOFD_495\\complete 8 49 26
17
3 25 22
1 9 01_HOOFD_110\\complete 06_VD_030_3\\complete 13 01_HOOFD_480\\complete
160 39 233
18 10 14
01_HOOFD_510_1\\complete 3
01_HOOFD_110_2\\complete 06_VD_060_1\\complete
29 51 4
119 41
3 121 22 15 3
01_HOOFD_505\\complete 22 5 1
7 88 01_HOOFD_055\\complete
103
06_VD_060_2\\complete
43
01_HOOFD_490_2\\complete
216
9 3 13 34
1 01_HOOFD_520\\complete 6
29 01_HOOFD_110_1\\complete 01_HOOFD_120\\complete 01_HOOFD_491\\complete
100 140 197
23 63
4 4
Objection and appeal
43 01_HOOFD_494a\\complete
31
01_HOOFD_101\\complete 01_HOOFD_490_3\\complete 15 7
16 24
01_HOOFD_495\\complete
250
3 13 9
20
17 01_HOOFD_510_2\\complete 3
14_VRIJ_010\\complete 4 01_HOOFD_510_2\\complete 3 252
15 29
7 3
11 8 01_HOOFD_515\\complete 17 01_HOOFD_490_5a\\complete
115 31
38 56 10
03_GBH_005\\complete 01_HOOFD_530\\complete
15 29 7 01_HOOFD_519\\complete 51 3
113
5 20 82
26
01_HOOFD_520\\complete 61 3 99 20
143
7 33 6 44 15
227 3 34 19 01_HOOFD_490_3\\complete
56
105 3 11
9 28 01_HOOFD_516\\complete 4 7 159
86
25 99 2 3 8 29
11 6 01_HOOFD_510_4\\complete 7
115
12 23 16 3 19 3
38 6 01_HOOFD_490_5\\complete 47
106
86 3 28 19
01_HOOFD_510_2a\\complete
31
29 5 15 4
16 27 17 01_HOOFD_490_4\\complete
155
3
01_HOOFD_500\\complete 75
117
32 4
49 01_HOOFD_510_0\\complete
32
23 20
01_HOOFD_510_1\\complete
253
17 33
01_HOOFD_505\\complete 01_HOOFD_510_3\\complete
66 115
10 3
01_HOOFD_530\\complete 01_HOOFD_490_1\\complete
104 255
96 30
01_HOOFD_490_1a\\complete
32
Fig. 7: Municipality log clustered on permit type and term description. In this
case, the discovered change in behavior leads to limited insights.
with the biggest group of cases, which now represents almost all behavior in the
log in 2012.
Though there clearly are differences in behavior between the two years, it is
difficult to interpret these findings directly. More analysis is necessary in order to
discover results that are meaningful in a business process management context.
For example, the activities that are executed could be considered when clustering
the cases in the log. As such, clusters would show differences and similarities in
control-flow as well. This indicates that the choice for the clustering perspectives
is essential in finding valuable insights about differences in behavior.
6 Related Work
Although concept drift is a well-studied topic in the data mining and machine
learning communities, little work has been done on detecting concept drift in
business processes. Bose et al. were the first to consider concept drift and change
detection in a process mining setting [5]. In their work, a classification of possible
changes in business processes is given, and statistical hypothesis tests are used
to detect regions of change. Even tough the authors consider the possibility of
change in data attributes, the scope of their work is limited to the detection of
control-flow changes in a process manifested as sudden drifts over a period of time.
More recently, Martjushev et al. built on this work by looking at gradual and
multi-order dynamics to detect concept drift in control-flow [11]. They extend the
105
work in [5] by providing solutions to detect gradual change as well. By considering
multi-order dynamics through the use of an adaptive window technique, process
change occurring at multiple levels of mixed time granularity can be detected.
Maaradji et al. employ statistical tests over the distributions of runs observed in
two consecutive time windows in order to detect concept drift [10]. As noted by
the authors, in order to find differences in process behavior a notion of equivalence
is necessary. In their paper, a notion of run-equivalence is used. It is shown that
drift can be identified fast and accurately by using an adaptive sliding window
technique. As a result, it can be used in an online (streaming) setting as an oracle
as to when a discovered model should be updated.
Weber et al. employ probabilistic deterministic finite automata (PDFA) to
represent the probability distributions generated by process models [17]. Similar
to [5] and [10], statistical hypothesis tests are used to detect whether or not a
distribution has changed significantly from a ground truth. The aim of their
technique is to identify process change as soon as possible, but with confidence
that change is significant, in order to discover a model representing reality as
good as possible. As such, only drift in control-flow is considered. In [6] a different
technique is proposed to automatically detect and manage concept drift in an
online setting. Here, concept drift is detected real-time using an estimation
technique based on abstract interpretation of the process and sequential sampling
of the log. The fitness of prefixes of new samples taken from the log is checked
against that of prefixes of initial samples. A change point is identified when
there is a significant difference between these two points. In the above-mentioned
techniques however, data attributes are not considered. As such, only changes in
control-flow behavior can be discovered.
Trace clustering techniques are often used to find different process variants.
Several trace clustering techniques have been proposed in the field of process
mining, and an extensive comparative analysis of trace clustering techniques
has recently been performed in [12]. Often, however, the temporal dimension
is not considered. In [9], the starting time of each process instance is used as
an additional feature in trace clustering. By combining control-flow and time
features, the clusters formed share both a structural similarity and a temporal
proximity. The technique is based on the technique proposed in [4] and considers
different types of changes, including sudden, recurring, gradual, and incremental
changes. In more complex evolving business processes however, including the
temporal proximity of cases might lead to misleading results. For example when
seasonal drifts are intertwined with gradual changes in the process.
The technique proposed in this paper uses similar ideas and concepts as used
in the papers mentioned above. However, trace clustering is used to find common
and deviating process behavior. By looking at both control-flow as well as data
attributes, the technique is made context-aware. We extend the technique in [8]
by including change detection in behavioral similarities between cases. In this
way, we can identify changes in common and deviating behavior, on both the
control-flow and data perspectives.
106
7 Conclusions and Future Work
Real-life business processes are often complex while exhibiting a high degree
of variability. Due to changing conditions and circumstances, these processes
continuously evolve over time. Existing process mining techniques assume the
process to be static and therefore are less suited for the analysis of contemporary
business processes. In this paper we presented a novel comparative trace clustering
approach that is able to expose differences in behavior in a process. By using
both control-flow and case data we take the process context into account. Insights
can be gained into how and why behavior has changed by comparing changes
in clusterings over different partitions of the log. This information can then be
used for further analysis, e.g. to design protocols, for early detection of unwanted
behavior or for auditing purposes. Besides the time dimension, different data and
control-flow attributes can be utilized in order to distinguish groups of behavior.
While our initial results show that indeed promising insights can be achieved,
there is still quite some manual work involved. More work is needed to further
automate the analysis process. In the future we would also like to look into how
changes in process behavior can be analysed in an online setting. Different ways
to visualize change in behavior should be explored as well.
Bibliography
[1] W.M.P. van der Aalst. Process Mining: Discovery, Conformance and En-
hancement of Business Processes. Springer, Berlin, 2011.
[2] W.M.P. van der Aalst, A. Adriansyah, A.K.A. de Medeiros, F. Arcieri,
T. Baier, T. Blickle, R.P.J.C. Bose, P. van den Brand, R. Brandtjen, J.C.A.M.
Buijs, et al. Process Mining Manifesto. In Business Process Management
Workshops, pages 169–194. Springer, 2012.
[3] R.P.J.C. Bose and W.M.P. van der Aalst. Context Aware Trace Clustering:
Towards Improving Process Mining Results. In Proceedings of the SIAM In-
ternational Conference on Data Mining, pages 401–412. Society for Industrial
and Applied Mathematics, 2009.
[4] R.P.J.C. Bose and W.M.P. van der Aalst. Trace Clustering Based on
Conserved Patterns: Towards Achieving Better Process Models. In Business
Process Management Workshops, pages 170–181. Springer, 2010.
[5] R.P.J.C. Bose, W.M.P. van der Aalst, I. Žliobaitė, and M. Pechenizkiy.
Handling Concept Drift in Process Mining. In Advanced Information Systems
Engineering, pages 391–405. Springer, 2011.
[6] J. Carmona and R. Gavalda. Online Techniques for Dealing with Concept
Drift in Process Mining. In Advances in Intelligent Data Analysis XI, pages
90–102. Springer, 2012.
[7] S. Goedertier, J. De Weerdt, D. Martens, J. Vanthienen, and B. Baesens.
Process Discovery in Event Logs: An Application in the Telecom Industry.
Applied Soft Computing, 11(2):1697–1710, 2011.
107
[8] B.F.A. Hompes, J.C.A.M. Buijs, W.M.P. van der Aalst, P.M. Dixit, and
J. Buurman. Discovering Deviating Cases and Process Variants Using Trace
Clustering. In Proceedings of the 27th Benelux Conference on Artificial
Intelligence (BNAIC), November 5-6, Hasselt, Belgium, 11 2015.
[9] D. Luengo and M. Sepúlveda. Applying Clustering in Process Mining to
Find Different Versions of a Business Process that Changes over Time. In
Business Process Management Workshops, pages 153–158. Springer, 2012.
[10] A. Maaradji, M. Dumas, M. La Rosa, and A. Ostovar. Fast and Accurate
Business Process Drift Detection. In Business Process Management, pages
406–422. Springer, 2015.
[11] J. Martjushev, R.P.J.C. Bose, and W.M.P. van der Aalst. Change Point
Detection and Dealing with Gradual and Multi-order Dynamics in Process
Mining. In Perspectives in Business Informatics Research, pages 161–178.
Springer, 2015.
[12] T. Thaler, S.F. Ternis, P. Fettke, and P. Loos. A Comparative Analysis of
Process Instance Cluster Techniques. In Proceedings of the 12th International
Conference on Wirtschaftsinformatik. Internationale Tagung Wirtschaftsin-
formatik (WI-15), March 3-5, Osnabrck, Germany. Universitt Osnabrck, 3
2015.
[13] B.F. van Dongen. Real-life Event Logs - Hospital Log, 2011. URL: http://dx.
doi.org/10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54, doi:
10.4121/uuid:d9769f3d-0ab0-4fb8-803b-0d1120ffcf54.
[14] B.F. van Dongen. BPI Challenge 2015, 2015. URL: http://dx.doi.org/
10.4121/uuid:31a308ef-c844-48da-948c-305d167a0ec1, doi:10.4121/
uuid:31a308ef-c844-48da-948c-305d167a0ec1.
[15] S. Van Dongen. A Cluster Algorithm for Graphs. Technical report, National
Research Institute for Mathematics and Computer Science in the Netherlands,
2000.
[16] G.M. Veiga and D.R. Ferreira. Understanding Spaghetti Models with Se-
quence Clustering for ProM. In Business Process Management Workshops,
pages 92–103. Springer, 2010.
[17] P. Weber, B. Bordbar, and P. Tino. Real-Time Detection of Process Change
using Process Mining. In Imperial College Computing Student Workshop,
pages 108–114, 2011.
108