-

Capturing the Sudden Concept Drift in Process Mining

Manoj Kumar M V

Likewin Thomas

likewinthomasg@nitk.ac.in 0

Annappa B

annappa@ieee.org 0 0 Department of Computer Science and Engineering National Institute of Technology Karnataka , Surathkal Mangalore - 575025 INDIA

132 143

Concept drift is the condition when the process changes during the course of execution. Current methods and analysis techniques existing in process mining are not pro cient of analyzing the process which has experienced the concept drift. State-of-the-art process mining approaches consider the process as a static entity and assume that process remains same from beginning of its execution period to end. Emphasis of this paper is to propose the technique for localizing concept drift in control- ow perspective by making use of activity correlation strength feature extracted using process log. Concept drift in the process is localized by applying statistical hypothesis testing methods. The proposed method is veri ed and validated on few of the real-life and arti cial process logs, results obtained are promising in the direction of e ciently localizing the sudden concept drifts in process-log.

Process mining is a fairly new research discipline that stands between process modeling and analysis on the one hand, and computational intelligence and data mining on the other hand. The idea of process mining is to discover, monitor and improve the operational, electronic and embedded processes by using the data logged in process logs[ 4 ].

Process mining comprises (automated) process discovery (i.e., mining process models), conformance checking (i.e., monitoring deviances by matching model and log), social network/ organizational mining, automated creation of simulation models, model extension, model repair, case prediction, and history-based recommendations as shown on g. 1.

There are two main reasons for the increasing attention in process mining. First, more and more events are being logged, thus, providing thorough info about the past of processes. Second, there is a necessity to develop and upkeep business processes in modest and quickly altering environments.

Process mining techniques o er a means to more rigorously check compliance and ascertain the validity and reliability of information about an organization's core processes.

Beginning point for process mining is availability of appropriate event log. All process mining methods assume that it is possible to sequentially record events. Each event refers to an activity (i.e., a well-de ned step in some process) and is related to a particular case (i.e., a process instance). Event logs may store extra info about events. In fact, whenever possible, process mining techniques use extra information such as the resource (i.e., person or device) executing or initiating the activity and time-stamp of the event etc.

Remaining sections of this paper are structured as follows. Section 2 discusses about concept drift with brief and concise example. Section 3 gives the brief description about the terminologies and notations used in this paper. Section 4 briefs about the methodology used to localize the sudden concept drift. Results of our experiments are given in section 5, brief about the related literature is explained in section 6 and this paper ends with some concluding remarks. 2

Concept drift

Process-centric analysis methods and techniques available in process mining are capable of generating excellent insight on working of operational process. If the process is not of static in nature, presently available process mining methods cannot be applied for the analysis. The main erroneous assumption that all of the available process mining techniques does is, "Process at the end of its execution is same as the process at the beginning of its execution" [ 12 ], this is not often the case due to the possibility of process change during the period of execution. All currently available process mining algorithms fail consider the changes happened in the process during the process execution.

Possibility of occurrence of concept drift has unfortunately been neglected while proposing methods available in the area of process mining. Not concentrating and ignoring the changes in the process makes end results of analysis obsolete.

End-to-end Solution for the phenomenon of concept drift can only be achieved by considering sub-problems involved, perspectives of change, change types, change patterns and duration of change in to account, same is shown in g. 2 Change detection and change localization are the two major sub-problems. Control- ow, data, case and organizational are four the main process perspectives. Sudden, recurring, incremental and gradual are the four di erent change types those can be normally observed. Most normally observed change patterns of change in control- ow perspective are shown in g. 2(c). Please refer [ 7,6,5 ] to get to know more about di erent control- ow, resource and data patterns that can be observed in operational process.

For example, consider the process model shown in g. 3(a) represent the repair process of electronic products in a company and is modeled with petrinet notations. A petri net is a bipartite graph consists of places (circle) and transition (rectangle). A transition becomes enable when each of its input places has at least one token in it. Upon ring of transition, it consumes a token from each of its input places and produces a token in each of its output places. The

Trace set-1 Trace set-2 t1 fr, i, c, d, g, rp, s, rcg t9 fr, i, u, c, d, g, rp , s, rcg t2 fr, u, d, c, g, rp, s, rcg t10 fr, u, i, c, d, g, rp, s, rcg t3 fr, i, c, d, g, t, rcg t11 fr, i, u, c, d, g, t, rcg t4 fr, u, d, c, g, t, rcg t12 fr, u, i, c, d, g, t, r, cg t5 fr, i, d, c, g, rp, s, rcg t13 fr, i, c, u, d, g, rp, s, rcg t6 fr, u, c, d, g, t, rcg t14 fr, c, i, u, d, g, t, rcg t7 fr, i, d, c, g, rp, s, rcg t15 fr, c, i, u, d, g, rp, s, rcg t8 fr, i, d, c, g, t, rcg t16 fr, i, u, d, c, g, rp, s, rcg g. shown in 3 is drawn using Colored Petri-Net1 Tools (CPNtools2). Process model in g. 3(a) has set of 10 di erent activities. In g. 3(a), transition sp with double rectangle represents sub-process.

(a) Repair process modeled in petri-net process modeling notation (b) Sub process of repair process (c) Sub-process of repair process after ocbefore occurance of concept drift curance of concept drift

Activities of the process in g. 3(a) are r=receive repair request, i=inspect item, u=update database, c=check warranty, d=decide the cost of repair, g=get the approval from customer, rp=repair product, s=send bill and collect charges, 1 Coloured Petri nets (CPN) are a backward compatible extension of the concept of Petri nets. CPN preserve useful properties of Petri nets and at the same time extend initial formalism to allow the distinction between tokens. 2 http://www.cpntools.org t=terminate the repair process and rc= return item and close case. Table 1 shows the traces of the repair process. According to the process log shown in table 1, process experiences concept drift after t8 i.e. the traces t1 to t8 represents the process traces before change and t9 to t16 are the traces possible after process change.

Before concept drift (before t9), any one of the activities inspect item or update database can be observed in traces of the log shown in table 1. After the occurrence of concept drift (after t8), both inspect item and update database activities can be observed. This example precisely signify the e ect of concept drift in process. If we employ the process discovery methods available in process mining to construct the process model using the process log shown in Table 1, outcome will be process model in the g. 3 with the excerpt shown in g. 3(a) as the subprocess replacing the activity sp. 3

Event class and Event class correlation

Let A be a set of activity names. A trace is a sequence of activities, i.e., 2 A . A simple event log L is a multi-set of traces over A, i.e., L 2 B(A )

De nition (Event, log trace, log). Let E be a set of unique set of log events. l is a log trace over E if and only if l is a non-repeating sequence on E . A set of log traces L is a log over E if and only if all log traces l 2 L are log traces over E and 8l1 ;l2 2 L : (set(l1) \ set(l2) 6= ;) ! (l1 = l2).

Using the de nition of event, trace and log, event class can be de ned as follows.

De nition (Event class). c 2 E ! C maps each event to its event class, where C is the set of event classes.

The set of event classes for a log trace l can be de ned as follows: The set of event classes for a log L is de ned as follows:

C(l) = fc(e)je 2 lg

C(L) [l2L C(l)

Let C be a set of event classes. The function ecc 2 C C ! R0+ assigns to each tuple of event classes a certain correlation value. The larger this the value is, the more related the two respective event classes are.

In our method we de ne the correlation function among event classes by scanning the whole log. We begin with a matrix of C C, set with zero values before the real scanning pass. While traversing the log, this matrix is updated for every following relation that is found. Correlation matrix, as well as the correlation function itself, is symmetric, i.e., ecc(X; Y ) = ecc(Y; X): During the scanning pass, this regularity requires to be preserved by the algorithm.

Consider the g. 4, the scanning is presently examining an event of class e1. We call the event presently under consideration as reference event. Looking at the directly preceding event of class e2, the scanner can establish an observation of the co-occurrence between event classes e1 and e2, which means that their association is strengthened. Similarly, the correlation matrix value for ecc(e1; e2) is incremented by i, the increment value ( generally set to 1). In our method, the scanning pass uses a look-forward window for calculating each event. This means that if the look-forward windows size is seven, the scanner will consider the upcoming seven events which have followed the reference event. When calculating events in the look forward window, the scanner will weaken its measurement exponentially, based on an attenuation factor a, where 0 a 1.

For any event y in the look-forward window, where x is the reference event, the correlation matrix will be updated as given below ecc(c(x); c(y)) = ecc(c(x); c(y)) + (i:an) (1) where n is the number of events located between x and y in the trace.

After the scanning pass has estimated all events in all traces of the log, a trustworthy correlation function between event classes is recognized, as expressed in the aggregated correlation matrix. Our correlation function thus relates two event classes as more linked, if events of these classes commonly happen closely together in traces of the log.

Concept drift is the condition where the process experiences change during the course of analysis. We believe that the representative appearance of feature values change before and after the occurrence of concept drift. By considering the sequential order of process instances in the log, we apply windowing strategy for selecting the instances for processing and to localize the occurrence of concept drift. Statistical hypothesis tests 3 are used to examine di erences between successive feature values obtained using event class correlation. 4

Methodology

3 Hypothesis testing is really a systematic way to test claims or ideas about a group or population, using data measured in a sample. Algorithm 1 Algorithm to detect concept drift using event class correlation Require: Process log with concept drifts 1: sub logs 0 // set the initial value to 0 2: sub logs split log(process log; size) 3: num sub logs sub logs:size() 4: while num sub logs 6= 0 do 5: i 0 6: activities = get activities of sub log(sub log[i]) // get the number of activities in the each sub log 7: i i + 1 8: cor[size(activities)][size(activities)] 0 9: for 8casei 2 sub logsi do 10: subcase 0 11: for 8eventi 2 casei do 12: look back 0 l 13: for 8eventsj 2 casei do 14: if name(eventi 6= eventj) then 15: if look back size then 16: cor[eventi][eventj] cor[eventi][eventj]+(i alook back) // calculate the ecc of event classes i and j 17: look back = look back + 1 18: end if 19: end if 20: end for 21: end for 22: end for 23: num sub logs num sub logs 1 24: level of signif icance = 0:05 // Set the level of signi cance (alpha value) 25: T est satistic = test hypothesis(cor; hypothesis test name; window size; num of popultions) // (performing hypothesis tests) 26: P value Compute P value(T est statistic) 27: if P value level of signif icance then 28: Reject H0 and declare concept drift // deciding the validity of H0 29: end if 30: end while

The standard process of statistical hypothesis testing comprises of four phases { S1: Formulating null (H0) and alternative hypothesis (H1) { S2 : Identifying a test statistic that can be used to assess the trustworthiness

H0. { S3 : Calculate the P -value (probability of obtaining a sample outcome, given that the H0 is true). { S4 : Compare the P -value to a statistical signi cance level . If P , that the observed e ect is statistically signi cant, H0 is ignored, and the H1 is considered as valid.

H0 can be stated as, (H0): There is no signi cant characteristic di erences in the manifestation of consecutive populations of feature values.

Null hypothesis is considered as fact until proved as false. When the null hypothesis is proved as false, alternative hypothesis (H1: There is signi cant di erence in manifestation of feature values) is considered and accepted and occurrence concept drift is declared.

Complete procedure for assessing the hypothesis tests on consecutive populations of ecc values is shown in the algorithm 1. We choose two-sample (since we need to analyze two samples of the population at the given point of time for detecting concept drift), independent (since both the samples are not depending on each other), non-parametric(since we do not know the priori distribution of the feature values in an event log), uni-variate and multi-variate (univariate tests deal with scalar data and multivariate tests deal with vector data) statistical hypothesis tests for detecting and localizing the concept drift in the process.

Using windowing strategy as instance selection method, successive populations of feature values are compared and examined to discover any signi cant di erence. Signi cant di erence between feature values only observed during the change in the process. Depending on the requirement of our problem and based on the characteristics of the tests described in the previous paragraph we consider Mann-Whitney U Test and The Moses Test for Equal Variability. MannWhitney U Test is used to answer "do two independent samples represent two populations with di erent median values" (or di erent distributions with respect to the rank-orderings of the scores in the two underlying population distributions)? The Moses Test for Equal Variability test will be used to answer Do two independent samples represent two populations with di erent variances? 5

Experiments and Results

Process log Cases Activities Events creo crep cins cdel L1 Loan application process 13,087 36 2,62,200 5,000 7,500 - L2 Volvo IT incident management process 7,554 13 65,533 - 3,000 4,000 L3 Insurance claim process 500 21 7,033 - - 200 400

Process before the occurrence of concept drift represent di erent version of the process than after the occurrence of concept drift. Concept drift can be observed in the process any number of times.

It is very hard to nd real-life operational process-log with concept drift in it. Process mining doesn't has any standard data set or workbench for testing the credibility of algorithms detecting and localizing concept drift. There are 8 e .0 u l a vp− .04

Trace no. Trace no. Trace no.

(a) (b) 4500 5500 6500 7500 5500 6500

7500 few real-life standard datasets available3 4, but they are not appropriate for testing the algorithms dealing with concept drift. In our experiments, we have taken appropriate data sets from open repository of process logs and arti cially induced concept drift in the control- ow perspective of the process. 3 http://data.3tu.nl/repository/ 4 http://www.processmining.org/logs/start

Process logs form the open process log repository are used and modi ed to include concept drift. We used Colored Petri Net (CPN) Tools with CPNXES library5 for creating synthetic process logs. Approach proposed in this paper is tested on 3 di erent logs shown in table 2.

{ creo: Rearranging activities. { crep: Replacing one activity with other. { cins: Inserting a new activity. { cdel: Deleting an existing activity. The word concept drift is initially coined by Schlimmer et.al. during 1986 in the article Incremental learning from noisy data [ 8 ]. Phenomenon of concept drift is known by many terminologies in other research disciplines (as Covariate Shift 5 https://westergaard.eu/2011/07/prom-package-documentation-keyvalue/ 1 process logs and models used in this paper can be downloaded at http://www.cse.nitk.ac.in/researchscholars/manoj-kumar-m-v in machine learning, as Load Shedding in databases, as Temporal Evolution in Information retrial etc.). E ciently handling concept drift is an important concern in every data analysis disciplines[ 2 ], unfortunately it has been deeply neglected in process mining. According to [ 2 ], concept drift is a non stationary learning problem over time and [ 1 ] describes drift as the process of changing the process. The core theory when dealing with the concept drift problem is uncertainty about the future. It can be assumed, estimated or predicted but there is no certainty.

Some e orts have been made to nd di erent versions of control- ow perspective of the process using clustering and classi cation techniques available in Data Mining[ 9,10,11 ]. Finding di erent versions of the process does not consider the type, pattern and perspective of concept drift. Hence, they cannot be the suitable means for solving phenomenon of concept drift.

ProM is the open source process mining framework consisiting more than 1; 2006 plug-in and plug-in variants that can be used for solving di erent process mining problems, out of which one or two plug-ins capable of addressing the problem of concept drift. To our knowledge, two works in the literature that addresses concept drift in process mining are [ 12,14,13 ]. Technique proposed in [ 12 ] are tested on real setting and the results are documented in [ 13 ]. Both [ 12,13 ] proposes extracting di erent global and local features out of process log and applying statistical hypothesis testing for detecting and localizing concept drift. Techniques shown in [ 12,14 ] propose solution for o ine and online methods for detecting and localizing sudden concept drift in control- ow perspective of process. The idea of extracting Event Class Correlation (ecc) feature is taken form [ 3 ]. End-to-end solution for the problem of concept drift can only be accomplished if it is addressed by considering all perspectives, types and patterns of change shown in g.2. E ort given in this paper suggests the method of localizing sudden concept drift in the control- ow perspective of the process using event class correlation feature by applying statistical hypothesis testing methods. 7

Conclusion

Handling the phenomenon of concept drift e ciently is the prime concern in all disciplines that deal with data analysis. Concept drift is the situation when process experiences changes in its associated perspectives during the period of its execution. The con guration of the process before the occurrence of concept drift is di erent from the process after the occurrence of concept drift. Stateof-the-art process-centric analysis techniques available in process mining behave poorly when employed to analyze the process that has experienced concept drift. Because, they consider the process as a static entity. But, process represents the dynamic aspect of the organization and can evolve in any perspective showing any change pattern exhibiting several di erent change type during the phase of its execution. This paper proposes the extraction of event class correlation 6 http://www.promtools.org/doku.php?id=packdocs feature for localizing the sudden concept drift in the control- ow perspective of the operational process. Results of the experimental study shown that proposed methods are capable of localizing concept drift e ciently. Our feature work include extension of the proposed methods to make working in on-line setting for sudden and gradual drift detection and localization.

1. Gama , Joo , et al. "A survey on concept drift adaptation . " ACM Computing Surveys (CSUR) 46.4 ( 2014 ): 44 .

2. Zliobaite , Indre. Learning under concept drift: an overview . Overview , Technical report , Vilnius University, 2009 techniques, related areas, applications Subjects: Arti cial Intelligence , 2009 .

3. Gnther , Christian W., Anne

Rozinat

, and Wil MP Van Der Aalst . "Activity mining by global trace segmentation." Business process management workshops . Springer Berlin Heidelberg, 2010 .

4. Van Der Aalst , Wil, et al. "Process mining manifesto." Business process management workshops . Springer Berlin Heidelberg, 2012 .

5. Russell , Nick, Ter Hofstede, Arthur

, Mulyar, Nataliya, Work ow control ow patterns: A revised view , Citeseer , 2006 .

Ter

Hofstede , Arthur

, David Edmond , and Wil MP van der Aalst . "Work ow resource patterns" ( 2005 ): 13 - 17 .

7. Russell , Nick, Arthur HM Ter Hofstede, David Edmond, and Wil MP van der Aalst . Work ow data patterns . QUT Technical report, FIT-TR-2004-01 , Queensland University of Technology, Brisbane, 2004 .

8. Schlimmer , Je rey C. , and Richard

H. Granger

Jr . "Incremental learning from noisy data . " Machine learning 1.3 ( 1986 ): 317 - 354 .

9. Song , Minseok, Christian W. Gnther, and Wil MP Van der Aalst . "Trace clustering in process mining." Business Process Management Workshops . Springer Berlin Heidelberg, 2009 .

10. Luengo , Daniela, and Marcos Seplveda . "Applying clustering in process mining to nd di erent versions of a business process that changes over time." Business Process Management Workshops . Springer Berlin Heidelberg, 2012 .

11. Bose , RP Jagadeesh Chandra, and Wil

MP van der Aalst.

" Context Aware Trace Clustering: Towards Improving Process Mining Results." SDM . 2009 .

12. Bose , RP Jagadeesh Chandra , et al. "Handling concept drift in process mining . " Advanced Information Systems Engineering . Springer Berlin Heidelberg, 2011 .

13. Bose , RP Jagadeesh Chandra , et al. "Dealing with concept drifts in process mining." Neural Networks and Learning Systems , IEEE Transactions on 25.1 ( 2014 ): 154 - 171 .

14. Carmona , Josep, and Ricard Gavalda . "Online techniques for dealing with concept drift in process mining." Advances in Intelligent Data Analysis XI . Springer Berlin Heidelberg, 2012 . 90 - 102 .