Classification Techniques for Conformance and Performance Checking in Process Analysis Hind Chfouka1 , Andrea Corradini1 and Roberto Guanciale2 1 Department of Computer Science, University of Pisa, Italy {chfouka,andrea}@di.unipi.it 2 School of Computer Science and Communication, KTH Royal Institute of Technology, Stockholm, Sweden robertog@kth.se Abstract. Standard process analysis techniques, like conformance check- ing or performance evaluation, are enabled by the existence of event logs that trace the process executions and by the presence of a model that formally represents the process. Such analysis techniques use only part of the huge amount of data recorded in event logs. In this paper the goal is to exploit this data to extract useful information for conformance checking and performance analysis. We present an approach that using standard classification technique, explores how data influence process behaviors by affecting its conformance or performance. 1 Introduction Today, many Information Systems that support the concept of business process record all events occurring during the process execution in event logs. An event log represents a trace of the process behavior that can be observed and analyzed in order to tune it with the business objectives pursued. Process Mining provides a set of techniques for process discovery and analysis. Through the knowledge extracted from event logs, process discovery allows one to construct a process model. Process analysis instead assumes the existence of a model and it consists of checking the conformance and performance of the process executions with respect to it. The work presented in this paper is motivated by the observation that exist- ing event logs are rich of data, but such data is used only in part by the existing process analysis techniques. Since data is considered as a valid source of knowl- edge, developing techniques based on data analysis can provide advantages for process discovery and analysis. This fact has already been observed several times in the literature, and it brought to the development of various approaches where machine learning techniques are applied to workflow or process discovery (see e.g. [4, 5, 8, 2]). Here instead, we consider more specifically process analysis. Our goal is to find a way to transform the data recorded during the process activities in knowl- edge useful for the analysis. In particular, we try to determine how data may affect the conformance and performance of process executions. Identifying this influence can provide qualitative information about the causes of possible anoma- lies in the process behaviors, facilitating the task of taking corrective measures. In order to analyze data contained in event logs, we propose an approach based on machine learning [10], where we exploit classification techniques to find patterns on data in presence of which conformance errors occur. The same approach has been extended to evaluate how data influences process efficiency, by combining classification problems with performance evaluation: this is just summarized briefly in the paper due to size constraints. Classification is a well- known machine learning technique which consists of identifying to which cate- gory, among a given set, a new observation belongs. This is done on the basis of a training set of data containing observations whose category membership is known. After the presentation of a case study in Section 2, Section 3 introduces some background concepts about process analysis. First the log replay algorithm is briefly described, then conformance analysis is introduced using the business process of Section 2. In Section 4 our approach to process analysis based on classification is presented. First a few basic concepts about the classification technique are described, next the conformance analysis based on classification is presented in detail through an illustration with our case study. A possible extension of the approach for performance analysis is briefly discussed. Finally, Section 5 describes the implementation of our approach provided as ProM6 [12] plug-ins. 2 The Case Study: a Sale Business Process A business process can be seen as a collection of activities occurring within an organization that lead to a specific goal. There are several formalisms for repre- senting business processes. In this paper we represent them using Petri nets [7, 11] since this makes possible to exploit some existing process analysis techniques that are based on such nets. However, in a business management context, Petri nets can be not expressive enough and usually process models are presented as workflows, for example using a standard notation for the business process mod- eling: BPMN (Business Process Model and Notation) [6]. Transforming a BPMN model into a Petri net is possible thanks to a mapping technique presented in [1]. Figure 1 presents the Petri net model for a sale process.3 This business process abstractly represents a fragment of the procedure followed in a commercial or- ganization for managing orders received from clients. In general this procedure includes several activities and each of them involves a specific department of the organization. We assume that the sale process begins with the notification of an order from a client, so the first activity is simply called Order. After the order is received, the sale process continues with some activities that can be done in 3 The Petri net of Figure 1 is obtained by transforming a BPMN model using the algorithm of [1], and mapping each activity to a pair of start/end transitions. The resulting net has been simplified by removing the unnecessary invisible transitions used to encode the Join or Fork gateway [1]. Fig. 1. Petri net for a sale process parallel, since they do not present any dependency. The FinancialCheck activity represents the financial analysis of the order (i.e verifying the financial situation of the client). Concurrently, the WarehouseCheck activity checks the availabil- ity of merchandise requested by the order and, in case of a shortage, starts a supplying procedure. This is done through the activity ExternalWarehouse. Af- ter the completion of the parallel activities, a synchronization is needed before continuing the sale procedure with the final activity called Notify, by which a result regarding the acceptance of the order is communicated to the client. 3 Process Analysis Process analysis is performed using the Petri net representing the process and an event log recording the related executions, also called process instances. The basic building blocks of event logs are events, that record the execution of the process actions. Therefore an event e can be seen as a tuple including the action recorded, a corresponding timestamp, and possibly other attributes describing relevant information about the state of the process. Events that belong to the same process instance are grouped in a trace, where events are ordered by times- tamp. An event log is a set of traces, recording the activities performed by a system during a finite number of process executions. In this paper the following assumptions are made: – All traces are instances of the same process. – For each action there exists a corresponding transition in the net that will be denoted, for simplicity, by the same name of the action. 3.1 Log replay algorithm The key algorithm exploited to analyze a Petri net model with respect to an event log is the log replay algorithm (see e.g. [9]). Given a Petri net model and an event log as input to the algorithm, for each trace in the log, the algorithm starts by placing one token in the start place of the net. For each event in the trace the corresponding transition is fired assuming a non-blocking behavior, then the marking of the net is updated. Non-blocking replay means that whenever a firing of a disabled transition is needed, the algorithm enables the transition either by creating artificial tokens in the pre-set or, if possible, by firing some invisible transitions; the non-determinism in this procedure is resolved with a suitable cost function. For each trace, the result of the log replay execution is a map indicating which transitions of the model have been fired for each event in the trace, and additionally which tokens were artificially created and which remained upon completion of the trace. The output of the log replay algorithm can be used to check the conformance of the traces. 3.2 Conformance Analysis The goal of the conformance analysis is to check if a trace complies with the Petri net modeling the business process. Conformance problems can be discovered by analyzing tokens artificially created during the replay (the missing tokens) and tokens not consumed (the remaining tokens). Figure 2 presents two different traces, T and T 0 , of the business process presented in Section 2, where only the actions and the timestamps are reported. The log replay of trace T , which is compliant with the Petri net, terminates with a single remaining token in the end place {p13 → 1} and without reporting missing tokens. Fig. 2. Conforming (T ) and non-conforming (T 0 ) traces The log replay of the trace T 0 terminates with remaining tokens {p5 → 1, p13 → 1} and one missing token {p8 → 1}. The missing token is created arti- ficially and this fact witnesses a wrong execution of the event Notify start. In fact, in T 0 this event is executed before the termination of the activity Finan- cialCheck : this is interpreted as a non-conformance to the process model. 4 Classification for Process Analysis Exploiting machine learning techniques for process analysis is encouraged by the presence of a huge amount of data recorded as attributes in event logs. The implicit information contained in those data could be significant for the process behavior analysis. In fact, this potential information could contain an explanation for the deviations discovered during conformance analysis, and for the performance level provided by the process. For this reason, we are interested in discovering how the process data may influence the conformance and the performance of the process executions. This is done with an approach based on classification, a classical machine learning technique able to detect patterns on data in correspondence to which the process assumes a specific behavior. 4.1 Classification: Basic Concepts In a classification problem [10], data is represented by a collection of records (called instances), and each of them is characterized by a tuple (x, y), where x is a set of attributes and y is a special one called target attribute. The value of the target attribute is a label identifying the class to which the record belongs. The goal is to learn a classification model that maps each attribute set x to one of the predefined class labels y. The general approach for solving a classification problem consists of providing a training set of records, whose class labels are already known, to the learning algorithm in order to build a classification model. This model is then applied to a test set, which consists of records with unknown class labels. This step allows to estimate the model accuracy using one of the standard approaches. For example, the accuracy can be represented by the percentage of records correctly predicted with respect to the total number of records in the test set. Many classification models can be applied to a classification problem. For our purpose we choose the decision tree model since the rules detected by the classification are shown explicitly, providing an overview of how process data affects the process behavior. In a decision tree, a class label (possible value of the target attribute) is assigned to each leaf node. The non-terminal nodes, which include the root and the internal nodes, contain attribute test conditions to filter records that have different characteristics. 4.2 Classification for Conformance Checking In order to identify the possible causes of non-conformance, a classification prob- lem can be formulated. The classification dataset is extracted from the process data: for each process instance, relevant information is extracted and included as a record in the dataset. For the process presented in Section 2, a record includes the following attributes: an instance identifier, a client identifier, the client typol- ogy (new or consolidated client), the sales manager responsible for the order, the financial officer who conducts the financial evaluation activities, the warehouse- man responsible for the warehouse checking, the supplying responsible name in case of a provision, and finally the result communicated to the client for the order issued. Furthemore, as target attribute we associate with each record the confor- mance result for that instance, which is a boolean value that is extracted from the output of the log replay. All these attributes are discrete and contribute to build the dataset presented in Table 4.2 of the classification problem. In this pre- liminary work, in order to explore the applicability of existing machine learning techniques to process analysis, we just used synthetic data generated to emulate process executions. Note that for this running example, attributes selected for the classification task are all focused on the actors of the process activities, but in general attributes can characterize several other aspects of activities. In order to build the classification problem, the attributes characterizing the process are extracted from the log, while the conformance attribute is computed using the conformance checking algorithm. The Petri net of Figure 3 shows the output of the conformance analysis executed on the event log: edges are labeled with the number of times they were activated, places are labeled with the number of remaining and missing tokens. Fig. 3. Petri net: conformance results. The resulting data highlight that there are 37 out of 250 process executions with conformance errors. The presence of 37 missing tokens in p11 represents the number of times that the log replay algorithm forced the execution of transition Notify start, even if it was not enabled, in order to mimic a corresponding event in the trace. Moreover, 37 remaining tokens in place p3 indicate the number of times that the financial check activities were not performed. Taking into account the application scenario, we can conclude that 37 instances of the process are not compliant with the sale policy since they did not execute the financial check activity. Table 1. Dataset for the conformance analysis. OrdIde CltIde CltType SalMan FinOff WrhsMan SupplyResp OrdResut Conf 1 20 consolidate Marco Mary Alex Gianni positive no 2 15 new Anna Johann Roberto Mario positive yes 3 10 consolidate Maria Mary Alessio Gianni negative no 10 18 consolidate Johann Mary Roberto Gianni positive yes ... ... ... ... ... ... ... .... ... Fig. 4. Decision tree for the conformance analysis. The results of the conformance checking enable the use of existing machine learning tools, allowing us to mine the decision tree of Figure 4 as classification model for the sale process. The resulting decision tree describes a data pattern in correspondence to which a process instance could present conformance errors: order managed by the sales manager Mary and received from consolidate clients of the organization may not respect the standard sale procedure. From the analysis done for the sale process based on both classification and log replay results, a new scenario of the sale procedure has emerged: orders done by the consolidate clients of the organization are not checked from a financial point of view. This fact can be accepted as a valid behavior in a sale proce- dure, and consequently an extension of the business process model is needed to include the new scenario (model extension activity). Alternatively, the misbe- having scenario can be considered as an anomaly in the sale procedure. In this case, since the analysis provides accurate information about the conformance error, corrective measures can be taken in order to prevent such errors in the future. The classifier constructed with this approach can be used also in a predictive way. Given a trace recording a process execution the decision tree can be used to predict the conformance result. This brings an advantage in terms of the time required by the analysis, since the replay of a process instance takes more time than its classification using a decision tree. Moreover, it is worth noting that whenever the set of attributes needed for the classification is known before the completion of a process execution, the non-conformances predicted by the decision tree can be used to promptly alert the stakeholders and activate proper countermeasures. 4.3 Extension of the approach to Performance Analysis Performance analysis of a process can be carried exploiting log replay. Since an event log contains timestamps, during the replay of a trace it is possible to compute (for each place of the Petri net model) some performance measures such as: synchronization time (i.e. time interval between arrival of a token in the place and enabling of a transition in the post-set of the place), sejour time (i.e. the time interval between arrival and departure of tokens) and waiting time (i.e. the time interval between enabling of a transition in the post-set of the place and token departure). Performing an analysis based on the measures computed by log replay gives just a quantitative information about the process performance. In order to under- stand the causes of performance anomalies that can affect the process behavior, one can explore the process data. Discovering how data attribute can influence process performance provides useful information in analyzing and optimizing the process services. For example, discovering data patterns in correspondence to which some activities need more time for completion than others, helps in making decisions about resources distribution to the process activities or in scheduling activities. In addition to the completion time, one could analyze a process under more complex performance metrics such as the synchronization time. A process with parallel branches and synchronizations can present bottle- neck activities that lead to increase the execution time of other activities and consequently the completion time of the entire process. To find out the possible data influences on synchronization time, the approach presented in Section 4.2 for conformance analysis through classification can be easily extended to perfor- mance analysis. For example, a classification problem can be formulated for each synchronization point of the process. In this way, the classifier obtained in cor- respondence to a synchronization point classifies the process instances based on their attribute value and regarding the synchronization time of the point taken into exam. 5 Implementation with ProM In order to experiment with some business process prototypes, the approach presented in this paper has been implemented as a set of plug-ins that integrates ProM6 [12], an open-source framework implementing Process Mining tools, and Weka [3], a data mining framework providing machine learning tools. We can classify the developed plug-ins in three categories. The first one includes plug-ins for building the dataset needed for the classification: – Generate Instances with Conformance: this plug-in takes as input an event log and its conformance results computed by log replay. It returns as output a training set as shown in Table 4.2 in the format needed by the Weka classification tool. Note that the plug-in extracts all the attributes data recorded in the event log. If necessary, the resulting dataset can be subject to some feature selection techniques before using it as a training set for the classification task. – Generate Instances to Classify: given an event log, the plug-in gener- ates a dataset of instances that can be classified using a classification model. The second category includes plug-ins for generating the classifier and for using it as a predictive model: – Generate Classifier: given a training set, the plug-in generates a decision tree model using the J48 algorithm, an implementation of the algorithm C4.5 provided by Weka. – Classify Instances: given a set of non-classified instances and a classifier model, this plug-in simply classifies the instances according to the classifier. Finally, the last category includes a set of plug-ins useful for the serialization, deserialization and visualization of the resources used by the previous plug-ins, such as dataset and classifiers. Figure 5 represents a data flow diagram that summarizes the data dependen- cies among the plug-ins discussed in this section. Fig. 5. Data dependencies among the implemented plug-ins. 6 Conclusions and Future Work We presented some preliminary results of a research activity aimed at applying machine learning techniques, i.e. classification algorithms, in the realm of pro- cess analysis. The idea is to try to identify some correlation between the data value of certain variables in the process instances of event logs, and the results of conformance checking. A successful identification of such values could allow to predict the conformance result of an instance during execution. The same tech- nique was applied to performance checking as well, but only briefly summarized here because of space constraints. The approach is implemented with a combina- tion of plug-ins of the Process Mining framework ProM and of the classification engine Weka. The approach has been successfully tested on synthetic event logs only. An obvious continuation of this work includes experimentations with real event logs: this will allow us to evaluate the scalability of the approach and its robustness in presence of noise. Classification techniques in the Process Mining field have already been ex- plored in [8] and more recently in [2]. In those papers the idea is to explore how data influences the case routing of process flow execution, by assigning a classifi- cation problem with each decision point in the model. In particular, [2] improves significantly the results of [8] by exploiting some recently developed alignment techniques between process instances and the process model. We intend to ex- plore how far these techniques can be applied fruitfully in our framework as well. References 1. R. Bruni, A. Corradini, G. L. Ferrari, T. Flagella, R. Guanciale, and G. Spagnolo. Applying process analysis to the Italian e-government enterprise architecture. In M. Carbone and J.-M. Petit, editors, WS-FM, volume 7176 of Lecture Notes in Computer Science, pages 111–127. Springer, 2011. 2. M. de Leoni and W. M. P. van der Aalst. Data-aware process mining: discovering decisions in processes using alignments. In S. Y. Shin and J. C. Maldonado, editors, SAC, pages 1454–1461. ACM, 2013. 3. M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten. The WEKA data mining software: an update. SIGKDD Explorations, 11(1):10– 18, 2009. 4. J. Herbst. A machine learning approach to workflow management. In R. L. de Mántaras and E. Plaza, editors, ECML, volume 1810 of Lecture Notes in Com- puter Science, pages 183–194. Springer, 2000. 5. L. Maruster. A machine learning approach to understand business processes. PhD thesis, Eindhoven University of Technolog, 2003. 6. OMG. Business Process Model and Notation, http://www.bpmn.org/, 2011. 7. C. A. Petri. Fundamentals of a theory of asynchronous information flow. In IFIP Congress, pages 386–390, 1962. 8. A. Rozinat and W. M. P. van der Aalst. Decision mining in ProM. In S. Dustdar, J. L. Fiadeiro, and A. P. Sheth, editors, Business Process Management, volume 4102 of Lecture Notes in Computer Science, pages 420–425. Springer, 2006. 9. A. Rozinat and W. M. P. van der Aalst. Conformance checking of processes based on monitoring real behavior. Inf. Syst., 33(1):64–95, 2008. 10. P.-N. Tan, M. Steinbach, and V. Kumar. Introduction to Data Mining. Addison Wesley, May 2005. 11. W. M. P. van der Aalst. The application of Petri nets to workflow management. Journal of Circuits, Systems, and Computers, 8(1):21–66, 1998. 12. W. M. P. van der Aalst, B. F. van Dongen, C. W. Günther, A. Rozinat, E. Verbeek, and T. Weijters. ProM: The process mining toolkit. In A. K. A. de Medeiros and B. Weber, editors, BPM (Demos), volume 489 of CEUR Workshop Proceedings. CEUR-WS.org, 2009.