The FeaturePrediction Package in ProM: Correlating Business Process Characteristics? Massimiliano de Leoni and Wil M.P. van der Aalst Eindhoven University of Technology, Eindhoven, The Netherlands {m.d.leoni, w.m.p.v.d.aalst}@tue.nl Abstract In Process Mining, often one is not only interested in learning pro- cess models but also in answering questions such as “What do the cases that are late have in common?”, “What characterizes the workers that skip this check ac- tivity?” and “Do people work faster if they have more work?”. Such questions can be answered by combining process mining with classification (e.g., decision tree analysis). Several authors have proposed ad-hoc solutions for specific ques- tions, e.g., there is work on predicting the remaining processing time and rec- ommending activities to minimize particular risks. This paper reports on a tool, implemented as plug-in for ProM, that unifies these ideas and provide a general framework for deriving and correlating process characteristics. To demonstrate the maturity of the tool, we show the steps with the tool to answer one correla- tion question related to a health-care process. The answer to a second question is shown in the screencast accompanying this paper. 1 Introduction Process mining is not only about automatically learning process models. It also con- cerns with replaying event logs on the model to, e.g., check conformance or to uncover bottlenecks in the process. However, such analyses are often only the starting point for providing initial insights. When discovering a bottleneck or frequent deviation, one would like to understand why it exists. This requires the correlation of different process characteristics. These characteristics can be based on the control-flow (e.g., the next activity going to be performed), the data-flow (e.g., the amount of money involved), the time perspective (e.g., the activity duration or the remaining time to the end of the process), the organization perspective (e.g., the resource going to perform a particu- lar activity), or, in case a normative process model exists, the conformance perspective (e.g., the skipping of a mandatory activity). The study of these characteristics and how they influence each other is of crucial im- portance when an organization aims to improve and redesign its own processes. Many authors have proposed techniques to relate specific characteristics in an ad-hoc manner, such as to predict the remaining processing time of a case or to analyze routing deci- sions in the process or possible risks (see [1] for a detailed literature analysis). These problems are specific instances of a more general problem, which is concerned with relating any process or event characteristic to other characteristics associated with single events or the entire process. This paper reports on a tool that solves the more ? Dr. de Leoni conducted this work when also affiliated with University of Padua, Italy, and financially supported by the Eurostars - Eureka project PROMPT (E!6696). Copyright c 2014 for this paper by its authors. Copying permitted for private and academic purposes. general correlation problem. The tool unifies the ad-hoc approaches described in lit- erature by providing a generic way to relate any characteristic (dependent variable) to other characteristics (independent variables). Readers are referred to [1] for a thorough introduction to the framework. Starting point is an event log. For each process instance (i.e., case), there is a trace, i.e., a sequence of events. Events are associated with different characteristics, repre- sented a key-value pairs. Mandatory characteristics are activity and timestamp. Other typical characteristics are the resource used to perform the activity, transactional infor- mation (start, complete, suspend, resume, etc.), and costs. However, many more char- acteristics can be associated to an activity (e.g., the age of a patient or size of an order). The tool builds a table where each row corresponds to a different event and each col- umn is a different characteristic. One of the columns become the dependent character- istic and the others are the independent characteristics; the relation between dependent and independent characteristics is discovered using decision-tree learning techniques. Before discovering the tree, the tool also allows some rows to be filtered out. For in- stance, one may want to only retain those events that refer to certain activities. If a certain characteristic is valuable for an analysis but not present, our tool also allows extending event logs with additional characteristics that are not readily avail- able. For instance, events can be extended with the remaining flow time till the end of the process instance or, also, the elapsed time since the process instance started. Other characteristics that may be added could be related to the resource who triggered an event (e.g., workload of the resource), i.e. who executed the respective activity. We can also add the next activity as a characteristic of an event. One can even add confor- mance checking results and external context information, such as weather information, to events as characteristics. In many cases, the values of these characteristics can be simply derived from the event log itself; in other cases, they need to be harvested from information sources outside the event log (weather information, stock index, etc.). Implementation. The tool is implemented as a plug-in of ProM, an open-source “plug- gable” framework for the implementation of process mining tools in a standardised environment (see http://www.promtools.org). The ProM framework is based on the con- cept of packages each of which is an aggregation of several plug-ins that are conceptu- ally related. Our new plug-in is available in a new package named FeaturePrediction, which is available in ProM version 6.4. A ProM plug-in requires a number of input objects and produces one or more output objects. The main input object of our plug-in is an event log, whereas the output is a decision tree. To build decision trees, the plug-in leverages on the implementation of the C4.5 algorithm in Weka (http://weka.sourceforge.net/). As mentioned before, our framework envisions the possibility to augment/manipulate the event logs with addi- tional features. On the this concern, the tool is easily extensible: a new log manipula- tion can be easily plugged in by (1) implementing 3 methods in a Java class that inherits from an abstract class and (2) programmatically adding it to a given Java set of available log manipulations. To date, the implementation already includes an extensive number of manipulations, which cover different process perspectives (time, control-flow, data, resource and conformance) and are listed in Table 1 of [1]. The application of some log manipulations requires additional input objects, such as a process model or a LTL formula. The plug-in is organized in a way that one arbitrary additional object can be given as input and used as source of information to enable log manipulations that can exploit it. 2 Usage of the Tool to Perform a Correlation Analysis Use Case In [1], we have reported on the ap- plication of our framework in col- laboration of UWV, the Dutch insti- tution that manages the provision of unemployment benefits for the em- ployees in the Netherlands who had previously lost their job. In particu- lar, we developed four analysis use cases to answer as many questions for which the institution was seek- ing an answer. As reported, many in- sights were derived, which had sig- Figure 1. The starting screen of the tool. nificant business value for UWV. However, in this paper, we want to complement such a evaluation with another one in a different business context. This section will show how an analysis use case can be carried out through our tool implementation in ProM. It is concerned with the process of treatment of pathologies related to eyes in a hospital in the Netherlands. The analysis use case aims at correlating the duration of executing activity Afspraak (in Dutch, appointment) to other process characteristics. This activity is performed by physicians who periodically visit hospitalized patients. After starting ProM, the user needs to choose plug-in Perform Prediction of Business Process Features. In addition to giving an event log as input, we also put forward a second object that provides the necessary information to augment/manipulate events with characteristics linked to the conformance of process instances against a prescribed process model (see [2] for de- tails). The initial screen is shown in Figure 1: no decision tree is constructed yet since the events to retain need to be chosen along with the dependent and independent char- acteristics to consider. The border of the screen contains three labels, namely Activities, Attributes and Configuration, used to, respectively, select activities for the events to re- tain, to pick the characteristics to consider and to set the parameters to construct the decision tree. By passing over the labels with the mouse, different configuration panels are shown (see Figure 2) The first step concerns with choosing the characteristics to consider: Figure 2(a) shows the panel where users select the characteristics to consider among those available. These characteristics are visualized in a tree and grouped by the process perspective to which they refer. By selecting a node in a tree, characteristics are added to those to consider. The characteristics linked to conformance are displayed differently: by selecting Consider fitness as feature, each event is augmented with the level of fitness of the trace to which the event belongs. By clicking on Open the fitness frame, users can selectively decide (panel not shown here) if the number of deviations for certain single activities should be considered as characteristics (see [2] for more details). After choosing the characteristics to consider, the next step is about select- ing the activities to retain. Since we aim to only provide corre- lation for Afspraak, events re- ferring to any other activity are filtered out. Figure 2(b) shows the corresponding panel: any ac- tivity different from Afspraak is going to be removed from the list. The filtering of events hap- pens in the phase that follows the manipulation with additional (a) Panel to select the process characteristics to consider. characteristics. This means that the choice of events to retain does not influence how events are augmented with additional characteristics, e.g. referring to the number of executions of given activities or to the pre- vious/next activity in trace. As (b) Panel to filter on the activities of the events to retain. final step, the analyst needs to choose which characteristic is the dependent one. This is done through the panel Configura- tion, shown in Figure 2(c). For our analysis use case, we se- lected Activity Duration as de- pendent characteristic. The dependent characteristic needs to be one among those se- lected through the panel in Fig- (c) Panel to select the dependent characteristics and the pa- ure 2(a). The other options in rameters for the decision-tree construction. the panel are used to configure the application of the C4.5 algo- Figure 2. Configuration Panels to build a correlation anal- rithm when building a decision ysis use case. tree. In particular, for this analy- sis, we decided to constrain the decision tree to be binary and allowed the decision tree to be pruned, with the constraint that no less than 167 events can be associated with a leaf so as to balance under- and over-fitting problems. C4.5 requires a dependent charac- teristic to be discrete. The activity duration is a continuous characteristic and, hence, needs to be discretized before being used. Different discretization techniques are acces- sible through the Discretization panel (not shown here). For this analysis, we opted for equal-frequency binning: intervals are of different sizes but (roughly) the same number of observed values falls into each one. Figure 3 shows the resulting decision tree. Some correla- tion rules can be derived: if the pre- vious activity is not Afspraak, the du- ration of an Af- spraak execution is likely being less than 214,748,364 Figure 3. The resulting decision tree that provides a correlation with the milliseconds, near- duration of executions of activity Afspraak. ly 2.5 days. Sim- ilar durations are also expected for the executions of Afspraak preceded by another Afspraak when the patient treatments have started since less than 1,874,700,000 milliseconds, around 21.7 days. Conversely, the duration of Afspraak executions seems to be significantly longer, i.e. around 22.3 instead of 2.5 days, if the patient treatments have started since a longer time. Since the event log only stored the timestamp of completions of activities, this duration accounts for both the actual execution time and the waiting/idle time before Afspraak was actu- ally started. If the event log also contained the timestamps when activities were started in cases, the duration would not consider the idle time. No correlation is made with characteristics related to resources and deviations. This means that the duration of the Afspraak executions is not related to those process characteristics. At https://svn.win.tue.nl/repos/prom/Documentation/FeaturePrediction/screencast.avi, a screencast is available that, starting for the event log and the reference process model, shows the entire sequence of steps to obtain the decision tree in Figure 3. The screencast also reports on a different correlation analysis use case that is concerned with correlat- ing several characteristics to the level of fitness of process instances with respect to given reference process model. References 1. de Leoni, M., var der Aalst, W.M.P., Dees, M.: A General Framework for Correlating Business Process Characteristics. In: Proceedings of the 12th International Conference of Business Process Management (BPM 2014). Volume 8659 of LNCS., Springer (2014) 250–266 2. de Leoni, M., van der Aalst, W.M.P.: Aligning event logs and process models for multi- perspective conformance checking: An approach based on integer linear programming. In: Proceedings of the 11th International Conference on Business Process Management (BPM’13). Volume 8094 of LNCS., Springer-Verlag (2013) 113–129