Unearthing the Real Process Behind the Event Data: The Case for Increased Process Realism (Extended Abstract) Gert Janssenswillen1 UHasselt - Hasselt University, Martelarenlaan 42, 3500 Hasselt, Belgium gert.janssenswillen@uhasselt.be Abstract. Companies in the 21st century possess a large amount of data about their products, customers and transactions. The increase in available event data gave rise to process mining, a discipline that fo- cuses on extracting insights about processes from event logs. However, correctly displaying business processes is not a trivial task. The concept of process realism is introduced in this dissertation — stressing the need for reliable process analysis results for evidence-based decision making — which is approached from two angles. Firstly, quality dimensions and measures for process discovery are analysed on a large scale and com- pared with each other on the basis of empirical experiments. Secondly, by developing a transparent and extensible tool-set, a framework is offered to analyse process data from different perspectives. Exploratory and de- scriptive analysis of process data and testing of hypotheses again leads to increased process realism. Based on both approaches, recommenda- tions are made for future research, and a call is made to give the process realism mindset a central place within process mining analyses. Keywords: Process mining · Event data · Conformance Checking 1 Introduction In current times, organisations possess a tremendous amount of data concerning their customers, products and processes. Many activities which are taking place in their operational processes are being recorded in event logs [2]. Techniques from the process mining field, which has grown steadily over the last decades, can be applied to gain insights into these event data [1]. Over the past decade, a lot of attention has been given to the discovery of process models from logs [4,5], and the quality measurement of these models [3, 6, 7]. The results of process mining analyses, if acted upon, can have important ramifications for business operations in two ways. Firstly, improvements to the performance of processes. Performance — or a lack thereof — can be expressed in many different manners, such as the time spent on the process or the incurred operational costs. Secondly, improvements in compliance with rules and regu- lations, whether imposed internally in organisations or by (inter)national laws, are important to prevent fraud and other types of risk. Both of these aspects Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 2 G. Janssenswillen strongly rely on the ability to accurately delineate the process and all its rele- vant characteristics based on the process data that has been extracted from the organisation’s information systems. This dissertation aims to contribute from several angles related to this accu- rate representation of processes, introducing the notion of the concept process realism. Process realism can be defined as the interest or concern for the ac- tual or real process, as distinguished from the abstract, speculative, etc., or, the tendency to view or represent processes as they really are. In order to optimise processes, evidence-based decision making is needed. Consequently, it is essential to map these processes in a realistic way. Blindly relying on both partial and/or inconsistent data and on algorithms can lead to wrong actions being taken. Process realism is approached from two perspectives. First, quality dimen- sions and measures for process discovery results are analysed on a large scale and compared with each other on the basis of empirical experiments (Part II of the thesis). Which measures are best suited to assess the quality of a discovered process model? What are their weaknesses and strengths? Which challenges still need to be overcome in order to evolve towards reliable quality measurement? The results of these experiments are discussed in Section 2. In addition to the focus on process models, process realism is also approached from a data point of view. By developing a transparent and extensible tool-set, a framework is offered to analyse process data from different perspectives (Part III of the thesis). Exploratory and descriptive analysis of process data and testing of hypotheses again leads to increased process realism. This led to the creation of bupaR, an open-source software suite for process analysis in R. This part of the dissertation is further discussed in Section 3. 2 Process Model Quality When looking at the results of a process discovery algorithm, the outcome is often too easily (mis)taken for absolute truth about the underlying process. However, the fact that it was discovered from a sample of event data, which probably also contains measurement errors, tells us that this is not necessarily the case. Simultaneously, it is not a reliable representation of the original event data either, because of the filters and other choices and assumptions imposed by the discovery algorithm used. As such, awareness about whether you are describing the event data, making assertions about the underlying process, or an ambiguous mix of both is currently missing. Being able to accurately quantify the quality of discovered process models, which is an important component of conformance checking, is critical for process discovery. Only through accurate quality measurement can the trustworthiness of discovered process models be assessed, to see whether the insights they deliver are reliable. It is crucial to know whether a discovered process model is a precise and fitting representation of the event data or the underlying process. Many quality measures — fitness, precision and generalization — have been developed over the past years, but they have so far only been evaluated narrowly on how Unearthing the Real Process Behind the Event Data (Extended Abstract) 3 they compare to each other. Moreover, it is not clear how to interpret or combine different dimensions such as precision and generalization. The research objective related to process model quality is therefore twofold: 1. examine the measures in terms of validity, sensitivity and feasibility, and 2. analyse their ability to quantify the quality of the model as a representation of the underlying process, i.e. the system. A summary of the results of Part II is shown in Table 1. The unbiased estima- tor column for fitness and precision measures refers to the ability of measures to act as an unbiased estimator of the correspondence between the process model and the underlying process (i.e, the system), when applied between a log and a model. We refer to this as system-fitness (system-precision) and log-fitness (log- precision), respectively. For generalization measures, unbiased estimator refers to the ability of generalization measures to unbiasedly estimate system-fitness, which is how generalization is most often defined.1 Table 1. Summary of results. F = Fitness, P = Precision, G = Generalization, ab = Alignment-Based, ne = Behavioural, tb = Token-Based, ba = Best Align, oa = One Align. Measure Feasibility Validity Sensitivity Unbiased estimator F ab 7 3 3 " ne 3 3 7 " tb " 3 3 " P ab " " " " ne 3 3 3 7 ba 7 3 3 7 oa " 3 " 7 G ab 3 7 7 7 ne 3 7 3 7 The experiments of Part II and their conclusions indicate several important challenges to be tackled by future research. Challenges related to process qual- ity measurement itself — e.g. how to estimate system-quality in an unbiased manner? — as well as to the experimental set-up for the empirical evaluation — which types of models or how many systems to generate are questions in these kind of experiments to which current literature does not provide an answer. 3 Process Analytics A process model, notwithstanding how superior in quality it might be, will always make abstraction of certain information — such as information on resources, 1 Note that, in contrast to fitness and precision, there is no universally agreed-upon definition of generalization. 4 G. Janssenswillen time, or other attributes — thereby partly sacrificing the realism one has about the process. While a model can indicate certain surprising or interesting patterns with regards to the process, the practitioner will want to have a means to further investigate this pattern, to understand why and how it came about, before he can decide whether — and which — corrective actions are required to improve the performance or compliance of the process. As such, this dissertation is also motivated by the necessity a tool-set to anal- yse process data in a flexible and powerful way, able to focus on very specific segments or perspectives of the processes. Important in this respect is the capa- bility to use proven data analytics techniques — from statistics to contemporary data mining tools — in order to truly unravel these patterns and confirm their reality. While many developments with respect to process analysis tools have been already made, important limitations can still be found which prevent these type of flexible and transparent inquiries. The contribution of the second part of this dissertation is therefore the de- velopment of a tool-set, answering to specific requirements which are identified based on the inventory of state-of-the-art tools, both of open-source and com- mercial nature. In particular, the following characteristics are considered: – Flexibility — the ability of the tool to analyse multiple perspectives of the process besides the omnipresent focus on control-flow. Also non-standard case and event attributes should receive their place in the analysis of the process. – Connectivity — the ability to use existing tools and techniques. Being con- nected with these existing functionalities will prevent that process analysis will end isolated from the advances in the broader data science field. – Transparency — abolishing the often obscuring characteristics of process analytics tools, such as hidden assumptions and ambiguous, behind-the- scenes pre- or post-processing steps. In order to bring about process realism, the tool should clearly document the workings of all the functionalities and allow for reproducible work-flows. Based on these aspects, the framework bupaR is introduced. bupaR is an ex- tensible set of R-packages for business process analysis, developed in order to support flexible, reproducible and extensible process analytics. As an evaluation of the framework, it is applied to two case studies. First, how can we use pro- cess data to better understand students’ study trajectories and to better guide students? Secondly, how can we apply process analysis in a railway context, in order to achieve a smoother service for passengers? Both case studies show that the framework clearly has added value, and that the answers to the questions asked can help to improve the processes under consideration. At the same time, unresolved challenges within process mining are also emphasised, such as the analysis of processes at the right level of granularity, and the assumption that process instances are independent of each other. Unearthing the Real Process Behind the Event Data (Extended Abstract) 5 4 Conclusions From both perspectives, process model and process data, recommendations are made for future research, and a call is made to give the process realism mindset a central place within process mining analyses. The research objective of the first part, with a focus on process model quality, was to analyse quality measures to examine their usefulness in terms of validity, sensitivity and feasibility, as well as their ability to quantify the quality of the model as a representation of the underlying process. So far, little research has been done concerning the evaluation and comparison of quality measures. The empirical analyses done in this dissertation elucidate this poor comprehension. Secondly, reproducible analysis is an important requirement for tools in cur- rent times. It is relevant for both academics — allowing them to reproduce experiments — as for industry — to rerun analyses on new data, or using new assumptions. Reproducibility can be obtained using different approaches. One is to support the creation of graphical work-flows, such as RapidMiner and other graphical data analysis environments do. Another approach is through scripting, which was the approach taken by bupaR. Because of investments in documenta- tion, tutorials, examples and a website, as well as through a straightforward API design, many actions have been taken to make bupaR as accessible as possible, thereby helping to spread the adoption of process mining to a broader audience. References 1. van der Aalst, W.M.P., Reijers, H.A., Weijters, A.J., van Dongen, B.F., De Medeiros, A.A., Song, M., Verbeek, H.M.W.: Business process mining: An industrial applica- tion. Information Systems 32(5), 713–732 (2007) 2. van der Aalst, W.M.P., Weijters, T., Maruster, L.: Workflow mining: Discovering process models from event logs. Knowledge and Data Engineering, IEEE Transac- tions on 16(9), 1128–1142 (2004) 3. Adriansyah, A., Munoz-Gama, J., Carmona, J., van Dongen, B.F., van der Aalst, W.M.P.: Alignment based precision checking. In: Business Process Management Workshops. pp. 137–149. Springer (2012) 4. Garcia-Banuelos, L., Dumas, M., La Rosa, M., De Weerdt, J., Ekanayake, C.C.: Controlled automated discovery of collections of business process models. Informa- tion Systems 46, 85–101 (2014) 5. Leemans, S.J., Fahland, D., van der Aalst, W.M.P.: Discovering block-structured process models from event logs: A constructive approach. In: International confer- ence on applications and theory of Petri nets and concurrency, pp. 311–329 (2013) 6. de Leoni, M., Maggi, F.M., van der Aalst, W.M.P.: An alignment-based framework to check the conformance of declarative process models and to preprocess event-log data. Information Systems 47, 258–277 (2015) 7. Senderovich, A., Weidlich, M., Yedidsion, L., Gal, A., Mandelbaum, A., Kadish, S., Bunnell, C.A.: Conformance checking and performance improvement in scheduled processes: A queueing-network perspective. Information Systems 62 (2016)