Stochastic Process Mining (Extended Abstract) Adam Burke Queensland University of Technology at.burke@qut.edu.au There are many information systems in the world, used by A stochastic process discovery framework was developed many organizations to build things and help people. In process in the investigation of this question [3]. In it, a control flow mining [1], models of organizations at work are automatically discovery algorithm is first used to discover a Petri net [1, constructed and analyzed. By describing large amounts of data p60], and the result is combined with a weight estimation quickly, process mining accelerates understanding of what an step to produce a GSPN [2]. Six estimators fitting the frame- organizations does, and how it may improve. For example, work were implemented, and evaluated experimentally against a medical doctor may note which common hospital intake stochastic conformance measures, using established discovery tasks are bottlenecks, or an auditor may see evidence that algorithms and real-life public event logs1 . The framework is a certain regulatory checks are carried out as expected. These generalization of existing stochastic discovery techniques that processes may be explicitly defined by the organization, as compose control-flow discovery with a pipelined stochastic in an insurance claim process, or implicit, as in a hospital estimation step [5], [6]. It is also a specialization, in that emergency room. it produces GSPNs with immediate transitions, rather than The successes of process mining to date have largely been GDT SPNs [5]. with control-flow models. In a control-flow model, causality is Stochastic quality measures of Entropy Recall and Preci- represented, but probability is not. Stochastic process models sion [7] and Earth-Movers’ Distance [8] were used for the are potentially more powerful tools for some types of organiza- evaluation, together with real-life logs from BPI challenges tional analysis and optimization where frequency and variation and a variety of discovery algorithms. Estimation techniques are key, because one way of understanding organizations is to found were of comparable quality, were applicable to a broader look at the patterns of what they repeatedly do, and what they range of event logs, and were generally faster than GDT SPN treat as exceptional. Stochastic models may also be used in discovery [5]. prediction and performance. This project then investigates: RQ1.2 How may stochastic models be discovered directly How can processes in organizations be understood using from event logs? stochastic models mined from organizational data? Techniques based on a computing pipeline of control-flow The relatively small body of existing work on this topic model discovery followed by some inference of stochastic is reviewed with research sub-questions and methods in Sec- data have some inherent design limitations. The control-flow- tions I-III. This project extends and applies this existing work only nature of the initially discovered model may introduce a while contributing novel techniques and analysis for the con- representational bias toward the structures of that output for- struction and use of these models on real-world event data. The malism. The multiple passes through the log is also awkward, process models notations used are Generalized Stochastic Petri and perhaps inefficient. Accordingly, this project introduced Nets (GSPNs) [2] and closely related extensions. Section I the novel Toothpaste miner framework, a set of direct dis- summarizes already conducted research into stochastic process covery algorithms where control-flow and stochastic aspects discovery, while Sections II and III relate to ongoing and are discovered in concert, through a process of reduction and planned research into quality dimensions and concept drift. abstraction, in polynomial time [4]. To do this, it introduces an intermediate abstraction targeted at stochastic process dis- I. D ISCOVERING S TOCHASTIC M ODELS (RQ1) covery, the Probabilistic Process Tree (PPT). The algorithms start with a trace model and reduce it to a target model using RQ1 How may stochastic process models be discovered formally defined rules, “squeezing” trace information into a automatically? usable form. A prototype was implemented2 and evaluated This project work investigated composition of existing empirically against existing techniques with promising results. control-flow discovery techniques with new weight estima- A PPT is an extension of Process Trees that includes tion [3] and direct stochastic process discovery [4] techniques. relative probabilities in the form of node weights, and is Direct Process Discovery algorithms output stochastic mod- related formally to GSPNs. Toothpaste miner is inspired by els without an intermediate control-flow discovery step. The region-based miners for control-flow models [9], [10] and the techniques for generating stochastic models are not themselves A LERGIA [11] algorithm. necessarily stochastic: they may also include analytic methods. RQ1.1 How may control-flow discovery techniques be lever- 1 Java code and data at https://github.com/adamburkegh/spd we aged for stochastic process model discovery? 2 Haskell code and data at https://github.com/adamburkegh/toothpaste Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). II. Q UALITY D IMENSIONS (RQ2) Building on this foundation, the evolution of organizational RQ2 What quality dimensions are empirically observable in processes can itself be analysed computationally, yielding stochastic process models and logs? models of process change. This long-run process drift can be The quality of process models is often quantified with described with second-order process models [14]. conformance measures, and those measures related to four This prospective research concerns novel algorithms, tech- standard control-flow quality dimensions [1, p188]. Recent niques and newly developed software for describing and un- research into stochastic process measures suggests both con- derstanding long-run process drift. The underlying formalisms nections and challenges to these quality dimensions. For are expected to be based on GSPNs and Reconfigurable Petri example, entropy measures have been related to both precision nets [16]. Models of long-run process drift will be constructed and recall (fitness) [7], but Earth-Movers Distance does not using a combination of stochastic process discovery and obviously align with an existing dimension [12]. Challenges adapted concept drift detection techniques. The techniques range from needing to translate the technical definition of e.g., will be evaluated experimentally against real-world event logs fitness, to a stochastic context, to differences in how adjacent and stochastic quality measures. Initial exploration for suitable fields theorize the role of simplicity and generalization. event log data is underway. Automatic techniques can make it At least one empirical and quantitative study compares pro- easier to understand the history of change in a process, what cess measures and dimensions for control-flow models [13]. is routine, and what is exceptional, and thereby make better Factor analysis showed Fitness and Precision as observable, organizations over time. orthogonal dimensions. Simplicity was excluded, and there R EFERENCES was insufficient evidence to support a Generalization dimen- [1] W. van der Aalst, Process Mining: Data Science in Action, 2nd ed. sion. For stochastic process models, this suggests experimental Berlin Heidelberg: Springer-Verlag, 2016. investigation of how models may be distinguished can inform [2] F. Bause and P. Kritzinger, Stochastic Petri Nets: An Introduction to the and foster new formalized measures, and understand relations Theory. Vieweg+Teubner Verlag, 2002. [3] A. Burke, S. J. J. Leemans, and M. T. Wynn, “Stochastic Process between them. Process mining data is ultimately derived from Discovery by Weight Estimation,” in Process Mining Workshops. Cham: real-world social activity, and that will narrow the vast space Springer, 2021, pp. 260–272. of candidate formalisms. [4] ——, “Discovering Stochastic Process Models By Reduction and Ab- straction,” in Application and Theory of Petri Nets and Concurrency, This research, which is ongoing, investigates empirically ser. Lecture Notes in Computer Science. Springer, 2021, pp. 312–336. identifiable orthogonal quality dimensions for stochastic pro- [5] A. Rogge-Solti, W. M. P. van der Aalst, and M. Weske, “Discovering cess models. The experiment design uses a dataset of stochas- Stochastic Petri Nets with Arbitrary Delay Distributions from Event Logs,” in BPM Workshops, ser. LNBP. Springer, 2014, pp. 15–27. tic process models for real-life processes, collected and evalu- [6] M. Camargo, M. Dumas, and O. Gonzlez-Rojas, “Automated discovery ated against a set of computationally cheap measures, termed of business process simulation models from event logs,” Decision exploration measures. Exploration measures are based on ex- Support Systems, vol. 134, p. 113284, Jul. 2020. [7] S. J. J. Leemans and A. Polyvyanyy, “Stochastic-Aware Conformance isting control-flow and stochastic model measures and around Checking: An Entropy-Based Approach,” in Advanced Information fifteen measures are being considered. Existing stochastic Systems Engineering, ser. LNCS. Springer, 2020, pp. 217–233. conformance measures [7], [8] are considered evaluation mea- [8] S. J. J. Leemans, W. M. P. van der Aalst, T. Brockhoff, and A. Polyvyanyy, “Stochastic process mining: Earth movers stochastic sures. The dataset for the experiment comprises thousands conformance,” Information Systems, p. 101724, Feb. 2021. of stochastic models, including randomly generated models [9] J. Carmona, J. Cortadella, and M. Kishinevsky, “New Region-Based and those from existing discovery algorithms. Evaluation mea- Algorithms for Deriving Bounded Petri Nets,” IEEE Transactions on Computers, vol. 59, no. 3, pp. 371–384, Mar. 2010, conference Name: sures will be used for a subset of discovered models where IEEE Transactions on Computers. conformance measure tools were expected to terminate. The [10] V. Liesaputra, S. Yongchareon, and S. Chaisiri, “Efficient Process subset excludes random models and discovered models which Model Discovery Using Maximal Pattern Mining,” in BPM, ser. LNCS. Springer, 2015, pp. 441–456. perform poorly using exploration measures. Undermeasured [11] R. C. Carrasco and J. Oncina, “Learning stochastic regular grammars phenomena and possible quality dimensions will be proposed by means of a state merging method,” in Grammatical Inference and based on a statistical analysis of the results. Applications, ser. LNCS. Berlin: Springer, 1994, pp. 139–152. [12] S. J. Leemans, A. F. Syring, and W. M. van der Aalst, “Earth movers III. L ONG -RUN P ROCESS D RIFT (RQ3) stochastic conformance checking,” in International Conference on Busi- ness Process Management. Springer, 2019, pp. 127–143. RQ3 How can we precisely describe the history of change [13] G. Janssenswillen, N. Donders, T. Jouck, and B. Depaire, “A com- in an organizations processes? parative study of existing quality measures for process discovery,” Information Systems, vol. 71, pp. 1–15, Nov. 2017. A model describes a system. When the system changes, and [14] R. P. J. C. Bose, W. M. P. van der Aalst, I. liobait, and M. Pechenizkiy, an automatically discovered model needs to detect that, this “Dealing With Concept Drifts in Process Mining,” IEEE Transactions problem is called concept drift [1, p320]. In process mining, on Neural Networks and Learning Systems, vol. 25, no. 1, pp. 154–171, Jan. 2014. the computational discovery and analysis of organizational [15] T. Brockhoff, M. S. Uysal, and W. M. P. v. d. Aalst, “Time-aware process models, this becomes process drift [14]. Recent work Concept Drift Detection Using the Earth Movers Distance,” in ICPM, on stochastic modelling for concept drift [15] suggests stochas- Oct. 2020, pp. 33–40. [16] M. Llorens and J. Oliver, “Structural and dynamic changes in concurrent tic models can be a productively describe these phenomena. systems: reconfigurable Petri nets,” IEEE Transactions on Computers, The significant existing literature on concept drift in pro- vol. 53, no. 9, pp. 1147–1158, Sep. 2004. cesses is focused on moments of drift, or anomaly detection. Copyright © 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).