=Paper=
{{Paper
|id=Vol-2703/paperTD7
|storemode=property
|title=Extensions to the bupaR Ecosystem: An Overview
|pdfUrl=https://ceur-ws.org/Vol-2703/paperTD7.pdf
|volume=Vol-2703
|authors=Gert Janssenswillen,Felix Mannhardt,Mathijs Creemers,Benoı̂t Depaire,Mieke Jans,Leen Jooken,Niels Martin,Greg Van Houdt
|dblpUrl=https://dblp.org/rec/conf/icpm/JanssenswillenM20
}}
==Extensions to the bupaR Ecosystem: An Overview==
Extensions to the bupaR Ecosystem: An Overview Gert Janssenswillen∗ , Felix Mannhardt† , Mathijs Creemers∗ , Benoı̂t Depaire∗ , Mieke Jans∗ , Leen Jooken∗ , Niels Martin∗‡ and Greg Van Houdt∗ ∗ UHasselt - Hasselt University Agoralaan, 3590 Diepenbeek, Belgium gert.janssenswillen@uhasselt.be † Technische Universiteit Eindhoven 5612 AZ Eindhoven, Netherlands f.mannhardt@tue.nl ‡ Research Foundation Flanders (FWO) Egmonstraat 5, 1000 Brussel, Belgium Abstract—Over the past few year, bupaR — the open-source TABLE I R-ecosystem for process analysis — has seen a considerable OVERVIEW OF B U P A R- ECOSYSTEM . increase in functionalities and users. It has been one of the first successful tools for script-based process analytics, and can Packages Purpose currently be seen as the state-of-the-art tool for process analysis bupaR* Core event log functionalities in R and an important player in the open-source process mining collaborateR Create Collaboration Graphs tool landscape. With a user-base consisting largely of professional daqapo* Identify data quality issues in process-oriented data process analysts, the ecosystem has helped to increase the edeaR* Exploratory and descriptive event data analysis adoption of process mining in a broad range of fields. In this eventdataR* Repository of event logs demonstration, we highlight recent extensions to the ecosystem heuristicsmineR* Discover models using the Heuristics Miner that will further increase its usefulness for practitioners during logbuildR Facilitate event log construction their process mining projects. pm4py* Bridge with the PM4Py python library processanimateR* Animate process maps Index Terms—bupaR, R, process analytics, data quality, knowl- processcheckR* Rule-based conformance checking edge management. processmapR* Create process maps processmonitR* Create process monitoring dashboards propro Create probabilistic process models I. I NTRODUCTION petrinetR* Support for petri nets understandBPMN* Calculate understandability metrics for BPMN bupaR is an ecosystem of R-packages geared towards the xesreadR* Read and write XES-files analysis of process data in R [1].The ecosystem builds upon * Published on CRAN (https://cran.r-project.org/) three key principles: (1) connectivity, (2) reproducibility and (3) extensibility. The latter indicates that the functionalities provided by bupaR are continuously evolving. Since the II. N EW FEATURES release of the core packages in 2017, both its usage and the A. LogbuildR range of provided functionalities have been steadily increas- ing. As shown in Table I, bupaR currently consists of 16 Getting event data in the right format before starting your interconnected libraries for process analysis in the ecosystem, analyses remains one of the important hurdles that process an- each targeting a specific problem or use case. alysts have to take. Notwithstanding bupaR’s functionality for While bupaR in itself is not new, this paper outlines a reading event logs from XES-files [3], practitioners typically significant number of new functionalities that have recently have to start from raw data, and make sure that it is correctly been added to the ecosystem. Hence, the current paper ex- converted into an event log. tends earlier publications about the functionalities for business In order to guide this conversion, the package logbuildR process analysis in R [1], [2]. has been developed. It provides a graphical interface that leads This paper is organised as follows. Section II lists recently the user through different steps to build an event log. The developed functionalities, Section III discusses the maturity package provides the user with intelligent suggestions and and usage of bupaR, while Section IV concludes the paper. direct feedback in each step, which help the analyst to select An accompanying tutorial and screencast can be found on appropriate identifiers (case, activity, etc), make sure that each GitHub.1 row represents a unique event in the process (versus multiple timestamps per row), convert timestamps to appropriate data 1 https://github.com/bupaverse/icpm-demo-tutorial formats, and ensure life-cycle values adhere to the agreed-upon Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). makes this package well-suited for a teaching context in which the computations are followed in a step-wise fashion. Also, it is easy to compose new variants based on different heuristics. D. Propro The results of control-flow discovery algorithms are mainly deterministic process models, which do not convey a notion of probability or uncertainty. Using Bayesian inference and Markov Chain Monte Carlo, propro [9] can build a statistical model on top of a process model using event data, which Fig. 1. Example of logbuildR interface: selecting appropriate identifiers. is able to generate probability distributions for choices in a process’ control-flow. propro is based on a generic algorithm to build a statistical model [10], which can then be used to standard transactional life-cycle model [4]. A screenshot of the test different kinds of hypotheses, such as non-deterministic graphical interface is shown in Figure 1. The logbuildR dependencies between different choices in the model. This package is available on GitHub.2 leads to valuable information about the process under consid- B. DaQAPO eration, which go beyond the discovery of its static control- flow. Hence, propro supports the enhancement of discovered Following the preparation of the event log, one of the first process models by exposing probabilistic dependencies, and steps in process analysis is to assess the quality of the data. allows to compare the goodness-of-fit of different models with In order to support this step, daqapo was developed [5], [6]. respect to the event data, each of which provides important Short for Data Quality Assessment for Process-oriented Data, advancements in the field of process mining. The propro daqapo provides a variety of methods to detect data quality package is available on GitHub.4 issues in process-oriented data. As the reliability of process analysis techniques largely E. ProcessanimateR depends on the quality of the event log, data quality is Animation using moving tokens can be a powerful visual- an important aspect to consider. Insufficient data quality, or isation tool to help understand the general process behavior. an inadequate understanding of it, will inevitably lead to The package procesanimateR implements an animation low-quality results — Garbage in, garbage out — or even library for bupaR that renders interactive process animations misleading ones — Garbage in, gospel out. using the web standard SVG. In order to stress the importance of data quality, daqapo In procesanimateR, each case is represented by a provides a large set of checks which enable users to identify a separate token that moves along the process map with speed range of data quality issues in a systematic way. These issues relative to the observed activity processing and waiting times. include missing events, incorrect timestamps, and inaccurate The visual appearance of tokens can be customised using any resource information. An overview of the available functions SVG shape and core properties, such as size and color, and can is shown in Table II. The daqapo package is available be dynamically adjusted based on event attributes. In a recent for installation on the Comprehensive R Archive Network release, the package was extended with support to project (CRAN).3 discovered process maps to an interactive geographical map in which each process activity has a fixed position, as shown C. HeuristicmineR in Figure 2. This enables new forms of animation and process The package heuristicsmineR brings extensible sup- visualisation in which the position of activities and the length port for variants of the Flexible Heuristics Miner [7] to bupaR. of edges are assigned clear semantics. This contrasts to the Two major variants are implemented: the original Flexible often random placement of activities and edges in traditional Heuristics Miner as described in [7] and a variant that uses process visualisation tools. time intervals derived from life-cycle transitions as described in [8]. Having discovered a Causal net, the dependencies F. CollaborateR and gateway information can be visualised or transformed Whereas most functionalities of bupaR have been devel- into a Petri net for further processing, e.g., by computing oped with no specific type of process in mind, this can not be alignments with the pm4py package — which bridges the said about collaborateR [11]. The origin of this package bupaR-ecosystem to the PM4Py python library for process lies in the area of software engineering. As its name implies, mining. An underlying design principle of the package is to it focuses on the collaboration between different process separate the computation into several phases, each of which participants. The underlying algorithm was published in recent provides an intermediate result that can be inspected and previous work [12]. visualised using the standard R print functionality. This In the fast-changing and flexible software engineering envi- ronments of today, knowledge management is critical. A clear 2 https://github.com/bupaverse/logbuildR 3 https://cran.r-project.org/package=daqapo 4 https://github.com/bupaverse/propro TABLE II AVAILABLE A SSESSMENT F UNCTIONS IN D A Q A P O . Function Description detect_activity_frequency_violations Detect case-wise anomalies in the number of occurrences of activities. detect_activity_order_violations Detect violations in the order of activities within cases. detect_attribute_dependencies Detect event-wise violations between attributes using logical conditions. detect_case_id_sequence_gaps Detect gaps in case identifiers, i.e. when case identifier is a numerical id. detect_conditional_activity_presence Detect activity presence versus logical conditions detect_duration_outliers Detect activity duration outliers detect_inactive_periods Detect inactive periods, i.e. periods without new arriving cases, or periods without any activity instances. detect_incomplete_cases Detect incomplete cases, given a set of essential activities, or final activities in the process. detect_incorrect_activity_names Detect incorrect activity names detect_missing_values Detect missing values detect_multiregistration Detect multi-registration, i.e. events recorded at the same time which belong to the same case or the same resource. detect_overlaps Detect overlapping activity instances detect_related_activities Detect missing related activities, i.e. when certain activities should co-exist. detect_similar_labels Detect spelling mistakes by searching for similar labels in a column. detect_time_anomalies Detect time anomalies, i.e. activities with a negative and/or zero duration. detect_unique_values Search for unique combinations of a given set of columns. detect_value_range_violations Detect invalid values, for categorical, numeric as well as time attributes. Fig. 2. Screenshot of a process animation where the process map has been projected on a geographical map. overview on how software developers collaborate can unearth valuable patterns such as the general structure of collaboration, Fig. 3. Example of collaboration graph. crucial resources, and risks (e.g. losing certain knowledge when a programmer decides to leave the company). Version control system (VCS) logs, which keep track of which tasks blue nodes are clusters of programmers. When programmers team members work on and when, contain data to provide have worked on the same files of the project, i.e. the same these insights. collaborateR provides an algorithm which software code, an edge is drawn between them. The colouring extracts and visualises a collaboration graph from VCS log of the edges indicates whether programmers worked separately data. The algorithm is partly based on the principles that also (orange), together using pair programming (green), or a mix underlie the Fuzzy Miner [13]. Its structure consists of four of both (blue). The size of both nodes and edges indicates phases: (1) building the base graph, (2) calculating weights the importance of the programmers and the strength of their for nodes and edges, and (3) simplifying the graph using relationships. The package is available on GitHub.5 aggregation and abstraction. Each of these phases offers the user flexibility to decide which parameters and metrics to III. M ATURITY AND USAGE include. This makes it possible for the human expert to exploit The packages of the bupaR collection that have been her existing knowledge about the project and team to guide published on the Comprehensive R Archive Network (13 at the the algorithm in building the graph that best fits the specific moment of writing, cf. Table I) gathered over 300k downloads use case, and hence will provide the most accurate insights. - more than half of which during the past year. The tools have An example of a collaboration graph is shown in Figure 3. In this graph, pink nodes are individual programmers, while 5 https://github.com/bupaverse/collaborateR been downloaded in 140 different countries. The core packages [4] W. van der Aalst, Process mining: discovery, conformance and enhance- bupaR, edeaR and processmapR respectively receive on ment of business processes. Heidelberg: Springer, 2011. [5] N. Martin, G. Van Houdt, and G. Janssenswillen, “Towards more average about 7k, 5k and 4k downloads each month, and are structured data quality assessment in the process mining field: the daqapo amongst the 10% most downloaded R packages. package,” in Proceedings of the European R Users Meeting 2020, 2020. bupaR has been used in general process mining research [6] N. Martin, G. Van Houdt, and G. Janssenswillen, daqapo: Data Quality Assessment for Process-Oriented Data, 2020, R package version 0.3.0. [14]–[17], and has been applied in more specific areas such as [7] A. J. M. M. Weijters and J. T. S. Ribeiro, “Flexible heuristics miner process simulation [18], transportation [19], healthcare [20], (FHM),” in CIDM. IEEE, 2011, pp. 310–317. Learning Analytics [21]–[24], predictive process monitoring [8] A. Burattin and A. Sperduti, “Heuristics miner for time intervals,” in ESANN, 2010. [25], [26], and others [27], [28]. As the majority of users are [9] G. Janssenswillen, propro: Build Probabilistic Process Models Using practitioners, bupaR has a profound impact on the adoption MCMC, https://github.com/bupaverse/propro. of process mining in various fields such as healthcare, con- [10] G. Janssenswillen, B. Depaire, and F. Christel, “Enhancing discovered process models using bayesian inference and mcmc,” in Proceedings of sulting, manufacturing, telecommunications, and governmental the 2020 BPI Workshop, 2020. agencies. In more popular media, various case studies are [11] L. Jooken and G. Janssenswillen, collaborateR: Build Collaboration available, for example, in the context of traditional business Graph Using Version Control System Logs, R package version 0.1.0. [12] L. Jooken, M. Creemers, and M. Jans, “Extracting a collaboration model processes, such as purchase-to-pay processes6 , how to use it from vcs logs based on process mining techniques,” in International with Power BI7 , or how to use it for web analytics.8 The Conference on Business Process Management. Springer, 2019, pp. new functionalities described in this paper further enhance the 212–223. [13] C. W. Günther and W. M. Van Der Aalst, “Fuzzy mining–adaptive pro- usefulness of bupaR for both researchers and practitioners. cess simplification based on multi-perspective metrics,” in International conference on business process management. Springer, 2007, pp. 328– IV. C ONCLUSION AND F UTURE W ORK 343. Since the introduction of bupaR, the ecosystem has steadily [14] M. Jans, P. Soffer, and T. Jouck, “Building a valuable event log for process mining: an experimental exploration of a guided process,” grown into broad toolbase, and has become widely used for Enterprise Information Systems, vol. 13, no. 5, pp. 601–630, 2019. process analytics. The extensions described in this paper will [15] A. Burattin, “Integrated, ubiquitous and collaborative process mining further enhance the use of bupaR, and its role in the adoption with chat bots.” in BPM (PhD/Demos), 2019, pp. 144–148. [16] S. Shershakov, “Enhancing efficiency of process mining algorithms with of process mining by practitioners in various industries. a tailored library: Design principles and performance assessment.” Future work will focus on the extension of the new func- [17] S. Kuehnel, S. T.-N. Trang, and S. Lindner, “Conceptualization, design, tionalities described in this paper, as well as adding new com- and implementation of econbpc–a software artifact for the economic analysis of business process compliance,” in International Conference ponents to the eco-system. While logbuildR is now a graphical on Conceptual Modeling. Springer, 2019, pp. 378–386. interface, it will be extended in the future so that the user will [18] M. Mesabbah and S. McKeever, “Presenting a hybrid processing mining also receive the R-code that is needed to produce the event data framework for automated simulation model generation,” in 2018 Winter Simulation Conference (WSC). IEEE, 2018, pp. 1370–1381. at the end. This code can be used for scripts or reports, thereby [19] F. Mannhardt and A. D. Landmark, “Mining railway traffic control logs,” making the log building step also reproducible. Furthermore, Transportation research procedia, vol. 37, pp. 227–234, 2019. the creation of collaboration graphs will be generalised so that [20] A. P. Kurniati, C. McInerney, K. Zucker, G. Hall, D. Hogg, and O. Johnson, “A multi-level approach for identifying process change it can be used for other process data as well, beyond version in cancer pathways,” in International Conference on Business Process control systems. New functionalities in the area of process Management. Springer, 2019, pp. 595–607. discovery, process data visualisation and predictive process [21] J. P. Salazar-Fernandez, M. Sepúlveda, and J. Munoz-Gama, “Influence of student diversity on educational trajectories in engineering high- monitoring are currently being developed. failure rate courses that lead to late dropout,” in 2019 IEEE Global Engineering Education Conference (EDUCON). IEEE, 2019, pp. 607– ACKNOWLEDGEMENTS 616. [22] D. Etinger, T. Orehovački, and S. Babić, “Applying process mining The authors would like to warmly thank all users who are techniques to learning management systems for educational process actively contributing to the bupaR-framework by submitting model discovery and analysis,” in International Conference on Intel- issues and pull requests on the GitHub repositories. ligent Human Systems Integration. Springer, 2018, pp. 420–425. [23] J. Saint, D. Gašević, W. Matcha, N. A. Uzir, and A. Pardo, “Combining R EFERENCES analytic methods to unlock sequential and temporal patterns of self- regulated learning,” in Proceedings of the Tenth International Confer- [1] G. Janssenswillen, B. Depaire, M. Swennen, M. Jans, and K. Vanhoof, ence on Learning Analytics & Knowledge, 2020, pp. 402–411. “bupar: Enabling reproducible business process analysis,” Knowledge- [24] D. Gašević, W. Matcha, J. Jovanović, A. Pardo, L.-A. Lim, S. Gentili Based Systems, vol. 163, pp. 927–930, 2019. et al., “Discovering time management strategies in learning processes [2] G. Janssenswillen and B. Depaire, “bupar: Business process analysis using process mining techniques,” in European Conference on Technol- in r.” in International Conference on Business Process Management - ogy Enhanced Learning. Springer, 2019, pp. 555–569. Demonstration track, 2017. [25] M. Tipirishetty, “Predictive process monitoring for lead-to-contract pro- [3] ——, xesreadR: Read and Write XES Files, 2019, R cess optimization,” 2016. package version 0.2.3. [Online]. Available: https://CRAN.R- [26] B. A. Tama and M. Comuzzi, “An empirical comparison of classification project.org/package=xesreadR techniques for next event prediction using business process event logs,” Expert Systems with Applications, vol. 129, pp. 233–245, 2019. 6 https://www.mmertens.eu/2020/06/process-mining-with-power-bi-and-r- [27] P. Delias and I. Kazanidis, “Exploiting higher-order dependencies for visuals/ process analytics. the case for political events’ analysis,” Kybernetes, 7 https://www.linkedin.com/pulse/how-analyze-business-process-powerbi- 2019. using-r-visuals-peter-pensotti/?articleId=6631215429794836480 [28] W. Ma, “Bias assessment and reduction in kernel smoothing,” 2018. 8 https://stuifbergen.com/2018/08/analyse-web-site-click-paths-as- processes/