=Paper=
{{Paper
|id=Vol-3299/Paper21
|storemode=property
|title=Discover Context-Rich Local Process Models (Extended Abstract)
|pdfUrl=https://ceur-ws.org/Vol-3299/Paper21.pdf
|volume=Vol-3299
|authors=Mitchel Brunings,Dirk Fahland,Eric Verbeek
|dblpUrl=https://dblp.org/rec/conf/icpm/BruningsF022
}}
==Discover Context-Rich Local Process Models (Extended Abstract)==
Discover Context-Rich Local Process Models (Extended Abstract) Mitchel Brunings1 , Dirk Fahland1 and Eric Verbeek1 1 Eindhoven University of Technology, Mathematics and Computer Science, Eindhoven, the Netherlands Abstract We introduce a new ProM plugin called Discover Context-Rich LPMs which mines a log for large local process models (LPMs) based on supported words. The main advantage of this plugin is that it produces much larger and much fewer LPMs than other tools. The plugin is packaged with an additional plugin called Generate HTML coverage report which calculates the coverage of LPMs along with several other quality measures. This extra plugin is useful to select and improve a set of LPMs. Keywords Sets of LPMs, coverage, ProM, process mining, process analytics, interactive process discovery 1. Introduction Discovering a single process model from a large and complex process often yields an undesirable model. Instead, one could discover a set of local process models (LPMs). By limiting the scope of individual LPMs to different smaller (local) sections of the process, they end up smaller and therefore more comprehensible. A complete set of LPMs can in turn give insights into the behavior of the entire process. Several tools exist to discover LPMs from event data [1, 2]. These tools generate 1000s of LPMs of limited size; the largest LPMs produced by [1] have about 4 transitions, and the LPMs produced by [2] are limited to 10 transitions by their method’s window size. In [3] we argue that we need fewer and larger LPMs for humans to make sense of a set of LPMs. We present a ProM implementation of an alternative approach [4] to discovering LPMs resulting in sets of fewer but larger LPMs in line with [3]. We implemented two ProM plugins: Discover Context-Rich LPMs and Generate HTML coverage report with the in- and outputs sketched in Fig. 1. 2. Discover Context-Rich LPMs We first discover context-rich LPMs using the following 4 steps, described in detail in [4]. 1) Mine supported words. This step is based on an algorithm developed in [5]. It efficiently finds all gapped subsequences (words) that occur in a log at least as often as a user-defined ICPM 2022 Doctoral Consortium and Tool Demonstration Track $ m.d.brunings@tue.nl (M. Brunings); d.fahland@tue.nl (D. Fahland); h.m.w.verbeek@tue.nl (E. Verbeek) © 2022 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR Workshop Proceedings http://ceur-ws.org ISSN 1613-0073 CEUR Workshop Proceedings (CEUR-WS.org) 100 Figure 1: The process of generating LPMs and the coverage report from a log in ProM. threshold 𝑡. For our implementation, we set 𝑡 to 10% of the number of occurrences of the most frequent event class. 2) Group by prefix of subsequence. Next, we group the supported words by the first 𝑛 events in each word. The default value 𝑛 = 1 groups words by their starting activity only, values 𝑛 > 1 distinguish words by a longer prefix. This allows grouping all frequent sub-behaviors of a process that begin in the same way. 3) Filter words. Then, we look at each group of words. There may be pairs of words in a group where the longer word contains the shorter. It is impossible for short words to occur less often than long words that contain them, but they may occur more often. When we see such a long and short word occur (almost) equally often, the short word does not add (much) information. If the short word occurs much more often, then the part we only see in the long word can apparently be skipped. We use a support delta 𝑑 (by default equal to 𝑡) to determine how much more often we need to see a short word to keep it. Thus, for each pair of words where one contains the other, we discard the shorter of the two if it does not occur at least 𝑑 more times than the longer one. What remains will be a set of words that we know are each important to represent in an LPM for this (local) part of the process. 4) Discover models. We now treat each set of filtered supported words with the same prefix as a local log and apply a process discovery algorithm to obtain an LPM of this local log. We already did all selection and filtering in the previous steps, so we should use an algorithm that guarantees 100% fitness. Therefore, we choose to use the basic Inductive Miner [6]. Results. Using this plugin on the BPIC2012 log [7] returned a set of 26 LPMs (Accepting Petri net Array) in about 8 minutes on a laptop with an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz and with 4.0 GB RAM @ 2400 MHz allocated to ProM. The mean number of labeled transitions for these LPMs is 9.2, the median is 7.5, the largest is 24. This shows that we can find a much more reasonable number of LPMs (10s instead of 1000s) and much larger LPMs (more than twice as large) in roughly equal running time as the other tools [1, 2]. 3. Selecting subset of desired LPMs In [3] we show that a desired property of a set of LPMs is that it describes all events in the data, i.e., all events are covered. We implemented “Generate HTML coverage report” to calculate 101 Figure 2: A fragment from the top of the coverage report for the LPMs discovered from BPIC2012 [7]. Figure 3: LPM 13 of 26 after replaying the set of trace fragments from BPIC2012 [7]. the coverage of LPMs along with several other quality measures. It first evaluates each model individually and then combines the results into a single report. For each LPM, we split the original log into subsequences bounded by the first and last activity of the LPM, and project onto the activities present in the LPM. We treat this set of subsequences as a log which we replay on the LPM. From the replay we extract measures such as fitness and precision. The plugin then overlays each discovered LPM with replay information as shown in Fig. 3 and opens a local browser to show a coverage report as shown in Fig. 2. This plugin took about 1 minute on the LPMs discovered from the BPIC2012 log with the technique of the previous section. The coverage report shows one column for each LPM, in the same order as in ProM. The top rows provide statistics per LPM and a selection box to pick LPMs to combine into a set. The coverage of each LPM over each trace variant is shown below by colored rectangles. Each LPM (column) either covers (green) or does not cover (gray) events (2 pixels high) in the trace variants (rows, with frequency on the right). The right-most column shows the combined coverage where gray means no LPM in the selected set covers that event, green means exactly one LPM covers that event, and purple means more than one covers it. The row right above the coverage visualization provides coverage statistics as both absolute numbers and ratio of the total events in the log. Fig. 2 shows that selecting the LPMs with ticked boxes gives a coverage of 0.872 with 0.234 102 duplicate. The analyst can (de-)select LPMs to maximize total coverage and minimize duplicate coverage; selecting all LPMs on BPIC2012 has 0.951/0.815 total/duplicate coverage. 4. Links Downloads and installation instructions, screencast, and source code can be found here: https: //svn.win.tue.nl/repos/prom/Documentation/LPMSupportedWords/index.html 5. Conclusion The presented tool provides a first step for interactively constructing meaningful sets of LPMs [3] by discovering far fewer but larger LPMs from which an analyst can choose relevant LPMs based on various quality criteria, in particular coverage. This is similar in a sense to sub-model freezing [8], as the user is in control of what the final model looks like. After using the tool, a user should still perform manual selection and cleanup of LPMs. Further developments should aid the user by suggesting “optimal” sets of LPMs. However, determining “optimal” requires further research on quality measures for LPM (c.f. [3]). We also argue the need for notation describing how LPMs in a set relate to each other globally. References [1] N. Tax, N. Sidorova, R. Haakma, W. M. van der Aalst, Mining local process models, Journal of Innovation in Digital Ecosystems 3 (2016) 183–196. [2] V. Peeva, L. L. Mannel, W. M. van der Aalst, From place nets to local process models, in: Petri Nets, Springer, 2022, pp. 346–368. [3] M. D. Brunings, D. Fahland, B. F. van Dongen, Defining meaningful local process models, in: ToPNoC XVI, Springer, 2022, pp. 24–48. [4] M. D. Brunings, Discovering Context-Rich Local Process Models, Master’s thesis, Eindhoven University of Technology, 2018. [5] D. Fahland, D. Lo, S. Maoz, Mining branching-time scenarios, in: 2013 28th IEEE/ACM International Conference on Automated Software Engineering (ASE), IEEE, 2013, pp. 443– 453. [6] S. J. Leemans, D. Fahland, W. M. van der Aalst, Discovering block-structured process models from event logs-a constructive approach, in: Petri Nets, Springer, 2013, pp. 311–329. [7] B. F. van Dongen, BPI Challenge 2012, 2012. [8] D. Schuster, S. J. van Zelst, W. M. van der Aalst, Sub-model freezing during incremental process discovery in cortado, 2021. 103