POD - A Tool For Process Discovery Using Partial Orders and Independence Information Hernán Ponce-de-León1 , César Rodrı́guez2 , and Josep Carmona3 1 Helsinki Institute for Information Technology HIIT and Department of Computer Science and Engineering, School of Science, Aalto University, Finland 2 Université Paris 13, Sorbonne Paris Cité, LIPN, CNRS, France 3 Universitat Politècnica de Catalunya, Barcelona, Spain Abstract. Current process discovery techniques focus on the derivation of a process model on the basis of the activity relations extracted from an event log. However, there are situations where more knowledge can be provided to the discovery algorithm, thus alleviating the discovery challenge. In particular, the (partial) characterization of the indepen- dence or concurrency between a pair of activities may be well-known. In this paper we present POD, a tool for discovery of Petri nets that can incorporate this type of additional information. We believe requiring in- dependence/concurrency information is very natural in many scenarios, e.g., when there is some knowledge of the underlying process. We show with one example how this extra information can effectively deal with problems such as log incompleteness. 1 Introduction In the last decade several algorithms have been proposed to discover a Petri net from an event log [1]. Petri nets are indeed a formalism well-suited for capturing concurrency or distribution, and are often employed as sound mathematical rep- resentations of processes tailored for formal analysis. Remarkably, they can be translated to user-friendly formalisms such as BPMN. However Petri-net-based process analysis is not widespread in industrial practice, where automata-based formalism are often instead preferred. These approaches, however, suffer from the well-known state-explosion problem in the presence of concurrency or distri- bution. One of the reasons for this low-key industrial adoption of Petri nets, we be- lieve, is that many existing algorithms infer the concurrency relations between transitions exclusively from the information present in the log. This poses two problems. First, the quality of this inference deteriorates with log incomplete- ness, i.e., when the log does not contain enough information to extract the component’s distribution in the architecture of the process. Second, from an information-theoretical perspective, the log simply contains no information to distinguish non-deterministic actions from truly concurrent behaviour, which Copyright c 2015 for this paper by its authors. Copying permitted for private and academic purposes. Independence Negative Traces Event Labeled Event Occurrence Petri Log Merge Partial Structure Net Fold Net Orders Fig. 1. Unfolding-based process discovery. implies that most of the aforementioned algorithms can at best be based on probabilistic analysis of limited reliability. To fill this gap, we present POD, a tool that incorporates this crucial user in- put: a so-called independence relation, i.e., an explicit, user-provided description of the concurrency relations between the system activities. One example could be the different tests that a patient should undergo in order to have a diagnosis: blood test, allergy test, and radiology test. Specifically, this input is provided under the form of an irreflexive, symmetric binary relation on the set of system actions. One of the few works that considers the concurrency or distribution of the system as an input is [2]. They investigate the synthesis of a Petri net from a set of partial orders and rely on ad-hoc operators tailored to compose them (choice, sequentialization, parallel compositions and repetition). Since the operators may in practice introduce wrong generalizations, a domain expert is consulted for the legality of every extra run. Our tool automates the generalization step based on different properties that the final model should satisfy (replay, preservation of concurrency, avoidance of undesired behaviors). For a detailed description of the theory underlying the POD tool, the reader can refer to [3]. 2 Tool Description The approach of our tool is summarized in Figure 1. Starting from an event log and a concurrency relation on its set of activities, we construct a collection of labeled partial orders whose linearizations include both the sequences in the log as well those that could be obtained via successive permutations of concurrent activities. We then merge this collection into an event structure [4] which we next transform into an occurrence net (acyclic Petri net) representing the same be- havior. Finally, we perform a controlled generalization step by selectively folding the occurrence net into a Petri net. This step yields a net that (a) can execute all traces contained in the event log, and (b) generalizes the behavior of the log in a controlled manner. The folding process is driven by a folding equivalence rela- tion [3], which we synthesize using SMT (Satisfiability Modulo Theories, see [5]). Different folding equivalences guarantee different properties about the final net. In [3] we propose three different classes of equivalences to (a) preserve all se- quential executions of the log, (b) to additionally preserve the concurrency or distribution stated in the independence relation and (c) to avoid re-introducing negative (or forbidden) traces (this feature is still unimplemented in POD). 2.1 Architecture and Maturity POD is a command-line tool implemented in Python. It interacts with the Z3 SMT-solver [6] for computing the aforementioned folding equivalences. It is cur- rently under development and will evolve to a more stable version in the near future. Currently, it can be seen as a prototype testbed for process discovery based on partial orders, independence information, and generalization based on folding equivalences. It has been tested with only small to medium-size examples. The most prominent feature in POD is the ability to generate fully-fitting nets whose transitions exhibit exactly the same independence (concurrency) than the one stated by the independence relation given as input (POD can also relax this controlled generalization, i.e., generalizing more). While the underlying theory [3] allows for the discovery of more than one transition per log activity, POD merges all associated events into one single transition (to simplify the implementation). This ensures a minimum number of final transitions, but POD could sometimes be unable to find a suitable equivalence (unsatisfiable SMT encoding). Since the number of transitions in the folded net is fixed, it turns out that the quality (in terms of precision and generalization) of the mined model increases as one increases the number of places (which the user can conveniently control). 2.2 Download, Formats, Usage POD is available from http://github.com/cesaro/pod/releases/tag/v0.1. It has been tested in MAC and Linux and the installation requires simply to unpack the files into a directory. Once unpacked, cd to it and run it as ./src/pod.py. The tool can also perform various handy tasks related to process discovery, such as extracting a random log from a Petri net, dumping statistics of a net, a log, etc. POD reads and writes logs in XES format and Petri nets in PNML format. As for the independence relation, it expects, for the time being, a plain-text file where every line defines one pair of dependent transitions (i.e., the file contains the complement of an independence relation, usually smaller), cf. POD’s website. Assume that we have a log file log.xes and a dependency file dep.txt. We instruct POD to do process discovery: ./src/pod.py discover log.xes dep.txt --out result.pnml By default POD will output the occurrence net mentioned above, i.e., it will perform no generalization at all. In other words, it will use the identity relation as a folding equivalence, cf. [3]. Option --eq instructs POD to use a different folding equivalence (i.e., more interesting generalizations). The most important values are --eq=sp-smt and --eq=ip-smt. Under these two options, POD will in fact not use a predefined folding equivalence, but will rather use SMT solving to synthesize one from the log and the independence relation: – Option sp-smt (sequence, or fitness, preservation). The synthesized fold- ing equivalence will merge all events with equal label into one single tran- sition. It will also merge the presets of any two merged events. Options --smt-{min,max}-places restrict the number of generated places. – Option ip-smt (independence preservation). In addition to preserving all sequences of the log in the final net, the synthesized equivalence will preserve the concurrency expressed by the independence relation. Accepts roughly the same options as sp-smt. 3 Example Consider the net in Figure 2 (a), and consider the following log containing only two of its executions (which however fire all possible eight transitions): 1, 3, 5, 6, 2, 3, 4, 6, 7 1, 3, 5, 6, 8 While one could argue that this is a very incomplete log, we show that POD can in fact reconstruct Figure 2 (a) using this log plus some independence relation.4 We supply POD with the best independence relation that an expert would provide. It coincides with the independence relation of the original net. For instance, transitions 1 and 8 are independent, as one can never interfere (neither make possible nor disable) the firing of the other. Similarly, 3 and 4, or even 1 and 2 are independent. Transitions, 4 and 5, or 1 and 3 are however dependent. 1 1 2 1 2 3 1 3 5 3 4 5 3 4 5 4 5 2 6 2 6 6 6 3 4 8 6 7 8 7 8 7 8 7 (a) (b) (c) (d) 3 8 1 6 3 5 2 6 7 (e) 4 Fig. 2. (a) original Petri net; (b) net mined with POD; (c) net mined with ILP Miner; (d) and (e): event structure and occurrence net internally constructed by POD. When we run POD with these two inputs (log and independence relation), it will internally construct the event structure depicted in Figure 2 (d). This object tracks in a compact, concurrency-aware fashion the dependencies and 4 Go to http://lipn.fr/~rodriguez/exp/bpm15/ to reproduce this experiment. conflicts between the occurrences of transitions (events) of the net, as recorded in the log. POD next transforms it into Figure 2 (e), an occurrence net denoting exactly the same behaviour as the event structure. This net is a fitting and 100% precise description of the log (all executions in it but not in the log are provably executions of the original system). Observe that the two events labelled by 3 and 5 (immediately after 1) are concurrent, as requested by the input independence relation. Observe also that there is two occurrences of transition 6 and that this is somehow an unwinded version of the original net. Finally POD will fold this occurrence net and output the result, depicted in Figure 2 (b), which is identical to the original except for the output places of 7 and 8. We asked POD to use an independence-preserving (option --eq=ip-smt) folding, meaning that the result net should exhibit the same concurrency infor- mation expressed in the input independence relation. By contrast, ILP Miner [7], which only uses the log, produced Figure 2 (c). Observe that it was unable to infer the concurrency between 3 and 4 or 5 as in all log traces the latter transitions occur always after 3. Interestingly, for the same reason it was also unable to infer the direct causality between 3 and 6. This is an artifact of using the theory of regions on incomplete logs, that derive the most precise model which then fails at incorporating unseen behavior. 4 Significance for the BPM Field Process discovery is recognized as one of the key enablers of BPM. However, problems like log incompleteness, as exemplified above, or noise hamper signif- icantly the success of process discovery techniques in industrial scenarios. We believe that the development of tools like POD, which let the user incorporate knowledge that can be used to soundly cope with these challenges, can fill the gap between academic tools and industrial acceptance of process discovery. References 1. van der Aalst, W.M.P.: Process Mining - Discovery, Conformance and Enhancement of Business Processes. Springer (2011) 2. Bergenthum, R., Desel, J., Mauser, S., Lorenz, R.: Construction of process models from example runs. Tr. Petri Nets and Other Models of Conc. 2 (2009) 243–259 3. Ponce-de-León, H., Rodrı́guez, C., Carmona, J., Heljanko, K., Haar, S.: Unfolding- based process discovery. CoRR abs/1507.02744 (2015) 4. Nielsen, M., Plotkin, G.D., Winskel, G.: Petri nets, event structures and domains, part I. Theoretical Computer Science 13 (1981) 85–108 5. de Moura, L.M., Bjørner, N.: Satisfiability modulo theories: introduction and ap- plications. Commun. ACM 54(9) (2011) 69–77 6. de Moura, L.M., Bjørner, N.: Z3: an efficient SMT solver. In: Tools and Algo- rithms for the Construction and Analysis of Systems, 14th International Conference, TACAS. Volume 4963 of LNCS., Springer (2008) 337–340 7. van der Werf, J., van Dongen, B., Hurkens, C., Serebrenik, A.: Process Discovery Using Integer Linear Programming. In: Proceedings of the 29th International Con- ference on Applications and Theory of Petri Nets, Springer-Verlag (2008) 368–387