Detecting Semantic Business Process Model Clones Thomas S. Heinze1 , Wolfram Amme2 , and André Schäfer2 1 German Aerospace Center (DLR) thomas.heinze@dlr.de 2 Friedrich Schiller University Jena [wolfram.amme,andre.schaefer]@uni-jena.de Abstract. Process modeling with languages like BPMN allows process designers to create the same business process model in various ways. Detecting model clones, i.e., pairs of business process models sharing a certain degree of similarity can be difficult. In this paper, we propose an approach to process model clone detection based upon dominator trees and targeted at detecting semantically though not necessarily syntactically similar process models of business processes. 1 Introduction Duplicated process models is a common issue in business process management and modeling and in particular relevant when process models are organized in model repositories [11]. Matching business process models and estimating their similarity has thus been an important research topic and has many applications, ranging from analyzing conformance to reference models [3], tracing and identification of process variants [2] to process model search and clone detection [1]. In this paper, we address the latter problem of finding process models which share a certain degree of similarity, which is known as model clone detection in the literature [11,12]. Note that such model clones can origin from homologous development [14], where a process designer reuses and modifies an existing model to generate a new process model. As a result, the original and new model usually share a high lexical similarity. In contrast, in heterologous development [14], process models are created independently of each other but implementing similar functionalities. Thus, such models are not necessarily syntactically similar, i.e., can differ in the number of activities and gateways or in their structure. Finding such semantic clones is in particular hard for business process models due to the large number of modeling purposes and practices, e.g., block- and graph-oriented process modeling styles. While differing modeling styles can be found in different business process modeling languages, multiple paradigms can even exist in the very same language. A well-known representative for such a language is BPMN. As an example, consider the two BPMN process models shown on the left- hand side of Fig. 1. Apparently, both process models implement a similar business process. In particular, in terms of possible execution traces, the lower model’s J. Manner, S. Haarmann, S. Kolb, N. Herzberg, O. Kopp (Eds.): 13th ZEUS Workshop, ZEUS 2021, Bamberg, held virtually due to Covid-19 pandemic, Germany, 25-26 February 2021, published at http://ceur-ws.org Copyright © 2020 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 26 Thomas S. Heinze et al. X H A D + B X X A + E + G + F B C C + D H B X X + + D E H G E F G A + + C F Fig. 1. Business process models (left) and dominator tree (right) behavior is a subset of the upper model’s behavior (activity G can precede activity F in the upper model in contrast to the lower model). Estimating the models’ similarity and therefore identifying them as model clones is though not straightforward due to the more graph-oriented modeling style of the upper model and the more block-oriented style of the lower model. Such a pair of process models is known as a semantic model clone, i.e., two process models which have similar behavior but do not necessarily share the same syntax and structure [12]. 2 Process Model Clone Detection In order to identify semantic model clones, we propose a model clone detection method based on dominator trees [10]. Dominator trees provide for an abstraction of process control flow and therefore encode the dominance relation of flow graphs. A node n in a flow graph is said to dominate another node m, iff n comes before m on every path from the flow graph’s start node to m. Note that this relation can be efficiently computed for flow graphs using data flow analysis, also for workflow graphs [4,8]. Each node and its immediate, i.e., closest dominating node then results in an edge in the dominator tree. The same dominator tree represents both the example models and is shown on the right-hand side of Fig. 1. As can be seen, the dominator tree comprises guaranteed happens-before relations between the models’ activities, while abstracting from the process models’ structure. In order to allow for an efficient comparison of process models, dominator trees are encoded into sets of integer sequences. In this way, standard metrics, e.g., Hamming or edit distance, can be used for their comparison. If the thus computed difference between two process models does not exceed a certain threshold value, the models are considered to be a model clone. For the encoding, each path of a dominator tree, starting at its root node and ending at a leaf node, is mapped to a sequence of integers, yielding a set of integer sequences. For the dominator tree in Fig. 1, this would mean to encode the five paths [A, B, D], [A, B, E, F ], [A, B, H], [A, B, G], and [A, C]. As the two example process models share the same dominator tree shown in Fig. 1, they are identified as model clone. Also Detecting Semantic Business Process Model Clones 27 considering modifications like adding, modifying, or removing a single or a small number of nodes in one of the models, as in case of homologous development, would only slightly affect their encoding and the result of their comparison. Control flow is an integral part of a business process model, but its other aspects have to be considered as well. While the encoding of dominator trees introduced above does not cover the analysis of node labels by itself, stemming, bag-of-words, word embeddings, and other related methods [6,11] can be inte- grated into our approach. In this way, the encoding becomes more permissive in the presence of differing node labels, as is common in business process models. Furthermore, the pre-processing of node labels by means of Static Single Assign- ment Form [4,8] helps in including aspects of process data in the encoding [4]. The thorough study of an optimal encoding is subject to our future work. 3 Related Work Business process model similarity estimation and matching has been a frequent topic in research. Due to space constraints, we refer to the survey in [11] for a comprehensive overview. Respective approaches can be categorized into: syntacti- cal, structural, behavioral methods and approaches based upon human judgement. Syntactical methods compare process models based on the therein used labels, e.g., as in [6]. Structural methods use process models as graphs for estimating their similarity, e.g., calculating the graph edit distance or isomorphisms [1]. Behavioral methods instead use process execution traces or logs to compare process models, e.g., measuring the longest common subseqence [3]. Note that our proposed method aligns between both, as it does not rely on the actual process modeling structure. In this way, we avoid the usually costly analysis of process behavior but are still able to abstract from modeling styles and practices. The comparison with behavioral methods and analysis of the tradeoff between precision and scalability to large process model repositories is subject to future work. In a way, our method reminds of causal footprints [2] or behavioral profiles and footprints [7,13]. Remember that dominator trees encode the guaranteed happens-before relation of activities. Behavioral profiles and footprints are based on execution traces and the direct or eventual successor relation of activities therein, respectively. Measuring similarity is then conducted using the relations’ Jaccard index for two process models, or an execution log and a process model. In general, the major part of research on clone detection addresses source code. However, some approaches have been considering clone detection for model-based languages besides business process modeling languages, e.g., UML [12]. 4 Next Steps We are interested in implementing the proposed method in a system for detecting semantic BPMN model clones, e.g., starting with the existing control flow analyses available in mojo [9]. The implementation would allow us for experimenting with various optimizations and parameters, including differing metrics and encodings. 28 Thomas S. Heinze et al. Furthermore, the thorough evaluation and comparison with state-of-the-art ap- proaches to process model clone detection and similarity estimation, in particular the trace-based approaches mentioned above, is another item for future work. Such an evaluation requires though a dataset with ground truth, i.e., known process model clones. As such datasets are unfortunately scarce, we plan to resort to human judges for generating missing labels in an existing dataset, using for example the set of mined process models in [5]. Alternatively, applying small mutations on seed models can be used to generate a dataset, similar to [7]. References 1. Dijkman, R.M., Dumas, M., van Dongen, B.F., Käärik, R., Mendling, J.: Similarity of business process models: Metrics and evaluation. Inf. Syst. 36(2), 498–516 (2011) 2. van Dongen, B., Dijkman, R., Mendling, J.: Measuring Similarity between Business Process Models. In: CAiSE 2008. LNCS, vol. 5074, pp. 450–464. Springer (2008) 3. Gerke, K., Cardoso, J.S., Claus, A.: Measuring the Compliance of Processes with Reference Models. In: OTM 2009. LNCS, vol. 5870, pp. 76–93. Springer (2009) 4. Heinze, T.S., Amme, W., Moser, S.: Static analysis and process model transformation for an advanced business process to Petri net mapping. Softw. Pract. Exp. 48(1), 161–195 (2018) 5. Heinze, T.S., Stefanko, V., Amme, W.: Mining BPMN Processes on GitHub for Tool Validation and Development. In: BPMDS/EMMSAD@CAiSE 2020. LNBIP, vol. 387, pp. 193–208. Springer (2020) 6. Klinkmüller, C., Weber, I., Mendling, J., Leopold, H., Ludwig, A.: Increasing Recall of Process Model Matching by Improved Activity Label Matching. In: BPM 2013. LNCS, vol. 8094, pp. 211–218. Springer (2013) 7. Kunze, M., Weidlich, M., Weske, M.: Behavioral Similarity – A Proper Metric. In: BPM 2011. LNCS, vol. 6896, pp. 166–181. Springer (2011) 8. Lee, J., Midkiff, S.P., Padua, D.A.: Concurrent Static Single Assignment Form and Constant Propagation for Explicitly Parallel Programs. In: LCPC 1997. LNCS, vol. 1366, pp. 114–130. Springer (1997) 9. Prinz, T.M., Spieß, N., Amme, W.: A First Step towards a Compiler for Business Processes. In: CC 2014. LNCS, vol. 8409, pp. 238–243. Springer (2014) 10. Schäfer, A., Amme, W., Heinze, T.S.: Detection of Similar Functions Through the Use of Dominator Information. In: ACSOS 2020 Comp. pp. 206–211. IEEE (2020) 11. Schoknecht, A., Thaler, T., Fettke, P., Oberweis, A., Laue, R.: Similarity of Business Process Models – A State-of-the-Art Analysis. ACM Comput. Surv. 50(4), 52:1– 52:33 (2017) 12. Störrle, H.: Towards Clone Detection in UML Domain Models. Softw. Syst. Model. 12(2), 307–329 (2013) 13. Weidlich, M., van der Werf, J.M.: On Profiles and Footprints – Relational Semantics for Petri Nets. In: Petri Nets 2012. LNCS, vol. 7347, pp. 148–167. Springer (2012) 14. Wu, M., Wang, P., Yin, K., Cheng, H., Xu, Y., Roy, C.K.: LVMapper: A Large- Variance Clone Detector Using Sequencing Alignment Approach. IEEE Access 8, 27986–27997 (2020)