Automatic Classification to Matching Patterns for Process Model Matching Evaluation Elena Kuss and Heiner Stuckenschmidt University of Mannheim, Artificial Intelligence Group Abstract. Business process model matching is concerned with the de- tection of similarities in business process models. To support the progress of process model matching techniques, efficient evaluation strategies are required. State-of-the-art evaluation techniques provide a grading of the evaluated matching techniques. However, they only offer limited informa- tion about strength and weaknesses of the individual matching technique. To efficiently evaluate matching systems, it is required to automatically analyze the attributes of the matcher output. In this paper, we propose an evaluation by automatic classification of the alignments to matching patterns. On the one hand, to understand strength and weaknesses of a matching technique. On the other hand, to identify potential for fur- ther improvement. Consequently, optimal matching scenarios of a spe- cific matcher can be derived. This further enables tuning of a matcher to specific applications. Keywords: business process model matching evaluation, matching per- formance assessment, matching patterns 1 Introduction Conceptual models, like business process models, are widely used to document a company’s operations. Process model matching aims in identifying semantic sim- ilarities in business process models. Application scenarios vary strongly for exam- ple from process model search in huge data bases [9], the automatic integration within company merges [11] or clone detection [6]. There are many techniques for the automatic identification of similarities in process models [4,13,17,19]. Cur- rent process model matching techniques focus on finding similarities between process models by comparing the label-strings of these process models. Overall, it can be observed that most research efforts are spent in advancing the process matching techniques compared to the advancement of their evalua- tion. The process model matching track (PMMT) at the Ontology Alignment Evaluation Initiative (OAEI) [1] and the Process Model Matching Contests (PMMCs) [2,3] apply different metrics to evaluate process model matching tech- niques. These metrics allow for evaluating the performance of matching tech- niques through a ranking. Consequently, they only provide limited information about the performance of matching techniques, e.g., detailed information about Copyright © by the paper’s authors. Copying permitted only for private and academic purposes. In: C. Cabanillas, S. España, S. Farshidi (eds.): Proceedings of the ER Forum 2017 and the ER 2017 Demo track, Valencia, Spain, November 6th-9th, 2017, published at http://ceur-ws.org 2 Elena Kuss and Heiner Stuckenschmidt the strength and weaknesses of a matching technique. However, the progress of process model matching techniques is strongly influenced by the available eval- uation techniques. Important is that the evaluation techniques are “fair” and that they can be computed efficiently, i.e., without intensive manual labor. In this paper, we propose a conceptually different evaluation approach. The basis of our idea is to group the alignments of the matcher output as well as the reference alignment into different categories. In our example we utilize five differ- ent such categories. For each of the category, the widely-used measures precision, recall and F-measure are computed. These metrics for each category enables us then to analyze a matcher’s performance in greater detail. This way, we gain insights which matcher performs well on “trivial” correspondences or can also identify “difficult” correspondences. Among the more complex correspondences, we learn for each individual matcher which correspondences can be found and which are challenging. In the first step of the proposed evaluation method, the correspondences in the reference alignment are assigned to the different categories. The same is done for the matcher output in the second step. Both steps are done automatically – no user input is required. Then, each category is treated as its own matching problem where standard metrics can be applied. By automatic annotation to matching patterns, the matching problem it- self is classified into categories. These categories divide the matching problem into different levels of complexity. On the one hand this helps gaining insights about the complexity of the matching task itself, on the other hand this helps understanding strength and weaknesses of a matcher. Equally important, the categorization helps to identify possibilities for the improvement of a matcher’s performance. Furthermore, the categorization helps understanding the performance of a matcher for specific matching tasks. Sometimes a matcher needs to satisfy dif- ferent tasks. For example when finding similar process models in a database it may be required to be able to identify a high fraction of correspondences which are semantically identical. However, for some applications it is simply required to compute a high fraction of identical labels. For an efficient evaluation it is required that the evaluation takes the specific application scenarios into account. The remainder of the paper is organized as follows. In Section 2, the basis of the process model evaluation is defined. Section 3 introduces the matching patterns which are automatically assigned and gives examples to illustrate these patterns. In Section 4, evaluation experiments are obtained and the results of the evaluation by matching patterns are discussed in detail. While Section 5 gives an overview about related work, Section 6 concludes the paper and discusses future work. 2 Problem Description Definition 1 (Alignment, reference alignment, matcher output, corre- spondence). For two process models P1 , P2 with their activity sets A1 , A2 , Automatic Classification to Matching Patterns for Matching Evaluation 3 an alignment is a set consistent of activity pairs (a1 , a2 ) with a1 ∈ A1 and a2 ∈ A2 . A reference alignment G is a subset of all possible alignments, i.e., G ⊆ A1 × A2 . Similarly, a matcher output O is a subset of all possible align- ments, i.e., O ⊆ A1 × A2 . An alignment of the reference alignment is also called a correspondence. The task of a matcher is to identify semantically similar alignments, i.e., to identify the correspondences of the reference alignment. Currently, the matching is mostly based on the label strings of the process models to be matched. Because it is rarely possible to reach a perfect matching, i.e., O = G, the matcher output needs to be evaluated. There is a whole body of literature on process model matching techniques. They have all in common that the evaluation focuses on grading the evaluated matchers. Thus, the evaluation is designed to provide a matcher’s rank within a group of matchers. However, most evaluation methods are not designed to provide a detailed analysis of an individual matcher output. To obtain such detailed information, currently it is required to manually process and interpret the matcher output to identify possibilities for improvement. In contrast, we propose a new evaluation technique which provides such information for an individual matcher output, without the need for manual processing. Our proposed category-dependent evaluation has the following properties: – informs about the data set, e.g., the complexity of the matching task – assesses the reference alignment indirectly, e.g., quality and quantity of man- ual annotations – identifies characteristics as well as strength and weaknesses of a matcher – enables to optimize a matcher to specific application scenarios. By identifying the strength and weaknesses of a matcher, the proposed eval- uation technique may aid the progress of matching techniques. This becomes clearer with the experiments in Section 4. 3 Automatic Classification to Matching Patterns Figure 3 illustrates the conceptual structure of the automatically assigned cat- egories. The matching problem is divided into “trivial” and “non-trivial” align- ments. “Trivial” alignments are any alignments which are identical after nor- malization. Definition 2 (Normalization). For the classification, an activity is normal- ized, if (1) all stop words are removed, (2) stemming has been applied and (3) case sensitivity is ignored. Example of stop-words are “of”, “for” and “the”. Examples for stemming are “checking” and “checks” transformed into “check”. “Non-trivial” alignments are all other alignments. The “non-trivial” category consists of four sub-categories. In the following, we define these categories and give examples. 4 Elena Kuss and Heiner Stuckenschmidt Fig. 1. Structural dependencies of the categories 3.1 The Categories The core of the evaluation via matching patterns is to automatically 1 assign correspondences of the reference alignment as well as the computed alignments to groups with specific attributes. After the automatically generated classification to one of the categories, the well-known metrics precision, recall and F-measure [14] are calculated for each of theses categories separately. As a consequence, the matching task is divided into groups with specific attributes. In the following, we define the categories and illustrate these with examples from the applied data sets. The categories are chosen to provide a deeper knowledge about specific attributes which are important features of a matching technique. Note that the specific numbering of the categories is not related to the complexity level of the corresponding category. Definition 3 (Categorization, category). Let A1 , A2 be the activity sets for two process models P1 and P2 . A categorization is a division of all possible alignments into disjoint sets C(i), i.e., C(i) ⊆ A1 × A2 for all i with ∪i C(i) = A1 × A2 and C(i) ∩ C(j) = ∅ for all i 6= j. Any C(i) is a category. In general, the categories should be chosen carefully. It is important that the classification can be assigned automatically. Moreover, too many categories 1 The implementation of the matching patterns, containing the automatic annotation can be accessed here: https://github.com/kristiankolthoff/PMMC-Evaluator/ tree/master/src/main/java/de/unima/ki/pmmc/evaluator/annotator Automatic Classification to Matching Patterns for Matching Evaluation 5 lead to too few alignments in each category; too few categories lack information. We propose the following categories (the examples are taken from the reference alignment of the data sets of the PMMC 2015): Category “trivial”: This category contains alignments which are identical af- ter normalization. Category I “no word identical”: Alignments which have no word in com- mon after normalization are assigned to this category. Examples: Ex. 1: Evaluate – Assessment of application Ex. 2: Hand application over to examining board – Send documents to selection committee [The stop word “to” is ignored and not counted as an identical word.] Ex. 3: Talk to applicant – Do the interview Ex. 4: Shipping – Delivery and Transportation Preparation Ex. 5: Shipment – Transportation Planning and Processing Category II “one verb identical”: Alignments which are assigned to this category have exactly one identical verb after normalization. No other words are identical. Examples: Ex. 6: Send documents by post – Send all the requirements to the secre- tarial office for students Ex. 7: Wait for results – Waiting for response [This example illustrates two specific characteristics: the verb is normal- ized (stemming), the stop word (in this case “for”) is ignored.] Ex. 8: Send acceptance – Send commitment Ex. 9: Check data – Check documents Category III “one word identical”: This category consists of alignments which have exactly one word (but not a verb) in common after normalization. Ex- amples: Ex. 10: Talk to applicant – Appoint applicant Ex. 11: Hand application over to examining board – Send application to selection committee [In this example the stop word “to” is ignored.] Ex. 12: Apply online – Fill in online form of application Ex. 13: Invoice approval – Invoice Verification Category “misc”: All other alignments, which cannot be assigned to one of the above categories. Examples: Ex. 14: Send application – Send application form and documents Ex. 15: Send documents to selection committee – Send application to selection committee Ex. 16: Receiving the written applications – Receive application Ex. 17: Time Sheet Approval – Time Sheet Permit 6 Elena Kuss and Heiner Stuckenschmidt The described categories vary strongly in their level of complexity. In the following, we discuss each category with increasing complexity level of the cate- gories. The Cat. trivial contains identical labels after normalization. Only basic syntactical matching techniques are required to identify such correspondences. The Cat. misc contains all alignments which cannot be assigned to one of the other categories. Thus, it contains only alignments which share more than one identical word. Therefore this category is a category with less complex alignments compared to Cat. I through Cat. III. Cat. II and Cat. III have a rather high complexity level, since these categories have just one word / one verb in common. Both categories can further indicate if a matcher produces already a high fraction of alignments if one word or the verb between two labels are identical. Cat. I, however, is the most complex category among the introduced categories, since these alignments have no word in common. They have no syntactical overlap. Consequently these alignments just have a semantic connection, like this is the case for synonyms. To identify alignments from this category correctly, a matcher requires advanced semantic knowledge. Note that each alignment is assigned to exactly one of these categories exclusively, i.e., the alignments cannot be assigned to several categories. Fig. 2. Example of a categorized reference alignment Figure 2 illustrates a simplified example of a reference alignment which is as- signed to the above described categories. The figure shows two example process models, which illustrate the application process of Master students at two uni- versities. A matcher’s task is to identify correspondences of one process model in Automatic Classification to Matching Patterns for Matching Evaluation 7 the other process model. The correspondences of this matching task are marked with different gray scales for each of the introduced categories above. For each introduced category there is one example in the figure. However, the Cat. misc is illustrated with two examples of the reference alignment. Note that the illus- trated figure is an example of a reference alignment. For the evaluation procedure also the alignments computed by a matcher are classified to the matching pat- terns. 3.2 Metrics for the Categories Definition 4 (category-dependent Precision, category-dependent Re- call, category-dependent F-measure). Let set G be the reference alignment, O be the matcher output and C(i) be the categories. The set G(i) is the collection of reference alignments assigned to category C(i), i.e., G(i) = G ∩ C(i) for all i. Similarly, O(i) is the collection of correspondences computed by a matcher and assigned to category C(i); i.e., O(i) = O ∩ C(i) for all i. The category-dependent Precision, cP (i), is defined as |G(i) ∩ O(i)| cP (i) = |O(i)| and the category-dependent Recall, cR(i), is given by |G(i) ∩ O(i)| cR(i) = . |G(i)| The category-dependent F-measure, cF M (i), is then cP (i) · cR(i) cF M (i) = 2 · . cP (i) + cR(i) Precision is the fraction of correctly computed alignments to all computed alignments. Recall is the fraction of correctly computed alignments to all correct correspondences (with respect to the reference alignment). Both precision and recall are values between 0.0 and 1.0. A precision of 1.0 means that all computed correspondences are contained in the reference alignment, i.e. O(i) ⊆ G(i). In contrast, a recall of 1.0 means that all correspondences of the reference align- ment are computed, i.e. G(i) ⊆ O(i). The F-measure is the harmonic mean of precision and recall. All alignments in the reference alignment as well as the matcher output are assigned to exactly one category exclusively, i.e. there is no overlap between these categories. After this, precision, recall and F-measure of the alignments of each category are calculated. That means, the categories are evaluated separately and independently. 4 Experiments To illustrate the insights which can be obtained by the proposed matching eval- uation, we apply the introduced technique to the data sets and participating 8 Elena Kuss and Heiner Stuckenschmidt matchers of the Business Process Model Matching Contest 2015 and the Pro- cess Model Matching Track at the OAEI 2016 with its participating matching techniques. 4.1 Setting In the following, the data sets are described. More details about the participating matchers and the data sets can be found at [1,2]. Birth Registration Data set (PMMC 2015) The birth registration data set contains 36 business process model pairs, which model the process of registering a new born child in Russia, South Africa, Germany and the Netherlands. The models are modeled as Petri-nets. Asset Management Data set (PMMC 2015) This data set consists of 36 model pairs of a SAP Reference Model collection which describe processes in the area of finance and accounting. These models are event-driven process chains (EPCs). University Admission Data set (PMMC 2015 / PMMT 2016) This data set contains 36 process model pairs, derived from 9 models from German universities, which model the application process of Master students. 4.2 Results Below, we present the results of the automatic generated matching patterns. The matching patterns are assigned automatically to the reference alignment, as well as to the matcher output of the matchers which participated in the PMMC 2015 and the PMMT at the OAEI 2016. Then category-dependent precision, recall and F-measure are computed for each category separately. After application of the matching patterns to the reference alignment as well as to the alignments computed by the matchers, the following results are computed. Computational results Tables 1-3 illustrate the results for each data set. The first column provides a list of all participating matchers. They are listed in al- phabetic order. In the second column, the F-measure (FM) over all matching patterns is given as the micro value, i.e. it is computed over all test cases. The remaining columns provide the category-dependent precision (cP), recall (cR) and F-measure (cFM) for each matcher in each category. cP, cR and cFM are macro values, independently computed for each of the matching patterns. For each category, the tables further show in the heading the fraction of correspon- dences from the whole data set as well as the total number of correspondences of a category in the reference alignment. The best three matchers are highlighted in each category. One central observation is the distribution of the correspondences Automatic Classification to Matching Patterns for Matching Evaluation 9 Approach FM Cat. Cat. I Cat. II Cat. III Cat. trivial no word iden. one verb iden. one word iden. misc [44.3%][103] [29.3%][68] [11.6%][27] [7.3%][17] [7.3%][17] cP cR cFM cP cR cFM cP cR cFM cP cR cFM cP cR cFM AML .698 .844 .959 .862 .952 .595 .623 .833 .300 .311 .667 .372 .344 .157 .576 .183 AML-PM .385 .844 .963 .864 .458 .397 .334 .187 .633 .217 .045 .500 .069 .112 .970 .151 BPLangM .397 .939 .816 .864 – – – .462 .344 .262 .152 .526 .175 .084 .348 .094 DKP .538 .844 .968 .867 .267 .048 .070 – – – – – – .136 .227 .099 DKP-lite .534 .844 .968 .867 – – – – – – – – – .136 .227 .099 KnoMa-Proc .394 .833 .931 .845 – – – .078 .133 .067 .068 .346 .092 .052 .409 .066 KMSSS .544 .846 1.0 .883 .450 .172 .151 .500 .289 .251 .357 .205 .164 .142 .636 .152 LogMap .481 .844 .978 .872 – – – .467 .167 .127 .082 .372 .094 .092 .530 .110 MSSS .608 .844 .968 .867 .500 .069 .057 .833 .500 .489 – – – .143 .091 .083 OPBOT .601 .978 .706 .774 .713 .468 .433 .562 .322 .290 .432 .500 .333 .128 .530 .164 pPalm-DS .253 .843 .986 .874 – – – .053 .344 .072 .029 .410 .046 .062 .939 .086 RMM-NHCM .668 .954 .930 .928 .821 .374 .397 .452 .456 .292 .550 .372 .302 .178 .439 .166 RMM-NLM .636 .843 1.0 .881 .486 .324 .303 – – – – – – 1.0 .091 .091 RMM-SMSL .543 .844 .912 .839 .778 .423 .439 .152 .311 .121 – – – .087 .121 .058 RMM-VM2 .293 .825 .767 .759 – – – .044 .367 .065 .040 .372 .058 .081 .742 .110 TripleS .485 .843 1.0 .881 – – – .077 .156 .072 .625 .179 .185 .025 .121 .029 Table 1. Results of University Admission data set (PMMC 2015 and PMMT 2016) Approach FM Cat. Cat. I Cat. II Cat. III Cat. trivial no word iden. one verb iden. one word iden. misc [45.9%][102] [34.2%][76] [0.9%][2] [8.1%][18] [10.8%][24] cP cR cFM cP cR cFM cP cR cFM cP cR cFM cP cR cFM AML-PM .677 .996 1.0 .998 1.0 .059 .059 .667 1.0 .667 .231 .396 .219 .571 .742 .565 BPLangM .646 .996 .951 .970 .300 .176 .149 1.0 .500 .500 .200 .354 .161 .646 .706 .528 KnoMa-Proc .355 .251 .968 .367 – – – .083 1.0 .119 .100 .062 .042 .220 .570 .237 KMSSS .579 .996 1.0 .998 – – – – – – .333 .062 .067 .342 .552 .297 MSSS .619 .996 1.0 .998 – – – – – – – – – .417 .127 .119 OPBOT .639 .996 1.0 .998 .250 .026 .033 .500 1.0 .500 .286 .469 .252 .640 .891 .653 pPalm-DS .474 .996 1.0 .998 – – – 1.0 1.0 1.0 .243 .312 .161 .301 .909 .333 RMM-NHCM .661 .996 1.0 .998 – – – – – – .500 .062 .083 .667 .303 .290 RMM-NLM .653 .996 1.0 .998 – – – – – – – – – 1.0 .194 .217 RMM-SMSL .354 .990 .582 .659 – – – – – – .333 .177 .144 .333 .109 .089 RMM-VM2 .603 .996 .962 .976 1.0 .059 .059 .667 1.0 .667 .131 .417 .126 .450 .612 .418 TripleS .578 .996 1.0 .998 – – – – – – .111 .062 .048 .372 .633 .324 Table 2. Results of Asset Management data set (PMMC 2015) in the reference alignments. This aids in understanding the complexity level of the applied data sets. Table 1 illustrates the results for the University Admission data set (includ- ing the participants of the PMMT at the OAEI 2016). As can be observed, the University Admission data set consists of a very high fraction of trivial correspon- dences. Almost half of the correspondences (44,3%) in the reference alignment are trivial correspondences. It can further be observed, that most matchers focus on identifying trivial correspondences. Just few matchers can identify a reason- able number of complex correspondences. Similar behavior can be observed for the Asset Management data set in Table 2 with 45,9% trivial correspondences. Again, most matchers focus on identifying these trivial correspondences. No matcher can achieve good results for Cat. I. For Cat. II and Cat. III there is a similar picture. However, for the Asset Management data set, the number of correspondences in Cat. II is too low to draw meaningful conclusions. For the Birth Registration data set (Table 3), there is a different observation. 75% of all correspondences are correspondences of Cat. I. This shows, that this 10 Elena Kuss and Heiner Stuckenschmidt Approach FM Cat. Cat. I Cat. II Cat. III Cat. trivial no word iden. one verb iden. one word iden. misc [4.5%][26] [75.0%][437] [1.5%][9] [9.9%][58] [9.1%][53] cP cR cFM cP cR cFM cP cR cFM cP cR cFM cP cR cFM AML-PM .392 .190 .792 .239 .386 .329 .329 .071 .048 .036 .496 .336 .308 .772 .366 .382 BPLangM .418 .891 .594 .562 .517 .254 .314 .500 .333 .250 .554 .346 .340 .742 .253 .295 KnoMa-Proc .262 .563 .698 .509 .215 .279 .229 .200 .190 .100 .130 .130 .082 .519 .342 .300 KMSSS .385 .908 .688 .701 .791 .239 .308 – – – .450 .148 .143 .773 .352 .355 MSSS .332 1.0 .667 .696 .973 .174 .243 – – – .583 .130 .119 1.0 .144 .158 OPBOT .565 .882 .854 .831 .676 .422 .483 .250 .286 .152 .714 .485 .470 .688 .451 .444 pPalm-DS .459 .894 .875 .871 .454 .356 .354 .100 .286 .111 .452 .426 .335 .706 .587 .504 RMM-NHCM .456 .923 .698 .717 .717 .319 .389 .400 .190 .167 .552 .262 .261 .623 .292 .283 RMM-NLM .309 1.0 .443 .487 .952 .163 .223 – – – .389 .130 .119 1.0 .154 .174 RMM-SMSL .384 .667 .292 .307 .506 .329 .346 .095 .286 .111 .304 .194 .147 .633 .390 .353 RMM-VM2 .433 .894 .854 .805 .391 .339 .339 .050 .190 .056 .413 .432 .335 .652 .470 .469 TripleS .384 .445 .719 .450 .651 .268 .309 – – – .433 .142 .131 .679 .367 .361 Table 3. Results of Birth Registration data set (PMMC 2015) data set is by far the most complex of these three data sets. Similar to the Asset Management data set, only a low fraction of correspondences of the reference alignment have only the verb in common (Cat. II). Furthermore, it can be ob- served that many matchers fail in identifying the trivial correspondences of this data set. One explanation may be the reference alignment for this data set be- cause we find that the reference alignment of the Birth Registration data set does not fully cover all trivial correspondences or contains wrong trivial alignments. Another observation is that matcher need to take structural dependencies into account, to differentiate between wrong and correct trivial alignments. The Cat. trivial of the Asset Management and the Birth Registration data sets only contains correspondences which are exactly identical without any nor- malization. This is not the case for the University Admission data set. Therefore, we further distinguish between the kind of trivial correspondences, i.e. if these correspondences are “identical” or “identical after normalization”; see Figure 3. We find that in the University Admission data set, about 7% of Cat. trivial consists of correspondences which are trivial after normalization, like stemming. This is a very small fraction and illustrates that for the detection of most corre- spondences of this category not even a normalization is required. However, the sub-division of Cat. trivial, helps to understand if matchers are able to detect trivial correspondences, which require normalization. Observations and Findings With the evaluation through matching patterns it is possible to identify characteristics, strength and weaknesses of a matcher. The results clearly show that most matchers focus on finding correspondences with low complexity, i.e., Cat. trivial and Cat. misc. The matchers clearly lack identifying complex correspondences. This is especially evident for the Asset Management data set which contains special technical terms. For detecting non- trivial correspondences, a matching technique requires knowledge about these terms. It can be observed that the matcher BPLangMatch, in contrast to the other matchers, is able to identify difficult correspondences of this specific data set. The matcher AML achieves very good results for Cat. I (cFM of 0.623) at Automatic Classification to Matching Patterns for Matching Evaluation 11 the University Admission data set. With this high performance for this category, it can be expected that the matcher AML would achieve a high performance on the Birth Registration data set. However, since this matcher did not participate in the PMMC 2015, the results for this data set are not available. Moreover, the matcher OPBOT achieves considerably good results for Cat. I in the Admission data set. Therefore it is not surprising that this matcher reaches the best F- measure on the Birth Registration data set. Approach University Admission Asset Management Birth Registration trivial I II III misc trivial I II III misc trivial I II III misc [103] [68] [27] [17] [17] [102] [76] [2] [18] [24] [26] [437] [9] [58] [53] FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN FP FN AML 12 6 2 27 1 22 3 11 45 8 AML-PM 12 5 32 48 60 12 113 10 117 1 1 0 0 75 2 0 19 11 13 4 48 5 203 287 9 8 24 38 6 32 BPLangM 6 24 37 68 8 20 55 9 70 10 1 5 15 71 0 1 16 13 8 7 2 8 72 311 3 6 16 37 6 41 DKP 12 4 19 60 0 27 0 17 36 14 DKP-lite 12 4 0 68 0 27 0 17 36 14 KnoMa-Proc 12 10 1 68 24 23 41 12 134 9 209 7 0 76 13 0 15 17 69 8 12 7 463 306 10 7 35 53 15 37 KM-SSS 12 0 16 61 7 18 9 13 83 6 1 0 0 76 0 2 3 17 61 10 2 7 23 326 1 9 7 52 4 41 LogMap 12 2 0 68 4 23 39 11 92 8 Match-SSS 12 4 4 64 2 17 0 17 9 17 1 0 3 76 0 2 0 18 8 21 0 8 6 347 1 9 3 53 0 48 OPBOT 3 21 11 32 8 21 12 10 60 8 1 0 18 74 3 0 27 9 21 2 2 4 74 248 5 5 17 29 12 24 pPalm-DS 13 3 5 68 143 15 317 10 216 2 1 0 4 76 0 0 35 13 163 1 4 3 181 274 10 7 32 34 17 19 RMM-NHCM 8 3 7 42 13 16 5 11 36 9 1 0 0 76 0 2 1 17 3 15 1 7 49 295 3 7 12 43 8 37 RMM-NLM 13 0 25 45 0 27 0 17 0 17 1 0 0 76 0 2 0 18 0 18 0 13 10 352 1 9 7 53 0 46 RMM-SMSL 11 17 6 34 59 15 8 17 43 15 1 56 0 76 0 2 14 14 5 22 4 16 130 291 13 7 19 50 8 39 RMM-VM2 8 20 2 68 114 20 169 11 104 5 1 7 0 75 1 0 39 10 16 9 3 4 190 280 21 7 32 31 13 28 TripleS 13 0 0 68 52 22 2 14 51 16 1 0 0 76 0 2 19 16 56 7 25 6 60 313 1 9 10 52 7 40 Table 4. False-positive (FP) and false-negative (FN) alignments for the three data sets and all matchers, assigned to the categories. However, from the three different data sets, it seems that the matchers are optimized to the specific data sets. For example, while the matchers focus on finding correspondences from Cat. trivial in the University Admission data set and Asset Management data set, in contrast, at the Birth Registration data set matchers aim at identifying correspondences from Cat. I. This can also be observed by the number of false-positive and false-negative alignments for each category (Table 4). The matchers compute a high number of false-positive align- ments in Cat. I for the Birth Registration data set, i.e. the matchers aim at identifying correspondences from this category. For the Asset Management data set, however, most matchers do not compute alignments from Cat. I at all. This can be explained by the fact that Cat. I is on the one hand, the most difficult category, but on the other hand, to succeed at the Birth Registration data set it is necessary to compute correspondences from this category. The reason is the very high fraction of correspondences on the whole Birth Registration data set for Cat. I (about 75%). The classification of the false-positives and false- negatives into the categories allows a more fine-grained understanding about a matcher’s performance. It enables to directly identify where sources for errors of the matchers are. For example the matcher KnoMa-Proc computes a very high number of false-positive alignments in Cat. trivial for the Asset Management data set. The matcher RMM-SMSL misses many trivial correspondences (56) from the Asset management data set. 12 Elena Kuss and Heiner Stuckenschmidt 5 Related work State-of-the-art evaluation techniques mostly rely on precision, recall and F- measure for evaluation of process model matching techniques. In [10] these met- rics are extended to probabilistic versions with a non-binary reference align- ment. This non-binary reference alignment contains support values obtained by the number of annotators which identify a correspondence. This extension al- lows a deeper understanding about a matcher’s performance, since it helps to understand if matchers focus on identifying correspondences with high support values (mostly obvious correspondences) or if they also can identify arguable correspondences. Furthermore, [18] propose a prediction of the performance of process matching techniques. In ontology matching, more research has been dedicated to evaluation strate- gies [5,7,15,16,20]. Euzenat, for example, introduced semantic precision and re- call. This extension for ontology matching techniques allows to differentiate if the computed correspondences are related, by taking the ontological structure into account. As a consequence, deductible alignments are evaluated. Although it better reflects the quality of the computed alignments, it does not provide information about the structure of correspondences a matcher can identify and thus does not provide information about specific strength and weaknesses of matching techniques. In schema matching, [12] propose synthetic scenarios to tune a schema matcher to specific applications. The annual Ontology Alignment Evaluation Initiatives apply synthetic data sets which allow to test matching systems on specific char- acteristics [1]. However, these synthetic data sets are artificially generated test cases which rely on the same resources as most matchers rely on, e.g. Word- Net. Therefore experts generated synonyms manually, but the manual synonym generation comes with high efforts. In [8], the authors explain that it is not suitable to apply matchers on syn- thetic scenarios, since these scenarios are too artificial. It is not clear if matchers have a similar performance on real-world data. The matching patterns, proposed in this paper, offer the same features as synthetic data sets, but with real-world data. 6 Conclusion In this paper we propose a conceptually new approach to divide the matching task as well as the matcher output into patterns with specific attributes. The pro- posed evaluation via matching patterns provides an in-depth evaluation about a matcher’s performance, including specific strength and weaknesses. Therefore it allows for an application dependent evaluation, as the evaluation procedure can aid in improving matching techniques to obtain desired attributes. The evalua- tion procedure further is an efficient way to automatically process the matcher output. It can help to understand what kind of false-positive and false-negative alignments matchers generate and therefore enables for an quantitative as well Automatic Classification to Matching Patterns for Matching Evaluation 13 as qualitative analysis. The approach can be extended by different matching patterns, the only limitation is that they can be assigned automatically. Fur- thermore, all standard metrics can be applied. In future work we aim to use the matching patterns for a prediction of the matcher’s performance for specific data sets. We further plan to apply the evaluation technique to the participating matchers of the next Process Model Matching Contest. Acknowledgments. We thank Kristian Kolthoff for his work on implementing the automatic assignment of the problem categories. References 1. Manel Achichi, Michelle Cheatham, Zlatan Dragisic, Jérôme Euzenat, Daniel Faria, Alfio Ferrara, Giorgos Flouris, Irini Fundulaki, Ian Harrow, Valentina Ivanova, et al. Results of the ontology alignment evaluation initiative 2016. In 11th ISWC workshop on ontology matching (OM), pages 73–129, 2016. 2. Goncalo Antunes, Marzieh Bakhshandeh, Jose Borbinha, Joao Cardoso, Sharam Dadashnia, Chiara Di Francescomarino, Mauro Dragoni, Peter Fettke, Avigdor Gal, Chiara Ghidini, et al. The process model matching contest 2015. GI- Edition/Proceedings: Lecture notes in informatics, 248:127–155, 2015. 3. Ugur Cayoglu, Remco Dijkman, Marlon Dumas, Peter Fettke, Luciano Garcı́a- Bañuelos, Philip Hake, Christopher Klinkmüller, Henrik Leopold, André Ludwig, Peter Loos, et al. Report: The process model matching contest 2013. In Inter- national Conference on Business Process Management, pages 442–463. Springer, 2013. 4. Remco Dijkman, Marlon Dumas, and Luciano Garcı́a-Bañuelos. Graph matching algorithms for business process model similarity search. In International Confer- ence on Business Process Management, pages 48–63. Springer, 2009. 5. Marc Ehrig and Jérôme Euzenat. Relaxed precision and recall for ontology match- ing. In Proc. K-Cap 2005 workshop on Integrating ontology, pages 25–32. No commercial editor., 2005. 6. Chathura C Ekanayake, Marlon Dumas, Luciano Garcı́a-Bañuelos, Marcello La Rosa, and Arthur HM ter Hofstede. Approximate clone detection in reposi- tories of business process models. In International Conference on Business Process Management, pages 302–318. Springer, 2012. 7. Jérôme Euzenat. Semantic precision and recall for ontology alignment evaluation. In IJCAI, pages 348–353, 2007. 8. Jérôme Euzenat, Christian Meilicke, Heiner Stuckenschmidt, Pavel Shvaiko, and Cássia Trojahn. Ontology alignment evaluation initiative: six years of experience. In Journal on data semantics XV, pages 158–192. Springer, 2011. 9. Tao Jin, Jianmin Wang, Marcello La Rosa, Arthur Ter Hofstede, and Lijie Wen. Efficient querying of large process model repositories. Computers in Industry, 64(1):41–49, 2013. 10. Elena Kuss, Henrik Leopold, Han Van der Aa, Heiner Stuckenschmidt, and Hajo A Reijers. Probabilistic evaluation of process model matching techniques. In Con- ceptual Modeling: 35th International Conference, ER 2016, Gifu, Japan, November 14-17, 2016, Proceedings 35, pages 279–292. Springer, 2016. 14 Elena Kuss and Heiner Stuckenschmidt 11. Marcello La Rosa, Marlon Dumas, Reina Uba, and Remco Dijkman. Business process model merging: An approach to business process consolidation. ACM Trans. Softw. Eng. Methodol., 22(2):11:1–11:42, March 2013. 12. Yoonkyong Lee, Mayssam Sayyadian, AnHai Doan, and Arnon S Rosenthal. etuner: tuning schema matching software using synthetic scenarios. The VLDB JournalThe International Journal on Very Large Data Bases, 16(1):97–122, 2007. 13. Henrik Leopold, Mathias Niepert, Matthias Weidlich, Jan Mendling, Remco Dijk- man, and Heiner Stuckenschmidt. Probabilistic optimization of semantic process model matching. In International Conference on Business Process Management, pages 319–334. Springer, 2012. 14. Christopher D Manning, Prabhakar Raghavan, Hinrich Schütze, et al. Introduction to information retrieval, volume 1. Cambridge university press Cambridge, 2008. 15. Hela Sfar, Anja Habacha Chaibi, Amel Bouzeghoub, and Henda Ben Ghezala. Gold standard based evaluation of ontology learning techniques. In Proceedings of the 31st Annual ACM Symposium on Applied Computing, pages 339–346. ACM, 2016. 16. Pavel Shvaiko and Jérôme Euzenat. Ontology matching: state of the art and future challenges. IEEE Transactions on knowledge and data engineering, 25(1):158–176, 2013. 17. Sergey Smirnov, Remco Dijkman, Jan Mendling, and Mathias Weske. Meronymy- based aggregation of activities in business process models. Conceptual Modeling– ER 2010, pages 1–14, 2010. 18. Matthias Weidlich, Tomer Sagi, Henrik Leopold, Avigdor Gal, and Jan Mendling. Predicting the quality of process model matching. In Business Process Manage- ment, pages 203–210. Springer, 2013. 19. Matthias Weidlich, Eitam Sheetrit, Moisés C Branco, and Avigdor Gal. Match- ing business process models using positional passage-based language models. In International Conference on Conceptual Modeling, pages 130–137. Springer, 2013. 20. Elias Zavitsanos, George Paliouras, and George A Vouros. Gold standard evalua- tion of ontology learning methods through ontology transformation and alignment. IEEE Transactions on Knowledge and Data Engineering, 23(11):1635–1648, 2011.