Sample Evaluation of Ontology-Matching Systems Willem Robert van Hage1;2 , Antoine Isaac1 , and Zharko Aleksovski3 1Vrije Universiteit, Amsterdam 2TNO Science & Industry, Delft 3 Philips Research, Eindhoven fwrvhage,aisaac,zharkog@few.vu.nl Abstract. Ontology matching exists to solve practical problems. Hence, metho- dologies to find and evaluate solutions for ontology matching should be centered on practical problems. In this paper we propose two statistically-founded evalu- ation techniques to assess ontology-matching performance that are based on the application of the alignment. Both are based on sampling. One examines the be- havior of an alignment in use, the other examines the alignment itself. We show the assumptions underlying these techniques and describe their limitations. 1 Introduction The advent of the Semantic Web has led to the development of an overwhelming num- ber4 of ontologies. Therefore, cross-referencing between these ontologies by means of ontology matching is now necessary. Ontology matching has thus been acknowledged as one of the most urgent problems for the community, and also as one of the most scientifically challenging tasks in semantic-web research. Consequently, many matching tools have been proposed, which is a mixed blessing: comparative evaluation of these tools is now required to guide both ontology-matching research and application developers in search of a solution. One such effort, the On- tology Alignment Evaluation Initiative5 (OAEI) provides a collaborative comparison of state-of-the-art mapping systems which has greatly accelerated the development of high-quality techniques. The focus of the OAEI has been mainly on comparing mapping techniques for research. Good evaluation of ontology-matching systems takes into account the purpose of the alignment.6 Every application has different requirements for a matching system. Some applications use rich ontologies, others use simple taxonomies. Some require equiva- lence correspondences, others subsumption or even very specific correspondences such as artist-style or gene-enzyme. Also, the scope of concepts and relations is often de- termined by unwritten application-specific rules (cf. [2]). For example, consider the subclass correspondence between the concepts Gold and Jewelry. This correspondence holds if the scope of Gold is limited to the domain of jewelry. Otherwise the two would just be related terms. In either case, application determines relevance. 4 http://swoogle.umbc.edu indexes over 10,000 ontologies by 2007. 5 http://oaei.ontologymatching.org 6 In this paper we use the definitions as presented in [1]: An ontology matching system produces a set of correspondences called an alignment. The best way to evaluate the quality of an alignment is trough extensive practical use in real-world applications. This, however, is usually not feasible. The main rea- son for this is usually lack of time (i.e. money). Benchmarks and experiments using synthesized ontologies can reveal the strengths and weaknesses of ontology-matching techniques, but disregard application-specific requirements. Therefore, the second best option is to perform an evaluation that mimics actual usage. Either by performing a number of typical usage scenarios or by specifying the requirements an application has for the alignment and then testing whether these requirements are met. The final measure for system performance in practice is user satisfaction. For the evaluation of matching systems, this means that a set of correspondences is good if users are satisfied with the effect the correspondences have in an application. Most current matching evaluation metrics simulate user satisfaction by looking at a set of assessed correspondences. For example, Recall expresses how many of the as- sessed correspondences are found by a system. This has two major problems. (i) Some correspondences have a larger logical consequence than others. That is to say, some cor- respondences subsume many other correspondences, while some only subsume them- selves. This problem is addressed quite extensively in [3] and [4]. (ii) Correct corre- spondences do not automatically imply happy users. The impact of a correspondence on system performance is determined not only by its logical consequence, but also by its relevance to the user’s information need. A correspondence can be correct and have many logical implications, but be irrelevant to the reasoning that is required to satisfy the user. Also, some correspondences have more impact than others. In the following sections we propose two alternative approaches to include rele- vance into matching evaluation, one based on end-to-end evaluation (Sec. 2) and one based on alignment sample evaluation (Sec. 3). Both approaches use sample evalua- tion, but both what is sampled and the sample selection criteria are different. The for- mer method uses sample queries, disregarding the alignment itself, and hence providing objectivity. The latter uses sample sets of correspondences which are selected in such a way that they represent different requirements of the alignment. We investigate the limitations of these statistical techniques and the assumptions underlying them. Fur- thermore, we calculate upper bounds to the errors caused by the sampling. Finally, in Sec. 4 we will demonstrate the workings of the latter of the two evaluation methods in the context of the OAEI 2006 food track. 2 End-to-end Evaluation This approach is completely system-performance driven, based on a sample set of rep- resentative information needs. The performance is determined for each trial informa- tion need, using a measure for user satisfaction. For example, such an information need could be “I would like to read a good book about the history of steam engines.” and one could use F-score or the Mean-Reciprocal Rank7 of the best book in the result list, or the time users spent to find an answer. The set of trials is selected such that it fairly represents different kinds of usage, i.e. more common cases receive more trials. Real- life topics should get adequate representation in the set of trials. In practice the trials 7 One over the rank of the best possible result, e.g. 1/4 if the best result is the fourth in the list. are best constructed from existing usage data, such as log files of a baseline system. Another option is to construct the trials in cooperation with domain experts. A concrete example of an end-to-end evaluation is described in [5]. In their paper, Voorhees and Tice explicitly describe the topic construction method and the measure of satisfaction they used for the end-to-end evaluation of the TREC-9 question-answering track. The size and construction methods of test sets for end-to-end retrieval have been investi- gated extensively in the context of information retrieval evaluation initiatives such as TREC [6], CLEF, and INEX8 . When all typical kinds of usage are fairly represented in the sample set, the total system performance can be acquired by averaging the scores.9 To evaluate the effect of an ontology alignment, one usually compares it to a baseline alignment in the context of the same information system. By changing the alignment while keeping all other factors the same, the only thing that influences the results is the alignment. The baseline alignment can be any alignment, but a sensible choice is a trivial alignment based only on simple lexical matching. Comparative End-to-end Evaluation n number of test trials (e.g. information system queries) in the evaluation sample A, B two ontology-matching systems Ai outcome of the evaluation metric (e.g. Semantic preci- 8 sion [3]) for the i-th test trial for system A < 1 Ai > Bi I [Ai > Bi ] = interpretation function that tests outperformance 0 Ai  Bi : S+ = ∑ I [Ai > Bi ] number of trials for which system A outperforms system B To compare end-to-end system performances we determine whether one system per- forms better over a significant number of trials. There are many tests for statistical significance that use pairwise comparisons. Each test can be used under different as- sumptions. A common assumption is the normal distribution of performance differ- ences: small differences between the performance of two systems are more likely than large differences, and positive differences are equally likely as negative differences. However, this is not very probable in the context of comparative evaluation of match- ing systems. The performance differences between techniques are usually of a much greater magnitude than estimation errors. There are many techniques that improve per- formance on some queries while not hurting performance on other queries. This causes a skewed distribution of the performance differences. Therefore, the most reliable test is the Sign-test [8, 9]. This significance test only assumes that two systems with an equal performance are equally likely to outperform each other for any trial. It does not take 8 respectively http://trec.nist.gov, http://www.clef-campaign.org, and http://inex.is.informatik.uni-duisburg.de 9 A more reliable method for weighted combination of the scores that uses the variance of each performance measurement is described in [7]. into account how much better a system is, only in how many cases a system is better. The test gives reliable results for at least 25 trials. It needs relatively large differences to proclaim statistical significance, compared to other statistical tests. This means sta- tistical significance calculated in this way is very strong evidence. To perform the Sign-test on the results of systems A and B on a set of n trials, we compare their scores for each trial, A1 ; : : : ; An and B1 ; : : : ; Bn . Based on these outcomes we compute S+ , the total the number of times A has a better score than B. For example, the number of search queries for which A retrieves better documents than B. The null- hypothesis is that the performance of A is equal to that of B. This hypothesis can be rejected at a confidence level of 95%† if 2  S+ n pn > 1:96 For example, in the case of 36 trials, system A performs significantly better than system B when it outperforms system B in at least 23 of the 36 trials. 3 Alignment Sample Evaluation Another evaluation approach is to assess the alignment itself. However, in practice, it is often too costly to manually assess all the correspondences. A solution to this problem is to take a small sample from the whole set of correspondences [10]. This set is manually assessed and the results are generalized to estimate system performance on the whole set of correspondences. As opposed to the elegant abstract way of evaluating system behavior provided by end-to-end evaluation, alignment sample evaluation has many hidden pitfalls. In this section we will only investigate the caveats that are inherent to sample evaluation. We will not consider errors based on non-sampling factors such as judgement biases, peculiarities of the ontology-matching systems or ontologies, and other unforeseen sources of evaluation bias. Simple Random Sampling p true proportion of the samples produced that is correct (unknown) n number of sample correspondences used to approximate p P̂ approximation of p based on a sample of size n δ margin of error of P̂ with 95% confidence The most common way to deal with this problem is to take a small simple random sample from the whole set of correspondences. Assessing a set of correspondences can be seen as classifying the correspondences as Correct or Incorrect. We can see the output of a matching system as a Bernoulli random variable if we assign 1 to every Correct correspondence and 0 to each Incorrect correspondence it produces. The true † About 95% of the cases fall within 1:96 times the standard deviation from the mean of the normal or binomial distribution. In the derivations we use 2 instead of 1:96 for the sake of simplicity. This guarantees a confidence level of more than 95%. Incorrect and not found Correct Correct and and found Incorrect not found and found B Correct A C Found Sample Fig. 1. Venn diagram to illustrate sample evaluation. A [ B is a sample of the population of Correct correspondences. B [ C is a sample of the population of Found correspondences. mapping from ontology X to ontology Y mapping from ontology Y to ontology X Fig. 2. Concepts to consider when creating a sample for Recall evaluation based on a topic. Black concepts are “on topic”, white concepts “off topic”. For example, the black concepts have some- thing to do with steam engines and the white concepts do not. Concepts to consider for sample correspondences are marked by clouds. This avoids bias against cross-topic correspondences. Precision of a system is the probability with which this random variable produces a 1, p. We can approximate this p by the proportion of 1’s in a simple random sample of size n. With a confidence of 95% this approximation, P̂, lies in the interval: P̂ 2 [ p δ ; p + δ ] where δ = p 1 (1) n Both Precision and Recall can be estimated using samples. In the case of Precision we take a random sample from the output of the matching system, Found in Fig. 1. In this figure the sample for Precision is illustrated as B [ C. The results for this sample can be generalized to results for the set of all Found correspondences. In the case of Recall we take a random sample from the set of all correct correspondences, Correct in Fig. 1. The sample for Recall is illustrated as A [ B. The results for this sample can be generalized to results for the set of all Correct correspondences. A problem with taking a random sample from all Correct correspondences is it is unknown which correspondences are correct and which are incorrect a priori. A proper random sample can be taken by randomly selecting correspondences between all pos- sible correspondences between concepts from the two aligned ontologies, i.e. a subset of the cartesian product of the sets of concepts from both ontologies. Each correspon- dence has to be judged to filter out all incorrect correspondences. This can be very time-consuming if there are relatively few valid correspondences in the cartesian prod- uct. The construction time of the sample of correct correspondences can be reduced by only judging parts of the ontologies that have a high topical overlap. For example, one can only consider all correct mappings between concepts having to do with steam engines. (cf. e.g. [11]) It is important to always match concepts about a certain topic in ontology X to all concepts in ontology Y , and all concepts about the same topic in ontology Y to all concepts in ontology X. This is illustrated in Fig. 2. This avoids a bias against correspondences to concepts outside the sample topic. There are two caveats when applying this approximation method. (i) A sample of correct mappings constructed in this way is arbitrary, but not completely random. Corre- spondences in the semantic vicinity of other correspondences have a higher probability of being selected than “loners”. This means ontology matching techniques that employ structural aspects of the ontologies are slightly advantaged in the evaluation. (ii) The method works under the assumption that correspondences inside a topic are equally hard to derive as correspondences across topics. Stratified Random Sampling N size of the entire population, e.g. the set of all correct correspondences h one stratum of the entire population Nh size of stratum h nh number of sample correspondences used to approximate p of stratum h P̂h approximation of p for the correspondences in stratum h A better way than simple random sampling to perform sample evaluation is stratified random sampling. In stratified sampling, the population (i.e. the entire set of correspon- dences used in the evaluation) is first divided into subpopulations, called strata. These strata are selected in such a way that they represent parts of the population with a com- mon property. Useful distinctions to make when stratifying a set of correspondences are: different alignment relations (e.g. equivalence, subsumption), correspondences in different domains (e.g cats, automobiles), different expected performance of the match- ing system (e.g. hard and easy parts of the alignment), or different levels of importance to the use case (e.g mission critical versus nice-to-have). The strata form a partition of the entire population, so that every correspondence has a non-zero probability to end up in a sample. Then a sample is drawn from each stratum by simple random sampling. These samples are assessed and used to score each stratum, treating the stratum as if it were an entire population. The approximated proportion and margin of error can be calculated with simple random sampling. Stratified random sampling for the evaluation of alignments has two major ad- vantages over simple random sampling. (i) The separate evaluation of subpopulations makes it easier to investigate the conditions for the behavior of matching techniques. If the strata are chosen in such a way that they distinguish between different usages of the correspondences, we can draw conclusions about the behavior of the correspondences in a use case. For example, if a certain matching technique works very well on chemical concepts, but not on anatomical concepts, then this will only come up if this division is made through stratification. (ii) Evaluation results for the entire population acquired by combining the results from stratified random sampling are more precise than those of simple random sampling. With simple random sampling there is always a chance that the sample is coincidentally biased against an important property. While every property that is distinguished in the stratification process will be represented in the sample. The results of all the strata can be combined to one result for the entire population by weighing the results by the relative sizes of the strata. Let N be the size of the entire population and N1 ; : : : ; NL the sizes of strata 1 to L, so that N1 +  + NL = N. Then the weight of stratum h is Nh =N. Let nh be the size of the simple random sample in stratum h and P̂h be the approximation of proportion p in stratum h by the sample of size nh . We do not require the sample sizes n1 ; : : : ; nL to be equal, or proportional to the size of the stratum. The approximated proportion in the entire population, P̂, can be calculated from the approximated proportions of the strata, P̂h , as follows: 1 L P̂ = Nh P̂h N h∑ =1 Due to the fact that the variance of the binomial distribution is greatest at p = 0:5, we know that the greatest margin-of-error occurs when P̂ = 0:5. That means that with a confidence of 95% the approximation of P̂ lies in the interval: s L P̂ 2 [ p where δ = p 1 Nh δ; p+δ] ∑ ( nh 1) (2) N h=1 Comparative Alignment Sample Evaluation pA true proportion of the correspondences produced by system A that is correct (unknown) P̂A sample approximation of pA P̂A;h P̂A in stratum h To compare the performance of two systems, A and B, using sample evaluation, we calculate their respective P̂A and P̂B and check if their margins of error overlap. If this is not the case, we can assume with a certain confidence that pA and pB are different, and hence that one system is significantly better than the other. For simple random sampling this can be calculated as follows: s jP̂A P̂B j > 2 P̂A (1 n P̂A ) + P̂B (1 n P̂B ) (3) For stratified random sampling this can be calculated as follows: s L P̂A;h (1 P̂A;h )  Nh  L P̂B;h (1 P̂B;h )  Nh  jP̂A P̂B j > 2 ∑ 1 + ∑ 1 (4) h=1 N nh h=1 N nh p For both methods the maximum difference needed to distinguish PA from PB with a confidence of 95% is 2= 2n. So if, depending on the type of sampling performed, equation (3) or (4) holds, there is a significant difference between the performance of system A and B. 4 Alignment Sample Evaluation in Practice In this section we will demonstrate the effects of alignment sample evaluation in prac- tice by applying stratified random sampling on the results of the OAEI 2006 food track10 for the estimation of Precision and we will calculate the margin of error caused by the sampling process. The OAEI 2006 food track is a thesaurus matching task between the Food and Agri- culture Organisation of the United Nations (FAO) AGROVOC thesaurus and the the- saurus of the United States Department of Agriculture (USDA) National Agricultural Library (NAL). Both thesauri are supplied to participants in SKOS and OWL Lite11 . The alignment had to be formulated in SKOS Mapping Vocabulary12 and submitted in the common format for alignments13 . A detailed description of the OAEI 2006 food track can be found in [12, 13]. Five teams submitted an alignment: Falcon-AO, COMA++, HMatch, PRIOR, and RiMOM. Each alignment consisted only of one-to-one semantic equivalence correspon- dences. The size of the five alignments is shown below. system RiMOM Falcon-AO Prior COMA++ HMatch all systems # Found 13,975 13,009 11,511 15,496 20,001 31,112 The number of unique Found correspondences was 31,112. The number of Correct correspondences can be estimated in the same order of magnitude. In our experience, voluntary judges can only reliably assess a few hundred correspondences per day. That means this means assessing all the Found correspondences in the alignments would already take many judges a few weeks of full-time work. This is only feasible with significant funding. Thus, we performed a sample evaluation. During a preliminary analysis of the results we noticed that the performance of the different systems was quite consistent for most topics, except correspondences between taxonomical concepts (i.e. names of living organisms such as “Bos Taurus”) with latin names where some systems performed noticeably worse than others. This was very sur- prising given that there was a straightforward rule to decide the validity of a taxonomical correspondence, due to similar editorial guidelines for taxonomical concepts in the two thesauri. Two concepts with the same preferred label and some ancestors with the same preferred label are equivalent. Also, when the preferred label of one concept is literally the same as the alternative label of the other and some of their ancestors have the same preferred label they are equivalent. For example, the African elephant in AGROVOC has a preferred label “African elephant” and an alternative label “Loxodonta africana”. In NALT it is the other way around. These rules allowed us to semi-automatically assess the taxonomical correspon- dences. This was not possible for the other correspondences. So we decided to sep- arately evaluate correspondences from and to taxonomical concepts. We also noticed 10 http://www.few.vu.nl/wrvhage/oaei2006 11 The conversion from SKOS to OWL Lite was provided by Wei Hu. 12 http://www.w3.org/2004/02/skos/mapping/spec 13 http://oaei.ontologymatching.org/2006/align.html that most other correspondences were very easy to judge, except correspondences be- tween biochemical concepts (e.g. “protein kinases”) and substance names (e.g. “trypto- phan 2,3-dioxygenase”). These required more than a layman’s knowledge of biology or chemistry. So we decided to also evaluate biological and chemical concepts separately, with different judges. This led to three strata: taxonomical correspondences, biological and chemical correspondences, and the remaining correspondences. The sizes of the strata, along with the size of the evaluated part of the stratum and the corresponding stratum weights are shown below. stratum topic stratum size (Nh ) sample size (nh ) stratum weight (Nh =N) taxonomical 18,399 18,399 0.59 biological and chemical 2,403 250 0.08 miscellaneous 10,310 650 0.33 all strata 31,112 21,452 Precision estimates using these strata have a maximum margin of error of: r 0:5  (1 0:5)  18399    2  18399  2  3 8% 2403 10310 1 + 1 + 1 : 31112 250 650 at a confidence level of 95%. That means that, under the assumption that there are no further biases in the experiment, a system with 82% Precision outperforms a system with 78% Precision with more than 95% confidence. If, for example, we are interested in the performance of a system for the alignment of biological and chemical concepts and use the sample of 250 correspondences to de- p rive the performance on the entire set of 2,403 correspondences our margin of error would be 1= 250  6:3%. Comparison of two systems based on only these 250 sam- plepbiological and chemical correspondences gives results with a margin of error of 2= 2  250  8:9%. That means with a confidence level of 95% we can distinguish a system with 50% Precision from a system with 59% Precision, but not from a system with 55% Precision. 5 Conclusion We presented two alternative techniques for the evaluation of ontology-matching sys- tems and showed the margin of error that comes with these techniques. We also showed how they can be applied and what the statistical results mean in practice in the con- text of the OAEI 2006. Both techniques allow a more application-centered evaluation approach than current practice. Apart from sampling errors we investigated in this paper, there are many other possi- ble types of errors that can occur in an evaluation setting. (Some of which are discussed in [14].) Other sources of errors remain a subject for future work. Also, this paper leaves open the question of which technique to choose for a certain evaluation effort. For ex- ample, when you want to apply evaluation to find the best ontology matching system for a certain application. The right choice depends on which technique is more cost effective. In practice, there is a trade-off between cheap and reliable evaluation: With limited resources there is no such thing as absolute reliability. Yet, all the questions we have about the behavior of matching systems will have to be answered with the avail- able evaluation results. The nature of the use case for which the evaluation is performed determines which of the two approaches is more cost effective. Depending on the na- ture of the final application, evaluation of end-to-end performance will sometimes turn out to be more cost effective than investigating the alignment, and sometimes the latter option will be a better choice. We will apply the techniques presented in this paper to the food, environment, and library tasks of the forthcoming OAEI 2007.14 This should give us the opportunity to further study this subject. Acknowledgments We would like to thank Frank van Harmelen, Guus Schreiber, Lourens van der Meij, Stefan Slobach (VU), Hap Kolb, Erik Schoen, Jan Telman, and Giljam Derksen (TNO), Margherita Sini (FAO), Lori Finch (NAL), Part of this work has been funded by NWO, the Netherlands Organisation for Scientific Research, in the context of the STITCH project and the Vitrual Laboratories for e-Science (VL-e) project.15 References 1. Euzenat, J., Shvaiko, P.: Ontology matching. Springer-Verlag, Heidelberg (DE) (2007) 2. Šváb, O., Svátek, V., Stuckenschmidt, H.: A study in empirical and casuistic analysis of ontology mapping results. In: Proc. of the European Semantic Web Conf. (ESWC). (2007) 3. Euzenat, J.: Semantic precision and recall for ontology alignment evaluation. In: Proc. of IJCAI 2007. (2007) 348–353 4. Ehrig, M., Euzenat, J.: Relaxed precision and recall for ontology matching. In: Proc. of K-CAP 2005 workshop on integrating ontologies. (2005) 25–32 5. Voorhees, E., Tice, D.: Building a question answering test collection. In: Proc. of SIGIR. (2000) 6. Voorhees, E.: Variations in relevance judgments and the measurement of retrieval effective- ness. In: Research and Development in Information Retrieval. (1998) 315–323 7. Meier, P.: Variance of a weighted mean. Biometrics 9(1) (1953) 59–73 8. Hull, D.: Evaluating evaluation measure stability. In: Proc. of SIGIR 2000. (2000) 9. van Rijsbergen, C.J.: Information Retrieval. Butterworths (1979) 10. Cochran, W.G.: Sampling Techniques. John Wiley & Sons, Inc. (1977) 11. Wang, S., Isaac, A., van der Meij, L., Schlobach, S.: Multi-concept alignment and evaluation. In: Proc. of the Int. Workshop on Ontology Matching. (2007) 12. Euzenat, J., Mochol, M., Shvaiko, P., Stuckenschmidt, H., Šváb, O., Svátek, V., van Hage, W.R., Yatskevich, M.: Results of the ontology alignment evaluation initiative (2006) 13. Shvaiko, P., Euzenat, J., Stuckenschmidt, H., Mochol, M., Giunchiglia, F., Yatskevich, M., Avesani, P., van Hage, W.R., Šváb, O., Svátek, V.: Description of alignment evaluation and benchmarking results. KnowledgeWeb Project deliverable D2.2.9 (2007) 14. Avesani, P., Giunchiglia, F., Yatskevich, M.: A large scale taxonomy mapping evaluation. In: Proc. of the Int. Semantic Web Conf. (ISWC). (2005) 14 http://oaei.ontologymatching.org/2007/ 15 http://www.vl-e.nl