Relevance-Based Evaluation of Alignment Approaches: the OAEI 2007 food task revisited Willem Robert van Hage1;2 , Hap Kolb1 , and Guus Schreiber2 1 TNO Science & Industry, Stieltjesweg 1, 2628CK Delft, the Netherlands, hap.kolb@tno.nl 2 Vrije Universiteit Amsterdam, de Boelelaan 1081a, 1081HV Amsterdam, the Netherlands, wrvhage@few.vu.nl, schreiber@cs.vu.nl Abstract. Current state-of-the-art ontology-alignment evaluation methods are based on the assumption that alignment relations come in two flavors: correct and incorrect. Some alignment systems find more correct mappings than others and hence, by this assumption, they perform better. In practical applications how- ever, it does not only matter how many correct mappings you find, but also which correct mappings you find. This means that, apart from correctness, relevance should also be included in the evaluation procedure. In this paper we expand the sample-based evaluation of the OAEI 2007 food task with a sample evaluation that uses relevance to prototypical search tasks as a selection criterion for the drawing of sample mappings. 1 Introduction In recent years ontology alignment has become a major field of research [3, 5]. Es- pecially in the field of digital libraries it has had a great impact. Good evaluation is essential for the deployment of ontology-alignment techniques in practice. The main contribution of this paper is to offer a simple method to capture the performance of alignment approaches in actual applications. We introduce relevance-based evaluation, which compensates for some of the shortcomings of existing methods by using the needs of users during sample selection. We apply this method to the data of the OAEI 2007 food task [2]. Nearly all existing evaluation measures used to determine the quality of alignment approaches are based on counting mappings [1, 2]. For instance, in the context of on- tology alignment, the definition of Recall is defined as the number of correct mappings a system produces divided by the total number of correct mappings that can possibly be found (i.e. that are desired to be part of the result). Regardless of their differences, most of these measures have one thing in common: They do not favor one mapping over the other in order to give an objective impression of system performance. Any mapping could prove to be important to some application. Therefore, they can only tell us how many mappings are found on average by a system, but not which mappings are found and whether the mappings that are found are those that are useful for a certain applica- tion. Whenever someone wants to decide which alignment approach is best suited for his application (e.g. [7]) he will have to reinterpret average expected performance in the light of his own needs. This can be a serious obstacle for users. A solution to this problem is to incorporate the importance of mappings (i.e. rele- vance) into the evaluation result. This solution immediately raises two new problems: (1) How to come up with suitable importance weights, and (2) How to define a sim- ple and intuitive way to use these weights With respect to problem 1, there are many sensible ways to weigh the importance of mappings. For example, based on the size of the logical consequence, cf. Semantic Precision and Semantic Recall [1], or on ex- pected traversal frequency, cf. [4]. Relevance-based evaluation equates importance to relevance to prototypical application scenarios. Likewise, with respect to problem 2, there are many sensible ways to incorporate mapping importance into an evaluation method. For example, linear combination, cf. [6], or stratification cf. [9]. As opposed to existing methods to account for the relevance of mappings that include it as a vari- able in an evaluation measure, we use relevance to steer the sample-selection process. Instead of randomly selecting mappings for the evaluation of alignment approaches (cf. the food and environment tasks described in [2]) we select only those that are relevant to an application. This way we can use existing and well-understood evaluation metrics, like Precision and Recall, to measure performance on important tasks as opposed to fictive average-case performance. 2 Experimental Set-up We demonstrate how relevance-based evaluation works by extending the existing results of the OAEI 2007 food task, which did not take relevance into account. We determine relevance for the mappings based on hot topics related to this task, like global warming and increasing food prices, which we obtain by means of query-log analysis, expert in- terviews, and news feeds. For the original OAEI 2007 food task, Recall was measured on samples that represent the frequency of topics in the vocabularies. In relevance-based evaluation the samples are drawn by the frequency of use in search tasks, specifically, finding documents about prototypical agricultural topics of current interest in one col- lection using the indexing vocabulary of the other. The procedure we use is as follows: (1) Gather topics that represent important use cases. We gather “hot” topics in agri- culture from the query log files of the FAO AGRIS/CARIS search engine, the FAO newsroom website, and interviews with two experts. Patricia Merrikin from the FAO’s David Lubin library, and Fred van de Brug, from the TNO Quality of Life food-safety group. We manually construct search-engine queries for each topic. (2) Gather docu- ments that are highly relevant to these topics. We ascertain which documents would be sufficient for the hot topics by gathering suitable candidate documents from the part of the FAO AGRIS/CARIS and USDA AGRICOLA reference databases that overlaps. We use a free-text search engine3 and manually filter out all irrelevant documents. (3) Collect the meta-data describing the subject of these documents and align the concepts that describe the subject of the documents to concepts in the other thesaurus. We collect values of the Dublin Core subject field from the AGRIS/CARIS and AGRICOLA refer- ence databases. These values come from subject vocabularies, respectively AGROVOC and the NAL Agricultural Thesaurus. We manually align each concept to the most sim- ilar concept in the other vocabulary. The resulting mappings make up our sample set 3 http://www.fao.org/agris/search of relevant mappings. (4) Apply the mappings for evaluation by counting how many of these mappings have been found by ontology alignment systems and comparing system performance based on these counts. Specifically, we re-calculate Recall for the top-4 systems of the OAEI 2007 food task, following the same procedure as described in [2, 9], but use the new set of relevant mappings. 3 Sample Construction Topics In order to get a broad overview of current affairs in the agricultural domain we gathered topics from three sources: Analysis of the search log files of the AGRIS/CARIS search engine, topics in the “Focus on the issues” section of the FAO Newsroom, and expert interviews. Detailed descriptions of the topics can be found at http://www. few.vu.nl/wrvhage/om2008/topics.html. Documents Per topic did a full-text search on the AGRIS/CARIS search engine limited to the set of documents that is shared between the AGRICOLA and AGRIS/CARIS col- lections and fetched the top-100 of the results. From these 1500 documents we selected only the ones that are relevant to our topics, on average 31 per query, and that have been assigned Dublin Core subject terms in both collections. This left 52 documents in total, on average 3.8 per query. For four of the topics we found no documents that were both relevant and indexed in both collections. The reason for this is that these topics are all very new issues. The greatest overlap between the AGRIS/CARIS and AGRICOLA collections exists for documents published between 1985 and 1995. After the year 2000 no documents have been imported and thus it is hard to find relevant documents for new issues. We assume that the 52 double-annotated relevant documents are representative of the set of all relevant documents with subject meta-data, i.e. also the documents with only annotations in one of the two collections. These are the documents for which align- ment could make the biggest difference. This is a reasonable assumption, because the indexing process of both collections is regulated by a protocol to control continuity. Mappings Having established which documents are potentially important to find, we have to decide which mappings will be of most benefit to someone who wants to find them. We assume that the mappings that map the subject annotations as strictly as pos- sible to the other vocabulary are the most beneficial for any search strategy that employs them. Given this assumption, we manually constructed the set of mappings that connect each concept used to index the 53 relevant documents with its most similar counterpart. The alignment of the 266 NALT concepts and 212 AGROVOC concepts was done by thesaurus experts at the FAO and USDA, Gudrun Johannsen and Lori Finch. This led to a sample reference alignment consisting of 347 mappings: 74 broadMatch / narrow- Match and 273 exactMatch (79%). 11 concepts had no exact, broader or narrower coun- terpart. This is a higher percentage of exactMatch mappings than we expected based on our experiences with the OAEI food task. For the food task, arbitrary subhierarchies of the AGROVOC and NAL thesaurus were drawn and manually aligned with the other thesaurus. Most of the resulting mappings were equivalence relations. The sample sets, the percentage of equivalence mappings in the reference alignment (i.e. the desired equivalence relations) varied between 54% and 71%. 4 Sample Evaluation Results Having constructed a new sample reference alignment we can use it to measure the per- formance of alignment approaches. The measurement of Recall under the open-world assumption is inherently hard, so we choose to reiterate the evaluation of Recall on the OAEI 2007 food task. This gives us a second opinion on the existing evaluation. For the sake of simplicity we calculate Recall scores of the top-4 of the systems that participated in the OAEI 2007 food task. The results are shown in table 1. Falcon-AO RiMOM DSSim X-SOM OAEI 2007 food, only exactMatch (54% of total) 0.90 0.77 0.37 0.11 hot topics, only exactMatch (79% of total) 0.96 " 0.60 # 0.16 # 0.07 # OAEI 2007 food, exact, broad, narrowMatch 0.49 0.42 0.20 0.06 hot topics, exact, broad, narrowMatch 0.75 " 0.47 " 0.12 # 0.05  Table 1. Recall of alignment approaches measured on sample mappings biased towards relevance to hot topics in agriculture and on impartial, non-relevance-based sample mappings from the OAEI 2007 food task. Arrows indicate significant differences (using the tests described in [9]). There are a number of striking points to note about these results. For most systems there is a significant positive or negative difference. Overall, the difference with non- relevance-based evaluation is large. For exactMatch relations performance in general is lower for relevance-based evaluation than for non-relevance-based evaluation, with the exception of Falcon-AO, although the relative difference is small. However, the ranking of the alignment approaches is left unchanged. The results of relevance-based evalu- ation seem to exaggerate the differences between the performance of the approaches. This can be explained by the relatively high number of obvious matches (93%) in the set of mappings on hot topics. None of the approaches was able to find a substantial number of difficult mappings, but the best approaches were good at finding all obvi- ous mappings before resorting to speculation about the harder mappings. The best two systems, Falcon-AO and RiMOM performed relatively good when accounting for all relation types, the last row of table 1, even though they found no broadMatch and nar- rowMatch relations. This is due to the kind of exactMatch relations they did, which were mostly of the obvious kind (i.e. literal matches), which was exactly the kind that was needed most for the hot topics. The high percentage of exactMatch relations in the set on hot topics accentuates their behavior. The converse goes for DSSim, which found a rel- atively low number of obvious mappings. Fewer broadMatch and narrowMatch mappings seem to be needed than one would expect from the non-relevance-based evaluation method. Compare the percentage in the OAEI 2007 Recall set, 54%, to the percentage based on hot topics, 78.6%. Although there is a large part of the AGROVOC and NALT vocabularies that does not have a counterpart in the other vocabulary, the portion that is actually used suffers less than one would expect from this mismatch. Apparently, indexers mainly pick their terms from a limited set, which shows a greater overlap. (After all, why needlessly complicate things?) It remains to be seen if this also applies to other vocabulary mappings. On one hand this means that approaches that can only find equivalence mappings perform better in practice than was expected. On the other hand it confirms the expectation that a large part (more than 20%) of the mappings that are needed for federated search over AGRIS/CARIS and AGRICOLA consists of other relations than equivalence relations. Also, one can conclude that systems that are inca- pable of finding a substantial number of equivalence relations can only play a marginal role. 5 Discussion By using relevance as a sample criterion we avoid having to come up with an artificial approximation of importance. We can simply explore the performance difference on samples consisting of relevant mappings and samples consisting of irrelevant mappings. Under minimal assumptions we avoid having to choose a specific retrieval method while retaining the the character of an end-to-end evaluation. (cf. the End-to-end Evaluation method described in [9]) This saves us the effort of extensive user studies while not ignoring the behavior of alignment approaches in real-life situations. Considering the fact that AGROVOC and NALT are two of the most widely used agricultural ontologies, and that they are prototypical examples of domain thesauri in their design we conclude the following. From the point of view of a developer of a federated search engine in the agricultural domain that needs an alignment we can conclude that at the moment the Falcon-AO is a good starting point. In the case described in this paper, Falcon-AO found three quarters of the mappings. This empirical study has shown that at least 20% of the required mappings to solve the typical federated-search problem described in this paper are hierarchical relations. Even though this is a smaller fraction than we initially expected it is still a large part. An extended version of this paper can be found in [8]. References 1. Jérôme Euzenat. Semantic precision and recall for ontology alignment evaluation. In Proc. of IJCAI 2007, pages 348–353, 2007. 2. Jérôme Euzenat, Malgorzata Mochol, Pavel Shvaiko, Heiner Stuckenschmidt, Ondřej Šváb, Vojtěch Svátek, Willem Robert van Hage, and Mikalai Yatskevich. Results of the ontology alignment evaluation initiative, 2007. 3. Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer-Verlag, Heidelberg (DE), 2007. ISBN 978-3-540-49611-3. 4. Laura Hollink, Mark van Assem, Shenghui Wang, Antoine Isaac, and Guus Schreiber. Two variations on ontology alignment evaluation: Methodological issues. In Proc. of ESWC, 2008. 5. Yannis Kalfoglou and Marco Schorlemmer. Ontology mapping: the state of the art. The knowledge engineering review, 18(1):1–31, march 2003. 6. Jaana Kekäläinen. Binary and graded relevance in ir evaluations–comparison of the effects on ranking of ir systems. Information Processing and Management, 41(5):1019–1033, 2005. 7. Malgorzata Mochol, Anja Jentzsch, and Jérôme Euzenat. Applying an analytic method for matching approach selection. In Proc. of OM-2006, pages 37–48, 2006. 8. Willem Robert van Hage. Evaluating Ontology-Alignment Techniques. PhD thesis, Vrije Universiteit Amsterdam, 2008. http://www.few.vu.nl/wrvhage/thesis.pdf. 9. Willem Robert van Hage, Antoine Isaac, and Zharko Aleksovski. Sample evaluation of ontology-matching systems. In Proc. of EON, 2007.