TaxoMap in the OAEI 2009 alignment contest Fayçal Hamdi1 , Brigitte Safar1 , Nobal B. Niraula2 , and Chantal Reynaud1 1 LRI CNRS UMR 8623, Université Paris-Sud 11, Bat. G, INRIA Saclay 2-4 rue Jacques Monod, F-91893 Orsay, France firstname.lastname@lri.fr 2 nobal.niraula@inria.fr Abstract. TaxoMap is an alignment tool which aims to discover rich correspon- dences between concepts. It performs an oriented alignment (from a source to a target ontology) and takes into account labels and sub-class descriptions. This new implementation of TaxoMap reduces significantly runtime and enables pa- rameterization by specifying the ontology language and different thresholds used to extract different mapping relations. It improves terminological techniques, with a better use of TreeTagger and introduces new structural techniques which take into account the structure of ontology. Special effort has been made to han- dle large-scale ontologies by partitioning input ontologies into modules to align. We conclude the paper by pointing out the necessary improvements that need to be made. 1 Introduction TaxoMap was designed to retrieve useful alignments for information integration be- tween different sources. The alignment process is then oriented from ontologies that describe external resources (named source ontology) to the ontology (named target on- tology) of a web portal. The target ontology is supposed to be well-structured whereas source ontology can be a flat list of concepts. TaxoMap makes the assumption that most semantic resources are based essentially on classification structures. This assumption is confirmed by large scale ontologies which contain rich lexical information and hierarchical specification without describing specific properties or instances. To find mappings in this context, we can only use the following available elements: labels of concepts and hierarchical structures. The new implementation of TaxoMap proposes a better morpho-syntactic analysis and new techniques. Moreover, the methods to partition large ontologies into modules which TaxoMap can handle easily were refined. We take part to five tests. We hope we perform better in terms of precision of map- pings generated and runtime. Tests on library data sets allow us to experiment our algo- rithm on large multilingual ontologies (English, French, and German). 2 Presentation of the System 2.1 State, Purpose and General Statement We consider an ontology as a pair (C, HC ) consisting of a set of concepts C arranged in a subsumption hierarchy HC . A concept c is defined by two elements: a set of labels and subclass relationships. The labels are terms that describe entities in natural language and which can be an expression composed of several words. A subclass relationship establishes links with other concepts. Our alignment process is oriented; from a source (OS ) to a target (OT ) ontology. It aims at finding one-to-many mappings between single concepts and establishing three types of relationships, equivalence, subclass and semantically related relationships de- fined as follows. Equivalence relationships An equivalence relationship, isEq, is a link between a con- cept in OS and a concept in OT with labels assumed to be similar. Subclass relationships Subclass relationships are usual isA class links. When a concept cS of OS is linked to a concept cT of OT with such a relationship, cT is considered as a super concept of cS . Semantically related relationships A semantically related relationship, isClose, is a link between concepts that are considered as related but without a specific typing of the relation. 2.2 Techniques Used The different techniques are based on the use of the moropho-syntactic analysis tool TreeTagger [1], and a similarity measure which compares the trigrams of the concept labels [2]. TreeTagger is a tool for tagging text with part-of-speech and lemma information, en- ables to take into account the language, lemma and an use word categories in an ef- ficient way. The words are classified as functional (verbs, adverbs or adjectives) and stop words (articles, pronouns). Once classified by TreeTagger, the words are divided into two classes, full words and complementary words, according to their category and their position in the label. In principle, all names are full words except if they are placed after a determiner, all other words are complementary words. This classification is then used to give more weight to the full words in the calculation of similarity between labels. The main methods used to extract mappings between a concept cs in OS and a concept ct in OT are: – Label equivalence: An equivalence relationship, isEq, is generated if the similarity between one label of cs and one label of ct is greater than a threshold (Equiv.threshold). – High lexical similarity: Let ctmax be the concept in OT with the highest similar- ity measure with cs . If the similarity measure is greater than a threshold (High- Sim.threshold) and if one of the labels of ctmax shares at least two full words in common with one of the labels of cs , the heuristic generates the relationship < cs isA ctM ax > if the label of ctmax is included in the cs one, otherwise it generates < cs isClose ctM ax >. – Label inclusion (and its inverse): If one of the labels of ctmax is included in one of the labels of cs , and if all words of included label are full words, we propose a subclass relationships < cs isA ctmax >. Inversely, if one of the labels of cs is in- cluded in one of the labels of ctmax , we propose a semantically related relationships < cs isClose ctmax >. – Reasoning on similarity values: Let ctM ax and ct2 be the two concepts in OT with the highest similarity measure with cs , the relative similarity is the ratio of ct2 similarity on similarity ctM ax . If the relative similarity is lower than a threshold (isA.threshold), one of the three following techniques can be used: • the relationship < cs isClose ctM ax > is generated if one of the labels of cs is included in one of the labels of ctM ax , and the words of the included label are complementary words. • the relationship < cs isClose ctM ax > is generated if the similarity of ctM ax is greater than a threshold (isClose.thresholdMax). • an isA relationship is generated between cs and the father of ctM ax if the similarity of ctM ax is greater than a second threshold (isA.thresholdMax). – Reasoning on structure: • an isA relationship < cs isA ct > is generated if the subclass relation < cs isSubClassOf X > appears in OS and if the equivalence mapping < X isEq ct > have been identified. • the relationship < cs isClose ct > is generated if ct is the concept in OT which have the most number of children in OT with the same label as the children of cs in OS . More details of this approach are given at the end of this sub-section. • an isA relationship < cs isA p > is generated if the three concepts in OT with the highest similarity measure with cs have similarity greater than a threshold (Struct.threshold), and have a common father p in OT . As we mentioned above, we use a structural heuristic based on the Semantic Cotopy measure of a concept, proposed by Maedche and Staab [3]. The Semantic Cotopy is based on the intentional semantics of a concept c in an ontology O, SC(c, O), defined as the set of all its super- and sub-concepts in O. When a concept c belongs to two ontologies, one can define the taxonomic overlap (T O) between O1 and O2 for this concept, denoted T O(C, O1 , O2 ) and defined as the ratio between the number of com- mon elements in the intentional semantics of c in O1 and in O2 and the total number of elements belonging to the union of these two sets. If a concept c is in O1 but not in O2 , an optimistic approximation of T O(c, O1 , O2 ) is defined as the maximum overlap ob- tained by comparing SC(c, O1 ) to the intentional semantics of all the concepts in O2 . Our heuristic uses SCD (c) which includes only the concept and its descendants instead of the original Semantic Cotopy. If a concept c is in O1 but not in O2 , we propose as candidate mapping for this concept c, the concept cM ax of O2 which maximizes the T O, if c and cM ax have at least two descendants in common. 2.3 Partitioning of large scale ontologies We propose a method of ontology partitioning [4], that relies on the implementation of PBM [5] algorithm. P BM partitions large ontologies into small blocks (or modules) and constructs mappings between the blocks, using predefined matched class pairs, called anchors to identify related blocks. We reuse the partitioning part and the idea of an- chors, but the originality of our method, called PAP (Partition, Anchor, Partition), is that it is alignment oriented, that means that the partitioning process is influenced by the mapping process. The PAP method consists of: – decompose the most structured ontology, that will be called the target, OT , into several blocks BT i , according to the P BM algorithm. – force the partitioning of the other ontology, called the source OS , to follow the pattern of OT . To achieve this, the method identifies for each block BT i constructed from OT all the anchors belonging to it. Each of these sets of anchors will constitute the kernel or center CBSi of a future block BSi which will be generated from the source OS . – reuse the P BM algorithm to partition the source OS around the centers CBSi . – align each block BSi built from a center CBSi with the corresponding block BT i . Fig. 1. The centers CBSi identified from BT i Fig. 2. Partition of OS around the centers CBSi The tests show that the maximum size of the blocks has to be fixed for the target ontology. If the themes covered by both ontologies are of the same importance, i.e. if the source ontology corresponds to a representation of the same importance than the representation of the target one, a maximum size for the blocks in the source ontology is not needed. Their size will become close to the size of the blocks of the target ontology. This phenomenon allows to avoid obtaining a lot of small isolated blocks which appear when the maximum size of the blocks of the source ontology is fixed. So, on the example of Fig2, the BS3 block remains isolated because the size of of the source blocks was fixed. Without limitation of the size, the BS3 block can be merged with BS2 . The only blocks which will remain isolated will be the blocks built when the source ontology will be partitioned, independently of the kernels identified in the decomposition of the target ontology, i.e. concepts with no relation with those of the target ontology. So, the fact that the concepts belonging to these isolated blocks are not aligned should not damage our results. 2.4 Adaptations made for the Evaluation Unlike in previous years, we have made some specific adaptations for the OAEI 2009 campaign. For Anatomy task, we did not use the techniques which generate isA relationship. All the alignments outputted by TaxoMap are uniformly based on the same parameters. We had, however, fixed confidence values depending on relation types. For library test, data sets consist of multilingual ontologies. In order to use lexical comparison, we translated non-English labels of all of the concepts of the vocabularies into English. The translation is done by using Googles translation APIs. 2.5 Link to the system and parameters file TaxoMap requires: – Mysql 3 – Java (Version 1.5 and above )4 – Google’s Java Client API for Translation 5 – TreeTagger with its language parameter files 6 The version of TaxoMap (with parameter files) used in 2009 contest can be down- loaded from: – http://www.lri.fr/˜hamdi/TaxoMap.jar: a parameter lg has to be specified it denotes the language of the ontology. For example TaxoMap.jar fr to perform alignment on ontologies in French. If no language is specified, it is supposed to be English. – http://www.lri.fr/˜hamdi/TaxoMap.properties: a parameter file which specifies: • The command to launch TreeTagger. • TreeTagger word categories that has to be considered as functional, stop words and prepositions. • The RDF output file. • Different thresholds of similarity, depending on the method used. – http://www.lri.fr/˜hamdi/dbproperties.properties: a parameter file which contains the user and password to access to MySql. 3 http://www.mysql.com 4 http://java.sun.com 5 http://code.google.com/p/google-api-translate-java 6 http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger 2.6 Link to the Set of Provided Alignments The alignments produced by TaxoMap are available at the following URLs: http://www.lri.fr/˜hamdi/benchmarks/ http://www.lri.fr/˜hamdi/anatomy/ http://www.lri.fr/˜hamdi/directory/ http://www.lri.fr/˜hamdi/library/ http://www.lri.fr/˜hamdi/benchmark-subs/ 3 Results 3.1 Benchmark Tests Since our algorithm only considers labels and hierarchical relations and only provides mapping for concepts, the recall would have been low even for the reference alignment. The overall results would have been similar -with no surprise- to those of last year. 3.2 Anatomy Test The anatomy real world case is to match the Adult Mouse Anatomy (denoted by Mouse) and the NCI Thesaurus describing the human anatomy (tagged as Human). Mouse has 2,744 classes, while Human has 3,304 classes. As last year, we considered Human as the target ontology as is it well structured and larger than Mouse. TaxoMap performs the alignment (with no need to partition) in about 8 minutes which is better than last year [6] where TaxoMap took about 25 minutes to align the two ontologies. As only equivalence relationships will be evaluated in the alignment contest, we did not use this year the techniques which generate isA relationship (except in the Task 3) and we change isClose mapping to equivalence. As a result, we found fewer mappings than last year but we hope that the precision will be better. – For the first task, TaxoMap discovers 1274 mappings, 973 Equivalence relations and 301 Proximity relations. – For the second task, we got only 1084 mappings, 973 Equivalence relations and 111 Proximity relations, using only the heuristic which identifies the relation < cs isClose ctM ax > when one of the labels of cs is included in one of the labels of ctM ax . – For the third task, we used, in addition of the techniques ot the first task, the heuris- tic which identifies subsumption links with ”High Lexical Similarity”. This allows to discover 1451 mappings and to slightly increase the recall, but reduce the preci- sion. In fact, many mappings like or are semantically correct but become false when the subsumption relation isA is automatically replaced by an Equivalence relation. – For the fourth task, we used the partial reference mapping in our partitioning method and we obtained 1131 mappings. This lower number of mapping is explained by two facts. The first one is that the structural heuristic based on the Semantic Cotopy is the only one of which the results can be improved by the use of the partial mapping. The second one is that the partitioning method increases the pre- cision but reduces the recall. 3.3 Directory Test The directory task consists of Web sites directories like Google, Yahoo! or Looksmart. To date, it includes 4,639 tests represented by pairs of OWL ontologies. TaxoMap takes about 40 minutes to complete all the tests. 3.4 Library Test In order to use lexical comparison in library data sets, which consist of multilingual ontologies, we used Google translation API [7] to translate non-English labels into En- glish. With our current configuration, we cannot partition the large sized library on- tologies. However, we used just a part of its data set to partition and then to find the mappings among concepts. As skos relations will be evaluated, we change different mapping types to skos ones with these confidence values: – (type1) isEq relations become skos:exactMatch with a confidence value set to 1. – (type2) isA relations become skos:narrowMatch with a confidence value set to 1 for label inclusion, 0.5 for relations generated by structural technique or by relative similarity method. – (type3) isGeneral relations become skos:broadMatch with a confidence value set to 1. – (type4) isClose relations become skos:relatedMatch with a confidence value set to 1. Generated mappings are as follows: – LCSH-RAMEAU: 5074 type1 relations, 48817 type2 relations, 116789 type3 rela- tions and 13205 type4 relations. – RAMEAU-SWD: 1265 type1 relations, 6690 type2 relations, 17220 type3 relations and 1317 type4 relations. – LCSH-SWD: 38 type1 relations. 3.5 Benchmark-Subs Test Benchmark-Subs tests aims to evaluate alignments which contain other mapping rela- tions than equivalence. Two tasks are available in this test: Gold-standard based evalu- ation concerning the evaluation of subsumption relations and open-ended task concern- ing the evaluation of equivalence and non-equivalence mappings. In our tool, for the first task, we use lexical methods to obtain subsumption relations. 4 General Comments 4.1 Results The new version of TaxoMap improves significantly the results on the previous version of TaxoMap in terms of runtime and precision of generated mappings. The new imple- mentation offers extensibility and modularity of code. TaxoMap can be parameterized by the language used in ontologies, the choice of used techniques and different thresh- olds. Our partitioning algorithms allow us to participate to tests with large ontologies. 4.2 Future Improvements The following improvements can be made to obtain better results: – To take into account all concepts properties instead of only the hierarchicals ones. – Use of WordNet as a dictionary of synonymy. The synsets can enrich the termino- logical alignment process if an a priori disambiguation is made. – To develop the remaining structural techniques which proved to be efficient in last experiments [8] [9]. 5 Conclusion This paper reports our participation to OAEI campaign with the new implementation of TaxoMap. Our algorithm proposes an oriented mapping between concepts. Due to partitioning, it is able to perform alignment on real-world ontologies. Our participation in the campaign allows us to test the robustness of TaxoMap, our partitioning algorithms and new structural techniques. References [1] Schmid H. Probabilistic Part-of-Speech Tagging Using Decision Trees, International Confer- ence on New Methods in Language Processing (1994) [2] Lin, D. : An Information-Theoretic Definition of Similarity. ICML. Madison. (1998) 296– 304 [3] Maedche, A. and Staab S. Measuring Similarity between Ontologies, EKAW (2002) [4] Hamdi, F., Safar, B., Reynaud, C. and Zargayouna, H. Alignment-based Partitioning of Large-scale Ontologies, in Advances in Knowledge Discovery and Management (AKDM09), to appear. [5] Hu, W., Zhao, Y., and Qu, Y. Partition-based block matching of large class hierarchies, Proc. of the 1st Asian Semantic Web Conference (ASWC06). pp.72-83, (2006) [6] Hamdi, F., Zargayouna, H., Safar, B., and Reynaud, C. TaxoMap in the OAEI 2008 alignment contest, Proceedings of the ISWC’08 Workshop on Ontology Matching OM-08 (2008) [7] http://code.google.com/p/google-api-translate-java/ [8] Reynaud, C. and Safar, B. When usual structural alignment techniques don’t apply, The ISWC’06 Workshop on Ontology matching (OM-06), (2006) [9] Reynaud, C. and Safar, B. Exploiting WordNet as Background Knowledge,The ISWC’07 Workshop on Ontology Matching (OM-07), (2007)