When usual structural alignment techniques don’t apply Chantal Reynaud1 and Brigitte Safar1 University of Paris Sud-CNRS (LRI), INRIA (Futurs), LRI, Building 490, 91405 Orsay Cedex, France {chantal.reynaud, brigitte.safar}@lri.fr http://www.lri.fr/~cr Abstract. This paper deals with taxonomy alignment. It presents struc- tural techniques of an alignment method suitable with a dissymmetry in the structure of the mapped taxonomies. The aim is to allow a uniform access to documents belonging to a same application domain, assuming retrieval of documents is supported by taxonomies. 1 Introduction Our work focuses on taxonomy alignment techniques. Indeed, we assume that the description of the content of most todays information systems is often not very much specified and is based on very simple ontologies reduced to classi- fication structures, i.e. taxonomies. Moreover, we suppose that the structures of the taxonomies that have to be aligned are heterogeneous and dissymmet- ric, one taxonomy being deep whereas the other one is flat. In this context, the approaches which relied on OWL data representations exploiting all the ontol- ogy language features don’t apply. Similarity of two entities cannot be identified based on their similar properties or on the status of their respective parents and siblings, because these data are not available. We can only use the following available data: labels of concepts in both taxonomies, the structure of the deeper taxonomy and external linguistic resources such as WordNet. The contribution of this paper is a mapping process composed of a sequence of various techniques designed to make best use of the characteristics of the tax- onomies: very specialized taxonomies with only sub-class links, concepts with labels which are expressions composed of a lot of words, words common to a lot of labels. We classify the found mappings into two groups according to their relevance: probable mappings and potential mappings to be confirmed. The map- ping process is generic, usable across application areas. It has been evaluated on real-world taxonomies and on test taxonomies extracted from a repository about ontology matching [6]. Experiments showed that the proposed techniques give very relevant mappings when the aligned taxonomies have the same character- istics as the taxonomies having motivated our approach. 2 The alignment approach The objective of our approach is to generate mappings between taxonomies. For us, a taxonomy is a pair (C, HC ) consisting of a set of concepts C arranged in a subsumption hierarchy HC . A concept is only defined by two elements: a label and subclass relations. The label is a string which can be an expression com- posed of several words. Subclass relations establish links with other concepts. It is the single semantic association used in the hierarchy. A taxonomy is gen- erally represented by an acyclic graph where concepts are nodes connected by directed edges corresponding to subclass links. Given two structurally dissym- metric taxonomies, the objective is to map the concepts of the less structured one, the source taxonomy TS , with concepts of the more structured one, the target taxonomy TT . The alignment process is oriented from TS to TT . The goal is to find one-to-one mappings. Relations can be of two kinds: equivalence (isEq) and subclass (isA). So, for each concept cS in TS , we try to find a corresponding concept, cT in TT , linked to cS with an equivalence or a subclass relation. Alignment is based on the Lin similarity measure [1] computed between each concept cS in TS and all the concepts in TT . This measure compares the tri-grams of the labels and has been adapted to take into account the importance of words inside expressions. From the measurements we compute M C, the set of mapping candidates of a concept cS in TS . M C includes concepts of TT which have a high similarity value with cS (only the three most similar concepts b1 , b2 , b3 are retained) and Inc, the set of concepts of TT with a label included in the label of cS . Various techniques (terminological and structural cf.Fig.1) are then applied in sequence to select the most relevant concept among all the mapping candidates [3]. We are going to show that the most relevant concept is not necessarily the one with the highest similarity measure. Terminological T axoM ap(TS , TT ) 1. For each cS ∈ TS do 2. For each cT ∈ TT do SimLinLike (cS , cT ) 3. M C ← MappingCandidates(cS ) 4. If TerminologicalTechniques(cS , M C) then stop 5. Else StructuralTechniques(cS , M C) Fig. 1. The Alignment process techniques are executed first. In default of place, they will not be detailed here. These techniques lead to mappings which are generally reliable but not always sufficiently numerous. Therefore, they are completed by the structural mappings described in the next section. These latter techniques define a mapping as a correspondence between close concepts. If the suggested mapping from cS to cT is wrong, then the right mapping will be a relation from cS to c′T , with c′T close to cT in the taxonomy. It is a guide for the user who will not have to browse the whole target taxonomy when studying the results of the system. 3 Exploiting structural features The two techniques presented in this section are structure based techniques leading to the discovery of subclass mappings. The first technique is performed on TT whose structure is supposed to be the deepest. Then we use W ordN et [2], exploiting its structure and its semantic relations. 3.1 Exploiting the structure of the target taxonomy: ST RT This technique, denoted ST RT , works on M C, the set of mapping candidates of a concept cS in TS . The idea is to exploit the location of the elements of M C in TT . Their proximity in the graph is considered to be a semantic proximity. We therefore try to identify the sub-graph rooted in a node associated to a concept which is not too general and such that this sub-graph groups a maximum of nodes of M C. It will represent a relevant context shared by most of the mapping candidates. We then consider that the involved concept cS may be mapped with a node of this sub-graph. ST RT relies on the computation of the Lowest Common Ancestor, LCA, of a set of nodes in a graph, which is the node of greatest depth which is an ancestor of all the nodes of the set. Our goal is to find a LCA of the elements of M C which is not too high in the taxonomy. However the LCA node of a set of elements is all quite high in the graph since the elements are very distant from each other. We propose a measure, the relative density (DR), to evaluate sub-graphs grouping nodes of a sub-set of M C. For each sub- graph rooted in Anc, the LCA node, and grouping M CAnc nodes, we compute DR(Anc). DR(Anc) relies on three criteria: (1) the number of elements in M CAnc , (2) SimLin Like , the similarity between the elements in M CAnc and cS ,(3) the distance as the number of edges on the paths from each element of M CAnc to Anc. The sub-graph rooted in the Anc with the highest DR is considered to be the most relevant. CMaxAnc , the node of this sub-graph with the highest similarity measure, will be the candidate selected for the mapping. If it belongs to Inc, the set of concepts with a label included into the label of cS , it is suggested as a possible parent of the involved concept cS . Otherwise, CMaxAnc is proposed as a possible sibling and its parent (not necessarily Anc) will be suggested as a possible parent of cS . As an example, Fig. 2 represents the sub-graph of TT grouping the elements of M C = {b1 , b2 , b3 } ∪ Inc = {beef } for cS = beef adipose tissue. The node Fresh meat is the LCA for all the elements of M C with a Fig. 2. Common ancestors and relative density distance of 7. However, beef is the LCA of three mapping candidates {beef , b2 , b3 } with a distance of only 2. DR(beef) is the highest (cf.Fig.2). beef connective tissue is the node of this sub-graph with the highest similarity value to cS . So beef adipose tissue will be a sibling of beef connective tissue and linked to beef with a subclass relation. Note that this technique avoids mappings with concepts with a little higher similarity measure but meaningful in a context different from the one common to most of the M C (as b1 in the example). 3.2 Exploiting the structure of W ordN et: ST RW The techniques seen up till now are not enough if the concepts are similar seman- tically but not syntactically. So, at that point, we propose to run ST RW . ST RW relies on the hyperonymy/hyponymy W ordN et structure to find the concept of TT semantically similar to each concept of TS not yet mapped. ST RW will be able to map, for example, cantaloupe with watermelon which are not synonyms but two specializations of melon. Running ST RW assumes that the application root node, denoted rootA , has already been identified. It is the most specialized concept in W ordN et which gen- eralizes all the concepts contained in the involved application domain. ST RW searches WordNet for the hypernyms of each term of TS not yet mapped and of each term of TT (according to all their senses) until rootA is reached. For exam- ple, the result of a search on cantaloupe is two sets of hypernyms corresponding to two different senses. Sense 1: cantaloupe →sweet melon →melon →gourd →plant → organism →Living thing Sense 2: cantaloupe →sweet melon →melon →edible fruit →green goods → food Fig. 3. A sub-graph of Twn where cantaloupe and watermelon are related Only the paths from the invoked terms to rootA will be selected because they represent the only senses which are accurate for the application (sense 2 in the example, the application root node being food ). So a sub-tree, denoted Twn , is obtained. It is composed of all the terms and the relations of the retained paths (cf.Fig.3). For each concept cS , ST RW selects in Twn the most similar concept belonging to TT using W u&P almer’s measure [5]. According to the simW &P measure, the concept that is the most similar to a node cS is its parent. Moreover, we showed in [3] that the similarity is higher between cS and any of its siblings or any of the descendants close to its siblings than between cS and its grandparent, until a depth p that can be computed for each node cS in function of its depth in the tree. In the same way, we can compute the depth p’ from which the similarity of the great-grandparent must be considered, and so on. Using these properties, we proposed an efficient strategy in [3] which does not require the computation of many similarity measurements. 4 Experiments and Discussion Two kinds of experiments have been performed. First, experiments have been made in the setting of the e.dot project1 , on two real-world taxonomies in the field of predictive microbiology. Second, we applied our techniques on test tax- onomies [6]. The latter are not structurally dissymmetric and cover a large do- main. The application conditions of the techniques are not achieved but our ob- jective is to make these tests in order to sketch some ideas to do improvements and to widen the scope of our approach. These experiments have shown where our specific strengths and weaknesses are. Whatever taxonomy we aligned, our approach was able to retrieve almost all the equivalence mappings given with the taxonomies. Furthermore, its strong point is to propose as a bonus a lot of other mappings (subclass mappings). Some mappings have a high precision and are then certain (likely mappings generated by the terminological techniques). Other ones (potential mappings generated by the structural techniques) are less certain (low precision) and have to be validated. This confirms the order in the applica- tion of our techniques. Concerning the structural techniques, ST RT proved to be very useful and leads to relevant mappings when concepts have labels com- posed of a lot of words and when some words are common to many labels. On the opposite, ST RW is all the more appropriate since the application domain is small. The real-world taxonomies which have motivated our approach gather all these characteristics, unlike the others. Better results are then obtained. 5 Conclusion We described two structural techniques to align structurally asymmetric tax- onomies. These techniques are original because different from a search of struc- tural similarity in models. They are executed to suggest additional mappings. These mappings are not certain but they can be a good complement, if human involvement is possible, as experiments showed. We will continue this work by adapting and extending our techniques according to the experiment results. Our first objective is to be able to align taxonomies relative to larger application domains. References 1. D. Lin. An Information-Theoretic Definition of Similarity, In Proc. of the Int. Conf. on Machine Learning (ICML), pp. 296-304, 1998. 2. Miller, G. A. WordNet: A lexical Database for English. Communications of the ACM. (1995) Vol. 38(11) 39-45 3. C. Reynaud, B. Safar. Structural Techniques for Alignment of Taxonomies: experi- ments and evaluation, In TR 1453, LRI, Univ. of Paris-Sud, June 2006. 4. P. Shvaiko, J. Euzenat. A Survey of Schema-based Matching Approaches, In Journal on Data Semantics, 2005. 5. Z. Wu and M. Palmer. Verb Semantics and Lexical Selection, In Proc. of 32nd Meeting of the Ass. for Computational Linguistics, 1994. 6. http://www.ontologymatching/evaluation.html 1 E.dot is a research project funded by national network on software technology (RNTL), 2003-2005.