YAM++ – Results for OAEI 2013 DuyHoa Ngo, Zohra Bellahsene University Montpellier 2, LIRMM {duyhoa.ngo, bella}@lirmm.fr Abstract. In this paper, we briefly present the new YAM++ 2013 version and its results on OAEI 2013 campaign. The most interesting aspect of the new ver- sion of YAM++ is that it produces high matching quality results for large scale ontology matching tasks with good runtime performance on normal hardware configuration like PC or laptop without using any powerful server. 1 Presentation of the system YAM++ - (not) Yet Another Matcher is a flexible and self-configuring ontology match- ing system for discovering semantic correspondences between entities (i.e., classes, ob- ject properties and data properties) of ontologies. This new version YAM++ 2013 has a significant improvement from the previous versions [6, 5] in terms of both effectiveness and efficiency, especially for very large scale tasks. The YAM++’s general architecture is not changed much. However, most of its algorithms have been updated/added by the new effective ones. For example, we have implemented a disk-based method for storing the temporary information of the input ontology during the indexing process in order to save main memor space. Consequently, the new version YAM++ 2013 has improved both the matching quality and time performance in large scale ontology matching tasks. This year, YAM++ participates in six tracks including Benchmark, Conference, Mul- tifarm, Library, Anatomy and Large Biomedical Ontologies tracks. However, due to limitation of time and person, we have not upgrade YAM++ to participate to Interactive matching evaluation and Instance Matching tasks. 1.1 State, purpose, general statement In YAM++ approach all useful information of entities such as terminological, structural or contextual, semantic and extensional are exploited. For each type of extracted infor- mation, a corresponding matching module has been implemented in order to discover as many as possible candidate mappings. The major drawback of the previous version YAM++ 2012 [5], despite the fact that it achieved good results and high ranking at almost tracks, is very low time performance, especially for the Large Biomedical Ontologies tracks. After carefully studying this is- sue, we realize that our algorithms for pre-processing and indexing the input ontologies lead to a high complexity of O(n2 ), where n is the size of the ontology. Additionally, the semantic verification component did not work well for the very large scale ontology matching task. In the current version YAM++ 2013, the flaws mentioned above have been signifi- cantly fixed. Firstly, we have revised our algorithms for pre-processing and indexing the input ontologies and now they are with O(knk+kvk) complexity, where n and v are the number of nodes and edges of a Directed Acyclic Graph transformed from the ontology. Moreover, we have implemented a disk-based method for storing the temporary infor- mation of the input ontologes during the indexing process. This method allows us save a significant space of main memory. this makes possible run YAM++ with very large scale ontology matching in a personal computer with ordinary configuration (using 1G JVM only). Secondly, we have introduced different inconsistent alignment patterns in order to detect as much as possible conflict set. Then, a new and fast approximate algorithm has been implemented to find the nearly optimization solution, which corresponds to the final consistent alignment. 1.2 Specific techniques used In this section, we will briefly describe the workflow of YAM++ and its main compo- nents, which are shown in Fig.1. Candidate Pre- Filtering Annotation Similarity Semantic Verification Indexing Computation Ontology Loader Structure Candidate Indexing Post-Filtering Context Similarity Indexing Propagation Fig. 1: Main components of YAM++ system In YAM++ approach, a generic workflow for a given ontology matching scenario is as follows. 1. Input ontologies are loaded and parsed by a Ontology Loader component; 2. Information of entities in ontologies are indexed by the Annotation Indexing, the Structure Indexing and Context Indexing components; 3. Candidates Pre-Filtering component filters out all possible pairs of entities from the input ontologies, whose descriptions are highly similar; 4. The candidate mappings are then passed into Similarity Computation compo- nent, which includes: (i) the Terminological Matcher component that produces a set of mappings by comparing the annotations of entities; (ii) the Instance-based Matcher component that supplements new mappings through shared instances be- tween ontologies and (iii) the Contextual Matcher, which is used to compute the similarity value of a pair of entities by comparing their context profiles. In YAM++, the matching results of the Terminological Matcher, the Contextual Matcher and the Instance-based Matcher are combined to have a unique set of mappings. We call them element level matching result. 5. The Similarity Propagation component then enhances element level matching re- sult by exploiting structural information of entities; We call the result of this phase structure level matching result. 6. The Candidate Post-Filtering component is used to combine and select the poten- tial candidate mappings from element and structure level results. 7. Finally, the Semantic Verification component refines those mappings in order to eliminate the inconsistent ones. Let us now to present the specific features of each component. Ontology Loader To read and parse input ontologies, YAM++ uses OWLAPI open source library. In addition, YAM++ makes use of (i) Pellet1 - an OWL 2 Reasoner in order to discover hidden relations between entities in small ontology and (ii) ELK2 reasoner for large ontology. In this phase, the whole ontology is loaded in the main memory. Annotation Indexing In this component, all annotations information of entities such as ID, labels and comments are extracted. The languages used for representing annotations are considered. In the case where input ontology use different languages to describe the annotations of entities, a multilingual translator (Microsoft Bing) is used to translate those annotations to English. Those annotations are then normalized by tokenizing into set of tokens, removing stop words, and stemming. Next, the resulting tokens are in- dexed in a table for future use. Structure Indexing In this component, the main structure information such as IS-A and PAR-OF hierarchies of ontology are stored. In particular, YAM++ assigns a compressed bitset values for every entity of the ontology. Through the bitset values of each entity, YAM++ can fast and easily gets its ancestors, descendants, etc. A benefit of this method is to easily access to the structure information of ontology and minimize memory for storing it. After this step, the loaded ontology can be released to save main memory. Context Indexing In this component, we define a context profile of an entity as a set of three text corpora: (i) Entity Description includes annotation of the entity itself; (ii) Ancestor Description comprises the descriptions of its ancestor and (iii) Descendant Description comprises the descriptions of its descendant. Indexing those corpora is We performed by Lucene indexing engine. Candidates Pre-Filtering The aim of this component is to reduce the computational space for a given scenario, especially for the large scale ontology matching tasks. In YAM++, two filters have been designed for the purpose of performing matching process efficiently. 1 http://clarkparsia.com/pellet/ 2 http://www.cs.ox.ac.uk/isg/tools/ELK/ – A Description Filter is a search-based filter, which filters out the candidate map- pings before computing the real similarity values between the description of enti- ties. Firstly, the descriptions of all entities in the bigger size ontology are indexed by Lucene search engine. For each entity in the smaller size ontology, three multi- ple terms queries corresponding to three description included in its context profile will be performed. The top-K algorithm based on ranking score of those queries is used to select the most similar entities. – A Label Filter is used to fast detect candidate mappings, where the labels of enti- ties in each candidate mapping are similar or differ in maximum two tokens. The intuition is that if two labels of two entities differ by more than three tokens, any string-based method will produce a low similarity score value. Then, these entities are highly unmatched. Similarity Computation The three matcher described in this component are the same as in the YAM++ 2012 version. For more detail, we refer readers to our papers: Termino- logical Matcher [7], Instance-based Matcher [6]. A slight modification at the Con- textual Matcher is that we use algorithm described in [8] for small ontology matching, whereas we use the Lucene ranking score for large scale ontology matching. Similarity Propagation This component is similar to the Structural Matcher compo- nent described in YAM++ 2012 version. It contains two similarity propagation methods namely Similarity Propagation and Confidence Propagation. – The Similarity Propagation method is a graph matching method, which inher- its the main features of the well-known Similarity Flooding algorithm [2]. The only difference is about transforming an ontology to a directed labeled graph. This matcher is not changed from the first YAM++ version to the current version. There- fore, for saving space, we refer to section Similarity Flooding of [6] for more details. – The Confidence Propagation method principle is as follows. Assume ha1 , b1 , ≡, c1 i and ha2 , b2 , ≡, c2 i are two initial mappings, which are maybe discovered by the el- ement level matcher (i.e., the terminological matcher or instance-based matcher). If a1 and b1 are ancestors of a2 and b2 respectively, then after running confidence propagation, we have ha1 , b1 , ≡, c1 + c2 i and ha2 , b2 , ≡, c2 + c1 i. Note that, con- fidence values are propagated only among collection of initial mappings. In YAM++, the aim of the Similarity Propagation method is discovering new map- pings by exploiting as much as possible the structural information of entities. This method is used for a small scale ontology matching task, where the total number of entities in each ontology is smaller than 1000. In contrary, the Confidence Propa- gation method supports a Semantic Verification component to eliminate inconsistent mappings. This method is mainly used in a large scale ontology matching scenario. Candidates Post-Filtering The aim of the Mappings Combination and Selection com- ponent is to produce a unique set of mappings from the matching results obtained by the terminological matcher, instance-based matcher and structural matcher. In this com- ponent, a Dynamic Weighted Aggregation method have been implemented. Given an ontology matching scenario, it automatically computes a weight value for each matcher and establishes a threshold value for selecting the best candidate mappings. The main idea of this method can be seen in [6] for more details. Semantic Verification After running the similarity or confidence propagation on overall candidate mappings, the final achieved similarity values reach a certain stability. Based on those values, YAM++ is able to remove inconsistent mappings with more certainty. There are two main steps in the Semantic Verification component such as (i) identify- ing inconsistent mappings, and (ii) eliminating inconsistent mappings. In order to identify inconsistencies, several semantic conflict patterns have been designed in YAM++ as follows (see [4] for more detail): – Two mappings ha1 , b1 i and ha2 , b2 i are crisscross conflict if a1 is an ancestor of a2 in ontology O1 and b2 is an ancestor of b1 in ontology O2 . – Two mappings ha1 , b1 i and ha2 , b2 i are disjointness subsumption conflict if a1 is an ancestor of a2 in ontology O1 and b2 disjoints with b1 in ontology O2 and vice versa. – A property-property mapping hp1 , p2 i is inconsistent with respect to alignment A if {Doms(p1 ) × Doms(p2 )} ∩ A = ∅ and {Rans(p1 ) × Rans(p2 )} ∩ A = ∅ then (p1 , p2 ), where Doms(p) and Rans(p) return a set of domains and ranges of property p. – Two mappings ha, b1 i and ha, b2 i are duplicated conflict if the cardinality matching is 1:1 (for a small scale ontology matching scenario) or the semantic similarity SemSim(b1 , b2 ) is less than a threshold value θ (for a large scale matching with cardinality 1:m). Two methods, i.e., complete and approximate diagnosis are used in order to elim- inate inconsistent mappings. We use complete version Alcomo [3] for small scale. In term of approximate version for large scale, we transform this task into a Maximum Weighted Vertex Cover problem. A modification of Clarkson algorithm [1], which is a Greedy approach. The idea of this method is that it iteratively removes the mapping with the smallest cost, which is computed by a ratio of its current confidence value to number of its conflicts. 1.3 Adaptations made for the evaluation Before running the matching process, YAM++ analyzes the input ontologies and adapts itself to the matching task. In particular, if the annotations of entities in input ontologies are described by different languages, YAM++ automatically translates them in English. If the number of entities in input ontologies is smaller than 1000, YAM++ is switched to small scale matching regime, otherwise, it runs with large scale matching regime. The main difference between the two regimes lies in the Similarity Propagation and Semantic Verification components as we discussed above. 1.4 Link to the system and parameters file A SEALS client wrapper for YAM++ system and the parameter files can be download at: http://www2.lirmm.fr/d̃ngo/YAMplusplus2013.zip. See the instructions in tutorial from SEALS platform3 to test our system. 3 http://oaei.ontologymatching.org/2013/seals-eval.html 1.5 Link to the set of provided alignments (in align format) The results of all tracks can be downloaded at: http://www2.lirmm.fr/d̃ngo/ YAMplus- plus2013Results.zip. 2 Results In this section, we present the evaluation results obtained by running YAM++ with SEALS client with Benchmark, Conference, Multifarm, Library, Anatomy and Large Biomedical Ontologies tracks. All experiments are executed by YAM++ with SEALS client version 4.1 beta and JDK 1.6 on PC Intel 3.0 Pentium, 3Gb RAM, Win- dow XP SP3. 2.1 Benchmark In OAEI 2013, Benchmark includes 5 blind tests for both organizers and participants. Those tests are regeneration of the bibliography test set. Table 1 shows the avergae results of YAM++ running on the Benchmark dataset. Test set H-mean Precision H-mean Recall H-mean Fmeasure Biblio 0.97 0.82 0.89 Table 1: YAM++ results on pre-test Benchmark track 2.2 Conference Conference track now contains 16 ontologies from the same domain (conference orga- nization) and each ontology must be matched against every other ontology. This track is an open+blind, so in the Table 2, we can only report our results with respect to the available reference alignments Test set H-mean Precision H-mean Recall H-mean Fmeasure Conference ra1 0.80 0.69 0.74 Conference ra2 0.78 0.65 0.71 Table 2: YAM++ results on Conference track 2.3 MultiFarm The goal of the MultiFarm track is to evaluate the ability of matcher systems to deal with multilingual ontologies. It is based on the OntoFarm dataset, where annotations of entities are represented in different languages such as: English (en), Chinese (cn), Czech (cz), Dutch (nl), French (fr), German (de), Portuguese (pt), Russian (ru) and Spanish (es). YAM++’s results are showed in the Fig. 2 2.4 Anatomy The Anatomy track consists of finding an alignment between the Adult Mouse Anatomy (2744 classes) and a part of the NCI Thesaurus (3304 classes) describing the human anatomy. Table 3 shows the evaluation result and runtime of YAM++ on this track. Fig. 2: YAM++ results on MultiFarm track Test set Precision Recall Fmeasure Run times Anatomy 0.944 0.869 0.905 62 (s) Table 3: YAM++ results on Anatomy track 2.5 Library The library track is a real-word task to match the STW (6575 classes) and the TheSoz (8376 classes) thesaurus. Table 4 shows the evaluation result and runtime of YAM++ against an existing reference alignment on this track. Test set Precision Recall Fmeasure Run times Library 0.692 0.808 0.745 411 (s) Table 4: YAM++ results on Library track 2.6 Large Biomedical Ontologies This track consists of finding alignments between the Foundational Model of Anatomy (FMA), SNOMED CT, and the National Cancer Institute Thesaurus (NCI). There are 9 sub tasks with different size of input ontologies, i.e., small fragment, large fragment and the whole ontologies. Table 5 shows the evaluation results and run times of YAM++ on those sub tasks. 3 General comments This is the third time YAM++ participates to the OAEI campaign. We found that SEALS platform is a very valuable tool to compare the performance of our system with the others. Besides, we also found that OAEI tracks covers a wide range of heterogeneity in ontology matching task. They are very useful to help developers/researchers to develop their semantic matching system. 3.1 Comments on the results The current version of YAM++ has shown a significant improvement both in terms of matching quality and runtime with respect to the previous version. In particular, the Test set Precision Recall Fmeasure Run times Small FMA - NCI 0.976 0.853 0.910 94 (s) Whole FMA - NCI 0.899 0.846 0.872 366 (s) Small FMA - SNOMED 0.982 0.729 0.836 100 (s) Whole FMA - SNOMED 0.947 0.725 0.821 402 (s) Small SNOMED - NCI 0.967 0.611 0.749 391 (s) Whole SNOMED - NCI 0.881 0.601 0.714 713 (s) Table 5: YAM++ results on Large Biomedical Ontologies track H-mean Fmeasure value of all the very large scale dataset (i.e., Library, Biomedical ontologies) has been improved. 4 Conclusion In this paper, we have presented our ontology matching system called YAM++ and its evaluation results on different tracks on OAEI 2013 campaign. The experimental results are promising and show that YAM++ is able to work effectively and efficiently with real-world ontology matching tasks. In near future, we continue improving the matching quality and efficiency of YAM++. Furthermore, we plan to deal with instance matching track also. References [1] Kenneth L. Clarkson. A modification of the greedy algorithm for vertex cover. Information Processing Letters, pages 23 – 25, 1983. [2] Sergey Melnik el at. Similarity flooding: A versatile graph matching algorithm and its appli- cation to schema matching. In ICDE, pages 117–128, 2002. [3] Christian Meilicke. Alignment incoherence in ontology matching. In PhD.Thesis, University of Mannheim, Chair of Artificial Intelligence, 2011. [4] DuyHoa Ngo. Enhancing ontology matching by using machine learning, graph matching and information retrieval techniques. In PhD.Thesis, University Montpellier II, 2012. [5] DuyHoa Ngo and Zohra Bellahsene. Yam++ results for oaei 2012. In OM, 2012. [6] DuyHoa Ngo, Zohra Bellahsene, and Remi Coletta. Yam++ results for oaei 2011. In OM, 2011. [7] DuyHoa Ngo, Zohra Bellahsene, and Konstantin Todorov. Opening the black box of ontology matching. In ESWC, pages 16–30, 2013. [8] DuyHoa Ngo, Zohra Bellasene, and Remi Coletta. A generic approach for combining lin- guistic and context profile metrics in ontology matching. In ODBASE Conference, 2011.