OAEI 2018 results of POMap++ Amir Laadhar1 , Faiza Ghozzi2 , Imen Megdiche1 , Franck Ravat1 , Olivier Teste1 , and Faiez Gargouri2 1 University of Toulouse, IRIT (CNRS/UMR 5505) 118 Route de Narbonne 31062 Toulouse, France {amir.laadhar,imen.megdiche,franck.ravat,olivier.teste}@irit.fr, 2 University of Sfax, MIRACL Sakiet Ezzit 3021, Tunisie {faiza.ghozzi,faiez.gargouri}@isims.usf.tn Abstract. Ontology matching is the process of finding a set of corre- spondences between the entities of two or more ontologies representing a similar domain. POMap++ is an ontology matching system associating ontology partitioning to the machine learning techniques. This associ- ation delivers a local matching learning. POMap++ provides an auto- mated local matching learning for the biomedical tracks. For the non- biomedical tracks we employ the version of POMap 2017. In this paper, we present POMap++ as well as the obtained results for the Ontology Alignment Evaluation Initiative of 2018. Keywords: Semantic web, Ontology Matching, Ontology partitioning, Machine learning 1 Presentation of the system Ontologies are the backbone of the semantic web. They enable sharing, reusing and accessing the knowledge resources [9]. Biomedical ontologies are domain- specific knowledge bases widely employed in biology and medicine. These ontolo- gies have been separately developed by different experts using different termi- nologies and modeling techniques. The integration of these data sources requires ontology matching tools. Ontology matching is the identification process corre- spondences between the entities of different ontologies. The alignment process is quite challenging in terms of the complexity of the existing biomedical ontolo- gies. POMap++ divide a biomedical ontology alignment to a set of sub-matching tasks called partitions. We align each sub-matching task using its local adequate settings. We automatically determine the local matching settings by generating a specific machine learning model for each sub-matching task. This automated tuning process of local matching parameters aims to improve the overall match- ing quality of a large ontology matching task. We employed POMap++ for the biomedical matching tasks and POMap [3] for the non-biomedical matching tasks. In the following section, we provide a detailed description of POMap++. 1.1 State, purpose, general statement 1.2 Specific techniques used The workflow of POMap++ for our second participation in the OAEI comprises four main steps, as flagged by the figure 1: Input ontologies indexing and load- ing, input ontologies partitioning, local matching learning and output alignment generation. The first and the last step are the same as in the last version of POMap [3]. In the second step, we define the pair of similar partitions between the two input ontologies. In the third step, we apply machine learning techniques in order to align every identified pair of similar partitions. In the following, we detail each of the four steps. Fig. 1. The architecture of POMap++. Step 1: Input ontologies indexing and loading The first step of the ontology indexation and loading is the pre-processing task. We pre-process the annotations of the two input ontologies by applying the Porter stemming [8] as well as the stop word removal process. We also remove the special characters. These indexes are stored along with the structure of the input ontologies. The structural indexing is responsible for representing the relationships between entities. Then, during the third task, the indexed data structures are loaded into the next step of POMap++. Step 2: Input ontologies partitioning We divide an ontology into a set of partitions using the hierarchical agglomer- ative clustering [5] approach. This approach does not take as input the required number of partitions. The hierarchical agglomerative clustering algorithm re- ceives as input structural similarity scores between all the entities of an input ontology. We compute the structural similarity between the entities of a single ontology according to the following Definition. The Definition 1 is inspired by Wu and Palmer [10] similarity measure. Definition 1 (Structural similarity between entities). We compute the structural similarity between all the entities in one ontology according to the Equation 1. For a given two entities ei,x and ei,y of an ontology Oi , lca is their lowest common ancestor. Dist(ei,x ,lca) represents the shortest distance between ei,x and lca in terms of number of edges. Dist(ei,y ,lca) denote the distance be- tween ei,y and lca. Dist(ri ,lca) is the distance between the root ri and lca. Dist(ri , lca) × 2 StrcSim(ei,x , ei,y ) = (1) Dist(ei,x , lca) + Dist(ei,y , lca) + Dist(ri , lca) × 2 Step 3: Local Matching learning Due to the high complexity of biomedical ontologies, no single syntactic simi- larity measure can effectively all the syntactic heterogeneity of a matching task. Therefore, for each local matching task, we construct its specific machine learn- ing model. The training set of every local learning model is not based on any reference alignments. We automatically construct a supervised training set for each local matching task of the set of local matchings. These training sets serve as the input for each local machine learning model. After identifying the par- titions for each ontology, we find the set of similar partitions between the two input ontologies using a set of anchors. The existing works retrieve labeled data either from the reference alignment or by creating it manually. However, the reference alignment commonly does not exist. We derive each local training set by cross-searching the entities of a local matching with the existing biomedical knowledge bases like Uberon. Since we are dealing with biomedical ontologies, anchors are extracted by cross-searching the input ontologies with the available external biomedical knowledge bases (KB) such as the Unified Medical Lan- guage System (UMLS) Metathesaurus [1], Medical Subject Headings (MeSH) [4], Uberon [6] and BioPortal [7]. For instance, UMLS integrates more than 160 biomedical ontologies. In our case, we cross-search the two input ontologies with the Uberon ontology to derive the set anchors. We employ the-state-of-the art syntactic similarity measures3 as features. The labeled data of the training set is usually hard to acquire. We apply the wrapper feature selection [2] method over the resulted local training sets. This technique selects the subset of the most effective and suitable features for each local training set. Therefore, each local matching task has its specific similarity measures. Then, we build a local machine learning model for each local matching task. The entities of each local matching task are classified using their specific machine learning model. This local learning model aligns the input entities based on the adequate matching parameters. Step 4: Output alignment generation 3 https://git.io/fNvqt The generated correspondences for every local matching task lmij,q are uni- fied to generate the final alignment file for the whole ontology matching task. The alignment file is compared to the reference alignment to evaluate the overall result accuracy. 2 Results 2.1 Anatomy The Anatomy track consists of finding the alignments between the Adult Mouse Anatomy and the NCI Thesaurus describing the human anatomy. The evaluation was run on a server coupled with 3.46 GHz (6 cores) and 8GB of RAM. Table 1 draws the performance of POMap++ compared to the five top matching systems. Our matching system achieved the third best result for this dataset with an F- measure of 89.7%, which is very close to the top results. The remaining challenge is to speed up the execution time by applying more optimizations. We also target the improvement of precision value for our next participation in the OAEI. 2.2 Disease and Phenotype This track is based on a real use case in order to find alignments between disease and phenotype ontologies. Specifically, the selected ontologies are the Human Phenotype Ontology (HPO), the Mammalian Phenotype Ontology (MP), the Human Disease Ontology (DOID) and the Orphanet and Rare Diseases Ontol- ogy(ORDO). The evaluation was run on an Ubuntu Laptop with an Intel Core i9-8950UK CPU @ 2.90GHz x 12 coupled with 25Gb RAM. POMap++ suc- ceeded to complete tow tasks HP-MP and DOID-ORDO. POMap produced 1502 mappings in the HP-MP task associated with 214 unique mappings. Among the eight matching systems, POMap++ achieved the fifth highest F-measure with an F-Measure of 69.9%. In the DOID-ORDO task, POMap generated 2563 map- pings with 174 unique ones. For this task, POMap++ obtained an F-Measure of 84.5% being the third best result for this track. 2.3 LargeBio This tracks aims to find the alignment between three large ontologies: Founda- tional Model of Anatomy (FMA), SNOMED CT, and the National Cancer Insti- tute Thesaurus (NCI). Among six matching tasks between these three ontologies, POMap++ succeeded to perform the matching between FMA-NCI (small frag- ments) and FMA-SNOMED (small fragments) with an F-Measure respectively of 88.9% and 40.4%. For the other tasks of the large biomedical track, POMap++ exceeded the defined timeout. 3 Conclusion The obtained results of POMap++ are promising especially for disease and phenotype as well as the anatomy track in which we ranked as the third top performing matching system. However, we did not opt to perform the local matching using structural-level features. Consequently, we are planning to add structural-level feature to the local matching process. References 1. Bodenreider, O.: The unified medical language system (umls): integrating biomed- ical terminology. vol. 32, pp. D267–D270. Oxford University Press (2004) 2. Kohavi, R., John, G.H.: Wrappers for feature subset selection. vol. 97, pp. 273–324. Elsevier (1997) 3. Laadhar, A., Ghozzi, F., Megdiche, I., Ravat, F., Teste, O., Gargouri, F.: Pomap results for oaei 2017. In: 12th International Workshop on Ontology Matching col- located with the 16th International Semantic Web Conference (OM@ ISWC’17). pp. pp–1 (2017) 4. Lipscomb, C.E.: Medical subject headings (mesh). vol. 88, p. 265. Medical Library Association (2000) 5. Müllner, D.: Modern hierarchical, agglomerative clustering algorithms (2011) 6. Mungall, C.J., Torniai, C., Gkoutos, G.V., Lewis, S.E., Haendel, M.A.: Uberon, an integrative multi-species anatomy ontology. vol. 13, p. R5. BioMed Central (2012) 7. Noy, N.F., Shah, N.H., Whetzel, P.L., Dai, B., Dorf, M., Griffith, N., Jonquet, C., Rubin, D.L., Storey, M.A., Chute, C.G., et al.: Bioportal: ontologies and integrated data resources at the click of a mouse. vol. 37, pp. W170–W173. Oxford University Press (2009) 8. Porter, M.F.: Snowball: A language for stemming algorithms (2001) 9. Shvaiko, P., Euzenat, J.: Ontology matching: state of the art and future challenges. IEEE Transactions on knowledge and data engineering 25(1), 158–176 (2013) 10. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the 32nd annual meeting on Association for Computational Linguistics. pp. 133–138. Association for Computational Linguistics (1994)