Partitioning and Matching Tuning of Large Biomedical Ontologies

Introduction

Large biomedical ontologies such as SNOMED CT, NCI, and FMA are extensively employed in the biomedical domain. These complex ontologies are based on diverse modelling views and vocabularies. We define an approach that breaks up a large ontology alignment problem into a set of smaller matching tasks. We coupled this approach with an automated tuning process, which generates the adequate thresholds of the available similarity measure for any biomedical matching task. Experiments demonstrate that the coupling between ontology partitioning and threshold tuning outperforms the existing approaches.

2 Partitioning and Matching Tuning of Biomedical Ontologies

Architecture overview

In figure 1, we depict the different stages for ontologies partitioning and threshold tuning. These stages are detailed in the following sections.

Ontologies Partitioning

We employ the hierarchical agglomerative clustering technique to divide an ontology into a set of partitions. This method is based on the equation 1 to compute the structural similarity between the entities of the input ontologies. This equation is inspired by Wu and Palmer [4] similarity measure. The partitioning of every ontology results in a dendrogram. We cut each dendrogram automatically in order to result in a set of partitions. We examine the output of all the possible cuts until finding the first cut which do not result in any isolated partitions. Isolated partitions are partitions containing only one entity. We identify the similar partition-pairs through the set of exact matchings between the input ontologies.

StrcSim(e i,m , e i,n ) = Dist(r i , lca) × 2 Dist(e i,m , lca) + Dist(e i,n , lca) + Dist(r i , lca) × 2

(1)

Threshold tuning

The available external knowledge sources represent mediator biomedical ontologies between the two input ontologies. We cross-search the input ontologies and the mediating ontology in order to find synthetic reference alignments. We compute the similarity score Sim between all the annotations of the generated alignments. These similarity scores are represented by: simScore = sim 1 ,... ,sim n . The threshold T h value is deducted from simScore using the Equation 2:

T h = simn sim1 sim i |simScore| (2)

3 Experiments

In Table 1, we compare our proposed partitioning approach to the currently available partitioning strategies using two OAEI 2017 biomedical data sets: the Anatomy task and the LargeBio small segments tasks. We employed UBERON as an external biomedical knowledge for deriving synthetic reference alignments. We use ISUB similarity measure to compute the similarity scores between the derived mappings. In Table 2, we illustrate the accuracy of the partitioning approach with the deduced thresholds.

Conclusion and Future Work

As future work, we intend to automate all the matching tuning process while focusing on different type of heterogeneity applied over the partitions-pairs.

Fig. 1 .1Fig. 1. Architecture Overview

Table 1 .1Anatomy track partitioning resultsPrecision F-Measure Recall Number of partitionsProposed approach 0.9450.8830.82957/57SeeCOnt [3]0.9510.8630.789NDFalcon [2]0.9640.7300.591139/119Alsayed et al. [1]0.9750.7530.61384/80

Table 2 .2Accuracy and derived thresholds for Anatomy and LargeBio tracksPrecision F-Measure Recall Derived ThresholdAnatomy0.9450.8830.8290.91FMA-NCI0.9570.8700.7890.69FMA-SNOMED 0.8600.6740.5540.75SNOMED-NCI 0.9110.6970.5640.85

A clustering-based approach for large-scale ontology matching AlsayedAlgergawy SabineMassmann ErhardRahm East European Conference on Advances in Databases and Information Systems

Berlin, Heidelberg

Springer 2011 Matching large ontologies: A divide-andconquer approach WeiHu YuzhongQu GongCheng Data Knowledge Engineering 67 1 2008 Seecont: A new seeding-based clustering approach for ontology matching AlsayedAlgergawy SamiraBabalou MohammadJKargar SHashemDavarpanah East European Conference on Advances in Databases and Information Systems Springer 2015 Verbs semantics and lexical selection ZhibiaoWu MarthaPalmer Proceedings of the 32nd annual meeting on Association for Computational Linguistics the 32nd annual meeting on Association for Computational Linguistics 1994