We Divide, You Conquer: From Large-scale Ontology Alignment to Manageable Subtasks with a Lexical Index and Neural Embeddings? Ernesto Jiménez-Ruiz1,2 , Asan Agibetov3 , Matthias Samwald3 , Valerie Cross4 1 The Alan Turing Institute, London, United Kingdom 2 Department of Informatics, University of Oslo, Norway 3 Section for Artificial Intelligence and Decision Support, Medical University of Vienna, Vienna, Austria 4 Miami University, Oxford, OH 45056, United States 1 Introduction An ontology matching task MT is composed of a pair of ontologies O1 and O2 and possibly an associated reference alignment MRA . The objective of a matching task is to discover an overlapping of O1 and O2 in the form of an alignment M. The size or search space of a matching task is typically bound to the size of the Cartesian product between the entities of the input ontologies. Large-scale ontology matching tasks still pose serious challenges to state-of-the-art ontology alignment systems [2]. In this paper we propose a novel method to effectively divide an input ontology matching task MT into several (independent) and more manageable (sub)tasks. This method relies on an efficient lexical index (as in LogMap [3]), a neural embedding model [4] and locality modules [5]. Unlike other state-of-the-art approaches, our method provides guarantees about the preservation of the coverage of the relevant ontology alignment. 2 Methods The approach presented in this paper relies on an ‘inverted’ lexical index (we will refer to this index as LexI), commonly used in information retrieval applications, and also used in ontology alignment systems like LogMap [3]. LexI encodes the labels of all entities of the input ontologies O1 and O2 , including their lexical variations (e.g., pre- ferred labels, synonyms), in the form of pairs key-value where the key is a set of words and the value is a set of entity identifiers1 such that the set of words of the key appears in one of the entity labels. Table 1 shows a few example entries of LexI for two ontologies. 2.1 Creating matching subtasks from LexI Deriving mappings from LexI. Each entry in LexI, after discarding entries pointing to only one ontology, is a source of candidate mappings. For instance the example in Ta- ble 1 suggests that there is a (potential) mapping between the entities O1 :Serous acinus and O2 :Liver acinus since they are associated to the same entry in LexI {acinus}. The mappings derived from LexI are not necessarily correct but will link lexically related ? An extended version of this paper is available in arXiv.org [1]. 1 The indexation module associates unique numerical identifiers to entity URIs. Table 1: Inverted lexical index LexI (left) and entity index (right). Index values have been split into elements of O1 and O2 . ‘-’ indicates that the ontology does not contain entities for that index entry. Index value ID URI Index key 7661 O1 :Serous acinus Entities O1 Entities O2 8171 O1 :Hepatic acinus { acinus } 7661,8171 118081 19987 O1 :Mesothelial cell of pleura { mesothelial, pleural } 19987 117237 55518 O1 :Lunate facet of hamate 118081 O2 :Liver acinus 117237 O2 :Pleural Mesothelial Cell { hamate, lunate } 55518 - 113578 O2 :Breast Feeding { feed, breast } - 113578,111023 111023 O2 :Inability To Breast Feed Fig. 1: Pipeline to extract matching subtasks from LexI. entities. We refer to the set of all mappings suggested by LexI as MLexI . Note that MLexI represents a manageable subset of the Cartesian product between the entities of the input ontologies. Mappings outside MLexI will rarely be discovered by standard matching systems as they typically rely on lexical similarity measures [2]. Context as matching task. Logic-based module extraction techniques compute ontol- ogy fragments that capture the meaning of an input signature (e.g., set of entities) with respect to a given ontology. In this paper we rely on bottom-locality modules [5], which will be referred to as locality-modules or simply as modules. Locality mod- ules play an important role in ontology alignment tasks. For example they provide the scope or context (i.e., sets of semantically related entities [5]) for the entities in a given mapping or set of mappings. The context of the mappings MLexI derived from LexI leads to two ontology modules O1LexI and O2LexI from O1 and O2 , respectively. MT LexI = hO1LexI , O2LexI i is the (single) matching subtask derived from LexI. The whole set of entries in LexI, however, may lead to a very large number of can- didate mappings MLexI and, as a consequence, to large modules O1LexI and O2LexI . These modules, although smaller than O1 and O2 , can still be challenging for many ontology matching systems. A solution is to divide the entries in LexI into more than one cluster. Figure 1 shows an overview of the pipeline where LexI is split into n clusters and these n clusters lead to n matching subtasks DMT = {MT1LexI , . . . , MTnLexI }. Clustering strategies. We have implemented two clustering strategies which we refer to as: naive and neural embedding. The naive strategy implements a very simple algorithm that randomly splits the entries in LexI into a given number of clusters of the same size, while the neural embedding strategy aims at identifying more accurate clusters and relies on the StarSpace toolkit and its neural embedding model [4]. Applied to the lexical index LexI, the neural embedding model learns vector representations for the individual words in the index keys, and for the individual entity identifiers in the AMA-NCIA FMA-SNOMED HPO-MP AMA-NCIA FMA-SNOMED HPO-MP FMA-NCI SNOMED-NCI DOID-ORDO FMA-NCI SNOMED-NCI DOID-ORDO 1.00 1.00 1.00 1.00 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.98 0.97 0.97 0.97 0.97 Coverage Coverage 0.96 0.96 0.96 0.96 0.95 0.95 0.95 0.95 0.94 0.94 0.94 0.94 0.93 0.93 0.93 0.93 2 10 20 50 100 200 2 10 20 50 100 200 Subtasks Subtasks (a) Naive strategy (b) Neural embedding strategy n Fig. 2: Coverage ratio with respect to the number of matching subtasks in DMT . index values. Since an index key is a set of words (see Table 1), we use the mean vector representation of the vectors associated to each word. Based on these aggregated neural embeddings we then perform standard clustering with the K-means algorithm. 3 Evaluation In this section we provide empirical evidence of the suitability of the presented approach to divide an ontology matching task.2 We rely on the datasets of the Ontology Align- ment Evaluation Initiative (OAEI), more specifically, on the matching tasks provided in the anatomy (AMA-NCIA), largebio (FMA-NCI, FMA-SNOMED, SNOMED-NCI) and phenotype (HPO-MP, DOID-ORDO) tracks. Adequacy of clustering strategies. We have evaluated the clustering strategies in terms of the coverage with respect to the OAEI reference alignments3 and the size of the on- tologies in the matching subtasks. We have compared the two strategies for different number of matching subtasks n ∈ {2, 5, 10, 20, 50, 100, 200}. Figure 2 shows the cov- n erage of the different divisions DMT for the naive (left) and neural embedding (right) strategies. The results are very good and, in the worst case, approx. 93% of the available 200 reference mappings in SNOMED-NCI are covered by the matching subtasks in DMT . The scatter plots in Figure 3 visualize, for the FMA-NCI case, the size of the source modules against the size of the target modules for the matching subtasks in each divi- n sion DMT . For instance, the (orange) triangles represent points |Sig(O1i )|, |Sig(O2i )| being O1i and O2i the ontologies (with i=1,. . . ,5) in the matching subtasks of DMT5 . The n naive strategy leads to rather balanced and similar tasks for each division DMT , while the neural embedding strategy has more variability in the size of the tasks within a given n division DMT . Nonetheless, on average, the size of the matching subtasks in the neural embedding strategy are significantly smaller than the ones in the naive strategy. Evaluation of OAEI systems. We have selected the following four systems from the lat- est OAEI campaigns: Mamba, FCA-Map, KEPLER, and POMap. These systems were unable to complete, given some computational constraints, some OAEI tasks. Table 2 n shows the obtained results with different divisions DMT computed by the naive and 2 Extended evaluation material in [1] and https://doi.org/10.5281/zenodo.1214149 3 A mapping is covered if it can (potentially) be discovered in one or more matching subtasks. 2 10 50 200 2 10 50 200 5 20 100 5 20 100 35000 35000 35000 35000 30000 30000 30000 30000 25000 25000 25000 25000 Size target module Size target module 20000 20000 20000 20000 15000 15000 15000 15000 10000 10000 10000 10000 5000 5000 5000 5000 0 0 0 0 0 5000 10000 15000 20000 25000 30000 35000 0 5000 10000 15000 20000 25000 30000 35000 Size source module Size source module (a) Naive strategy (b) Neural embedding strategy Fig. 3: Source and target module sizes in the computed subtasks for FMA-NCI. Table 2: Evaluation of systems that failed to complete OAEI tasks in 2015-2017. Matching Naive strategy Neural embedding strategy Tool Task Year subtasks P R F t (h) P R F t (h) 20 0.88 0.63 0.73 2.3 0.89 0.62 0.73 1.0 Mamba Anatomy 2015 50 0.88 0.62 0.73 2.4 0.89 0.62 0.73 1.0 20 0.56 0.90 0.72 4.4 0.62 0.90 0.73 3.1 FCA-Map FMA-NCI 2016 50 0.58 0.90 0.70 4.1 0.60 0.90 0.72 3.0 20 0.45 0.82 0.58 8.9 0.48 0.80 0.60 4.3 KEPLER FMA-NCI 2017 50 0.42 0.83 0.56 6.9 0.46 0.80 0.59 3.8 20 0.54 0.83 0.66 11.9 0.56 0.79 0.66 5.7 POMap FMA-NCI 2017 50 0.55 0.83 0.66 8.8 0.57 0.79 0.66 4.1 neural embedding strategies, for the OAEI tasks that the selected systems failed to com- pute results. For example, Mamba was able to complete the OAEI 2015 Anatomy track 20 50 with divisions DMT and DMT involving 20 and 50 matching subtasks, respectively. The subtasks generated by the neural embedding strategy lead to much lower times. The results are encouraging and suggest that the proposed method to divide an on- tology matching task (i) leads to a very limited information loss (i.e., high coverage), and (ii) enables new systems to complete large-scale OAEI tasks. References 1. Jimenez-Ruiz, E., Agibetov, A., Samwald, M., Cross, V.: Breaking-down the Ontology Align- ment Task with a Lexical Index and Neural Embeddings. arXiv (2018) Available from: https://arxiv.org/abs/1805.12402. 2. Shvaiko, P., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE Trans. Knowl. Data Eng. 25(1) (2013) 158–176 3. Jiménez-Ruiz, E., Cuenca Grau, B.: LogMap: Logic-Based and Scalable Ontology Matching. In: International Semantic Web Conference. (2011) 273–288 4. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: StarSpace: Embed All The Things! arXiv (2017) 5. Cuenca Grau, B., Horrocks, I., Kazakov, Y., Sattler, U.: Modular reuse of ontologies: Theory and practice. J. Artif. Intell. Res. 31 (2008) 273–318