We Divide, You Conquer:
     From Large-scale Ontology Alignment to Manageable Subtasks
            with a Lexical Index and Neural Embeddings?

      Ernesto Jiménez-Ruiz1,2 , Asan Agibetov3 , Matthias Samwald3 , Valerie Cross4
                         1
                          The Alan Turing Institute, London, United Kingdom
                     2
                        Department of Informatics, University of Oslo, Norway
      3
        Section for Artificial Intelligence and Decision Support, Medical University of Vienna,
                                             Vienna, Austria
                       4
                          Miami University, Oxford, OH 45056, United States


1      Introduction
An ontology matching task MT is composed of a pair of ontologies O1 and O2 and
possibly an associated reference alignment MRA . The objective of a matching task is
to discover an overlapping of O1 and O2 in the form of an alignment M. The size or
search space of a matching task is typically bound to the size of the Cartesian product
between the entities of the input ontologies. Large-scale ontology matching tasks still
pose serious challenges to state-of-the-art ontology alignment systems [2]. In this paper
we propose a novel method to effectively divide an input ontology matching task MT
into several (independent) and more manageable (sub)tasks. This method relies on an
efficient lexical index (as in LogMap [3]), a neural embedding model [4] and locality
modules [5]. Unlike other state-of-the-art approaches, our method provides guarantees
about the preservation of the coverage of the relevant ontology alignment.

2      Methods
The approach presented in this paper relies on an ‘inverted’ lexical index (we will refer
to this index as LexI), commonly used in information retrieval applications, and also
used in ontology alignment systems like LogMap [3]. LexI encodes the labels of all
entities of the input ontologies O1 and O2 , including their lexical variations (e.g., pre-
ferred labels, synonyms), in the form of pairs key-value where the key is a set of words
and the value is a set of entity identifiers1 such that the set of words of the key appears in
one of the entity labels. Table 1 shows a few example entries of LexI for two ontologies.

2.1     Creating matching subtasks from LexI
Deriving mappings from LexI. Each entry in LexI, after discarding entries pointing to
only one ontology, is a source of candidate mappings. For instance the example in Ta-
ble 1 suggests that there is a (potential) mapping between the entities O1 :Serous acinus
and O2 :Liver acinus since they are associated to the same entry in LexI {acinus}. The
mappings derived from LexI are not necessarily correct but will link lexically related
?
     An extended version of this paper is available in arXiv.org [1].
 1
     The indexation module associates unique numerical identifiers to entity URIs.
Table 1: Inverted lexical index LexI (left) and entity index (right). Index values have
been split into elements of O1 and O2 . ‘-’ indicates that the ontology does not contain
entities for that index entry.
                                     Index value        ID     URI
    Index key                                           7661   O1 :Serous acinus
                             Entities O1 Entities O2    8171   O1 :Hepatic acinus
    { acinus }               7661,8171 118081           19987 O1 :Mesothelial cell of pleura
    { mesothelial, pleural } 19987       117237         55518 O1 :Lunate facet of hamate
                                                        118081 O2 :Liver acinus
                                                        117237 O2 :Pleural Mesothelial Cell
    { hamate, lunate }      55518       -               113578 O2 :Breast Feeding
    { feed, breast }        -           113578,111023   111023 O2 :Inability To Breast Feed


                Fig. 1: Pipeline to extract matching subtasks from LexI.

entities. We refer to the set of all mappings suggested by LexI as MLexI . Note that
MLexI represents a manageable subset of the Cartesian product between the entities of
the input ontologies. Mappings outside MLexI will rarely be discovered by standard
matching systems as they typically rely on lexical similarity measures [2].
Context as matching task. Logic-based module extraction techniques compute ontol-
ogy fragments that capture the meaning of an input signature (e.g., set of entities)
with respect to a given ontology. In this paper we rely on bottom-locality modules [5],
which will be referred to as locality-modules or simply as modules. Locality mod-
ules play an important role in ontology alignment tasks. For example they provide
the scope or context (i.e., sets of semantically related entities [5]) for the entities in
a given mapping or set of mappings. The context of the mappings MLexI derived from
LexI leads to two ontology modules O1LexI and O2LexI from O1 and O2 , respectively.
MT LexI = hO1LexI , O2LexI i is the (single) matching subtask derived from LexI.
    The whole set of entries in LexI, however, may lead to a very large number of can-
didate mappings MLexI and, as a consequence, to large modules O1LexI and O2LexI . These
modules, although smaller than O1 and O2 , can still be challenging for many ontology
matching systems. A solution is to divide the entries in LexI into more than one cluster.
Figure 1 shows an overview of the pipeline where LexI is split into n clusters and these
                                       n
clusters lead to n matching subtasks DMT    = {MT1LexI , . . . , MTnLexI }.
Clustering strategies. We have implemented two clustering strategies which we refer to
as: naive and neural embedding. The naive strategy implements a very simple algorithm
that randomly splits the entries in LexI into a given number of clusters of the same
size, while the neural embedding strategy aims at identifying more accurate clusters
and relies on the StarSpace toolkit and its neural embedding model [4]. Applied to
the lexical index LexI, the neural embedding model learns vector representations for
the individual words in the index keys, and for the individual entity identifiers in the
                       AMA-NCIA    FMA-SNOMED     HPO-MP                                   AMA-NCIA   FMA-SNOMED   HPO-MP
                       FMA-NCI     SNOMED-NCI     DOID-ORDO                                FMA-NCI    SNOMED-NCI   DOID-ORDO
            1.00                                              1.00              1.00                                           1.00

            0.99                                              0.99              0.99                                           0.99

            0.98                                              0.98              0.98                                           0.98

            0.97                                              0.97              0.97                                           0.97
 Coverage


                                                                     Coverage
            0.96                                              0.96              0.96                                           0.96

            0.95                                              0.95              0.95                                           0.95

            0.94                                              0.94              0.94                                           0.94

            0.93                                              0.93              0.93                                           0.93

                   2 10 20   50       100               200                            2 10 20   50     100              200
                                    Subtasks                                                          Subtasks

                             (a) Naive strategy                                          (b) Neural embedding strategy
                                                                                       n
            Fig. 2: Coverage ratio with respect to the number of matching subtasks in DMT .

index values. Since an index key is a set of words (see Table 1), we use the mean vector
representation of the vectors associated to each word. Based on these aggregated neural
embeddings we then perform standard clustering with the K-means algorithm.

3              Evaluation
In this section we provide empirical evidence of the suitability of the presented approach
to divide an ontology matching task.2 We rely on the datasets of the Ontology Align-
ment Evaluation Initiative (OAEI), more specifically, on the matching tasks provided in
the anatomy (AMA-NCIA), largebio (FMA-NCI, FMA-SNOMED, SNOMED-NCI)
and phenotype (HPO-MP, DOID-ORDO) tracks.
Adequacy of clustering strategies. We have evaluated the clustering strategies in terms
of the coverage with respect to the OAEI reference alignments3 and the size of the on-
tologies in the matching subtasks. We have compared the two strategies for different
number of matching subtasks n ∈ {2, 5, 10, 20, 50, 100, 200}. Figure 2 shows the cov-
                                     n
erage of the different divisions DMT      for the naive (left) and neural embedding (right)
strategies. The results are very good and, in the worst case, approx. 93% of the available
                                                                                        200
reference mappings in SNOMED-NCI are covered by the matching subtasks in DMT                 .
    The scatter plots in Figure 3 visualize, for the FMA-NCI case, the size of the source
modules against the size of the target modules for the matching subtasks in each divi-
        n
sion DMT    . For instance, the (orange) triangles represent points |Sig(O1i )|, |Sig(O2i )|
being O1i and O2i the ontologies (with i=1,. . . ,5) in the matching subtasks of DMT5
                                                                                        . The
                                                                                  n
naive strategy leads to rather balanced and similar tasks for each division DMT       , while
the neural embedding strategy has more variability in the size of the tasks within a given
            n
division DMT    . Nonetheless, on average, the size of the matching subtasks in the neural
embedding strategy are significantly smaller than the ones in the naive strategy.
Evaluation of OAEI systems. We have selected the following four systems from the lat-
est OAEI campaigns: Mamba, FCA-Map, KEPLER, and POMap. These systems were
unable to complete, given some computational constraints, some OAEI tasks. Table 2
                                                            n
shows the obtained results with different divisions DMT          computed by the naive and
 2
       Extended evaluation material in [1] and https://doi.org/10.5281/zenodo.1214149
 3
       A mapping is covered if it can (potentially) be discovered in one or more matching subtasks.
                                 2                   10               50                  200                                             2                  10               50                  200
                                 5                   20               100                                                                 5                  20               100
                     35000                                                                      35000                        35000                                                                      35000

                     30000                                                                      30000                        30000                                                                      30000

                     25000                                                                      25000                        25000                                                                      25000
Size target module


                                                                                                        Size target module
                     20000                                                                      20000                        20000                                                                      20000

                     15000                                                                      15000                        15000                                                                      15000

                     10000                                                                      10000                        10000                                                                      10000

                     5000                                                                       5000                         5000                                                                       5000

                        0                                                                      0                                0                                                                      0
                             0       5000    10000      15000 20000       25000   30000    35000                                     0        5000   10000      15000 20000       25000   30000    35000
                                                     Size source module                                                                                      Size source module

                                            (a) Naive strategy                                                                           (b) Neural embedding strategy
                        Fig. 3: Source and target module sizes in the computed subtasks for FMA-NCI.

                      Table 2: Evaluation of systems that failed to complete OAEI tasks in 2015-2017.
                        Matching                                                              Naive strategy   Neural embedding strategy
                      Tool                  Task           Year
                        subtasks                                                           P     R     F t (h)  P     R    F      t (h)
                           20                                                             0.88 0.63 0.73 2.3 0.89 0.62 0.73        1.0
    Mamba Anatomy 2015
                           50                                                             0.88 0.62 0.73 2.4 0.89 0.62 0.73        1.0
                           20                                                             0.56 0.90 0.72 4.4 0.62 0.90 0.73        3.1
   FCA-Map FMA-NCI 2016
                           50                                                             0.58 0.90 0.70 4.1 0.60 0.90 0.72        3.0
                           20                                                             0.45 0.82 0.58 8.9 0.48 0.80 0.60        4.3
   KEPLER FMA-NCI 2017
                           50                                                             0.42 0.83 0.56 6.9 0.46 0.80 0.59        3.8
                           20                                                             0.54 0.83 0.66 11.9 0.56 0.79 0.66       5.7
    POMap FMA-NCI 2017
                           50                                                             0.55 0.83 0.66 8.8 0.57 0.79 0.66        4.1

neural embedding strategies, for the OAEI tasks that the selected systems failed to com-
pute results. For example, Mamba was able to complete the OAEI 2015 Anatomy track
                  20        50
with divisions DMT    and DMT     involving 20 and 50 matching subtasks, respectively.
The subtasks generated by the neural embedding strategy lead to much lower times.
    The results are encouraging and suggest that the proposed method to divide an on-
tology matching task (i) leads to a very limited information loss (i.e., high coverage),
and (ii) enables new systems to complete large-scale OAEI tasks.

References
1. Jimenez-Ruiz, E., Agibetov, A., Samwald, M., Cross, V.: Breaking-down the Ontology Align-
   ment Task with a Lexical Index and Neural Embeddings. arXiv (2018) Available from:
   https://arxiv.org/abs/1805.12402.
2. Shvaiko, P., Euzenat, J.: Ontology matching: State of the art and future challenges. IEEE
   Trans. Knowl. Data Eng. 25(1) (2013) 158–176
3. Jiménez-Ruiz, E., Cuenca Grau, B.: LogMap: Logic-Based and Scalable Ontology Matching.
   In: International Semantic Web Conference. (2011) 273–288
4. Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: StarSpace: Embed All The
   Things! arXiv (2017)
5. Cuenca Grau, B., Horrocks, I., Kazakov, Y., Sattler, U.: Modular reuse of ontologies: Theory
   and practice. J. Artif. Intell. Res. 31 (2008) 273–318