RDF2Vec-based Classification of Ontology
              Alignment Changes

                           Matthias Jurisch, Bodo Igler

                    RheinMain University of Applied Sciences
                Department of Design – Computer Science – Media
                              Unter den Eichen 5
                          65195 Wiesbaden, Germany
               matthias.jurisch@hs-rm.de, bodo.igler@hs-rm.de


      Abstract. When ontologies cover overlapping topics, the overlap can be
      represented using ontology alignments. These alignments need to be con-
      tinuously adapted to changing ontologies. Especially for large ontologies
      this is a costly task often consisting of manual work. Finding changes
      that do not lead to an adaption of the alignment can potentially make
      this process significantly easier. This work presents an approach to find-
      ing these changes based on RDF embeddings and common classification
      techniques. To examine the feasibility of this approach, an evaluation on
      a real-world dataset is presented. In this evaluation, the best classifiers
      reached a precision of 0.8.


Keywords: RDF Embedding, Change Classification, Ontology Alignment, On-
tology Mapping, Mapping Adaption


1   Introduction

Finding alignments between ontologies, also known as ontology matching, is a
non-trivial task and has been an active area of research over the last ten years.
Several approaches in this area are based on the structure of the ontologies,
logical axioms or lexical similarity [1]. However, once these alignments are found,
they will not necessarily stay untouched forever. Especially when alignments
connect large ontologies, adapting these alignments to changes is a work-intensive
task. In the area of biomedical ontologies, some alignments contain around 6500
correspondences that might be affected by a change in one of the ontologies they
connect. Given a change in the ontology, detecting which parts of the alignment
are affected by the change and need to be adapted is not a trivial task that usually
requires manual labour. The effort required for this task can be significantly
reduced, if some changes can be excluded from it. However, it is usually not
clear how to identify changes that do not affect the alignment.
    In this paper, we propose an approach to this problem based on RDF em-
beddings and well-known classification techniques. The central aspect of this
approach is to represent changed concepts by their RDF embedding and classify
2       Matthias Jurisch, Bodo Igler

whether an alignment statement nearby should be changed. To gain evidence if
this approach works, we evaluate it using a dataset from the area of biomedical
ontologies. On this dataset, our approach is able to identify changes affecting
alignment statements with a precision of 0.8.
    The remainder of this work is structured as follows: Section 2 discusses foun-
dations of our work and related approaches. The general approach is presented
in Section 3. Evaluation methodology, the dataset and results are shown in Sec-
tion 4. Section 5 discusses the results of our evaluation, and advantages and
disadvantages of our approach. A conclusion is given in Section 6.


2    Foundations and Related Work

An ontology alignment (sometimes also called ontology mapping) is a set of
correspondences between entities in different ontologies [1]. To make it easier to
reason about these alignments, we use the following formal definition in the style
of [2] for ontology mappings: An alignment between two ontologies O1 and O2
is defined as

     AO1 ,O2 = {(c1 , c2 , semT ype)|c1 ∈ O1 , c2 ∈ O2 , semT ype ∈ {≡, ≤, ≥}}

    AO1 ,O2 is the set of all alignment statements. To denote a change of an
ontology over time, we use the prime symbol (e.g., a changed version of O is
denoted as O0 ). The alignment adaption problem for two ontologies O1 and O2
connected by AO1 ,O2 can then be stated as finding a new alignment A0O0 ,O0 ,
                                                                                 1  2
when O1 and O2 evolve to O10 and O20 .
    In the area of ontology alignment adaption, several approaches are based on
rules or rule-based dependency analysis. [5] is focussed on finding which changes
are relevant to parts of the alignment using a dependency analysis. [10] proposed
an incremental approach reacting to specific changes in database schemas based
on rules. For each change pattern a specific modification for the mapping is
defined. [12] proposed an approach that is based on a composition of alignments.
A new alignment A0O0 ,O0 is created by a composition of the alignment AO1 ,O2 and
                       1  2
A+                                          0
  O2 ,O20 , the alignment between O2 and O2 . [2] have shown that these techniques
can also be applied to ontologies. However, all of these approaches require a set
of rules that need to be constructed by a domain expert and are not necessarily
reusable for other domains. Also, these approaches are not able to identify which
changes in the ontologies are prone to causing an alignment change.
    The task of knowledge base completion shares some properties with the prob-
lem we address in this work. In that area, classifiers are given a subject and a
predicate and try to predict an object [7]. Approaches like [11], [4] and [9] also use
vector representations for prediction. However, this task does not take changes
in the knowledge bases into account and is not applied to ontology alignments.
    To our knowledge, no approach exists that predicts whether a given change
has an impact on the alignment without using a detailed set of rules. This issue
is at the core of our research.
              RDF2Vec-based Classification of Ontology Alignment Changes         3

3     Approach

Our general approach is based on the representation of changed resources using
RDF embeddings, a represenation of RDF nodes as vectors in a high-dimensional,
dense vector space. RDF embeddings are generated using RDF2Vec [8], an ap-
proach based on random graph walks as input to Word2Vec [6]. The RDF2Vec-
Model is trained on an RDF graph consisting of the ontologies O1 and O2 as
well as the alignment AO1 ,O2 as defined in section 2. With these embeddings, we
train a classifier on whether a changed resource affected an alignment statement
and use this classifier to predict whether other changes will affect the alignment.
We define a changed resource to lead to an alignment change, if a changed align-
ment statement is within a distance of two in the RDF graph. This relatively
small measure is used to make it easier to exclude certain regions from the search
for affected statements. For the same reason, only changes that are close to an
alignment axiom are regarded. The respective changes c are extracted using an
extension of the Protégé plugin owl-diff1 . By comparing the parts of AO1 ,O2 and
A0O0 ,O0 that are in the direct neighbourhood of c, it is possible to separate all
    1  2
changes into two groups: (1) changes that caused an alignment change in their
neighbourhood and (2) changes that did not cause an alignment change in their
neighbourhood and therefore did not affect the alignment.
    Each changed resource is represented by the corresponding RDF2Vec vector.
Hence, the input to the training of the classifier is a pair (v(c), k) consisting
of a vector v(c) and a class k. k determines whether c caused a change in its
direct neighbourhood. The task at hand is to correctly classify new vectors. To
solve this problem, we use several common classification techniques: Regression,
Naive Bayes, Tree-Based Algorithms as well as Support Vector Machines and
Multilayer Perceptrons. Each algorithm is trained on one set of changes and
evaluated on a different set.


4     Evaluation

The research questions behind our evaluation are the following:

 1. Can RDF embeddings be used for change classification with an acceptable
    performance? This question tries to clarify, whether our approach is in gen-
    eral applicable to the problem at hand.
 2. Which classifiers can be used for this problem? This question is used to
    identify the best classifiers for our problem.


4.1    Dataset

The dataset used to answer these research questions in our experiments is a
real-word dataset from the domain of biomedical ontologies. It has been used
1
    https://github.com/mhfj/owl-diff
4       Matthias Jurisch, Bodo Igler

in several works that deal with alignment adaption, e.g., [3], [2]. The dataset
comprises three ontologies: SNOMED-CT, the NCI-Thesaurus and FMA. For
each ontology, yearly versions from 2009-2012 are available. Additionally, the
dataset contains alignments extracted from the UMLS metathesaurus between
the ontologies for each year. This dataset has been made publicly available2 by
the authors of [2].
    For simplicity of our presentation, we will only present the alignment between
the ontologies NCI and FMA in the version change from 2009 to 2010. In the
formal notation introduced in Section 2, O1 and O2 refer to the ontologies NCI
and FMA as of 2009 and O10 and O20 as of 2010, respectively. The alignment from
2009 is denoted by AO1 ,O2 and the version from 2010 by A0O0 ,O0 .
                                                                1   2
    From 2009 to 2010, 924 changes are near alignment statements of which
47% require an adaption. These changes are used as a training set. The test set
consists of the changes from 2010 to 2011. This set contains 785 changes near
alignment statements, of which 36% lead to an alignment adaption.


4.2   Methodology

To generate RDF embeddings, the code from RDF2Vec [8] was used. The em-
beddings were trained using the skip gram model, with 500 dimensions used
for the embeddings and random walks of length 8, as this was identified as the
best-performing variant in [8]. An overview regarding classification methods used
on these embeddings is given in Table 1. Standard scikit3 implementations are
used for the classification process. The classifiers are trained on changes from
2009-2010 and validated on changes from 2010-2011 of the dataset described in
Section 4.1. The performance of different classification techniques is evaluated
based on f1-measure, accuracy, precision and recall.


                                Table 1. Classifiers

      Category                 Method
      Regression              Logistic Regression (LR)
      Naive Bayes             Gaussian Naive Bayes (NB)
      Nearest Neighbour       KNN
      Tree-Based Algorithms CART, Random Forest
      Support Vector Machines RBF-Kernel, Linear Kernel
      Multilayer Perceptron   MLP hiden-layer-size: 250; 250,250; 500; 500,500


2
  https://dbs.uni-leipzig.de/de/research/projects/evolution_of_
  ontologies_and_mappings/ontology_mapping_adaption
3
  http://scikit-learn.org/stable/
              RDF2Vec-based Classification of Ontology Alignment Changes         5

4.3   Results

The results of the described process are displayed in Table 2. Only changes
close to the alignment were included in this evaluation, since it would otherwise
be very easy to achieve accuracy values above 95%. Results for MLP did not
vary based on the structure of the hidden layers, so one row represents all MLP
results. All algorithms show a very similar performance regarding the evaluated
metrics. The highest achieved precision is 0.81, which can be reached using MLP
and linear SVM classification. These methods also reach the highest f1-measures
of 0.75. Accuracy of all classifiers is only marginally higher than what can be
achieved using random guessing, given the distribution of classes in the test set.


                          Table 2. Classification Results

                         f1-measure     accuracy     precision    recall
       LR                   0.74          0.67          0.80       0.69
       NB                   0.67          0.58          0.73       0.62
       KNN                  0.71          0.62          0.75       0.69
       CART                 0.73          0.65          0.80       0.68
       RandomForest         0.75          0.67          0.80       0.70
       SVM rbf              0.74          0.65          0.77       0.71
       SVM linear           0.75          0.67          0.81       0.70
       MLP                  0.75          0.68          0.81       0.70


5     Discussion

The results presented in section 4.3 give us some evidence on our first research
question: Using RDF embeddings to represent changes seems to be a promising
approach to the mapping adaption problem, as we can see a precision around
0.8. In general, several classification approaches show a similar performance. This
precision can be achieved, although the approach uses no information regarding
the nature of changes, e.g., the algorithm can not distinguish the correction of
typos from major, structural changes.
    An important advantage of this approach is that no sophisticated change
model that is adapted to the domain is required. Approaches like [2] require a
rule-base that needs to be constructed from a detailed understanding of typical
changes in the domain the ontologies describe. Hence, the author of these rules
needs to be an expert in ontology engineering as well as the application domain.
Also, these rules need to be constantly adapted to evolving domains, whereas
an RDF-embedding based approach could learn new patterns autonomously.
However, to demonstrate these advantages, it is still required to show that this
approach is also applicable to other data sets and different application domains.
6       Matthias Jurisch, Bodo Igler

6    Conclusion and Outlook
In this work, we presented an approach to ontology alignment adaption based
on RDF embeddings and common classification techniques. An evaluation on a
dataset from the biomedical domain provided some evidence, that the approach
is feasible. On the dataset, best-performing classifiers had a precision of 0.8.
    As future work, several extensions are possible: Further evaluations could
be performed on different datasets. Also, a combination of this approach with
existing mapping adaption approaches could be examined. Change types could
be used as another input to the classification process to improve classification
accuracy. Another aspect for future work is to determine, when embeddings need
to be updated, since embeddingis will become outdated when ontologies change.


References
 1. Jérôme Euzenat and Pavel Shvaiko. Ontology matching. Springer, Heidelberg,
    2007.
 2. Anika Groß, Julio Cesar dos Reis, Michael Hartung, Cédric Pruski, and Erhard
    Rahm. Semi-automatic adaptation of mappings between life science ontologies.
    Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial
    Intelligence and Lecture Notes in Bioinformatics), 7970 LNBI:90–104, 2013.
 3. Ernesto Jiménez-ruiz, Bernardo Cuenca Grau, Ian Horrocks, and Rafael Berlanga.
    Logic-based assessment of the compatibility of UMLS ontology sources. In JOUR-
    NAL OF BIOMEDICAL SEMANTICS, 2010.
 4. Armand Joulin, Edouard Grave, Piotr Bojanowski, Maximilian Nickel, and Tomas
    Mikolov. Fast Linear Model for Knowledge Graph Embeddings. 2017.
 5. Michel Klein and Heiner Stuckenschmidt. Evolution Management for Intercon-
    nected Ontologies. Workshop on Semantic Integration at ISWC 2003, 2003.
 6. Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient estimation
    of word representations in vector space. CoRR, abs/1301.3781, 2013.
 7. Maximilian Nickel, Kevin Murphy, Volker Tresp, and Evgeniy Gabrilovich. A
    review of relational machine learning for knowledge graphs. Proceedings of the
    IEEE, 104:11–33, 2016.
 8. Petar Ristoski and Heiko Paulheim. RDF2Vec: RDF graph embeddings for data
    mining. In The Semantic Web - ISWC 20162016, pages 498–514, 2016.
 9. Richard Socher, Danqi Chen, Christopher Manning, Danqi Chen, and Andrew Ng.
    Reasoning With Neural Tensor Networks for Knowledge Base Completion. Neural
    Information Processing Systems (2003), pages 926–934, 2013.
10. Yannis Velegrakis, Renée J. Miller, and Lucian Popa. Mapping adaptation under
    evolving schemas. VLDB ’03 Proceedings of the 29th international conference on
    Very large data bases - Volume 29, pages 584–595, 2003.
11. Bishan Yang, Wen-tau Yih, Xiaodong He, Jianfeng Gao, and Li Deng. Embedding
    Entities and Relations for Learning and Inference in Knowledge Bases. 2014.
12. Cong Yu and Lucian Popa. Semantic Adaptation of Schema Mappings when
    Schemas Evolve. Very Large Data Bases, pages 1006 – 1017, 2005.