KGMatcher Results for OAEI 2021

            Omaima Fallatah1,2 , Ziqi Zhang1 , and Frank Hopfgartner1
            1
            Information School, The University of Sheffield, Sheffield, UK
           {oafallatah1,ziqi.zhang,f.hopfgartner}@sheffield.ac.uk
    2
      Department of Information Systems, Umm Al Qura University, Saudi Arabia
                             oafallatah@uqu.edu.sa


        Abstract. KGMatcher is a scalable and domain independent matching
        tool that matches the schema (classes) of larger Knowledge Graphs by
        following a hybrid matching approach. KGMatcher is composed of an
        instance-based matcher which only uses annotated instances of knowl-
        edge graph classes to generate candidate class alignments, and a string-
        based matcher. This year is the first OAEI participation of KGMatcher,
        and it is the best performing system on the common knowledge graph
        track. Although KGMatcher results are promising, further improvements
        of the matching techniques’ and matcher combination can be introduced.


Keywords: Knowledge Graphs · Instance-based Ontology Matching · Machine
Learning · Schema Matching.


1     Presentation of the system

1.1    State, purpose, general statement

Combining different matching techniques is a common practice in ontology match-
ing tools. Matching techniques are divided into three main categories [10]: (1)
Element level techniques which discover similar entities by processing textual
annotation of ontologies entities, (2) Structural level techniques which study
the relation between ontologies entities to generate candidate similar pairs, and
Finally (3) Extensional or Instance based techniques which utilize populated
instances, i.e., (ABox) data to generate alignments at schema level (TBox).
    The matcher proposed here is a hybrid approach that combines an element
level matcher to an instance-based one. The first matcher uses entity labels to
generate candidate pairs, and the latter produces candidate class alignments
by only using annotated instance names. While conventional ontologies mainly
focus on modelling classes and properties, Knowledge Graphs (KGs), particularly
those available on the web, are large-scale and describes numerous instances [3].
Following a self-supervised and two-way classification approach, the presented
matcher trains a classifier using each KG instances as training data. Each trained
    Copyright © 2021 for this paper by its authors. Use permitted under Creative Com-
    mons License Attribution 4.0 International (CC BY 4.0).
2        O. Fallatah et al.

KG classifier can then be used to classify any instance name into one of the classes
from that KG. This approach is domain independent and is capable of coping
with KGs with unbalanced populations (for details, see [4]). This makes the
matcher particularly useful for matching large KGs with numerous populated
and overlapping instances such as DBpedia [1] and YAGO [12].


1.2    Specific techniques used

Given two input knowledge graphs, O and O0 , where O contains a set of classes
        0     1         i                                                      i
O = {CO   , CO  , .., CO    }, and each class contains a set of instances CO       = {ei0 , ei1 , ..., ein }.
                                                                                       j
Similarly, O0 contains a set of classes such that O0 = {CO           0      1
                                                                       0 , CO 0 , .., CO 0 } where
  j         j j           j
CO  0 = {e0 , e1 , ..., em }. KGMatcher has two main components: an instance-

based matcher and a name matcher. The workflow of KGMatcher is illus-
trated in figure 1.


                       Fig. 1: An overview of KGMatcher process


Preprocessing Given the two input KGs, the matcher starts by parsing and
indexing the lexical data of the two KGs separately. Following the standard free
text search/index approach, an index is created for each KG where each class is
treated as a document and the content of each ‘document‘ is the concatenation
of the class’s instance labels. In addition to the standard text cleaning processes,
a word segmentation method is applied in order to separate multi-word entities,
e.g., academicfield. Using a dictionary, the method is able to infer the spaces
between words and replace them with a space.
                                               KGMatcher Results for OAEI 2021              3

Instance-based Matcher The first component of KGMatcher belongs to the
extensional matcher category. It uses a self-supervised machine learning ap-
proach to map KG classes based on their instances overlap. The matching is
done in a two-way classification method where a KG classifier is trained using
one KG’s instances data. Later on, that classifier is used to classify any instance
name into one of the classes from that KG.

    – Exact Name Filtering. The matcher starts by applying an exact name filter
      to exclude class labels that exist in both input KGs. Given the large number
      of classes in typical KGs, this step works as a blocking strategy which reduces
      the search space of the instance-based matcher.

    – Undersampling. As the instance distribution in KGs tends to be highly imbal-
      anced, the goal of this step is to balance the number of populated instances
      in the input KGs to avoid biased classification results. KGMatcher follows a
      resampling strategy aimed at undersampling classes with instances number
      exceeding the average number of instances per class in that KG, i.e., majority
      classes. The standard TF-IDF [11] method which often deployed to measure
      word relevance in a collection of documents is used to undersample KGs
      classes. Here, the TF-IDF of a word in each class represents the relevance
      of that word in a particular class in comparison to other classes in the KG.
      Therefore, for each majority class, the most frequent words in terms of TF-
      IDF score are used to undersample its instance names. Therefore, instance
      names that do not compose any of the words with high TF-IDF scores are
      discarded. As a result, a balanced and indicative set of a KG instance names
      are obtained to be used as training data. For details about this step, see [4].

    – Training KG classifiers. Here, a KG classifier will be separately trained for
      each input KG using the previously undersampled data. Pre-trained word
      embeddings are used here as features to capture and present the semantics of
      KG instance names. Compared to traditional feature representation meth-
      ods, word embedding and language models are recognized as effective ways
      to capture the semantic similarity of words. KGMatcher is able to train two
      types of classifiers, a Deep Neural Network (DNN) model3 4 similar to other
      successful NLP tasks such as [9] and [2], and a pre-trained BERT model [8].
      KGMatcher will automatically opt to using the BERT model if a GPU is
      available during runtime. The output of this phase is two classifiers, CLSO
      and CLSO0 trained using the instances from the two input KGs O, and O0
      respectively.
3
    The parameters selected for the DNN model: an input layer of pre-trained word
    embedding model followed by four fully connected hidden layers with 128, 128, 64,
    32 rectified linear units. A dropout layer of 0.2 is added between each pair of dense
    layers for regulation. Finally, a softmax layer for multi-class classification, taking the
    total number of classes in the KG we are training a classifier for.
4
    The input layer is the Google News token-based model https://tfhub.dev/google/
    tf2-preview/nnlm-en-dim128/1
4         O. Fallatah et al.

    – KG1 alignment elicitation is the process of generating candidate pairs using
      the classifier trained on the first input KG. Candidate pairs are generated
      by iteratively applying the classifier CLSO to instances in the other KG’s
                                                     j
      classes. As a result, each instance name in CO   0 is now classified into a class in
                                   i  j
      O. The candidate pair (CO ,CO0 ) is added to the first candidate alignments
                                          j                                         i
      set AO→O0 if the majority of CO       0 were classified as instances of CO . A

      similarity score between [0,1] is obtained using the percentage of instances
      that voted for a particular class. Therefore, if 600 out of 1000 instance names
           j                      i
      in CO  0 were voting for, CO the similarity score of that pair will be 0.6.


    – KG2 alignment elicitation is similar to the above illustrated elicitation pro-
      cess. However, the roles of the two KGs are reversed where CLSO0 , i.e., the
      classifier trained on the second KG (O0 ), is applied to O instances in order
      to obtain the second candidate alignment set AO0 →O .

    – Similarity computing is where KGMatcher combines the two candidate align-
      ment sets resulted from the two-way classification method. First, the matcher
      separately stores each directional alignment in an alignment matrix of a
      |O|.|O0 | dimension. The two matrices are then aggregated into one matrix by
                                                                          6   3
      taking the average similarity score of each pair. For example, if (CO ,CO 0 ,0.88)
                         5   3
      in AO→O and (CO0 ,CO , 0.64) in AO →O ) their aggregated similarity value
                 0                            0

      will be 0.76. Consequently, the final alignments for this matcher are chosen by
      following the state-of-the-art automatic final alignment selection approach
      introduced in [5]. Given an alignment matrix, this method iteratively select
      the pair with the highest similarity score for each class in both KGs.


Name Matcher The second component of KGMatcher is an element-level
matcher, which measures the similarity of KG class labels. First, the edit distance
of each class pair is measured, and then their semantic similarity is measured by
a word embedding method. For the edit distance, KGMatcher calculates the lev-
enshtein distance for each class pair. Regarding the word embedding similarity,
a pre-trained word2vec model is used to represent class labels before measuring
their cosine similarities. The semantic similarity is measured in a Vector Space
Model, where words with high semantic relations are often represented closer
to each other. In the case of multi-words labels, the vector representation of
each word composing the label is aggregated with an element-wise average of
the composing word vectors. Finally, the maximum of the two similarity mea-
sures is chosen as the name similarity of that pair. The threshold value of the
name matcher is set to 0.8. To illustrate, if the word embedding similarity of
(RailwayStation,TrainStation) is 0.83 while their levenshtein distance is 0.56,
the maximum similarity value, i.e., the word embedding similarity which is also
higher than 0.8. Nonetheless, in case that the two similarity scores are lower
than the threshold, that pair will be excluded from the candidate alignment.
                                         KGMatcher Results for OAEI 2021           5

Post Processing KGMatcher combines the results generated from the two
component matchers by following the same method described earlier in 2 to
combine the two instance classification alignments.

Instance Matching For the OAEI participation, we have adapted KGMatcher
to also match the instances of KGs. The instance matching component is very
simple. First, standard text preprocessing techniques such as lowercasing, and
removing stopwords and non-alphanumeric characters are applied. Then, KG-
Matcher generates candidate instance pairs based on the existence of the label
in the opposite knowledge graph.

1.3   Adaptations made for the evaluation
KGMatcher is mainly developed with Python. To facilitate reusing and evaluat-
ing KGMatcher and for the OAEI submission, KGMatcher was packaged using
a SEALS client. The wrapping services from the Matching EvaLuation Toolkit
(MELT) [7] was used to warp KGMatcher ’s Python-process, and to generate the
SEALS package.

2     Results
In this section, we present and discuss the results for each of the OAEI tracks
where KGMatcher was able to produce a non-empty alignment file. The results
include the following OAEI tracks: Conference, Knowledge Graph, and Common
Knowledge Graphs track.

2.1   Conference
In the Conference track, when following the rar2-M3 evaluation, KGMatcher F1
score (0.52) is slightly lower than both baselines, i.e., StringEquiv (0.53) and edna
(0.56). This particular evaluation, i.e., M3 takes into consideration both class and
property matches. The fact that KGMatcher does not match property justifies
the negative impact of the undiscovered property alignments on the matcher
performance on this task. Further, given that the Conference track datasets do
not include enough number of instances to apply the instance-based matcher,
the name matcher is the only matcher applied to map classes. In terms of the
new experimental cross-domain test case of mapping DBpedia and OntoFram,
KGMatcher performance (0.55) better than both baselines StringEquiv and edna
which have the scores 0.42 and 0.45 respectively.

2.2   Knowledge Graph
In the Knowledge Graph track, KGMatcher was able to generate results for
all the 5 test cases at both classes and instances level only. In terms of class
matching, the matcher yields satisfactory results, with 0.79 for F1 score. The
added instance matcher has positively impacted the overall matcher result on
this task, with a precision of 0.94, a recall of 0.66 and F1 of 0.82.
6       O. Fallatah et al.

2.3   Common Knowledge Graphs
Along with other 6 matchers, KGMatcher was able to complete the task of
matching the classes from two cross-domain and common KGs. KGMatcher ob-
tained a precision of 0.97 and a recall of 0.91. With an F1 score of 0.94, KG-
Matcher is the best performing matcher on this track.


3     General comments
The results of KGMatcher has been very encouraging. In the common knowl-
edge graph track, it achieves outstanding results. This indicates that our hybrid
approach, utilizing instances data to map KG classes, is able to outperform sys-
tems that use other matchers’ combination. It is important to note that the
performance of KGMatcher instance-based component depends on the dataset
nature. Since KGMatcher is learning KG classifiers by using general pretrained
word embedding models, the more representative the KG instances of real-world
entities, the better are the instance classification results. Figure 2 shows the dif-
ferent between the performance when classifying instances from common KGs,
e.g., NELL, compared to a single domain KG from the knowledge graph track.
Note that the latter mainly annotates classes in the entertainment domain [6].


                             (a)                     (b)

Fig. 2: The instance classification report of a 20 randomly sampled classes from
the OAEI KG MemoryAlpha in (a) and NELL in (b). Note that y-axis numbers
indicate class IDs.


    In future versions of KGMatcher, we would like to improve the combination
of techniques used within the name matcher. Currently, this component is rather
                                          KGMatcher Results for OAEI 2021            7

simple and unable to discover matching pairs with high lexical complexity. This
has also affected the matcher’s performance on datasets where instance data are
not existing or difficult to classify. Additionally, improving the instance-based
matcher by further studying other sampling approaches and experimenting with
other machine learning methods, would likely improve the overall performance
of the matcher.

4    Conclusion
As part of OAEI 2021, this paper presents KGMatcher, a matching tool that
utilizes instance data annotated within large KGs to map their classes. The
process is done by learning KG classifiers, which are able to classify instances
into a particular KG class. The results suggest that a hybrid approach that
incorporate an instance-based technique can be highly effective for matching
large cross-domain KGs.

References
 1. Bizer, C., Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mende,
    P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., Bizer, C.: DBpedia – A
    Large-scale, Multilingual Knowledge Base Extracted from Wikipedia. Semantic
    Web pp. 1–5 (2012)
 2. Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.:
    Natural language processing (almost) from scratch. Journal of machine learning
    research 12(ARTICLE), 2493–2537 (2011)
 3. Fallatah, O., Zhang, Z., Hopfgartner, F.: A gold standard dataset for large knowl-
    edge graphs matching. In: OM@ISWC (2020)
 4. Fallatah, O., Zhang, Z., Hopfgartner, F.: A hybrid approach for large knowledge
    graphs matching. In: OM@ISWC (2021)
 5. Gulić, M., Vrdoljak, B., Banek, M.: Cromatcher: An ontology matching system
    based on automated weighted aggregation and iterative final alignment. Journal of
    Web Semantics 41, 50–71 (2016)
 6. Hertling, S., Paulheim, H.: The knowledge graph track at oaei. In: European Se-
    mantic Web Conference. pp. 343–359. Springer (2020)
 7. Hertling, S., Portisch, J., Paulheim, H.: MELT - matching evaluation toolkit. In:
    Semantic Systems. The Power of AI and Knowledge Graphs - 15th International
    Conference. pp. 231–245 (2019)
 8. Maiya, A.S.: ktrain: A low-code library for augmented machine learning. arXiv
    preprint arXiv:2004.10703 (2020)
 9. Minaee, S., Kalchbrenner, N., Cambria, E., Nikzad, N., Chenaghlu, M., Gao, J.:
    Deep learning based text classification: A comprehensive review. arXiv preprint
    arXiv:2004.03705 (2020)
10. Otero-Cerdeira, L., Rodrı́guez-Martı́nez, F.J., Gómez-Rodrı́guez, A.: Ontology
    matching: A literature review. Expert Systems with Applications pp. 949–971
    (2015)
11. Schütze, H., Manning, C.D., Raghavan, P.: Introduction to information retrieval,
    vol. 39. Cambridge University Press Cambridge (2008)
12. Tanon, T.P., Weikum, G., Suchanek, F.: Yago 4: A reason-able knowledge base.
    In: European Semantic Web Conference. pp. 583–596. Springer (2020)