<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Hybrid Approach for Large Knowledge Graphs Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omaima Fallatah</string-name>
          <email>oafallatah@uqu.edu.sa</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziqi Zhang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hopfgartner</string-name>
          <email>f.hopfgartnerg@sheffield.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Information Systems, Umm Al Qura University</institution>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information School, The University of She eld</institution>
          ,
          <addr-line>She eld</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Matching large and heterogeneous Knowledge Graphs (KGs) has been a challenge in the Semantic Web research community. This work highlights a number of limitations with current matching methods, such as: (1) they are highly dependent on string-based similarity measures, and (2) they are primarily built to handle well-formed ontologies. These features make them unsuitable for large, (semi-) automatically constructed KGs with hundreds of classes and millions of instances. Such KGs share a remarkable number of complementary facts, often described using di erent vocabulary. Inspired by the role of instances in large-scale KGs, we propose a hybrid matching approach. Our method composes an instance-based matcher that casts the schema matching process as a two-way text classi cation task by exploiting instances of KG classes, and a string-based matcher. Our method is domain-independent and is able to handle KG classes with unbalanced population. Our evaluation on a real-world KG dataset shows that our method obtains the highest recall and F1 over all OAEI 2020 participants.</p>
      </abstract>
      <kwd-group>
        <kwd>Knowledge Graphs</kwd>
        <kwd>Machine Learning</kwd>
        <kwd>Schema Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        In recent years, many public Knowledge Graphs (KGs) have been developed
and shared, e.g., DBpedia [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and NELL [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Common KGs are often
domainindependent and semi-automatically constructed. Common KGs are highly
complementary, therefore, they are often integrated in several web applications such
as reasoning and query answering.
      </p>
      <p>KGs have gained more attention in the Semantic Web, which facilitates
sharing and reusing knowledge such as those annotated in ontologies. Similar to
ontologies, KG entities are highly heterogeneous, since many real-word entities can
be described using di erent vocabulary. Nevertheless, while ontologies
primarily focus on modelling the schema of a speci c domain, cross-domain KGs are
known for describing numerous instances. Due to their nature of being largely
Copyright © 2021 for this paper by its authors. Use permitted under Creative
Commons License Attribution 4.0 International (CC BY 4.0).
generated in a semi-automated manner, KGs are less well-formed compared to
manually created and well-designed ontologies.</p>
      <p>
        The problem of ontology matching has been well studied, and matching
systems are annually evaluated through the Ontology Alignment Evaluation
Initiative (OAEI 3). A new track for matching KGs has been introduced to OAEI in
2018, where ontology matchers are evaluated on the tasks of matching classes,
properties and instances. By design, KGs are known for their large number of
instances (ABox). Therefore, the majority of current matchers focus on
matching their instances. However, recent studies have shown that the problem of
matching KGs schema (TBox) remains a challenging task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Moreover, many
KG matchers exploit class matches to generate and re ne instance matches [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ].
      </p>
      <p>
        With current matching solutions mainly focusing on well-formed ontologies,
the problem of matching automatically curated and large KGs remains signi
cant. While the majority of the state-of-the-art methods are highly dependent
on string/language and structural-based techniques [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ], KGs often lack some
textual descriptions, e.g., comments, required by these methods. In terms of
structural-based similarity measures, despite that some KGs lack the schematic
information required by such methods, they can be error-prone [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]. This justi es
their high-level of dependency on string-based matchers' results.
      </p>
      <p>
        This work proposes a novel method for mapping classes in large KGs by
combining string-based measures with an instance-based method. The latter
only uses annotated instance names to generate similar class pairs. Our
domainindependent method utilizes the large number of instances in KGs, and is able
to cope with unbalanced population of KG classes. This is particularly useful
in scenarios of large KGs with rich populated instances, such as DBpedia that
is the central linking dataset in the current linked data cloud, and NELL that
creates a large-scale KGs in a never-ending machine reading fashion. In addition
to OAEI KG benchmark, we conduct an experiment to evaluate the performance
of our method on a real-world KG benchmark [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. We compare the results of our
proposed approach against the systems participated in the KG track in OAEI
2020 and show that our method obtains the highest recall and F1 measure in
the task of matching common KGs classes.
      </p>
      <p>The remainder of this paper is structured as follows. An overview of the
related work is provided in Section 2; Section 3 describes the details of the
proposed matcher; Section 4 describes our experiments and 5 discuses the results,
followed by a conclusion and future work discussion in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        Ontology matching systems often combine di erent matching techniques [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ].
Element-level matchers discover similar entities by utilizing the textual
annotations de ned in the ontology's entities, e.g., URIs, labels, and comments. Other
methods leverage lexical databases, such as WordNet4, as background knowledge
      </p>
      <sec id="sec-2-1">
        <title>3 http://oaei.ontologymatching.org/ 4 https://wordnet.princeton.edu/</title>
        <p>
          to discover semantic similarity. However, recent studies, such as [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ], highlights
that WordNet lacks su cient coverage in comparison to word embedding based
similarity measures. It is di cult for semantic-based techniques to outperform
string-based ones, therefore, combining both measures is a common strategy [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ].
        </p>
        <p>
          Matchers such as the well-known AML [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] and LogMap [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] employ
elementlevel techniques. Both matchers make use of background knowledge bases in
order to match biomedical ontologies. Some recent OAEI KG track participants
have been utilizing other resources. For example, Wiktionary [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], which is an
element level-matcher that uses an online lexical resource known as Wiktionary.
Similarly, ALOD2Vec [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] utilizes WebIsALOD5, an automatically generated
RDF dataset of hypernym relations, as background knowledge.
        </p>
        <p>
          In terms of Structural-level matchers, they exploit structural information
available in well-formed ontologies like disjoint axioms to re ne element-level
alignments, such as in AML and LogMap matcher family. ATBox [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] is an OAEI
2020 participant which uses similar techniques to lter mappings initially
discovered by a string-based matcher. Such an approach requires a well-formed
ontology which is not the case in the context of common KGs that lack the
schematic richness due to their automatically generated nature [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          The nal matcher category is Extensional or instance-based matchers that
use instances data to generate schema level alignments. The intuition of such a
method is that similar classes or properties shares a substantial overlap of their
instances. However, according to [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], it is signi cantly challenging to measure
the extension of such an overlap. Previous works that incorporate this method are
predominantly domain-dependent [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]. Zhang et al.[
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] introduced an
instancebased approach that only matches the properties within single LOD datasets,
which include some KGs such as DBpedia. Their results are encouraging to
apply instance-based methods on cross-dataset settings, particularly with LOD
datasets and KGs sharing many similar characteristics.
3
3.1
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Approach</title>
      <sec id="sec-3-1">
        <title>Overview</title>
        <p>Our matching approach can be formalized as: the Input takes two KGs, O
and O0, where O contains a set of classes O = fC0 ; C1 ; ::; COig, and each class
coofnctlaaisnssesasusceht othfaitnsOta0n=cesfCCO0Oi0 ; C=O1f0;e:i0:;; Cei1Oj; 0:g::; weinhge.rOeSiCmOjOi0la=rlyf,eOj0;0ej1c;o:n::t;aeijmnsg.aOsuetr
method is composed of an instance-based matcher and a name matcher.
The architecture of the proposed method is illustrated in Figure 1.</p>
        <p>The work ow starts with parsing the two input KGs and applying general
text preprocessing (Section 3.2). The second component of the method is the
matching process which starts with an instance-based matcher (Section 3.3).
This matcher is performed in two stages and will result in two directional
alignment sets, denoted AO!O0 which is a set of correspondences between classes</p>
        <sec id="sec-3-1-1">
          <title>5 http://webisa.webdatacommons.org/</title>
          <p>from O and O0 respectively, and AO0!O which is a set of correspondences in
the opposite direction. Section 3.3 explains the process of aggregating the two
directional alignments in order to obtain one alignment set for this matcher
Ainstance. The second matcher is based on string/semantic similarity, which
belongs to the element-level matchers category. This matcher uses class labels to
generate equivalent class pairs denoted as Aname (Section 3.3). The process of
selecting the nal alignments A is described in Section 3.4 (Post Processing
in g. 1).
3.2</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>Pre-processing</title>
        <p>The matcher starts by parsing the two input KGs in order to separately index
their lexical data structure. Given a KG, we create an index of its classes by
following the standard free text indexing approach for search engines. Here, each
class is treated as a document, and the text content of that `document` is the
concatenation of the labels of all the class's instance names. In order to obtain
a cleaner version of the datasets, standard text preprocessing techniques are
applied. All entity labels are transferred into lowercase and all stopwords and
non-alphanumeric characters are removed. Finally, we replace all underscores
characters, which are often used to separate multi-word entity labels, with a
space character. KGs class names can also be described with multiple words,
such as placeofworship or by using a camel case (e.g.,ReligiousBuilding).
Therefore, a word segmentation process which utilizes a dictionary is applied to
infer the spaces between words while the camel case is replaced with a white
space as well.
3.3</p>
      </sec>
      <sec id="sec-3-3">
        <title>The matching process</title>
        <p>Instance-based Matcher . This matcher uses a self-supervised approach to
map KG classes based on their shared instances. The matching process is divided
into a two-way classi cation process where a KG classi er is trained with one
KG's instances, and then used to classify a given instance into one of the classes
from that KG. As illustrated in Figure 1, this matcher starts by applying an exact
name lter then undersamples the datasets from the two KGs in preparation for
the training phase. After the training process, the classi er trained on O (i.e.,
CLSO) is used to classify instances from O0. The classi cation results are then
used to elicit the directional alignments AO!O0 . The second alignment election
process is similar to the rst one, except that the two KGs roles are reversed to
generate AO0!O. Finally, the candidate class pairs for this matcher is generated
based on the two directional alignments.
Exact Name Filtering . We start by ltering classes in both KGs with exact
names. Therefore, if a class label exists in both input KGs, both classes will be
excluded from the instance based matching process. Our goal here is to use the
instance-based matcher to leverage the nal alignments with class pairs that are
likely to not be discovered by simple string matchers. Further, this also serves
as a blocking step which reduces the search space for the matcher, as large KGs
can have hundreds of classes to be matched.</p>
        <p>
          Undersampling . Typical large KGs are often very imbalanced. For example,
Figure 2 shows the distribution of classes in two common KGs, that we will
discuss later in Section 4, i.e., DBpedia6 and NELL7. While some classes in NELL
have over 20,000 instances, other classes have less than 10 instances. This
imbalance problem can detrimentally a ect the learning process and therefore, a
sampling process can be useful. The problem of learning from imbalanced datasets
has been a thoroughly studied eld where di erent solutions have been
developed and analyzed [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]. Solutions often require targeting the majority classes,
i.e, classes with large number of data points, and the minority classes, i.e, classes
with few data points [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>
          While random undersampling/oversampling [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] are common practices in
machine learning applications, they still carry some major limitations. Randomly
undersampling the majority classes can result in losing relevant information in
the eliminated samples. In contrast, random oversampling, which generates
du6 https://wiki.dbpedia.org/develop/datasets/dbpedia-version-2016-10,visited
on 14-2-2020
7 http://rtw.ml.cmu.edu/rtw/resources, iteration number 1115, visited on
22-22020
plicate data points in minority classes, can ultimately result in model over tting
as a result of having the multiple samples of same data points. According to [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
datasets with severe class imbalance can be very challenging to train machine
learning models and will often require specialized resampling techniques as
opposed to generalized solutions.
        </p>
        <p>
          Our sampling strategy aims to balance KGs instance population by
undersampling the majority classes. Here, we de ne a Majority class as a class with a
number of instances which exceeds the average number of instances per class in
that particular KG. Due to their automated generation process, large KGs often
su er from redundant information in their instances [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. Our goal is to obtain
a smaller yet an indicative instance samples in order to limit the e ect of the
sparsity problem on the training/learning process.
        </p>
        <p>
          To help identify a set of indicative instance names, we deploy a TF-IDF [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]
based method to resample KG instances. TF-IDF is widely used to evaluate the
relevance of words to a collection of documents by weighting their occurrences.
In this task, we calculate TF-IDF of tokens for each class. Hence, the weight of a
token here represents how relevant a token is to a particular class in comparison
to other classes in the KG. Consequently, for each majority class, we use the top k
words in terms of TF-IDF score to undersample its instance names. To illustrate,
assuming that COi is a majority class in O, and W = fw0; w1; w2::wk 1g, is a
list of tokens with the top k TF-IDF score in Ci . Then, we discard instance
O
names that do not contain one of the words in W . As a result, we have a set of
indicative instance names that will be used to train CLSO. The same process
will be applied to the classes in the other input KG, i.e., O0 to train CLSO0 .
        </p>
        <p>
          Training KG classi ers . A KG classi er CLSO will be trained using the
previously undersampled data. We utilize pre-trained word embeddings as features.
Word Embeddings (WE) are considered one of the most e ective approaches to
capture the semantic similarity of words, unlike traditional feature
representation methods. As classi cation method we experiment with two SoA methods
(1) a model that uses pre-trained BERT model [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], and (3) a Deep Neural
Network (DNN). The architecture of this model is inspired by previous high
performing models in di erent NLP tasks such as in [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] and [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          KG1 alignment elicitation is the process of eliciting candidate directional
alignment between a class in one KG (i.e., O) and classes in another KG (e.g.,
j
O0), based on the output of CLSO. To perform the elicitation given CO0 =
fej0; ej1; :::; ejmg we apply CLSO to classify each instance of COj0 . As a result,
we classify each instance name into a class in O. A correspondence between
COjj0 and COi is added to the candidate alignments set AO!O0 if the majority of
CO0 instances were classi ed as instances of Ci . To generate a similarity value
O
between [
          <xref ref-type="bibr" rid="ref1">0,1</xref>
          ], we use the percentage of instances that voted for the majority
j
class. For example, if 700 out of 1000 instances in CO0 were classi ed as, COi the
similarity measure of that pair will be 0.7.
        </p>
        <p>KG2 alignment elicitation. Similar to KG1 alignment elicitation, we reverse
the roles of the two KGs to generate AO0!O. Candidate alignments are generated
based on the output of the classi er trained on O0, i.e., CLSO0 .</p>
        <p>Similarity computing . This phase of the matcher aims to (1) combine the
two directional alignments resulted from the two KGs alignment elections (i.e,
AO!O0 and AO0!O), and (2) select the nal alignments for the instance-based
matcher. In order to combine both alignments sets, each directional alignment
will be rst stored into an alignment matrix of a dimension of jOj:jO0j. An
alignment matrix contains the correspondences for all class pairs from the two
input KGs. To aggregate the two matrices, we take the average of the similarity
value of each pair. For example, if (CO4,CO50 ,0.69) in AO!O0 and (CO50 ,CO4, 0.73)
in AO0!O) their aggregated similarity value will be 0.71.</p>
        <p>
          Consequently, the nal alignment Ainstance is generated by following the
automated alignment selection approach introduced in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Given an alignment
matrix and a threshold t , this method goes through each row at a time and
selects the maximum correspondence in each row, if the similarity value for that
correspondence is beyond t (e.g., bold text in g. 3). When a class is involved
in two correspondences (e.g., CO30 in row 1 and 7), only the one with the higher
similarity is retained (e.g., (C6 ; CO30 ; 0:92)), and the previously selected
corre
        </p>
        <p>O
spondence is deleted. This process takes places iteratively until all classes are
selected with a correspondence and no changes are to be made.
Name matcher . This matcher calculates the similarity of KG classes based
on the string and the semantic similarity of their names. Given a set of all
possible correspondences between classes in O and O0, generated with an exclusive
pairwise comparison, we measure the word embedding similarity and the edit
distance similarity of the two class names. We only apply the two similarity
measures to KGs class names, as not all KGs provides other longer descriptions
such as comments. For the edit-distance similarity, we calculate the normalized
levenshtein distance for each class pairs. This method normalizes the edit
distance value by the length of the longer string to get a value between [0.0, 1.0].</p>
        <p>In terms of the word embedding similarity, a Google pre-trained word2vec
model is used to represent class names and measure their cosine similarities in the
Vector Space Model where semantically similar words are represented closer to
each other. Following the same approach in Section 3.2, concatenated strings such
as awardtrophytournament are segmented into multiple words. Thus, in the case
of a multi-word class name, the matcher aggregates the vector representation of
each word composing the class name by taking an element-wise average of the
vectors of each composing word.</p>
        <p>We then choose the maximum of the two similarity measures, if the similarity
scores are higher than a threshold tn. To illustrate, assuming that a pair of the
two classes RailwayStation and TrainStation where their word embedding
similarity is 0.83 and their edit distance is 0.56 we select the maximum
similarity value, i.e., the word embedding similarity which is also higher than the
tn. However, if the two similarity scores of a pair are lower than the tn, then
that pair will not be added to the candidate alignment set. The output of this
matcher is a candidate alignment set Aname to be combined with the instance
matcher alignments (i.e., Ainstance) to be detailed in the next section.
3.4</p>
      </sec>
      <sec id="sec-3-4">
        <title>Final alignment selection</title>
        <p>The above-explained similarity measures will both result in an alignment set. The
goal of this nal stage of the matching approach is to combine the two
matchers results, as well as to select the nal alignments for the complete matcher.
Given Aname and Ainstance, we follow the same method as explained before in
Section 3.3 for creating Ainstance out of the two directional alignment.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <p>
        The aim of this experiment is to test our matching approach on the task of
matching large KGs and to compare it to OAEI participants. In addition to
OAEI participants, we also tested the performance of the baseline matcher
KGbaselineLabel8. Similar to OAEI, we use precision, recall, F-measure metrics to
evaluate the accuracy of the matching methods against the gold standard. The
Matching Evaluation Toolkit (MELT) [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] was used to perform this evaluation.
Our matcher is implemented with python and was wrapped using MELT
external matchers wrapping tool. The evaluation was executed on a VM with 128GB
of RAM, 16 vCPUs (2.4 GHz), and 12GB GPU.
4.1
      </p>
      <sec id="sec-4-1">
        <title>Datasets</title>
        <p>
          The datasets used for evaluation are: (1) The NELL-DBpedia dataset is
created from the schema of two large public KGs [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ]. The gold standard consists of
129 true positive class alignments between NELL and DBpedia. To the best of
our knowledge, this dataset is the largest available benchmark for matching KGs
classes. This dataset is domain-independent and o ers a substantial number of
instances, which allow for evaluating instance-based matchers. (2) The OAEI
Knowledge Graphs Benchmark, which o ers ve test cases generated from
eight di erent KGs. The largest test cases in terms of class matching have 15
and 14 positive class alignments [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. Similar to OAEI, in this work, we evaluate
and share the average results of the ve tasks.
8 http://oaei.ontologymatching.org/2020/results/knowledgegraph/index.html
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>Method's parameters</title>
        <p>
          Our method has the following parameters: (1) k which is the number of words
ranked by their TF-IDF scores in a class to be used for the undersampling
process. In the reported results, we set k to 10. However, we have created three
di erent con gurations based on k = 5; 10; 20. We discuss the impact of k value
in section 5.1. (2) For the threshold value of the name matcher tn, we set tn to
0.8 which is inline with previous element-level methods that combines multiple
similarity measures such as [
          <xref ref-type="bibr" rid="ref20 ref9">9,20</xref>
          ]. (3) In the nal alignment process, we apply
a threshold (t ) of 0.22, the value recommended for this method in [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ].
5
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Results and discussion</title>
      <p>As shown Table 1, on the NELL-DBpedia dataset, one can observe that our
matcher achieves a recall and F1 that are clearly higher than all OAEI 2020
participants. Here, we report the highest result out of the three k value con
gurations, i.e., 10. However, our matcher outperforms all OAEI matchers, despite
what k value we use. AML is the best performing matcher on the task of
matching classes in OAEI KG track. However, on the task of matching common KG
classes, our matcher outperforms AML with 0.06 in F1, and 0.12 in recall. In
terms of OAEI KG dataset, our matcher does not perform as well, i.e., the F1
is 0.77. We argue that the systems that outperform our matcher on this task
implement a variety of matchers that target not only the labels of KG entities, but
also consider other entities' metadata. For example, AML incorporates 9 di
erent matchers including structural matchers, and other lters to further improve
the quality of the matching results.</p>
      <p>Other systems also utilize external background knowledge resources such
as ALOD2Vec and Wiktionary. However, our matcher only uses the labels of
entities to produce matching class alignments. Moreover, the majority of OAEI
systems incorporate multiple string-based techniques such as n-gram, pre xes
and su xes. For instance, one of the string processes implemented by ATBox
matcher aimed at nding ontology speci c stopwords, which are words that often
appear in a certain KG or ontology. It considers such words as stopwords to be
removed prior to applying further string matching techniques. This allows to
discover the similarity of hsidebar starship, starshipi and hsidebar novel,
noveli, if sidebar was a corpus speci c stopword. However, those classes will
not be matched by word embedding models or by an edit-distance similarity
with a high threshold such as the one we apply, i.e., 0.8.</p>
      <p>Another contribution to the di erent performance of our matcher across the
two datasets is the datasets' nature. While the rst dataset is constructed from
common KGs where classes annotate complementary real-world entities, the
OAEI datasets have very di erent nature. Moreover, they are restricted to a
single domain (entertainment) where distinguishing entities of book, music and
movie can be a di cult task. This is due to the usage of words in naming such
entities, which can be very heterogeneous and inconsistent. To illustrate, g. 4
shows the di erence in the performance of our KG classi er when it is trained to
classify NELL instances, as opposed to when it is trained to classify the
MemoryAlpha KG instances in the OAEI benchmark dataset. Clearly, it was easier to
classify instances from multi-domain large KGs like NELL.</p>
      <p>(a)
(b)
Fig. 4: The instance classi cation report of a 20 randomly sampled classes from
the OAEI KG MemoryALpha in (a) and NELL in (b). Note that y-axis numbers
indicate class IDs.
5.1</p>
      <sec id="sec-5-1">
        <title>The impact of components and parameters</title>
        <p>In this section, we discuss the impact of several components and parameters of
the proposed method. First, in addition to the BERT-based classi er, we also
tested our method with another model, based on a simple DNN with 4 density
connected layers. Although both models outperform all OAEI participants on
the task of matching common KGs, we use BERT model as it is shown to achieve
slightly better F1 (by 0.01). This shows that our approach is generalizable to
other machine learning algorithms for the KG classi er. Second, in terms of the
threshold k, we tested three con gurations, 5, 10, and 20. We noticed identical
results with k=10 and 20, while slightly lower F1 when k=5, i.e., 0.94 which
still signi cantly outperforms all OAEI participants. Further, setting k to 10
compared to 20 led to a signi cant reduction in runtime due to more aggressive
undersampling (from 55 minutes on the common KGs dataset to 29 minutes).
Finally, the exact name ltering had a positive e ect on the overall performance
of our method, by increasing the F1 by 0.02 on the common KGs dataset.
6</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion and future work</title>
      <p>This work reports an ongoing study of utilizing instances to match KGs classes.
We proposed a novel domain independent approach for matching classes in large
KGs. To the best of our knowledge, our matcher includes the rst
instancebased matcher for matching KG classes with the ability to handle unbalanced
populations. Our ndings suggest that a hybrid approach that composes an
instance-based matcher can be very e ective for matching common KG classes.
In future versions of our matcher, we aim to further study the thresholds focusing
on the possibility to automate that decision, and to further improve our matcher
combination to be able to address di erent matching tasks. We also aim to extend
our matching method to match all KGs entities.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lehmann</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Isele</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jakob</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jentzsch</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kontokostas</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mende</surname>
            ,
            <given-names>P.N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hellmann</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Morsey</surname>
            , M., van Kleef,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Auer</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bizer</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          :
          <article-title>DBpedia { A Large-scale, Multilingual Knowledge Base Extracted from Wikipedia</article-title>
          . Semantic Web pp.
          <volume>1</volume>
          {
          <issue>5</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Carlson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Betteridge</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kisiel</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Settles</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hruschka</surname>
          </string-name>
          , E.R., Mitchell, T.M.:
          <article-title>Toward an architecture for never-ending language learning</article-title>
          .
          <source>In: Twenty-Fourth AAAI Conference on AI</source>
          (
          <year>2010</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Collobert</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weston</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bottou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karlen</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavukcuoglu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kuksa</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Natural language processing (almost) from scratch</article-title>
          .
          <source>Journal of machine learning research 12(ARTICLE)</source>
          ,
          <volume>2493</volume>
          {
          <fpage>2537</fpage>
          (
          <year>2011</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Fallatah</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hopfgartner</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>A gold standard dataset for large knowledge graphs matching</article-title>
          .
          <source>In: Ontology Matching 2020: Proceedings of the 15th International Workshop on Ontology Matching co-located with (ISWC</source>
          <year>2020</year>
          )
          <article-title>(</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Faria</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pesquita</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Santos</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmonari</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cruz</surname>
            ,
            <given-names>I.F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Couto</surname>
            ,
            <given-names>F.M.:</given-names>
          </string-name>
          <article-title>The agreementmakerlight ontology matching system</article-title>
          .
          <source>In: OTM Confederated International Conferences</source>
          . pp.
          <volume>527</volume>
          {
          <fpage>541</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fernandez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <article-title>Garc a</article-title>
          , S.,
          <string-name>
            <surname>Galar</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Prati</surname>
            ,
            <given-names>R.C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Krawczyk</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Herrera</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Learning from imbalanced data sets</article-title>
          , vol.
          <volume>11</volume>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Gulic</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vrdoljak</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Banek</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Cromatcher: An ontology matching system based on automated weighted aggregation and iterative nal alignment</article-title>
          .
          <source>Journal of Web Semantics</source>
          <volume>41</volume>
          ,
          <issue>50</issue>
          {
          <fpage>71</fpage>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Dbkwik: A consolidated knowledge graph from thousands of wikis</article-title>
          .
          <source>In: 2018 IEEE International Conference on Big Knowledge (ICBK)</source>
          . pp.
          <volume>17</volume>
          {
          <issue>24</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Atbox results for oaei 2020</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . vol.
          <volume>2788</volume>
          , pp.
          <volume>168</volume>
          {
          <fpage>175</fpage>
          .
          <string-name>
            <surname>RWTH</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>The knowledge graph track at oaei</article-title>
          .
          <source>In: European Semantic Web Conference</source>
          . pp.
          <volume>343</volume>
          {
          <fpage>359</fpage>
          . Springer (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Hertling</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>MELT - matching evaluation toolkit</article-title>
          .
          <source>In: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference</source>
          . pp.
          <volume>231</volume>
          {
          <issue>245</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Isaac</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Der Meij</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schlobach</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>An empirical study of instancebased ontology matching</article-title>
          .
          <source>In: The Semantic Web</source>
          , pp.
          <volume>253</volume>
          {
          <fpage>266</fpage>
          . Springer (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Jimenez-Ruiz</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          :
          <article-title>Logmap family participation in the oaei 2020</article-title>
          .
          <source>In: Proceedings of (OM</source>
          <year>2020</year>
          ).
          <article-title>CEUR-WS (</article-title>
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Ksieniewicz</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>Undersampled majority class ensemble for highly imbalanced binary classi cation</article-title>
          .
          <source>In: Proceedings of the Second International Workshop on Learning with Imbalanced Domains: Theory and Applications</source>
          . pp.
          <volume>82</volume>
          {
          <issue>94</issue>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>A.C.</given-names>
          </string-name>
          :
          <article-title>The e ect of oversampling and undersampling on classifying imbalanced text datasets</article-title>
          . The University of Texas at Austin p.
          <volume>67</volume>
          (
          <year>2004</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Maiya</surname>
            ,
            <given-names>A.S.:</given-names>
          </string-name>
          <article-title>ktrain: A low-code library for augmented machine learning</article-title>
          .
          <source>arXiv preprint arXiv:2004</source>
          .
          <volume>10703</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Minaee</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalchbrenner</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cambria</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nikzad</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chenaghlu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
          </string-name>
          , J.:
          <article-title>Deep learning based text classi cation: A comprehensive review</article-title>
          .
          <source>arXiv preprint arXiv:2004</source>
          .
          <volume>03705</volume>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Monych</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hladik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.: Deskmatcher.
          <source>In: CEUR Workshop Proceedings</source>
          . vol.
          <volume>2788</volume>
          , pp.
          <volume>181</volume>
          {
          <issue>186</issue>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Ngo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bellahsene</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Todorov</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Opening the black box of ontology matching</article-title>
          .
          <source>In: Extended Semantic Web Conference</source>
          . pp.
          <volume>16</volume>
          {
          <fpage>30</fpage>
          . Springer (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Nkisi-Orji</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wiratunga</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Massie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hui</surname>
            ,
            <given-names>K.Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Heaven</surname>
          </string-name>
          , R.:
          <article-title>Ontology alignment based on word embedding and random forest classi cation</article-title>
          .
          <source>In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases</source>
          . pp.
          <volume>557</volume>
          {
          <fpage>572</fpage>
          . Springer (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Otero-Cerdeira</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rodr</surname>
            guez-Mart nez,
            <given-names>F.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gomez-Rodr guez</surname>
          </string-name>
          , A.:
          <article-title>Ontology matching: A literature review</article-title>
          .
          <source>Expert Systems with Applications</source>
          pp.
          <volume>949</volume>
          {
          <issue>971</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Padurariu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Breaban</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          :
          <article-title>Dealing with data imbalance in text classi cation</article-title>
          .
          <source>Procedia Computer Science</source>
          <volume>159</volume>
          ,
          <issue>736</issue>
          {
          <fpage>745</fpage>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hladik</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Alod2vec matcher results for oaei 2020</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . vol.
          <volume>2788</volume>
          , pp.
          <volume>147</volume>
          {
          <fpage>153</fpage>
          .
          <string-name>
            <surname>RWTH</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Portisch</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulheim</surname>
          </string-name>
          , H.:
          <article-title>Wiktionary matcher results for oaei 2020</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . vol.
          <volume>2788</volume>
          , pp.
          <volume>225</volume>
          {
          <fpage>232</fpage>
          .
          <string-name>
            <surname>RWTH</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25. Schutze, H.,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Raghavan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          : Introduction to information retrieval, vol.
          <volume>39</volume>
          . Cambridge University Press Cambridge (
          <year>2008</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Thor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kirsten</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rahm</surname>
          </string-name>
          , E.:
          <article-title>Instance-based matching of hierarchical ontologies</article-title>
          .
          <source>Datenbanksysteme in Business, Technologie and Web</source>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gentile</surname>
            ,
            <given-names>A.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Blomqvist</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Augenstein</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ciravegna</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>An unsupervised data-driven method to discover equivalent relations in large Linked Datasets</article-title>
          . Semantic Web pp.
          <volume>197</volume>
          {
          <issue>223</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>