<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>International Workshop on Ontology Matching, October</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>The Impact of Imbalanced Class Distribution on Knowledge Graphs Matching</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Omaima Fallatah</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ziqi Zhang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Frank Hopfgartner</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Information School, The University of Shefield</institution>
          ,
          <addr-line>Regent Court, 211 Portobello, Shefield S1 4DP</addr-line>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Information Systems, Umm Al Qura University</institution>
          ,
          <addr-line>Mecca 24382</addr-line>
          ,
          <country country="SA">Saudi Arabia</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Universität Koblenz-Landau</institution>
          ,
          <addr-line>Mainz 55118</addr-line>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <volume>23</volume>
      <issue>2022</issue>
      <fpage>0000</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Mapping large Knowledge Graphs (KGs) has been a fundamental problem in the semantic web community. Many state-of-the-art methods are not suitable for matching cross-domain, large, and automatically constructed KGs that often sufer from highly imbalanced class distribution. Therefore, recent studies have revisited instance-based matching techniques in addressing this task. This is because such large KGs often lack a well-defined structure and descriptive metadata about their classes, but contain numerous class instances. In this work, we study the problem of imbalanced class distribution in large KG schema matching using instance-based methods. Building on a state-of-the-art method reported in the 2021 OAEI common knowledge graphs track, we study diferent resampling techniques and propose a new method to address class imbalance in the matching task. We show that our method improves state-of-the-art by up to 11% of recall and 4% in terms of recall. In addition, this work also produces a new public gold standard dataset for mapping large KG classes with over 300 class links, and is by far the largest domain-independent dataset for KG schema matching.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Knowledge Graphs Matching</kwd>
        <kwd>Instance-based Matching</kwd>
        <kwd>Ontology Matching</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Over the last decade, there has been a significant growth in the creation and application of
knowledge graphs. Knowledge Graph is a unique data structure for representing real-world
entities in a structured and connected fashion [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. With their potential in a wide range of
downstream applications such as query answering, recommendation systems and semantic
search [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], they are utilized by large companies such as Google, Facebook and Microsoft.
Besides such proprietary KGs, there are a number of large common KGs available, including
YAGO and Wikidata. Such cross-domain KGs are known for sharing heterogeneous yet highly
complementary facts.
      </p>
      <p>
        Despite the growth in such large-scale KGs, one problem is dealing with the quality of the
data generated automatically. This has resulted in continuous eforts to facilitate refining their
entities by increasing their coverage (i.e., completion), and detecting errors (i.e., correctness) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
To achieve this, mapping and aligning KGs at both entity and schema level is crucial. Mapping
and aligning KG entities has been a significant challenge in the semantic web community. The
Ontology Alignment Evaluation Initiative (OAEI1) has two tracks dedicated to KGs, including
one that particularly evaluates matching common KGs2. However, most matchers participating
on this task are aimed at matching well-formed ontologies, which is not necessarily the case for
cross-domain and semi- or fully-automatically generated KGs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. Despite the large number
of matching tools, matching the schema of large-scale KGs is far from trivial. Current schema
matching systems sufer in terms of balancing between eficiency and efectiveness to solve the
task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Recent studies have shown significant improvements over state-of-the-art systems
by exploiting an instance-based approach for matching the schema, i.e., classes and properties,
of KGs [
        <xref ref-type="bibr" rid="ref6 ref7">6, 7</xref>
        ]. However, the problem of unbalanced class distribution and particularly
underrepresented classes remain challenging for instance-based approaches.
      </p>
      <p>
        In this work, we address this issue by introducing a new method for matching classes from
large KGs. Our previous work [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] introduced KGMatcher, an instance-based method that
achieved the best results in the recent OAEI 2021 common KG track [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. The method adopts
a data-driven approach where a two-way classification technique is followed to map classes
from two KGs based on the extent to which the instances of a class in one KG are classified
as instances of classes in another KG. First, a multi-class classifier is trained using instances
of classes from each KG. Next, those classifiers are then applied to the other KG to classify its
instances. Mapping pairs of classes are then derived based on the classification results of the
two classifiers.
      </p>
      <p>While our method achieved good results, it still sufered from unbalanced class distribution.
To address this issue, in this work, we look into various sampling techniques and propose a
combined approach of over- and under-sampling in classifier training. We call this improved version
of our method, KGMatcher+. Specifically, we adapt the sampling component of KGMatcher
to better handle the imbalance problem. We introduce and compare six diferent resampling
strategies. We evaluate KGMatcher+ on two large datasets, including the dataset from OAEI
common KG track and a new gold standard we make public as part of this work. We show that
KGMatcher+ achieves the best results compared to OAEI 2021 participants.</p>
      <p>Although many solutions for dataset imbalance has been introduced, we identify that there is
no research on addressing the class imbalance issue in KG matching, or even ontology matching
in general. Previous research in the context of classification tasks has shown that there is often
no one-size-fit-all method, and thus previous findings may not generalise to this task. Combined
with the increasing research in KG matching and popularity in instance-based techniques, we
argue that it is imperative to further investigate, and develop methods to address the issue
of imbalanced distribution in the KG matching task. Section 2 details KGMatcher+. Then, in
Section 3 and Section 4 we introduce the datasets, and experiment results. Finally, Section 5
concludes this work.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Approach</title>
      <p>
        Here, we introduce KGMatcher+, specifically focusing on balancing class distribution for
matching KG classes. We first give an overview of it in Section 2.1 (readers interested in the details
may refer to [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], which is the basis of KGMatcher+). Then, in Section 2.2, we describe how our
method uses resampling techniques to address the data imbalance issue.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Overview of KGMatcher+</title>
        <sec id="sec-2-1-1">
          <title>2.1.1. Preliminaries</title>
          <p>
            Given two input knowledge graphs  and ′, we define the correspondence between two
classes  ∈  and ′ ∈ ′ as the tuple &lt; , ′,  &gt; where  ∈ [
            <xref ref-type="bibr" rid="ref1">0, 1</xref>
            ] is the similarity value
of  and ′. Each class in the two KGs has a set of instances,  = {0, 1, 2..., } and
′ = {0, 1, 2..., }. The following sections describe diferent modules of the matcher, as
illustrated in Figure 1.
          </p>
        </sec>
        <sec id="sec-2-1-2">
          <title>2.1.2. Input Knowledge Graph Indexing</title>
          <p>The first component of the matcher consists of three steps. It starts by parsing the two input KGs
in order to extract and then separately index their lexical annotations. This is followed by text
preprocessing to normalise entity labels (e.g., lowercasing, stopwords removal). Preprocessing
also separates multi-word entities such as creativeworkseries and dancegroup by using
a word segmentation algorithm based on a dictionary.</p>
        </sec>
        <sec id="sec-2-1-3">
          <title>2.1.3. Instance-based Matching</title>
          <p>
            In a two-way classification fashion, this matching component is split into two matching
processes. Each is based on generating a multi-class classifier for one of the input KGs using its
instance names as training data and class names as classification labels. Therefore, a classifier
trained on  instances data will be able to to predict the class  to which a given ‘instance’
name may belong. In the following, we briefly describe the steps of the instance-based matcher.
• Exact name filter: working as a blocking strategy, this removes class pairs that share
exactly the same labels from  and ′. Further, classes with only one instance are
eliminated from this process, as our BERT-based classifier will be unable to learn from
them.
• Resampling KG instances: the resampling phase is based on applying a data balancing
technique in order for the matcher to cope with the imbalance distribution of instances
across KG classes. In section 2.2, we discuss diferent data imbalance solutions that we
implement as part of KGMatcher+.
• Instance Classification: here we build a multi-class classifier for  and ′. Class
instances are split into 85% for training and 15% for the purpose of evaluating the classifier,
which we use a simple BERT-based sequence classifier [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ]. The two classifiers ℒ and
ℒ′ will be utilized in the two following steps of the matcher.
• Alignment Elicitation: here we derive class mapping candidates based on the classification
results. First, in the direction  → ′ to generate mapping candidates denoted as
→′ ,  will be treated as the source KG and ′ as the target. Subsequently, ℒ ,
is applied to all instance names in ′. By taking the class with the highest probability
value, each instance in ′ will have a predicted class in . To generate →′ .
We pair each class ′ with the class  that receive the most votes, based on applying
ℒ to instances of ′. As an example, the pair &lt; 4, ′2, 0.57 &gt; means that 57% of ′2
instances were predicted to be 4 when applied to ℒ . The second elicitation process
is done in the opposite direction, repeating the same procedure to create ′→ . The
two resulting alignment sets are to be combined in the final step of the instance-based
matcher.
• Alignment Selection: →′ and ′→ are unified at this stage to generate one
alignment set for the instance-based matcher. Specifically, we use the approach in [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ],
which firstly creates a  × ℳ table (where  is the number of classes in  and ℳ the
number of classes in ′), and populates the table based on the two directional alignment
sets created before. The algorithm then identifies the highest value in the table cells as
the final alignment candidates, and then deletes the corresponding row/column from the
table. The process is repeated until all rows/columns are deleted. For further details of
this algorithm, please refer to [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-1-4">
          <title>2.1.4. Name Matcher and Final Alignment Generation</title>
          <p>In this step, the terminological similarity of the KG class names is measured to be combined with
their instance-based similarity. This matcher combines two similarity measures: one focuses on
the string similarity, while the other focuses on semantic similarity. For string similarity, we use
the normalised Levenshtein distance. For semantic similarity, class names are represented using
a pretrained word2vec model, and then the cosine similarity is calculated. For each pair, the
higher value of the two similarity measures is chosen as that pair’s similarity value, while higher
than a threshold of 0.8. To generate the final alignments of KGMatcher+, the instance-based
(a) YAGO
(b) Wikidata
alignments are combined with the name matcher alignments. This is done by following the
same alignment selection method used earlier, while treating each as a directional alignment.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Addressing Imbalanced Data Distribution</title>
        <p>In the instance-based matching stage, the training data, i.e., KG class instances, are typically
imbalanced. Figure 2 depicts the distribution of instances in two KGs related to our datasets in
Section 4. In the following, we introduce six diferent sampling strategies that make the key
components of KGMatcher+. When each strategy is used instead of others, we denote that
version of KGMatcher+ as KGMatcher+{SS}, where SS indicates the corresponding Sampling
Strategy. Since we are dealing with a classification problem, we can resort to popular methods
used for dealing with imbalanced training data in classification. As the goal of data balancing
technique is to decrease the bias of classifiers towards majority classes at the expense of the
minority classes, we need to define both in the context of KGs. However, there is a lack of
consensus on how majority/minority classes are defined in multi-classification settings. Here,
we adopt an approach where we firstly calculate the average number of instances per class
within a KG, and then classes with fewer instances than this number are treated as minority
classes and those above it as majority classes.</p>
        <sec id="sec-2-2-1">
          <title>2.2.1. KGMatcher+ {Random Undersampling}</title>
          <p>
            In binary classification, this strategy indicates excluding data samples from the majority class
to match the size of the minority class. This is a common strategy seen in the literature [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
Similarly, in multi-class classification tasks, this method is independently applied to each class
by randomly sampling an equal sample size of all classes. The sample size matches the size of
the class with the least data samples [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ].
          </p>
        </sec>
        <sec id="sec-2-2-2">
          <title>2.2.2. KGMatcher+ {TF-IDF Undersampling}</title>
          <p>
            As opposed to random undersampling, which randomly discards instances from the majority
classes and can result in losing potentially useful samples, this method uses TF-IDF [13] to
measure the ‘importance’ of samples and select them based on this score. This method was first
introduced in our earlier work [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] and used as the only sampling component in KGMatcher [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ].
TF-IDF undersampling is applied to majority classes. Briefly, we calculate the TF-IDF of the
words from KG instances. Similar to applying TF-IDF for information retrieval tasks, each class
here is treated as a ‘document’ and the concatenation of the labels of its instances is treated
as its content. Then, a word with high TF-IDF to a certain class indicates it is more specific
to that class. The highest ranked ten words per class are then used to undersample instances
in the majority classes by discarding instance names that do not contain any of these words.
Although this method is efective in downsampling the majority classes while maintaining the
integrity of the data, the problem with classes with fewer instances remains unresolved.
          </p>
        </sec>
        <sec id="sec-2-2-3">
          <title>2.2.3. KGMatcher+ {SMOTE}</title>
          <p>
            Synthetic Minority Over-sampling Technique [14] is the most common oversampling method
applied in the literature to handle imbalanced data. It randomly oversamples the minority
classes by generating syntactic data for each minority class. The algorithm uses the K-nearest
neighbours to current instances in a minority class to introduce new synthetic samples from
neighbouring samples [15]. This technique is considered as an alternative to random
oversampling, which is a non-heuristic approach that balances classes by duplicating the samples in
the minority classes to match the size of the largest majority class [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. However, it is known
that random oversampling often leads to model overfitting [ 16]. Another reason for excluding
random oversampling is the severe class imbalance ratio, e.g., in YAGO the smallest class has
one instance while some classes have over 100,000 instances as depicted in Figure 2. Random
oversampling will produce overwhelmingly redundant instances for small classes, making
overfitting much worse.
          </p>
        </sec>
        <sec id="sec-2-2-4">
          <title>2.2.4. KGMatcher+ {TF-IDF + Oversampling}</title>
          <p>Combining undersampling and oversampling strategies is another approach to handle
imbalanced datasets. Following such a hybrid strategy has shown to improve the results of several
classification tasks [ 16, 17]. While earlier work already experimented with other variations of
this idea, here we propose a novel method that combines TF-IDF undersampling with random
oversampling. We aim to maintain a trade-of between handling the imbalance issue in both
majority and minority classes. After applying the TF-IDF undersampling to the majority classes,
we apply oversampling to make each class equal-size in terms of their instances. This includes
creating repeated samples from minority classes.</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>2.2.5. KGMatcher+ {TF-IDF + SMOTE}</title>
          <p>This strategy is similar to the previous one. However, instead of random oversampling, here,
SMOTE is applied as an oversampling technique to handle the minority classes.</p>
        </sec>
        <sec id="sec-2-2-6">
          <title>2.2.6. KGMatcher+ {Cost-based Learning}</title>
          <p>All previous strategies belong to the category of data-level methods, often applied to the datasets
prior to training a model. Another type of strategy (i.e., ‘algorithm level’) aims to modify existing
machine learning models in an efort to reduce their bias towards the majority classes [ 18].
A common algorithm-level approach is cost sensitive learning [19] which modifies the class
weights by assigning larger weights to minority class(es) and smaller weights for majority
class(es) to be used during the model learning process. In this work, we evaluate a
state-of-theart approach [20], which gives each class a weight that is equal to its total number of instances
divided by the distribution of instances across all classes as depicted in Equation 1, where dict
is a dictionary of classes and their assigned weights, () is umber of instances in the
class .</p>
          <p>_ ℎdict = /( * ())
(1)</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Datasets</title>
      <p>Our first dataset is NELL-DBpedia, which is the OAEI common knowledge graphs dataset. This
dataset contains a gold standard mapping of 129 pairs of classes between NELL and DBpedia.
They each contain an average of 8,000 and 4,000 instances per class respectively. The second
dataset is YAGO-Wikidata created based on [21]. It specifically maps the Schema.org classes to
Wikidata’s schema. Diferent from the original dataset, which only includes the class alignment,
we refer to this dataset as YAGO-Wikidata because we retrieved the instances of Schema.org
classes from YAGO 3. The original gold standard includes over 500 mappings, however not all
of them are equivalence. For the purpose of our task, and given that the majority of studies on
mapping KGs only consider equivalence matches, we only include mappings annotated with
the relationship equivClass. As a result, the new dataset contains 304 equivalent class pairs.
Further, since Wikidata’s entities are often represented by their Q indices, e.g., Q1234, we use
the Wikidata python API to query their URIs in order to retrieve their labels. The same API
was then used to generate a subgraph of Wikidata that includes all the 304 classes and their
annotated instances. Similarly, we use YAGO’s SPARQL query endpoint to retrieve all schema
and instances metadata that are connected to the 304 classes included in the original dataset
alignments. On average, this dataset includes over 33,000 and 12,000 instances per class in
YAGO and Wikidata respectively. We make the new dataset publicly available in 4. This includes
the two subgraphs in rdf/xml format and the alignments file according to OAEI’s standards.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Evaluation and Comparative Study</title>
      <p>In Section 4.1, we first compare the diferent sampling strategies in KGMatcher+; in Section 4.2,
we compare the results of KGMatcher+ against state-of-the-art methods for mapping KGs.
Similar to OAEI standards, we use precision, recall, F-measure to evaluate the accuracy of
the resulted alignments. All systems are implemented using the Matching Evaluation Toolkit
(MELT)5, which is also used for recent OAEI campaigns. Our experiments have been executed
on a VM with 128 GB of RAM, with 2.4 GHz 16 vCPUs, and a 12 GB GPU.</p>
      <p>3YAGO 4 https://yago-knowledge.org/downloads/yago-4
4https://github.com/OmaimaFallatah/YagoWikiData
5https://github.com/dwslab/melt</p>
      <sec id="sec-4-1">
        <title>4.1. Impact of sampling strategies on KGMatcher+</title>
        <p>
          Table 1 shows the precision, recall, and 1 of KGMatcher+ when using diferent sampling
strategies. As the table shows, in terms of 1, KGMatcher+{TF-IDF + oversampling} outperforms
all other variations with (1=0.91) on the YAGO-Wikidata dataset and (1=0.95) on the
NELLDBpedia dataset. In terms of undersampling strategies, KGMatcher+{Random Undersampling}
fails to improve the overall results on both datasets compared to the results obtained when no
sampling was applied. On the other hand, while KGMatcher+{TF-IDF undersampling} does leave
the matching results on both datasets unchanged, compared to no sampling, it maintains the
same performance while significantly decreasing the matcher processing time (from 55 minutes
to 29 minutes on the NELL-DBpedia dataset and from 3 hours to 1.5 on the YAGO-Wikidata).
The latter strategy represents our earlier work, KGMatcher [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ].
        </p>
        <p>There is a gap in the performance of the two undersampling strategies, as the first one
randomly eliminates samples from KG classes. Further, with this strategy, instance samples are
reduced to match the size of the class with the least samples, which can be less than 10 instances
in some common KGs. This rather aggressive reduction in training data could have badly afected
the classifier training. However, using TF-IDF to downsample classes in KGMatcher+{TF-IDF
undersampling} does not negatively impact the results, as the elimination process maintains
instances with indicative words.</p>
        <p>In terms of oversampling strategies, KGMatcher+{SMOTE} decreases the recall on both
datasets, which subsequently afects the 1 score as well. Further, KGMatcher+ {TF-IDF +
SMOTE} shows results similar to the best performing strategy on the NELL-DBpedia dataset,
i.e., KGMatcher+{TF-IDF + Oversampling}. Nonetheless, this strategy does not perform as well
on the YAGO-Wikidata dataset, which is twice the size of the NELL-DBpedia. This shows
that combining both undersampling and oversampling is the best strategy for this task. Even
though class distribution in the KGs used in the experiments was severe, undersampling the
majority classes with TF-IDF helps mitigate this issue. In contrast, generating synthetic data
from KG instances seems to introduce noisy samples to the dataset, as indicated by the result of
KGMatcher+{SMOTE}. This seems to be consistent with previously reported findings in text
classification tasks [15].</p>
        <p>The final data balancing strategy is KGMatcher+ {Cost-based learning}, which adapts the
BERT model to handle class imbalance by using class weights. Although the main advantage of
this method is to maintain the integrity of the datasets, this did not work well as it achieved
the worst precision, recall, and F-measure, which are even lower than the model using no data
balancing strategies at all. This is due to over penalising the classifier for incorrectly classifying
instances from the minority classes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Comparison to State-of-the-art</title>
        <p>Table 2 depicts the performance of KGMatcher+{TF-IDF + Oversampling} against multiple
OAEI’s best performing matchers in the two KG tracks. We can observe that the proposed
matcher outperforms all baselines on both datasets, recording the highest recall and 1. In
terms of the NELL-DBpedia dataset, KGMatcher+ outperforms all matchers by a minimum of
6% in 1 score and by 11% in recall.</p>
        <p>On the YAGO-Wikidata dataset, which has worse class imbalance compared to NELL-DBpedia,
KGMatcher+ beats all matchers with 4% in terms of recall and 2% in 1. One can also notice that
all matchers score lower on YAGO-Wikidata. This is likely due to the size of YAGO-Wikidata,
which is twice the size of NELL-DBpedia. Readers should also note that while all matchers were
able to generate class alignments when applied to the full version of YAGO-Wikidata, AML
and LogMap were only able to process a smaller version of the dataset, with a small subset of
instances per class. Also, both do not utilize instances during the matching process.</p>
        <p>In order to analyse the ability of diferent matching methods to discover alignments containing
very imbalanced classes, we conducted a quantitative study of imbalanced pairs in the two
datasets. We define a pair (, ′) as an imbalance class pair, if one of the classes is a majority
class and the other is a minority class, or if both classes are considered as minority classes. We
counted around 40 imbalanced pairs in YAGO-Wikidata and 22 in NELL-DBpedia. Figure 3
shows the number of discovered imbalanced alignments by diferent methods compared. On
both datasets, KGMatcher+ was able to discover over 60% of such imbalanced pairs (64% in
NELL-DBpedia and 63% in YAGO-Wikidata). On the YAGO-Wikidata, for instance, KGMatchers+
discovered 25 out of the 40 imbalanced pairs, while the next best systems (AML and LogMap)
(a) YAGO-Wikidata
(b) NELL-DBpedia
discovered only 11 and 9 pairs respectively. The results indicate that mapping the schema of
large and common KGs is not a trivial task and needs to be carefully handled.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Ablation Study</title>
        <p>Table 3 presents an ablation study for KGMatcher+ on the two datasets. In particular, we
look at the efects of the sampling strategy, and the name matcher. In terms of resampling,
adapting a resampling strategy does not only improve the processing time as discussed earlier,
but also positively afected precision, recall and F-measure on both datasets. This is an inspiring
achievement, as the majority of undersampling methods often have a negative efect on the
learning process [15]. Regarding the name matcher, combining it with the instance-based
method has improved KGMatcher+ results by increasing the recall on both datasets while
maintaining a good balance between precision and recall. While the terminological method
utilized by KGMatcher+ is the basic edit distance combined with a word embedding based
similarity metric, it achieved similar performance to state-of-the-art methods that utilize more
complex terminological and structural techniques as shown in Table 2. This further demonstrates
that large-scale and cross domain KGs are very diferent from conventional ontologies, and
therefore require more tailored solutions.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we introduced KGMatcher+ which specifically addresses the problem of
imbalanced class distribution in the task of matching the classes of common large KGs. To the best
of our knowledge, there is a lack of studies in this direction, and our work is the first that
addresses this problem in the context of KG matching. We experimented with diferent sampling
strategies, including one that is newly proposed in this work. We show that combining TF-IDF
undersampling and oversampling techniques outperforms other strategies. Our work provides
empirical reference for future research on large KG matching, which is an increasing challenge
due to the typical class imbalance issues. Our future work will expand KGMatcher+ to map
KGs properties and utilize the results of schema matching to align KG instances.
the curse of imbalanced datasets in machine learning, The Journal of Machine Learning
Research 18 (2017) 559–563.
[13] J. Ramos, et al., Using tf-idf to determine word relevance in document queries, in:
Proceedings of the first instructional conference on machine learning, volume 242, Citeseer,
2003, pp. 29–48.
[14] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority
over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.
[15] C. Padurariu, M. E. Breaban, Dealing with data imbalance in text classification, Procedia</p>
      <p>Computer Science 159 (2019) 736–745.
[16] B. Krawczyk, Learning from imbalanced data: open challenges and future directions,</p>
      <p>Progress in Artificial Intelligence 5 (2016) 221–232.
[17] H. Feng, W. Qin, H. Wang, Y. Li, G. Hu, A combination of resampling and ensemble
method for text classification on imbalanced data, in: International Conference on Big
Data, Springer, 2021, pp. 3–16.
[18] Z. Liu, W. Cao, Z. Gao, J. Bian, H. Chen, Y. Chang, T.-Y. Liu, Self-paced ensemble for highly
imbalanced massive data classification, in: 2020 IEEE 36th international conference on
data engineering (ICDE), IEEE, 2020, pp. 841–852.
[19] C. Elkan, The foundations of cost-sensitive learning, in: International joint conference on
artificial intelligence, volume 17, 2001, pp. 973–978.
[20] G. King, L. Zeng, Logistic regression in rare events data, Political analysis 9 (2001) 137–163.
[21] P. Krauss, schemaorg-wikidata-map, https://github.com/okfn-brasil/
schemaOrg-Wikidata-Map, 2017.
[22] H. Sven, P. Heiko, Atbox results for oaei 2021, in: Proceedings of the 16th International
Workshop on Ontology Matching co-located with the 20th International Semantic Web
Conference (ISWC), volume 3063, 2022.
[23] J. Portisch, H. Paulheim, Wiktionary matcher results for oaei 2021, in: Proceedings of the
16th International Workshop on Ontology Matching, volume 3063, 2022, pp. 199–206.
[24] E. Jiménez-Ruiz, Logmap family participation in the oaei 2020, in: Proceedings of the 15th
International Workshop on Ontology Matching (OM 2020), volume 2788, CEUR-WS, 2020,
pp. 201–203.
[25] J. Portisch, H. Paulheim, Alod2vec matcher results for oaei 2021, in: Proceedings of the
16th International Workshop on Ontology Matching, volume 3063, 2022.
[26] D. Faria, C. Pesquita, T. Tervo, F. M. Couto, I. F. Cruz, Aml and amlc results for oaei 2019,
in: Proceedings of the 16th International Workshop on Ontology Matching co-located
with the 20th International Semantic Web Conference (ISWC), volume 3063, 2022.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Heist</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hertling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ringler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Knowledge graphs on the web - an overview</article-title>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Obraczka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schuchart</surname>
          </string-name>
          , E. Rahm,
          <article-title>Embedding-assisted entity resolution for knowledge graphs (</article-title>
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Knowledge graph refinement: A survey of approaches and evaluation methods</article-title>
          ,
          <source>Semantic web 8</source>
          (
          <year>2017</year>
          )
          <fpage>489</fpage>
          -
          <lpage>508</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Fallatah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <article-title>A gold standard dataset for large knowledge graphs matching</article-title>
          ,
          <source>in: Ontology Matching 2020: Proceedings of the 15th International Workshop on Ontology Matching co-located with (ISWC</source>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rahm</surname>
          </string-name>
          , E. Peukert,
          <article-title>Large-scale schema matching</article-title>
          .,
          <source>Encyclopedia of Big Data Technologies</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>O.</given-names>
            <surname>Fallatah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <article-title>Kgmatcher results for oaei 2021</article-title>
          ,
          <source>in: Proceedings of the 16th International Workshop on Ontology Matching co-located with the 20th International Semantic Web Conference (ISWC)</source>
          , volume
          <volume>3063</volume>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D.</given-names>
            <surname>Ayala</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Hernández</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ruiz</surname>
          </string-name>
          , E. Rahm,
          <article-title>Towards the smart use of embedding and instance features for property matching</article-title>
          ,
          <source>in: 2021 IEEE 37th International Conference on Data Engineering (ICDE)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>2111</fpage>
          -
          <lpage>2116</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Fallatah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hopfgartner</surname>
          </string-name>
          ,
          <article-title>A hybrid approach for large knowledge graphs matching</article-title>
          ,
          <source>in: Proceedings of the 16th International Workshop on Ontology Matching</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A. S.</given-names>
            <surname>Maiya</surname>
          </string-name>
          ,
          <article-title>ktrain: A low-code library for augmented machine learning</article-title>
          ,
          <source>arXiv preprint arXiv:2004</source>
          .
          <volume>10703</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>M.</given-names>
            <surname>Gulić</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Vrdoljak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Vuković</surname>
          </string-name>
          ,
          <article-title>An iterative automatic final alignment method in the ontology matching system</article-title>
          ,
          <source>Journal of Information and Organizational Sciences</source>
          <volume>42</volume>
          (
          <year>2018</year>
          )
          <fpage>39</fpage>
          -
          <lpage>61</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>X.-Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.-H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <article-title>Exploratory undersampling for class-imbalance learning</article-title>
          ,
          <source>IEEE Transactions on Systems, Man, and Cybernetics</source>
          ,
          <string-name>
            <surname>Part</surname>
            <given-names>B</given-names>
          </string-name>
          (
          <year>Cybernetics</year>
          )
          <volume>39</volume>
          (
          <year>2008</year>
          )
          <fpage>539</fpage>
          -
          <lpage>550</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>G.</given-names>
            <surname>Lemaître</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nogueira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. K.</given-names>
            <surname>Aridas</surname>
          </string-name>
          ,
          <article-title>Imbalanced-learn: A python toolbox to tackle</article-title>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>