<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Peter Kardos</string-name>
          <email>kardos@inf.u-szeged.hu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Zsolt Szántó</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Richárd Farkas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Ontology Mathching</institution>
          ,
          <addr-line>Ontology Alignment, Language Models</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <fpage>23</fpage>
      <lpage>24</lpage>
      <abstract>
        <p>meaning. This paper presents the results of the WomboCombo Matcher in the Ontology Alignment Evaluation Initiative (OAEI) 2022. WomboCombo is an ontology matching tool that finds node pairs starting out from simpler exact string matching based steps through more complex neural Language Model based steps. We also train a classifier to diferentiate between entities with the same and entities with similar Word meaning based matcher over Combinations (WomboCombo) is a multi-stage ontology matching system that uses only textual information to find the same entities in two knowledge graphs. The first step is a simple exact string similarity based pairing process followed by more complex and resource exhausting steps as we progress through the whole system. Later stages utilize pretrained Language Models to find entities with the same meaning but diferent lexical representations. Each stage has its own output which we then combine for a final alignment (therefore the name Combo). WomboCombo was built for and got tested only on the Knowledge Graph track and mainly focused on the instance pairs. This decision is supported by the fact of instance nodes carrying the core information of a graph. Also class and property counts are only a handful, making it easier to correctly pair and most knowledge graphs even miss out on representing classes or properties. WomboCombo is implemented in python and is compatible with the Matching EvaLuation Toolkit (MELT) [1] with SEALS packaging.</p>
      </abstract>
      <kwd-group>
        <kwd>1</kwd>
        <kwd>1</kwd>
        <kwd>State</kwd>
        <kwd>purpose</kwd>
        <kwd>general statement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.2.1. Exact matching module</title>
      <p>The exact matching module takes a property as input and given the two graphs it will search
for nodes that has the same string representation of this property.</p>
      <p>We’ve also experimented with fuzzy string matching algorithms with diferent parameters,
however these solutions brought more noise with themselves than actual pairs. Therefore we’ve
discarded using fuzzy string matching.</p>
      <p>WomboCombo’s first two steps use the exact matching module but with a diferent property.
The first step uses the nodes’ Label property, the second step uses the nodes’ AltLabel property
both of them resulting in a high precision matches. Even though the precision is high using the
following module’s we focus on increasing the recall of our final alignment.</p>
    </sec>
    <sec id="sec-2">
      <title>1.2.2. SentenceBert module</title>
      <p>This module loads a pretrained SentenceBert model and outputs a vector embedding to all of
the nodes in the two graphs using their textual description (abstract). For each of the nodes of
graph A the  most similar nodes from the other graph will be paired using cosine similarity.
The main purpose of the module is to get a pool of pairs where the nodes have similar meaning.</p>
      <p>
        This is the 3rd step in our pipeline where we load all-MiniLM-L6-v2 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] model and use the
abstract properties text value trimmed to the first two sentence to get a vector representation.
Maximum the top 6 pairs were gathered for each node and discarded all pairs below 0.6 cosine
similarity treshold. In case a graph have no abstract property this module is inactive.
      </p>
    </sec>
    <sec id="sec-3">
      <title>1.2.3. Same vs Similar module</title>
      <p>The Same vs Similar module is the most resource exhausting one. Our goal with this module
is to train a classifier that can diferentiate between two nodes that are similar versus two
nodes that represent the same concept. We achieve this by automatically creating a training
dataset with 2 classes {same, similar } and training a Language Model based classifier. Based
on a candidate pair pool the trained model discards the predicted similar pairs and returns the
remaining.</p>
      <p>
        In our submission the selected Language Model was albert-base [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] pretrained on the MRPC
task which is a sentence similarity task. The pairs we’ve considered same meaning (positive)
were the exact matching pairs on the Label property (1st steps output). We’ve generated 1
similar pair (negative) to each of these using the SentenceBert module where we replaced one
of the nodes of the pair with the highest ranking node that wasn’t the positive. We can say that
these negative pairs might contain noise, but as the task only has 1:1 gold pairs the noise should
be minimal. We used the abstract property to get the textual information for each node. As for
the training process a batch size of 1 and a learning rate of 10−5 were used. The training process
was let to run for 100 epochs, but an EarlyStop with 5 patience could shut it down before that
which suggest that we split the dataset to train and evaluation sets. To get the output alignment
of this module we used the trained classifier to filter the Sentence Bert module’s output.
      </p>
    </sec>
    <sec id="sec-4">
      <title>1.2.4. Union and Filtering module</title>
      <p>This module can union diferent alignments considering the confidence of each candidate. As
the gold pairs are all 1:1 matches we run a Top 1 filtering based on the confidence score of each
pair. In our pipeline the final alignment is calculated from 3 alignment pools maintaining the
order: Exact Matching over Labels, Exact Matching over AltLabels and the Same vs Similar
module’s output. If a pair is selected into the final alignment we pay attention to not include
any additional connections to these nodes while merging the diferent pair sets.</p>
      <sec id="sec-4-1">
        <title>2. Results</title>
        <p>2.1. Knowledge Graph
WomboCombo was not evaluated due to the organizers reporting TypeError when running the
system, but we report the scores achieved on the Knowledge Graph V4 test cases calculated
on our local machine using the oficial library and evaluation code. Our score achieved on the
marvel dataset is much lower than the other datasets due to missing abstract fields for most of
the nodes.</p>
      </sec>
      <sec id="sec-4-2">
        <title>3. General comments</title>
        <p>We have tested our system with diferent parameters for example whether cutting the abstracts,
using SentenceBert to vectorize over labels/altlabels or even tried diferent pretrained language
models. We could not find one parameter set that works best on all the 5 datasets, mostly
marvelcinematicuniverse-marvel
memoryalpha-memorybeta
memoryalpha-stexpanded
starwars-swg
starwars-swtor
diferent parameters worked better on certain datasets. For submission we have selected the
parameters with the best mean scores over the datasets.</p>
        <p>WomboCombo is not the best choice when matching properties or classes as these nodes do
not have abstract fields most of the time. Therefore only the exact match pairs could be found
in the resulting alignment. This was a big issue on the Marvel datasets as even the instances
had no abstract property in 90% of the nodes.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4. Conclusions</title>
        <p>In this paper, we presented the WomboCombo matching system and its results in the OAEI
2022 campaign. The system participated only in the Knowledge Graph track. Our solution only
considers the textual information of a node and creates pairs using a multi-step process that
includes exact string matching and more complex neural Language Model based steps as well.
The results show that these complex steps can successfully find not so trivial pairs boosting the
most basic matchers.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hertling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Portisch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          ,
          <article-title>Melt - matching evaluation toolkit</article-title>
          , in: M.
          <string-name>
            <surname>Acosta</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Cudré-Mauroux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Maleshkova</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Pellegrini</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Sack</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Sure-Vetter</surname>
          </string-name>
          (Eds.),
          <source>Semantic Systems. The Power of AI and Knowledge Graphs</source>
          , Springer International Publishing, Cham,
          <year>2019</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>245</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>N.</given-names>
            <surname>Reimers</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Gurevych</surname>
          </string-name>
          ,
          <article-title>Sentence-bert: Sentence embeddings using siamese bert-networks</article-title>
          , CoRR abs/
          <year>1908</year>
          .10084 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1908</year>
          .10084.
          <article-title>a r X i v : 1 9 0 8 . 1 0 0 8 4</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Goodman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Gimpel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sharma</surname>
          </string-name>
          , R. Soricut,
          <string-name>
            <surname>ALBERT:</surname>
          </string-name>
          <article-title>A lite BERT for self-supervised learning of language representations</article-title>
          , CoRR abs/
          <year>1909</year>
          .11942 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1909</year>
          .11942.
          <article-title>a r X i v : 1 9 0 9 . 1 1 9 4 2</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>