<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>SORBETMatcher Results for OAEI 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>FrancisGosselin</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal Zouaq</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>LAMA-WeST Lab, Departement of Computer Engineering and Software Engineering</institution>
          ,
          <addr-line>Polytechnique Montreal, 2500 Chem. de Polytechnique, Montréal, QC H3T 1J4</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper presents the results of SORBETMatcher in the OAEI 2023 competition. SORBETMatcher is a schema matching system for both equivalence matching and subsumption matching. SORBETMatcher is largely based on SORBET Embeddings, a novel ontology embedding method that leverages large language models, random walks, and a regression loss to construct a latent space that encapsulates ontology structures. Despite recognizing certain limitations inherent in SORBET Embeddings, SORBETMatcher performed well in the OAEI competition. It emerged as the leading system in three out of the five subsumption matching challenges within the Bio-ML track, as well as in the equivalence matching problem involving ORDO-DOID.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ontology alignment</kwd>
        <kwd>Schema matching</kwd>
        <kwd>Representation Learning</kwd>
        <kwd>ISWC-2023</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.2. Specific techniques used</title>
      <sec id="sec-1-1">
        <title>1.2.1. Candidate Selection</title>
        <p>The first step of the matching process is to determine which concepts are likely to be matched.
Since fetching SORBET Embeddings can be a long process for large ontologies, reducing the
number of candidate concepts can greatly improve the runtime. There are three strategies to
obtain a smaller set of candidate classes. Firstly, we employ a string matcher that identifies pairs
of concepts with matching labels or synonyms as alignments. Concepts originating from these
high-precision alignments are pruned from the set of considered classes. Secondly, some classes
in the Bio-ML track has theuse_in_alignment tag indicating whether they should be used or
not. Finally, in the local ranking of the Bio-ML track, candidates mappings are suggested from
a test.cands file. We identify each unique class in the candidates, and consider them as the sole
relevant classes.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2.2. SORBET Embeddings</title>
        <p>SORBET is an Ontology Embedding method that has the goal of obtaining rich BERT embeddings
while rearranging the latent space based on the ontology’s structure. To achieve this, SORBET
ifne-tunes SentenceBERT, a pre-trained siamese BERT model, with a regression loss based on
the distance between classes:</p>
        <p>Where M, is a training dataset containing pairs of class es,,

 . 
 is a predicted similarity,
and A is a hyperparameter representing the distance between 2 classes. Intuitively the parameter
A will control the sparsity of the ontology’s classes in the latent space. The bigger the value of
A the larger the distance between neighbor classes. The distance d is defined by the number of
subClassOf relationships betwee n and   .</p>
        <p>To obtain SORBET embeddings representing classes, the input of the SentenceBERT model is a
random walk describing each class, providing context to classes given their neighbor subclasses,
parent classes, and classes related by object properties. Both at training and inference time, a
new random walk is created to describe a concept. The fine-tuning of SentenceBERT is achieved
with pairs of concepts composed of positive samples, semi-negative samples and negative
samples.</p>
        <p>By sampling a class and its neighbors, then computing a similarity score relative to their
distance, SORBET Embeddings attempts to replicate the structure of the ontology in the latent
space. Therefore, similar classes from diferent ontologies get restricted into the same region of
the latent space, making embeddings well-suited for the alignment or matching task.</p>
      </sec>
      <sec id="sec-1-3">
        <title>1.2.3. Compute cosine similarity matrix</title>
        <p>Using the embedding of all relevant classes, a similarity matrix is constructed using the cosine
similarity measure as highlighted by equatio2n.</p>
        <p>=
where Ω is a function that transforms a concept into its SORBET Embedding an ,d 
represent the source and target concept respectively.</p>
        <p>The i-th row represents the i-th concept from the source ontology and the j-th column
represents the j-th concept from the target ontology. This matrix is initialized with a few
values. The similarity of alignments outputted by the string matcher (described in the candidate
selection in section1.2.1) are set to 1.0 while the columns and rows of the concepts whose
use_in_alignment property is False are set to 0. For local rankings, the similarity between pairs
of classes that are not in the candidates are set to 0. All the remaining cells are filled with the
cosine similarity values.</p>
      </sec>
      <sec id="sec-1-4">
        <title>1.2.4. Greedy Matcher with threshold</title>
        <p>
          To determine which mappings from the similarity matrix will be retained, we utilize a
straightforward greedy algorithm, akin to approaches in related works such7]aasn[d [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. This simple
algorithm sorts the similarity values and then iterates through each ele m
 e nint descending
order of scores and selecting mappings provided that neither its source nor target concepts
have already been chosen. The algorithm goes on until the value ogf oes below the threshold
value of 0.75. Even though the neighbors of equivalent classes also have a high similarities,
the goal of the greedy matching algorithm is to reduce these false positive alignments and to
produce 1:1 alignments.
        </p>
      </sec>
      <sec id="sec-1-5">
        <title>1.2.5. Local Ranking</title>
        <p>The local ranking evaluation method requires only the target candidates for a concept to be
sorted in descending order. The algorithm then iterates through each non-null row i, and applies
an index sort to indicate to which j-th concept (from the j-th column) the source concept is
most likely to be matched with.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>1.3. Specific settings and Hyperparameters</title>
      <p>For the OAEI competition, two models with diferent hyperparameters were used, the MEL6T] [
submission model and the semi-supervised Bio-ML model. For the both models, SORBET
Embeddings were trained starting from the pre-trained
SentenceBEsRenTtence-transformers/allMiniLM-L6-v2. The MELT model was trained simultaneously on the conference and anatomy
track with a value of A equal to 5 while no other changes were made to the original
hyperparameters used in SORBET4[]. This model was also used for the evaluation of the unsupervised
equivalence matching of the Bio-ML track. This was done to show how SORBETMatcher
performs in a zero-shot learning context.</p>
      <p>For the remaining results of the Bio-ML track, SORBET was individually fine-tuned on the
sub-tracks, using the train reference alignments as positive samples in the SORBET training.
The hyper-parameters of SORBET’s semi-supervised version on Bio-ML were the following:
For the A value, our experiments hinted that shallow ontologies are better embedded with a
low A value. Therefore, the OMIM-ORDO and NCIT-DOID had a value of A kept at 4, while
for the rest of the sub-tracks A was reduced to 3. Other experiments have also shown that
the generation of negative samples during training lead to worse results, this caused us to
remove them completely. This may be due to the fact that negative samples are normally used
to increase the precision but at the cost of reducing recall. However, since the precision is high
in most sub-tracks, the trade-of can be counter-productive.</p>
      <sec id="sec-2-1">
        <title>2. Results</title>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>2.1. Anatomy</title>
      <p>Full results for all SORBETMatcher’s alignments are shown in T1a.ble
The anatomy track involves aligning the Adult Mouse Anatomy (MA) with the NCI Thesaurus,
which describes Human Anatomy (NCI). SEBMatcher achieved an F1-score of 0.909, with a
precision of 0.923 and a recall of 0.895. In comparison to other systems in this year’s competition,
SORBET obtained the 3rd position out of 9 based on the F1 score. However, it is worth noting
that SEBMatcher’s performance lagged in terms of runtime, having a total time of 4032 seconds.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Conference</title>
      <p>The conference track involves aligning a set of ontologies that describe the domain of conference
organization. This track encompasses multiple reference alignment sets, with M1 alignments
focusing solely on classes, M2 on properties, and M3 containing both classes and properties.
Given that SORBET is presently only able to embed classes exclusively, its performance is less
robust when applied to the M3 reference alignments.
2.3. Bio-ML
The Bio-ML consists of 5 diferent reference alignments across multiple ontologies. It is separated
into equivalence matching and subsumption matching. SORBETMatcher participated to both
sub-tasks. The equivalence matching is also decomposed into 2 categories, one with the
unsupervised test set (100% of reference alignments) and one with the semi-supervised test set
(70% of reference alignments).
0.181
0.695
0.311
0.659
0.557</p>
      <sec id="sec-4-1">
        <title>3. General comments and Conclusion</title>
        <p>Overall, SORBETMatcher achieved a top performance in some of the tracks while still having
some improvements to be made on others.</p>
        <p>The Bio-ML subsumption track is the task where SORBETMatcher scored the strongest,
with three first places and one second place. However, SORBETMatcher scored last in the
OMIM-ORDO sub-track by a a large margin, especially for higher Hits@K. This may indicate a
lfaw in the SORBET Embeddings obtained on the OMIM or ORDO ontologies. The nature of this
problem is still to be further investigated, but our initial hypothesis is that it might be due to
the restriction axioms (which are numerous in these ontologies) and which are not considered
by SORBET in its semi-negative sampling.</p>
        <p>The results of the Bio-ML equivalence matching track are mixed. SORBETMatcher scored
the best in the NCIT-DOID subtrack where it achieved first place in both unsupervised and
supervised test sets. Considering the subsumption results for the NCIT-DOID subtrack, where
SORBETMatcher largely outperformed other systems, we hypothesize that SORBET
Embeddings are much more representative of ontologies with higher depths such as DOID. Another
conclusion we can draw from these results is the capability of SORBET Embeddings to work in
zero-shot learning tasks. Indeed, the unsupervised results all come from the MELT packaging
of the system, in which SORBET is frozen after being trained on the conference and anatomy
tracks. Therefore, at inference time, the BERT model has never seen the concept to embed,
hence our conclusion about its zero-shot capability. As for datasets like Pharm and Neoplas,
SORBETMatcher has yielded disappointing results. The problem may be of the same nature as
the OMIM-ORDO dataset in subsumption matching, but it could also be because of the lack of
hyper-parameters tuning, which can be very sensitive.</p>
        <p>As for the performance of SORBETMatcher on the conference and anatomy tracks,
SORBETMatcher was able to obtain good results by reaching the second and third place respectivelly.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4. Acknowledgements</title>
        <p>This research has been funded by Canada’s NSERC Discovery Research Program.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Bento</surname>
          </string-name>
          , Amal Zouaq, and Michel Gagnon. “
          <article-title>Ontology Matching Using Convolutional Neural Networks”</article-title>
          . English.
          <article-title>PInr o:ceedings of the 12th Language Resources</article-title>
          and Evaluation Conference. Marseille, France: European Language Resources Association, May
          <year>2020</year>
          , pp.
          <fpage>5648</fpage>
          -
          <lpage>5653</lpage>
          . isbn:
          <fpage>979</fpage>
          -
          <lpage>10</lpage>
          -95546-34-4. url: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .6.
          <fpage>93</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Jiaoyan</given-names>
            <surname>Chen</surname>
          </string-name>
          et al. “
          <article-title>Contextual semantic embeddings for ontology subsumption prediction”</article-title>
          .
          <source>In: World Wide Web 26.5</source>
          (
          <issue>Sept</issue>
          .
          <year>2023</year>
          ), pp.
          <fpage>2569</fpage>
          -
          <lpage>2591</lpage>
          . issn:
          <fpage>1573</fpage>
          -
          <lpage>1413</lpage>
          . doi:
          <volume>10</volume>
          .1007/s11280- 023-01169-9. url: https://doi.org/10.1007/s11280-023-01169-9.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Francis</given-names>
            <surname>Gosselin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amal</given-names>
            <surname>Zouaq</surname>
          </string-name>
          .
          <article-title>“SEBMatcher Results for OAEI 2022”</article-title>
          .
          <source>OInn:tology Matching 2022 : Proceedings of the 17th International Workshop on Ontology Matching (OM</source>
          <year>2022</year>
          )
          <article-title>co-located with the 21th International Semantic Web Conference (ISWC</article-title>
          <year>2022</year>
          ), Hangzhou, China, virtual conference,
          <source>October</source>
          <volume>23</volume>
          ,
          <year>2022</year>
          . Vol.
          <volume>3324</volume>
          .
          <string-name>
            <surname>CEUR Workshops</surname>
          </string-name>
          <article-title>Proceedings</article-title>
          . CEURWS.org,
          <year>2022</year>
          , pp.
          <fpage>202</fpage>
          -
          <lpage>209</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Francis</given-names>
            <surname>Gosselin</surname>
          </string-name>
          and
          <string-name>
            <given-names>Amal</given-names>
            <surname>Zouaq</surname>
          </string-name>
          . “
          <article-title>SORBET: A Siamese Network for Ontology Embeddings Using a Distance-Based Regression Loss and BERT”</article-title>
          .
          <source>IInn:ternational Semantic Web Conference</source>
          . Springer.
          <year>2023</year>
          , pp.
          <fpage>561</fpage>
          -
          <lpage>578</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Yuan</given-names>
            <surname>He</surname>
          </string-name>
          et al. “
          <article-title>Bertmap: A bert-based ontology alignment system”</article-title>
          .
          <source>PI nro:ceedings of the AAAI Conference on Artificial Intelligence</source>
          . Vol.
          <volume>36</volume>
          . 5.
          <year>2022</year>
          , pp.
          <fpage>5684</fpage>
          -
          <lpage>5691</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Sven</given-names>
            <surname>Hertling</surname>
          </string-name>
          , Jan Portisch, and Heiko Paulheim. “
          <string-name>
            <surname>MELT - Matching EvaLuation Toolkit</surname>
          </string-name>
          <article-title>”</article-title>
          .
          <source>In: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS</source>
          <year>2019</year>
          , Karlsruhe, Germany, September 9-
          <issue>12</issue>
          ,
          <year>2019</year>
          , Proceedings.
          <year>2019</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>245</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -33220-4\_17. url: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -33220- 4%5C_
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Vivek</given-names>
            <surname>Iyer</surname>
          </string-name>
          , Arvind Agarwal, and Harshit Kumar. “
          <article-title>VeeAlign: Multifaceted Context Representation Using Dual Attention for Ontology Alignment”</article-title>
          .
          <source>PIrnoc:eedings of the 2021 Conference on Empirical Methods in Natural Language Processing</source>
          . Ed. by
          <string-name>
            <surname>Marie-Francine Moens</surname>
          </string-name>
          et al.
          <source>Online and Punta Cana</source>
          , Dominican Republic:
          <article-title>Association for Computational Linguistics</article-title>
          , Nov.
          <year>2021</year>
          , pp.
          <fpage>10780</fpage>
          -
          <lpage>10792</lpage>
          .doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .emnlp-main.
          <volume>842</volume>
          . url: https://aclanthology.org/
          <year>2021</year>
          .emnlp-main.
          <volume>8</volume>
          .
          <fpage>42</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Nils</given-names>
            <surname>Reimers</surname>
          </string-name>
          and
          <string-name>
            <given-names>Iryna</given-names>
            <surname>Gurevych</surname>
          </string-name>
          . “
          <article-title>Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks”</article-title>
          .
          <source>In:Conference on Empirical Methods in Natural Language Processing</source>
          .
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jifang</given-names>
            <surname>Wu</surname>
          </string-name>
          et al. “
          <article-title>Daeom: A deep attentional embedding approach for biomedical ontology matching”</article-title>
          .
          <source>In: Applied Sciences 10.21</source>
          (
          <year>2020</year>
          ), p.
          <fpage>7909</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>