<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Francis Gosselin</string-name>
          <email>francis.gosselin@polymtl.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Amal Zouaq</string-name>
          <email>amal.zouaq@polymtl.ca</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Chem. de Polytechnique</institution>
          ,
          <addr-line>Montréal, QC H3T 1J4</addr-line>
          ,
          <country country="CA">Canada</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>LAMA-WeST Lab, Departement of Computer Engineering and Software Engineering</institution>
          ,
          <addr-line>Polytechnique Montreal, 2500</addr-line>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2022</year>
      </pub-date>
      <abstract>
        <p>This paper presents the results of the Structural Embeddings with BERT Matcher (SEBMatcher) in the OAEI 2022 competition. SEBMatcher is a novel schema matching system that employs a 2 step approach: An unsupervised pretraining of a Masked Language Modeling BERT fed with random walks, followed by a supervised training of a BERT for sequence classification fed with positive and negative mappings. This is the first year of participation in the OAEI for SEBMatcher and it has obtained promising results in participating tracks.</p>
      </abstract>
      <kwd-group>
        <kwd>1</kwd>
        <kwd>1</kwd>
        <kwd>State</kwd>
        <kwd>purpose</kwd>
        <kwd>general statement</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1.2. Specific techniques used</title>
      <sec id="sec-1-1">
        <title>1.2.1. Preprocessing</title>
        <p>During the preprocessing of the ontologies, we transform class and properties into tokens that
can be processed by BERT. We apply a basic grammar correction for incorrect labels using
the python library autocorrect, an algorithm based on edit distance. We then add acronym
resolution for niche acronyms and synonym resolution with Wordnet since not all ontologies
define synonyms for their classes (oboInOwl:hasRelatedSynonym for example). We also apply
basic preprocessing such as case folding, tokenization, and punctuation removal. Finally, we
remove tokens that are not part of the BERT vocabulary.</p>
      </sec>
      <sec id="sec-1-2">
        <title>1.2.2. Step 1: Unsupervised Model for Synonym and Context Embedding</title>
        <p>The motive of this pre-training is firstly to introduce BERT to ontologies represented as sentences.
But more importantly, it is to make a first refinement of the concept (class) embeddings, by
making a concept and its synonyms close to each other in the vector space. This is done by
using BERT’s Masked language Modeling (MLM) training algorithm. The pre-training is crucial
for the model since it is needed to understand information about the context and structure of
an ontology. Our experiences have shown that without this pre-training, the model struggles to
achieve a high score.</p>
        <p>Random walks. Random walks [6] are how we have chosen to represent a concept and its
context. In this step, a concept is replaced by its tree walk before being fed into BERT. The
tree walk is essentially a set of random walks, defined in (algorithm</p>
        <sec id="sec-1-2-1">
          <title>1), over the taxonomical</title>
          <p>structure and the object properties related to the concept. This algorithm takes as input a root
concept from an ontology and outputs multiple random walks starting from the root concept.
The resulting tree walk will not contain the same concept twice (aside from the root concept),
this is done to add more diversity to the context of the root concept. If the ontology contains
synonyms or multiple labels for the same concept, we randomly select one label every time we
append this concept to a walk. To represent the relation SUBCLASS OF, the symbol &lt; or &gt; is
used. If a child class is to the right of its parent class, we denote the relation as &gt; e.g. ”Person &gt;</p>
        </sec>
        <sec id="sec-1-2-2">
          <title>Teacher”, otherwise we use &lt;.</title>
          <p>Algorithm 1 Tree walk
Input:</p>
          <p>,
Output:  ℎ   
 
  
1: Initialize set of visited nodes   ∶= {}
2: Initialize walk  ∶= [ ]
3:  _ ℎ</p>
          <p>←  (1..  ℎ )
4: for  ∶= 0  
_ ℎ</p>
          <p>0    ℎ  
5:
6:
7:
8:
9:
10:
11:
12:
 
  
for  ∶= 0   
_ ←  (1..
0</p>
          <p>_( )

 ← 
ℎ  ← 
 +1 ,   ← ℎ
Append  , 
  ←   ∪  +1
_    
do
do
_ℎ (
+1  
_)
_ 
 ) / 
_ℎ</p>
          <p>(ℎ )
Preprocessing step
Concept
Raw Tree Walk
Tokenized Tree Walk
Masked Tree Walk</p>
          <p>Sample of data
ProgramCommitteeChair
ProgramCommitteeChair &lt; ProgramCommitteeMember &lt; Person
Program Committee Chair &lt; Program Committee Member &lt; Person</p>
          <p>Program Committee Chair &lt; [MASK] [MASK] [MASK] &lt; Person</p>
          <p>
            BERT Masked Language Modeling (MLM). The Masked Language Modeling learning
technique consists of masking a certain portion of the input tokens (in our case tokens of
concepts) and then trying to correctly predict the masked tokens. At inference, we omit the
language modeling head since it is not needed. To preserve the lexical integrity of concepts,
masks are applied to all their sub tokens. This rule does not apply to concepts with many tokens
(more than 5) since the task would be too hard. 15% of concepts are masked. The pre-trained
model ClinicalBERT [
            <xref ref-type="bibr" rid="ref1">1</xref>
            ] is used as initial weights for the model and was chosen for the Anatomy
track. Given that the same model must be used across tracks in the OAEI competition, we kept
the same model. However, more domain-related models could enhance the performance in
specialized tracks.
          </p>
          <p>Pooling layer. To retrieve the contextual embedding of a concept, its tree walk is firstly
passed in the trained model. Then BERT’s last hidden layer’s output is filtered to only keep
the tokens’ embeddings corresponding to the root node. Finally, a MEAN pooling is applied to
the filtered tokens’ embeddings. For example, given a root node ”Science” and the tree walk
”Science &lt; Subject; Science &gt; Computer Science”, we obtain Science’s embedding by computing
the average of the two ”Science” tokens. This filtering is applied to obtain embeddings that do
not drift too far from the concept, as it is hypothesized that the other concepts of the walk are
not needed in the pooling layer since they already contextualize the concept with attention
layers. This results in having a tree walk embedding of the root node that is not too far from its
single concept embedding.</p>
          <p>Similarity Matrix. To determine which candidate mappings (pairs of concepts) will be
considered during inference, we perform a first computation of similarity based on the concept
embeddings obtained after the pooling layer. Then we compute a matrix whose rows represent
the concepts of the source ontology and whose columns represent the concepts of the target
ontology, and for each combination, we return the cosine distance between their embeddings,
as illustrated by formula 1.</p>
          <p>, = 
_  (Ω(
 ), Ω( ′))
(1)
Ω Function that transforms a concept into its tree walk embedding
  ,  ′ the source and target concept respectively</p>
        </sec>
      </sec>
      <sec id="sec-1-3">
        <title>1.2.3. Step 2: Supervised Alignment Scoring</title>
        <p>The second step of the SEBMatcher system is a supervised classification task whose objective is
to distinguish between positive and negative mappings or alignments. During this step, a BERT
for sequence classification takes pairs of tree walks (representing our concepts) as input and
returns whether they are valid mappings or not. At inference time, candidate mappings are
evaluated and only a subset is retained as valid candidates in the final alignments, including
highly similar lexical-based alignments (string alignments).</p>
        <p>
          Candidate Selection. Since the BERT classifier takes tree walk pairs and outputs a mapping
score, it does not produce tree walks embedding. Thus we cannot directly compute a cosine
similarity matrix from a dot product. In the supervised model, each mapping must be passed to
the BERT classifier to obtain a similarity score. Given a source ontology  and a target ontology
 ′, the computation of a similarity matrix of ||| ′| possible alignments would require a vast
amount of time. A conventional solution to this problem [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] is candidate selection, a module
where we refine the scope of the calculations to highly probable matches. When comparing
methods of obtaining candidate mappings, the best candidates should firstly have the highest
Hits@K, where K is the number of the most probable target concepts to match for each source
concept, and secondly should have a low computation time.
        </p>
        <p>Our system’s candidate selection is a mix of lexical matching and concept embedding
similarities. Firstly, SEBMatcher computes string alignments. Those are mappings where the source
and target concepts have at least one completely matching label or synonym. For optimization
purposes, the string alignments are directly treated as part of the final alignments and are not
passed to the BERT alignment classifier. Then, for each source concept, we take a subset of the
|O’| possible mappings that consists of the  = 10 highest cosine similarity mappings, efectively
reducing the number of considered candidates to || .</p>
        <p>Alignments Training Dataset. The supervised model requires a dataset of negative and
positive alignments during training. This dataset consists of reference alignments, string
alignments and generated positive and negative alignments as described below:
• Negative alignments consist of a mix of hard and soft samples. Hard negative samples
are negative alignments where the source and target concepts appear to be closely related.
These samples are useful since they are important to correctly predict at inference time.
Soft negative samples are alignments that the model should have no trouble ruling out
since they are randomly chosen. The methods to generate these alignments are:
1. hard intra-negative sampling: a negative alignment consisting of a random concept
and one of its neighbour.
2. hard inter-negative sampling: a negative sample made by taking a reference or string
alignment and replacing either the source or target concept with its neighbouring
concept.
3. soft inter-negative sampling: a set of randomly chosen pairs of concepts from 2
diferent ontologies that assured to not be positive alignments.
• Positive alignments are a mix of reference alignments and generated alignments. The
goal here is to have the most diverse set of alignments possible in order to regularize our
model, the same way data augmentation is used in Machine Learning.</p>
        <p>1. reference mappings: 20% of the reference alignments. As per the OAEI rules, the
system must be general, and cannot be fine-tuned for a single task. Therefore
alignments from all tracks (Anatomy and Conference) where SEBMatcher participated
were used.
2. string alignments: mappings computed in the candidate selection step that are
considered positive alignments.</p>
        <p>3. intra-positive sampling: a random concept paired with a copy of itself.</p>
        <p>BERT Alignment Classifier . The scoring of a source and target tree walk pair is done
with a BERT for sequence classification model that is fine-tuned from the initial weights of
the BERT MLM model, the intuition behind this is that the classifier should have an easier
process of learning if it benefits from the learned representations and attention weights during
the MLM training. To produce a mapping score, the tree walks pair of the source and target
concept are concatenated into one string along with the [SEP] token between them. Then, this
string is passed to the BERT classifier which outputs the probability of the pair being a positive
alignment. During training, at the start of step 2 (Supervised training), the weights of the BERT
MLM are copied into the BERT classifier. Note that this cannot be done for the heads of the
models since this is where both architectures difer. The weights of the BERT classifier are
updated during training, but the BERT MLM model does not inherit these new weights.</p>
        <p>Similarity Matrix Refinement. Now that each candidate has been given a mapping score
by the BERT classifier, we are able to create a more accurate similarity matrix. The new similarity
matrix contains the scores of all candidate mappings, while the score of each string alignment
that was pruned during candidate selection is set to 1. All other possible mappings are set to
zero.</p>
        <p>
          Greedy Matcher. To select which mappings from the refined similarity matrix are going
to be kept, we employ a greedy algorithm similar to other works like [
          <xref ref-type="bibr" rid="ref7">7, 2</xref>
          ]. This very simple
algorithm iterates through candidate mappings from highest to lowest scores and selects every
mapping whose source and target concepts are not part of an already selected mapping. The
initial set of mappings is the set of string alignments computed during candidate selection.
Furthermore, only mappings with scores higher than 0.85 are considered.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>1.3. Parameters</title>
      <p>For the tree walks, the number of branches is randomly chosen between 3 and 5. The path
length for each branch is randomly chosen between 2 and 5. The BERT MLM model was trained
for 40 epochs with a learning rate of 1e-5 with the ADAM optimizer while the BERT Alignment
classifier was trained for 50 epochs with the same configuration. Both models have been trained
on a RTX a6000 48GB graphic card.
2. Results
Full results for all reference alignments are shown on Table 2.</p>
    </sec>
    <sec id="sec-3">
      <title>2.1. Anatomy</title>
      <p>The anatomy track consists of matching the Adult Mouse Anatomy (MA) and the NCI Thesaurus
describing the Human Anatomy (NCI). SEBMatcher reached an f1-score of 0.908, with a precision
of 0.945 and a recall of 0.874. Compared to this year‘s other systems, SEBMatcher ranked 2 out
of 10 in terms of f1 score. The runtime however is where SEBMatcher performed the worst,
with a total time of 35602 seconds. This runtime includes training time (35350 seconds) and
inference (252 seconds).</p>
    </sec>
    <sec id="sec-4">
      <title>2.2. Conference</title>
      <p>The conference track consists of matching a collection of ontologies describing the domain
of organising conferences. This track contains diferent reference alignments sets, the M1
alignments contain only classes, M2 alignments are only properties, and M3 is both classes and
properties. Since SEBMatcher currently matches exclusively classes, it does not perform well
on the M3 reference alignments.
3. General comments and Conclusion
Overall, SEBMatcher obtained a high performance on the anatomy track but less interesting
results in the conference track. Unfortunately, we found a bug in our greedy algorithm at the
very end of the execution phase. After fixing it, our F1-score reaches a new score of 0.75 for
the ra1-M1 alignments and has a similar score for the anatomy track. This is however not
part of the oficial competition results. SEBMatcher is a system that fully exploits the benefits
of rich ontologies, since it can more easily organize elements in the embedding space when
there is a well-defined ontological structure. Thus flat ontologies would not be usable with our
architecture.</p>
      <p>There is still a lot of room for improvement. Firstly, a way to improve performance would be
to generate more semantically complex positive alignments, this refers to the fact that string
alignments are easy mappings to find and the system benefits a lot more by learning from
dificult positive alignments. Secondly, other transformer architectures than BERT could be
explored. SEBMatcher produces mapping probabilities that are often close to either 1 or 0 and
are rarely far away from both values. This is undesirable for filtering valid candidates.</p>
      <p>It is also to be noted that SEBMatcher withdrew from the BioML track since we did not have
enough time to adapt the system to large ontologies. In fact, this brings up a main downside
of SEBMatcher which is its runtime. This is due to our BERT pre-training and fine-tuning. At
inference, however, SEBMatcher performs reasonably. In our future work, we plan to condense
these two training steps into one process. Another way to improve runtime would be to
prioritize concepts that are likely to be matched and ignore those that are not. We plan to
continue the exploration of concept embeddings and the generalization of our approach to other
ontologies and tracks.
4. Acknowledgements
This research has been funded by Canada’s NSERC Discovery Research Program.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Emily</given-names>
            <surname>Alsentzer</surname>
          </string-name>
          et al.
          <source>Publicly Available Clinical BERT Embeddings</source>
          .
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .48550/ ARXIV.
          <year>1904</year>
          .
          <volume>03323</volume>
          . url: https://arxiv.org/abs/
          <year>1904</year>
          .03323.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>Alexandre</given-names>
            <surname>Bento</surname>
          </string-name>
          , Amal Zouaq, and Michel Gagnon. “
          <article-title>Ontology Matching Using Convolutional Neural Networks”</article-title>
          .
          <source>English. In: Proceedings of the 12th Language Resources and Evaluation Conference</source>
          . Marseille, France: European Language Resources Association, May
          <year>2020</year>
          , pp.
          <fpage>5648</fpage>
          -
          <lpage>5653</lpage>
          . isbn:
          <fpage>979</fpage>
          -
          <lpage>10</lpage>
          -95546-34-4. url: https://aclanthology.org/
          <year>2020</year>
          .lrec-
          <volume>1</volume>
          .
          <fpage>693</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Jacob</given-names>
            <surname>Devlin</surname>
          </string-name>
          et al. “
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding”</article-title>
          . In: arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Yuan</given-names>
            <surname>He</surname>
          </string-name>
          et al. “
          <article-title>Bertmap: A bert-based ontology alignment system”</article-title>
          .
          <source>In: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          . Vol.
          <volume>36</volume>
          . 5.
          <year>2022</year>
          , pp.
          <fpage>5684</fpage>
          -
          <lpage>5691</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Sven</given-names>
            <surname>Hertling</surname>
          </string-name>
          , Jan Portisch, and Heiko Paulheim. “
          <string-name>
            <surname>MELT - Matching EvaLuation Toolkit</surname>
          </string-name>
          <article-title>”</article-title>
          .
          <source>In: Semantic Systems. The Power of AI and Knowledge Graphs - 15th International Conference, SEMANTiCS</source>
          <year>2019</year>
          , Karlsruhe, Germany, September 9-
          <issue>12</issue>
          ,
          <year>2019</year>
          , Proceedings.
          <year>2019</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>245</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -33220-4\_17. url: https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -33220-4%5C_
          <fpage>17</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <string-name>
            <given-names>Ole</given-names>
            <surname>Magnus</surname>
          </string-name>
          Holter et al. “
          <article-title>Embedding owl ontologies with owl2vec”</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . Vol.
          <volume>2456</volume>
          . Technical University of Aachen.
          <year>2019</year>
          , pp.
          <fpage>33</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Vivek</given-names>
            <surname>Iyer</surname>
          </string-name>
          , Arvind Agarwal, and Harshit Kumar. “
          <article-title>Multifaceted Context Representation using Dual Attention for Ontology Alignment”</article-title>
          . In: CoRR abs/
          <year>2010</year>
          .11721 (
          <year>2020</year>
          ). arXiv:
          <year>2010</year>
          .11721. url: https://arxiv.org/abs/
          <year>2010</year>
          .11721.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Leon</given-names>
            <surname>Knorr</surname>
          </string-name>
          and
          <string-name>
            <given-names>Jan</given-names>
            <surname>Portisch</surname>
          </string-name>
          . “
          <article-title>Fine-TOM matcher results for OAEI 2021”</article-title>
          .
          <source>In: CEUR Workshop Proceedings</source>
          . Vol.
          <volume>3063</volume>
          .
          <string-name>
            <surname>RWTH</surname>
          </string-name>
          .
          <year>2022</year>
          , pp.
          <fpage>144</fpage>
          -
          <lpage>151</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Jifang</given-names>
            <surname>Wu</surname>
          </string-name>
          et al. “
          <article-title>Daeom: A deep attentional embedding approach for biomedical ontology matching”</article-title>
          .
          <source>In: Applied Sciences 10.21</source>
          (
          <year>2020</year>
          ), p.
          <fpage>7909</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>