<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sven Hertling</string-name>
          <email>sven.hertling@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Heiko Paulheim</string-name>
          <email>heiko.paulheim@uni-mannheim.de</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Ontology Matching, Knowledge Graphs, Large Language Model</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Data and Web Science Group, University of Mannheim</institution>
          ,
          <country country="DE">Germany</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <abstract>
        <p>This paper presents the results of the OLaLa matching system participating in the OAEI 2023. The system is based on sentence-transformers as well as large language models. The former is used to generate correspondence candidates which is independent of any overlapping tokens because the comparison is only based on embeddings. To finally select the best mappings, a large language model is used to decide if two given textual representations of the source and target concept are equal or not. Based on positive and negative words that the LLM predicts, a confidence is extracted. Still, there are a lot of decisions that heavily influence the final result like (1) how can each concept be verbalized into text, (2) which prompt to use, and (3) which language model to choose. A lot of combinations were executed and the most useful one is submitted and packaged as a matching system.</p>
      </abstract>
      <kwd-group>
        <kwd>https</kwd>
        <kwd>//www</kwd>
        <kwd>uni-mannheim</kwd>
        <kwd>de/dws/people/researchers/phd-students/sven-hertling/ (S</kwd>
        <kwd>Hertling)</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
CEUR
Workshop
Proceedings</p>
      <p>O1
O2</p>
      <p>TextExtraction</p>
      <p>Top K</p>
      <p>SBERT</p>
      <p>Model
High Precision</p>
      <p>Matcher</p>
      <p>TextExtraction
0.9
0.1</p>
      <p>0.8</p>
      <p>LLM</p>
      <p>Application
Large Language</p>
      <p>Model</p>
      <p>Prompt
0.9
0.1
0.8
0.9
0.1
+ Cardinality</p>
      <p>Filter</p>
      <p>Confidence</p>
      <p>Filter
OLaLa</p>
      <p>
        Still, a lot of decisions need to be made when designing a full matching system such as (
        <xref ref-type="bibr" rid="ref1">1</xref>
        )
how to present each concept/correspondence to the model, (
        <xref ref-type="bibr" rid="ref2">2</xref>
        ) which prompt to use, and (
        <xref ref-type="bibr" rid="ref3">3</xref>
        )
which model to choose.
      </p>
      <p>There is already a publication about the system in general [3]. The following description is
taken from this paper.</p>
      <p>Figure 1 shows an overview of the simple architecture of the OLaLa system. All components
are implemented in MELT [4], a framework for matcher development and evaluation. MELT
is also used by the OAEI to package, submit, and evaluate the systems. Thus it is possible for
the ontology matching community to use each component in their own matching pipeline and
customize them to their liking. The implementation of OLaLa is publicly available and we
provide a console application2 which allows to run the system and modify the most important
parameters.</p>
      <p>At the very beginning, some matching candidates need to be extracted from the two given
input ontologies O1 and O2. Afterwards, those candidates are included in the user-defined
prompt and presented to the LLM. Two options are possible: 1) each correspondence is analyzed
independently of each other 2) given a source entity all possible target entities are presented and
the LLM needs to decide which one is correct (or none of them). The output of the high-precision
matcher is added to ensure that the simple matches are included as well. Finally, some filters
are applied to fulfill the usual requirements for an alignment such as a 1:1 mapping (cardinality
iflter). The confidence filter at the end ensures that only correspondences with reasonably high
confidence are returned. In the following sections, we will describe each step in more detail.</p>
    </sec>
    <sec id="sec-2">
      <title>1.1. Candidate Generation</title>
      <p>Due to the fact that the LLMs can usually not analyze the input ontologies as a whole
(except small ontologies like those in the OAEI conference track, see [5]), some correspondence
candidates need to be generated. In this stage only the recall is interesting and the higher
the recall the better. Some of the related approaches apply an inverted index to find possible</p>
      <sec id="sec-2-1">
        <title>2https://github.com/dwslab/melt/tree/master/examples/llm-transformers</title>
        <p>similar entities. This requires some textual overlap of those concepts. In OLaLa, the well-known
Sentence BERT models (SBERT) are used to generate those candidates. This allows a higher
recall because it can also find similar entities without any textual overlap. The trained SBERT
models are finetuned siamese BERT models on a huge set of paraphrases [ 6]. SBERT as well
as all LLMs only process text, but the input is an ontology. Thus it is necessary to verbalize
the concepts into some natural language text. In MELT they are called TextExtractors (see
section 1.3).</p>
        <p>For the candidate generation step, we use the so-called TextExtractorSet. It extracts all
texts of a resource which are either labels (e.g. rdfs:label, skos:prefLabel, schema:name)
or descriptions (e.g. rdfs:comment, dc:description, schema:comment). In addition to that,
the URI fragment is extracted in case it contains not more than 50% numbers. As a last step, all
annotation properties are followed recursively and all labels of those resources are added as
well.</p>
        <p>All those extracted texts for each resource are embedded and a semantic search is executed.
It computes the cosine similarity between a list of query embeddings and a list of corpus
embeddings and returns the top-k neighbors for each text. From those, we select the top-k best
neighbors per resource. This procedure is then repeated but the source and target ontologies
are swapped such that both act once as the query and once as the corpus embedding.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>1.2. LLM Application</title>
      <p>There are two principal approaches how the candidates are presented to the LLM. The first one
is binary decisions, i.e., deciding whether one candidate is correct or not; the second is multiple
choice decisions, i.e., selecting the most likely correspondence for one concept from a set of
possible targets.
1.2.1. Binary Decisions
Binary decisions are implemented in class LLMBinaryFilter. For each candidate
correspondence, the source and target entity are verbalized into text and replaced in the prompt given
by the user. The output of generative models, such as the ones applied in this work, is always
natural language text. To convert this into a binary decision, the following technique is applied:
We search for target tokens/words that indicate the result (e.g. true/yes or false/no). If such a
token is found, the generation process is directly stopped. Due to the high computation amount,
an early stopping approach is useful to process a large number of candidates. Up to now, only
the decision is extracted and in case the model generates other texts like “This is a correct
match”, we fail at detection.</p>
      <p>To overcome this issue and also extract a specific confidence we do the following. If any of
the target tokens is detected, then we retrieve the scores of the complete vocabulary and apply
the softmax function to it. This corresponds to the probability that the word is generated at this
position. We check for all words in a class (e.g. yes, true) the probability and take the maximum
value which is normalized by the negative class. Thereby, we get a confidence between zero
and one, and every confidence above 0.5 is a predicted positive token.</p>
      <p>In case no positive or negative token is generated, the probabilities at the first generated
token are used. All those computations would not be possible with a model accessed by an
API such as ChatGPT. Prompt engineering would be another way to get to confidences (e.g.
use a prompt such as “and also provide a confidence score with your answer“) but usually, the
chatbot will respond that it is not able to provide a specific confidence value and even if it does,
it is not easy to extract it out of the generated text.</p>
      <p>The default generation strategy3 is greedy such that each token with the highest probability is
chosen and the generation process is continued with this text. The implementation also allows
to switch to e.g. contrastive search [7] but due to the usual short answers, it is not necessary
nor helpful.</p>
    </sec>
    <sec id="sec-4">
      <title>1.3. TextExtractors / Verbalizers</title>
      <p>In addition to combining all texts from the TextExtractorSet explained before, an even simpler
extractor called TextExtractorOnlyLabels is implemented. It extracts only one text which can
originate from the following properties(in decreasing importance): skos:prefLabel,
rdfs:label, URI fragment, skos:altLabel, skos:hiddenLabel. This means if a skos:prefLabel
is detected, only this label is used.</p>
      <p>Including more context in those examples is achieved by the TextExtractorVerbalizedRDF.
It selects all RDF triples from the corresponding KG where the resource is in the subject position.
Those triples are verbalized - meaning that each subject, predicate, and object is replaced by
the text of OnlyLabels extractor. All triples with a label-like property are skipped because the
information is already included. As an example, the statement“:MA_0000002 rdfs:subClassOf
:MA_0001112“ is converted to “spinal cord grey matter sub class of grey matter“.</p>
      <p>As a variation of the previous extractor, it is also tried out to provide the triples directly as
serialized RDF. The default of the ResourceDescriptionInRDF extractor is to serialize to turtle
format where the prefixes are used but the prefix definition is excluded from the generated text
to make it shorter (other serializations can also be configured). If there are resources in the
object position of the triples, they will be also replaced by a literal containing the corresponding
label.</p>
    </sec>
    <sec id="sec-5">
      <title>1.4. High-Precision Matcher</title>
      <p>The high-precision matcher is a simple matcher in MELT that eficiently searches for concepts
with the exact same normalized label (or URI fragment if a label is not available).4 The
normalization includes lowercasing, camel case, and deletion of non alpha-numeric characters. If there
is only one such candidate for a concept, then it is matched.</p>
    </sec>
    <sec id="sec-6">
      <title>1.5. Postprocessing</title>
      <p>After the application of the LLM, the resulting alignment is further post-processed by filters. To
keep the matching pipeline simple, only two additional filters are applied. The cardinality filter
ensures a one-to-one mapping which is usually required.</p>
      <sec id="sec-6-1">
        <title>3https://huggingface.co/docs/transformers/main/en/generation_strategies 4https://dwslab.github.io/melt/matcher-components/full-matcher-list</title>
        <p>To further improve the alignment and remove correspondences that are likely to be incorrect,
the confidence filter is applied. All correspondences that do not have a higher or the same
confidence as a predefined threshold value are excluded.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>1.6. Final Configuration</title>
      <p>For the final configuration, a lot of parameters need to be fixed. The SBERT model for the
candidate generation step is set to multi-qa-mpnet-base-dot-v1,5 and the value k during the
top-k neighbors search is set to five. This gives a balance between the number of generated
correspondences as well as the achieved recall. The TextExtractorSet is used to generate
multiple text representations of the resource to run the search in the embedding space.</p>
      <p>The LLM model is set to upstage/Llama-2-70b-instruct-v26 and to generate the text in
prompt 7 (see [3]), i.e., a few-shot prompt with three positive and negative examples each7,
TextExtractorOnlyLabels is used. With this prompt, the binary decision approach is automatically
selected. For the text generation, the maximum number of tokens (max_new_tokens8) is set
to 10 but this number of tokens is usually not reached because a positive or negative word is
detected before. The next parameter which is fixed is the temperature. The lower the value,
the more deterministic the results are (the token with the highest probability is chosen as the
predicted token). With increased temperature, the outputs are more randomized (resulting
in more creative texts). We set the temperature to zero such that the results are reproducible.
Other generation parameters are set to their default values.</p>
      <p>The cardinality filter does not require any parameters, and the value of the confidence filter is
set to 0.5. With this setting, we filter out all correspondences where the LLM predicts a negative
word (such as “no“ or “false“). Thus we do not need to tune the confidence value and do not
require any training alignment for it.</p>
    </sec>
    <sec id="sec-8">
      <title>1.7. Adaptations made for the evaluation</title>
      <p>OLaLa is available as a docker image provided at figshare. If the image is started, an HTTP
endpoint within the container on port 8080 is started. The webserver fulfills the requirements
of the REST interface described in the MELT user guide 9.</p>
    </sec>
    <sec id="sec-9">
      <title>1.8. Link to the system and parameters file</title>
      <p>OLaLa can be downloaded from
https://doi.org/10.6084/m9.figshare.24150846.v2. [8]
5https://huggingface.co/sentence-transformers/multi-qa-mpnet-base-dot-v1
6https://huggingface.co/upstage/Llama-2-70b-instruct-v2
7The positive and negative examples are taken from the anatomy track and used across all tracks.
8https://huggingface.co/docs/transformers/main/en/main_classes/text_generation#transformers.GenerationConfig
9https://dwslab.github.io/melt/matcher-packaging/web</p>
      <sec id="sec-9-1">
        <title>2. Results</title>
        <p>This section discusses the results of OlaLa for each track of OAEI 2023. The system is not
able to produce meaningful results on the multiform track because it is not yet designed for
multilingual input. But there are also open-source LLMs out there that support it well and it is
worth testing them in the future.</p>
        <p>In this year, it is also possible to submit final alignment files in case the system requires
substantial hardware resources. OLaLa requires two GPUs with at least 40 GB RAM. Thus we
also submitted the produced alignment files.</p>
      </sec>
    </sec>
    <sec id="sec-10">
      <title>2.1. Anatomy</title>
      <p>When comparing the F1 measure in the anatomy track, the system is in second place with 0.91.
This is a rather good value for the track without explicit background knowledge. Nevertheless,
one needs to take care take also the pretrained model is trained on a huge amount of text that
may cover similar topics.</p>
      <p>To improve the results, the recall needs to be higher. Furthermore, the alignment could
be coherent with respect to the given ontologies by using an explicit check and filtering of
correspondences.</p>
    </sec>
    <sec id="sec-11">
      <title>2.2. Conference</title>
      <p>In the conference track, OLaLa (F1 of 0.6) is way better than the string equivalence approach
(F1 of 0.53). But there are still four matchers that are better than this value. In this track, the
precision value is quite low at 0.59. Due to the fact that each correspondence has an associated
confidence, a higher threshold would make sense here.
2.3. Bio ML
In this track, OLaLa performs quite well. Only SORBETMtch and LogMapBio are better in most
test cases. We still need to look into SNOMED-FMA (Body) test case where OLaLa is the worst
system in terms of F1 unsupervised measure.</p>
    </sec>
    <sec id="sec-12">
      <title>2.4. Biodiv</title>
      <p>In biodiv track, the system scores are on the first place for the following test cases:
• NCBITAXON-TAXREFLD Plantae
• NCBITAXON-TAXREFLD Fungi
• NCBITAXON-TAXREFLD Chromista
• NCBITAXON-TAXREFLD</p>
      <p>For NCBITAXON-TAXREFLD Animalia LogMapLt is a bit better.</p>
    </sec>
    <sec id="sec-13">
      <title>2.5. Common Knowledge Graphs</title>
      <p>For the test case Nell-DBpedia, the proposed system performs best even without instance
information because for class matches it does not require any instance information.</p>
      <p>For Yago-Wikidata most other systems (except LsMatch) are better in terms of F1.</p>
    </sec>
    <sec id="sec-14">
      <title>2.6. Knowledge Graph</title>
      <p>OLaLa is very good at property matching but for classes some systems like SORBETMtch or
LSMatch are better. In the future, we also plan to include instance matching as well which
currently takes too much time to execute.</p>
      <sec id="sec-14-1">
        <title>3. General comments</title>
      </sec>
    </sec>
    <sec id="sec-15">
      <title>3.1. Discussions on the way to improve the proposed system</title>
      <sec id="sec-15-1">
        <title>The system can be optimized in multiple ways.</title>
        <p>The main issue is scalability because the prediction of new tokens by the LLM takes a lot of
time (including the initial processing of the prompt). To reduce this time, the generation process
can be sped up (e.g. batch processing). Additionally, we could only process the correspondences
with the highest confidence first, and only in case the LLM predicts a negative token, we process
other candidates.</p>
        <p>Another extension would be to diferentiate between class, property, and instance matches.
Up to now, the prompt and text extraction are the same for all types of resources. Maybe it is
helpful to change either the prompt, the text extractor, or both for each type.</p>
        <sec id="sec-15-1-1">
          <title>4. Conclusions</title>
          <p>In this paper, we have analyzed the results of the OLaLa system. In many tracks, it can achieve
very good results and shows that the textual information is very helpful when generating
correspondences.</p>
          <p>Most of the components that are used in this system are included in the MELT framework[4]
which allows other researchers to reuse and compose components in their systems.
[4] S. Hertling, J. Portisch, H. Paulheim, MELT - matching evaluation toolkit, in: International
conference on semantic systems (SEMANTICS), 2019, pp. 231–245.
[5] S. S. Norouzi, M. S. Mahdavinejad, P. Hitzler, Conversational ontology alignment with
chatgpt, arXiv preprint arXiv:2308.09217 (2023).
[6] N. Reimers, I. Gurevych, Sentence-bert: Sentence embeddings using siamese bert-networks,
in: Proceedings of the 2019 Conference on Empirical Methods in Natural Language
Processing, Association for Computational Linguistics, 2019. URL: http://arxiv.org/abs/1908.10084.
[7] Y. Su, T. Lan, Y. Wang, D. Yogatama, L. Kong, N. Collier, A contrastive framework for neural
text generation, Advances in Neural Information Processing Systems 35 (2022) 21548–21561.
[8] S. Hertling, OLaLa for OAEI (2023). URL: https://figshare.com/articles/software/OLaLa_
for_OAEI/24150846. doi:10.6084/m9.figshare.24150846.v2.</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          , et al.,
          <source>Llama</source>
          <volume>2</volume>
          :
          <article-title>Open foundation and fine-tuned chat models</article-title>
          ,
          <source>arXiv preprint arXiv:2307.09288</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hertling</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Paulheim</surname>
          </string-name>
          , Olala:
          <article-title>Ontology matching with large language models</article-title>
          ,
          <source>in: Knowledge Capture Conference 2023 (K-CAP '23), December 5-7</source>
          ,
          <year>2023</year>
          , Pensacola, FL, USA,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .1145/3587259.3627571.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>