<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>R. David);</journal-title>
      </journal-title-group>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Disambiguation for Semantic Annotations: Fusing Knowledge Graphs, Lexical Resources, and Large Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Robert David</string-name>
          <email>robert.david@semantic-web.com</email>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Kernerman</string-name>
          <email>anna@lexicala.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ilan Kernerman</string-name>
          <email>ilan@lexicala.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nicolas Ferranti</string-name>
          <email>nicolas.ferranti@wu.ac.at</email>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Assaf Siani</string-name>
          <email>assaf@lexicala.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Workshop</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lexicala by K Dictionaries</institution>
          ,
          <country country="IL">Israel</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>NOVA University Lisbon</institution>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Semantic Web Company</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>Vienna University of Economics and Business</institution>
          ,
          <country country="AT">Austria</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>1987</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>Knowledge models, constructed from vocabularies and ontologies, establish a formal basis to enable semantic annotations, which can support retrieval use cases in the context of Retrieval Augmented Generation (RAG) systems. In such a scenario, we face the challenges of word sense disambiguation (WSD), multiword expressions (MWE), and multilinguality (of models and content) in the retrieval process. For WSD and MWE, there is a need for contextual knowledge to diferentiate word senses of expressions in the content. For multilinguality, we aim for systems which support content that comes in a mix of languages, as well as querying across languages. To support both goals, we propose a combination of knowledge models, multilingual linguistic data (including lexicographic resources) and large language models (LLMs). Via dictionaries with additional lexical information for multiple languages, we implement cross-language queries, and with the integration of LLMs we use these quality language resources to drive multilingual disambiguation for Graph RAG systems. In this paper, we present research carried out jointly by Semantic Web Company and Lexicala by K Dictionaries, including our approach and methodology along with preliminary results of our experiments on converging language resources, knowledge graphs, and large language models.</p>
      </abstract>
      <kwd-group>
        <kwd>large language models</kwd>
        <kwd>word sense disambiguation</kwd>
        <kwd>Graph RAG</kwd>
        <kwd>multilingual</kwd>
        <kwd>knowledge graphs</kwd>
        <kwd>semantic annotation</kwd>
        <kwd>language resources</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Motivation</title>
      <p>https://lexicala.com/ (A. Siani)
https://www.poolparty.biz/ (R. David); https://lexicala.com/ (A. Kernerman); https://lexicala.com/ (I. Kernerman);</p>
      <p>CEUR</p>
      <p>ceur-ws.org</p>
      <p>
        In this paper, we describe our approach to enable WSD in the context of RAG [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], specifically
Graph RAG systems, where the basis is the retrieval of documents annotated with entities from a
knowledge graph (KG), specifically concepts from a taxonomy represented using the Simple Knowledge
Organisation System (SKOS) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. These concept annotations can be multilingual, however, there can also
be ambiguities introduced in the translation when facing multilingual content in the retrieval process.
In the following, we describe the retrieval architecture using components of the PoolParty Semantic
Suite product1. In our scenario, the retrieval, based on concept annotations, can face the situation where
a concept in English has translation equivalents to several concepts in Hebrew, and vice versa. In such
a situation, the multilinguality of a SKOS concept is not suficient to represent the diferent senses. Our
approach is to use multilingual lexical data afor the representation of polysemous words and include
such information in the process of the LLM-based WSD to disambiguate concept annotations. While
there has been work on using LLMs for disambiguation tasks [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], the specific challenge for Graph RAG
presented in this paper has not been addressed so far to the best of our knowledge.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Methodology</title>
      <p>To implement our approach for multilingual WSD, we design the following workflow for data processing
in our Graph RAG architecture. The workflow is based on the PoolParty approach for implementing
a semantic retrieval system, here specifically a Graph RAG system, extended with a new step in the
workflow for disambiguation of concept annotations based on linguistic data and an LLM. In the
following, we first describe the workflow for creating semantic annotations and then explain the
disambiguation step used in the retrieval.</p>
      <p>Workflow for Semantic Annotations
1. Model a SKOS thesaurus representing the knowledge domain for which we want to implement
the Graph RAG. The thesaurus contains SKOS concepts (entities + multilingual labels) to annotate
the documents. Optionally, additional information from an ontology can be used to extend the
thesaurus (Taxonomy &amp; Ontology Server).</p>
      <p>2. The documents in a corpus are annotated with the concepts from the thesaurus (Entity Extractor).</p>
      <p>3. The results of the annotation process are stored, potentially linked to further resources and are
used for retrieval in the RAG process (Data Integration &amp; Linking).</p>
      <p>The system can now be queried using a user input (question) and retrieves documents based on the
semantic annotations. However, the concept annotations can be ambiguous and potentially misinterpreted,
especially if the question and (some) documents don’t share the same language.</p>
      <p>Word sense disambiguation during retrieval Our approach combines 3 components: (i)
concepts from the KG, (ii) language resources, and (iii) an LLM (specifically, ChatGPT-3.5 is used for our
implementation).</p>
      <p>• The KGs provide multilingual concepts as a basis for the semantic annotations. However, diferent
concepts can match the same term because they use the (one) same label, and are therefore
ambiguous.
• Language resources provide detailed knowledge about concepts, including translation equivalents
between diferent languages, a representation of the diferent word senses over these languages,
and example sentences of their usage.
• The LLM performs the WSD for the concept annotations, augmented by the language resources,
which provide the disambiguation options and context information to improve the disambiguation
result.</p>
      <p>The WSD step in the workflow is shown below, starting from the user asking the system a question,
to deciding on the correct word sense. The diagram represents only WSD step in the whole Graph RAG
architecture.</p>
      <p>1. An information request is formulated to the system by a user asking a question as an input to
generate an answer based on retrieved documents via semantic annotations.
2. The question is annotated with concepts from the thesaurus. However, it might contain
polysemous words. Even if this is not the case in the language used to formulate the question, this
might still be true for the (target) language of a document.
3. In the case of such ambiguous concepts, the system asks the LLM to disambiguate the word
sense. A prompt is constructed, which includes the specific information about the word senses
from the language resources, including the usage examples, and which helps (to augment) the
LLM to perform the disambiguation with higher precision. Because the language resources also
contain multilingual representations across languages, the system can perform cross-lingual
disambiguation.
4. Finally, the disambiguation step returns the correct word sense equivalent (only), to be further
used in the retrieval process, thereby increasing precision.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experiments</title>
      <p>In our experiment, we tested ChatGPT’s translation abilities between Hebrew and English. While it
demonstrates near-perfect translation capabilities in high-resource language pairs, this is not always the
case for language pairs involving a low-resource language, such as Hebrew. This challenge is particularly
evident in out-of-context sentences containing polysemous words, where the correct interpretation
may be apparent only to native speakers. The experiment aimed to determine whether ChatGPT’s
ability to correctly interpret ambiguous words could be improved by providing the relevant dictionary
definitions. We found out that in several instances, ChatGPT successfully identified the correct meaning,
showcasing the potential of lexicographic content to enhance the performance of LLMs. To illustrate
this, we present three examples.</p>
      <p>Example 1 The Hebrew word בד [bad] is polysemous in the meanings of ‘fabric’ (common use) and
‘tree branch’ (literary register, not widely common). The following prompt was given: “Translate
into English הפירות העמיסו על הבדים”. The correct translation would be “The fruits weighted down
the branches.”, yet ChatGPT 3.5 provided the following result: “The fruits weighted down the fabric.”
In a new ChatGPT conversation, the LLM was provided with the dictionary definition of the word
“בד” before being asked to translate the sentence again. It managed to correctly identify the intended
meaning out of the eight possible meaning sand translate the sentence correctly despite having no
further conversational context.</p>
      <p>Example 2 The Hebrew word בר [baʁ] is polysemous in the meanings of ‘bar’ (i.e., a pub) and ‘wild’
(adjective) or ‘wilderness’ (noun); both are equally used. The following prompt was given: “Translate
into English שמירה על הבר עמדה במוקד עיסוקיו.”. The correct translation would be “Guarding the wilderness
stood at the center of his activities.”, yet ChatGPT 3.5 had the following result: “Guarding the bar stood
at the center of his activities.”. Once again, when including the full dictionary entry of ‘בר’ in the prompt,
and with no additional context, ChatGPT managed to translate the sentence correctly.
Example 3 The Hebrew word למתג [le’ma.teg] is polysemous in the meanings of ‘to brand’ (commonly
used) and ‘to restrain’ (rarely used). The following prompt was given: “Translate into English הוא
מיתג את הדחף לצפות בחדשות.”. The correct translation would be “He restrained the urge to watch the
news.”, yet ChatGPT 3.5 had the following result: “He branded the urge to watch the news.”. Once again,
we provided ChatGPT with the Hebrew dictionary entry for the word ”מיתג” and it then managed to
translate the sentence correctly.</p>
      <p>In all three examples, providing ChatGPT with the dictionary entry containing the various meanings of
the ambiguous word was suficient to enhance its translation accuracy, allowing it to correctly interpret
the polysemous word.</p>
      <p>However, examples with failures were encountered, too. One failed attempt of ChatGPT to
produce a good translation is the Hebrew word גבר [ge.veʁ]; this word is widely used in both spoken
and written language, mostly with the meaning ‘man’. However, it also has the meaning of ‘rooster’,
which is scarcely used and is unknown to many native speakers, since it originated in the rabbinic
literature of the Talmud (dated approximately in the 200-500 AD, many centuries before the revival of
modern Israeli Hebrew). When given the prompt: “Translate into English הגבר קרא בקול עם הזריחה.” (The
rooster called aloud at sunrise.), ChatGPT 3.5 had chosen the common meaning of ‘man’, which would
require a very specific context to make sense and be understood by native speakers. Even with the full
dictionary entry including the ‘rooster’ meaning, it failed to provide a good translation, preferring
again the ‘man’ meaning.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Conclusions</title>
      <p>The experiments showed that for low-resource languages, Graph RAG can be enhanced in cases of
retrieval requiring WSD when provided with additional context from language resources. Our approach
helps with cross-language retrieval by resolving ambiguities introduced by translation and thereby
avoiding misinterpretations in the generated answers. Also, we can represent the word senses as
structured data in the KG and thereby provide a basis to make the retrieval explainable. While our
approach showed value in the experiments, it leaves several open questions to be explored in future
work.</p>
      <p>First, the qualitative experiments presented in this paper do not show the value of our approach on
a large scale. Future work needs to expand them to a quantitative study, where we can measure the
impact on the quality of the retrieval.</p>
      <p>Second, we will look into other low-resource language pairs besides English-Hebrew to determine if
our approach is suficiently generic. Closely associated with this question are experiments on content
using more than two languages.</p>
      <p>Third, we also face the problem of metaphorical terms, which are more challenging to translate
because it is not only necessary to understand the context, but also background knowledge is required
for a particular metaphor to be interpreted correctly. Metaphors are distinguished roughly into two
types. The first is borrowing, expanding the original meaning of the term in a metaphoric use. For
example, the term ‘grasp’, initially meaning a physical grasp (of an object), had gone through semantic
expansion, meaning both a physical grasp and a mental grasp (of an idea or a concept). The second is
ifgurative phrases and terms, which when interpreted literally have no reasonable meaning and can
only be understood metaphorically. Metaphors of the first type are ambiguous and, when provided
with no additional discursive context, can be interpreted by native speakers in both the literal and the
metaphoric meanings. However, in cases of figurative ‘fixed’ phrases, native speakers would never
assign a literal meaning. For example, the Hebrew phrase כרסה בין שיניה, literally meaning ‘her belly
between her teeth’, describes a pregnant woman. Whereas native speakers do not need any additional
context and would always interpret it in its metaphoric meaning, LLMs need further context and/or a
lexical definition to decipher the metaphor. Besides experiments on low-resource language pairs, we
can determine if metaphorical resolution also provides an advantage on high-resource language pairs.</p>
      <p>Fourth, we aim to investigate if our approach reduces hallucinations, because it basically prevents
some kinds of misinterpretation.</p>
      <p>With our work, we contribute to building high-quality Graph RAG systems by providing multilingual
WSD based on language resources, KGs, and LLMs.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Perez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piktus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Petroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karpukhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Küttler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          , W.-t. Yih,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rocktäschel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Riedel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <article-title>Retrieval-augmented generation for knowledge-intensive NLP tasks</article-title>
          ,
          <source>in: Proceedings of the 34th International Conference on Neural Information Processing Systems</source>
          , NIPS '20, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Allemang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hendler</surname>
          </string-name>
          ,
          <article-title>Semantic Web for the Working Ontologist: Efective Modeling in RDFS and</article-title>
          OWL, 2 ed., Morgan Kaufmann Publishers Inc., San Francisco, CA, USA,
          <year>2011</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>S.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <article-title>Use Large Language Models for Named Entity Disambiguation in Academic Knowledge Graphs</article-title>
          ,
          <source>in: 2023 3rd International Conference on Education, Information Management and Service Science (EIMSS</source>
          <year>2023</year>
          ), Atlantis Press,
          <year>2023</year>
          , pp.
          <fpage>681</fpage>
          -
          <lpage>691</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>