<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Semantic Doppelgängers: How LLMs Replicate Lexical Knowledge</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luigi Di Caro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Laura Ventrice</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rachele Mignone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stefano Locci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Torino, Department of Computer Science</institution>
          ,
          <addr-line>Corso Svizzera 185 - 10149 Torino</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>This scientific paper aims to investigate how a single large language model, such as ChatGPT, can be used to mimic lexical resources and generate ad hoc lexical knowledge in real time by incorporating contextual information. We conduct a comprehensive study on ChatGPT's ability to capture various aspects of lexical semantics such as synonyms, antonyms, hypernyms, and hyponyms, and compare it with well-known resources such as WordNet. We also evaluate ChatGPT's performance on tasks that require knowledge of lexical semantics, such as semantic similarity. Our results show that ChatGPT is able to capture a significant amount of lexical semantic information, with its performance on lexical semantic tasks being highly dependent on the quality and relevance of the contextual information. We also observe that ChatGPT's ability to generate ad hoc lexical knowledge in real time is a major advantage over traditional lexical resources, which may not be able to keep up with the constantly evolving nature of language. Overall, our study sheds light on the potential of large language models such as ChatGPT to mimic and even surpass traditional lexical resources in capturing and generating lexical semantic knowledge. This has important implications for natural language processing applications that require real-time access to up-to-date lexical information.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lexical Semantics</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Lexical semantics plays a crucial role in natural language understanding and is an essential
component of many natural language processing (NLP) tasks. Traditional lexical resources, such
as WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], BabelNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], and ConceptNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], have been widely used to provide structured
knowledge about words and their relationships. However, these resources are often static and
require constant manual updates to remain relevant in the face of the rapidly evolving nature
of language.
      </p>
      <p>Recently, large language models like ChatGPT have shown a remarkable ability to generate
coherent and contextually relevant responses in a conversational setting. This paper investigates
the potential of ChatGPT as an alternative to traditional lexical resources by evaluating its
ability to generate ad hoc lexical knowledge in real time, incorporating contextual information,
and comparing its performance to established resources on various lexical semantic tasks.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        WordNet [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a widely used lexical resource that provides structured information about
synonyms, antonyms, hypernyms, and hyponyms. BabelNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] is a multilingual semantic
network that integrates lexical information from various sources, including WordNet, Wikipedia,
and other resources. ConceptNet [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] is a knowledge graph that combines information from
multiple sources to capture common sense and lexical knowledge.
      </p>
      <p>
        Several studies have explored the potential of neural networks in capturing lexical semantics.
For instance, [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] have shown that word embeddings can capture semantic relationships
to some extent. More recently, large scale language models such as BERT [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] and GPT-3 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]
have demonstrated impressive performance on various NLP tasks, including those that require
lexical semantic knowledge.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>We conduct a comprehensive study on ChatGPT’s ability to capture various aspects of lexical
semantics, including synonyms, antonyms, hypernyms, and hyponyms. We compare its
performance with well-known resources such as WordNet on a range of tasks that require knowledge
of lexical semantics, such as word sense disambiguation and semantic similarity.</p>
      <p>To provide contextual information, we design a set of carefully crafted prompts that elicit
specific aspects of lexical semantics from ChatGPT. We also evaluate the efect of varying the
amount and relevance of contextual information provided to the model on its performance in
capturing lexical semantic knowledge.</p>
      <sec id="sec-3-1">
        <title>3.1. Experimental settings</title>
        <p>To investigate ChatGPT’s ability to capture various aspects of lexical semantics and compare its
performance with well-known resources like WordNet, we designed the following experiment
settings.
3.1.1. Dataset Construction
• Collect a dataset consisting of word pairs annotated with their semantic relationships,
including synonyms, antonyms, hypernyms, and hyponyms.
• Include a diverse range of word pairs to cover diferent semantic domains and levels of
complexity.
• Ensure a suficient number of instances for each semantic relationship to provide reliable
evaluation.
3.1.2. Experimental Tasks
• Perform word sense disambiguation task: Provide ambiguous word instances from the
dataset and ask ChatGPT to disambiguate the correct sense based on the given context.
• Conduct semantic similarity task: Present word pairs and assess the degree of similarity
based on ChatGPT’s responses.
• Compare ChatGPT’s performance on these tasks with the performance of WordNet as a
baseline.</p>
        <sec id="sec-3-1-1">
          <title>3.1.3. Contextual Information Design</title>
          <p>• Craft carefully designed prompts that elicit specific aspects of lexical semantics from</p>
          <p>ChatGPT.
• Create prompts that target synonyms, antonyms, hypernyms, and hyponyms to evaluate</p>
          <p>ChatGPT’s ability to capture each semantic relationship accurately.</p>
          <p>• Vary the prompts to cover diferent levels of complexity and variations in context.
3.1.4. Contextual Information Variation
• Explore the efect of varying the amount of contextual information provided to ChatGPT
on its performance in capturing lexical semantic knowledge.
• Design experiments with diferent lengths of prompts to assess the impact on ChatGPT’s
ability to generate accurate lexical information.
• Evaluate ChatGPT’s performance with prompts containing varying degrees of relevance
to the target word pairs.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.5. Evaluation Metrics</title>
          <p>• Utilize established evaluation metrics for word sense disambiguation, such as accuracy or</p>
          <p>F1 score, to assess the model’s performance.
• Measure semantic similarity using metrics like cosine similarity or Spearman’s rank
correlation coeficient to quantify ChatGPT’s ability to capture semantic relationships.</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.6. Baseline Comparison</title>
          <p>• Compare ChatGPT’s performance on the experimental tasks with the performance of
WordNet, a widely-used lexical resource, to determine the efectiveness of ChatGPT in
capturing lexical semantics.
• Calculate and report performance metrics for both ChatGPT and WordNet, enabling a
direct comparison between the two approaches.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.7. Statistical Analysis</title>
          <p>• Conduct statistical analysis (e.g., t-tests or ANOVA) to determine if there are significant
diferences between ChatGPT and WordNet’s performance on the experimental tasks.
• Perform post-hoc analysis if necessary to investigate specific pairwise comparisons
between diferent conditions or semantic relationships.</p>
          <p>By implementing these experiment settings, we can comprehensively evaluate ChatGPT’s
ability to capture various aspects of lexical semantics and compare its performance with WordNet,
while exploring the impact of diferent contextual information settings on its performance.</p>
          <p>In the next section, we report some preliminary results obtained on the basis of two basic
settings: i) semantic similarity and ii) semantic relation extraction.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Our results show that ChatGPT is able to capture a significant amount of lexical semantic
information, with its performance on lexical semantic tasks being highly dependent on the
quality and relevance of the contextual information. In some cases, ChatGPT even surpasses
traditional lexical resources in capturing and generating lexical semantic knowledge.</p>
      <p>We also observe that ChatGPT’s ability to generate ad hoc lexical knowledge in real time is
a major advantage over traditional lexical resources, which may struggle to keep up with the
constantly evolving nature of language.</p>
      <sec id="sec-4-1">
        <title>4.1. Semantic Similarity</title>
        <p>
          As an initial experiment, we utilized the widely recognized SimLex-999 dataset [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]. This dataset
consists of word pairs accompanied by similarity scores. For our experiment, we randomly
selected 20 word pairs for each part-of-speech category (nouns, adjectives, and verbs), as shown
in Table 1. We specifically asked ChatGPT to evaluate the similarity between the words in each
pair using a binary decision approach (yes or no). Subsequently, we discretized the SimLex
dataset based on ChatGPT’s assessments.
        </p>
        <p>Verbs
go come
take steal
listen hear
think rationalize</p>
        <p>occur happen
vanish disappear
multiply divide</p>
        <p>plead beg
begin originate
protect defend
kill destroy
create make
accept reject
ignore avoid
carry bring
leave enter
choose elect</p>
        <p>lose fail
encourage discourage
achieve accomplish</p>
        <p>Adjectives</p>
        <p>old new
smart intelligent
hard dificult
happy cheerful
hard easy
fast rapid
happy glad
short long
stupid dumb
weird strange
wide narrow</p>
        <p>bad awful
easy dificult
bad terrible
hard simple
smart dumb
insane crazy
happy mad
large huge
hard tough</p>
        <p>Nouns
wife husband</p>
        <p>book text
groom bride
night day
south north
plane airport
uncle aunt
horse mare
bottom top
friend buddy
student pupil
world globe
leg arm
plane jet
woman man</p>
        <p>horse colt
actress actor
teacher instructor
movie film
bird hawk</p>
        <p>The Pearson correlation coeficient is used as a measure of the strength and direction of
the relationship between our model’s similarity scores and the human-annotated similarity
judgments in the SimLex dataset.</p>
        <p>To assess the correlation, we compared the similarity scores generated by the language model
with the SimLex dataset. We extracted the relevant word pairs for adjectives, verbs, and nouns
and calculated the Pearson correlation coeficient for each category separately. The coeficient
ranges from -1 to 1, with 1 indicating a perfect positive correlation, 0 indicating no correlation,
and -1 indicating a perfect negative correlation.</p>
        <p>The experiment demonstrated a good correlation with the SimLex dataset, as indicated by the
calculated Pearson coeficients. The average Pearson coeficient across all three categories was
0.604, with diferent coeficients for each category. Specifically, the average Pearson coeficient
for adjectives was 1.0, indicating a perfect positive correlation. For verbs, the average Pearson
coeficient was 0.419, indicating a moderate positive correlation. Lastly, for nouns, the average
Pearson coeficient was 0.392, also indicating a moderate positive correlation.</p>
        <p>The high correlation coeficient for adjectives suggests that the language model performs
exceptionally well in capturing the semantic similarity of adjectival word pairs. This indicates
that the model can efectively distinguish between synonyms and antonyms in this category.</p>
        <p>While the correlation coeficients for verbs and nouns are slightly lower, they still demonstrate
a significant positive relationship between our language model and the SimLex dataset. This
suggests that the model successfully captures the semantic similarities between verbs and nouns,
although to a slightly lesser extent than adjectives.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Semantic Relation Extraction</title>
        <p>As a second experiment, we tried to reconstruct and label word pairs and their semanti
relationships as encoded in a lexical semantic resource, i.e., WordNet.</p>
        <p>In this experiment, we aimed to assess the language model’s ability to recognize and label
semantic relationships in word pairs based on the information encoded in a lexical semantic
resource, specifically WordNet. We hypothesized that the language model would exhibit a high
level of accuracy in identifying and labeling various semantic relationships.</p>
        <p>To conduct the experiment, we selected a diverse set of word pairs representing diferent
semantic relationships available in WordNet. These relationships included synonyms, antonyms,
hyponym-hypernym pairs, meronym-holonym pairs, attributes, entailments and cause-efect
relationships. We ensured that the chosen word pairs covered a wide range of semantic nuances
and complexities.</p>
        <p>The language model was presented with each word pair and asked to label the specific
semantic relationship between them. The labels were then compared against the corresponding
relationships as defined in WordNet. The evaluation metric for the experiment was the accuracy
of the language model’s labeling.</p>
        <p>The results of the experiment, shown in Table 2, revealed that the language model
demonstrated an exceptional ability to recognize and label semantic relationships with a high level
of accuracy. In fact, the model achieved a perfect accuracy rate in identifying and labeling the
semantic relationships encoded in WordNet.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>This study provides evidence that large language models such as ChatGPT have the potential
to mimic and even surpass traditional lexical resources in capturing and generating lexical
semantic knowledge. The ability to generate ad hoc lexical knowledge in real time, incorporating
contextual information, ofers a significant advantage over static resources like WordNet.</p>
      <p>Our findings have important implications for NLP applications that require real-time access
to up-to-date lexical information, pointing towards a shift from relying on traditional lexical
resources to incorporating large language models as a source of lexical semantic knowledge.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <article-title>Wordnet: A lexical database for english</article-title>
          ,
          <source>Communications of the ACM</source>
          <volume>38</volume>
          (
          <year>1995</year>
          )
          <fpage>39</fpage>
          -
          <lpage>41</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. P.</given-names>
            <surname>Ponzetto</surname>
          </string-name>
          ,
          <article-title>Babelnet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network</article-title>
          ,
          <source>Artificial Intelligence</source>
          <volume>193</volume>
          (
          <year>2012</year>
          )
          <fpage>217</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Speer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Havasi</surname>
          </string-name>
          ,
          <article-title>Conceptnet 5.5: An open multilingual graph of general knowledge</article-title>
          ,
          <source>in: Proceedings of AAAI</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>T.</given-names>
            <surname>Mikolov</surname>
          </string-name>
          , I. Sutskever,
          <string-name>
            <given-names>K.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G. S.</given-names>
            <surname>Corrado</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          ,
          <source>in: Advances in neural information processing systems</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>3111</fpage>
          -
          <lpage>3119</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pennington</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Manning</surname>
          </string-name>
          , Glove:
          <article-title>Global vectors for word representation</article-title>
          ,
          <source>in: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          ,
          <year>2014</year>
          , pp.
          <fpage>1532</fpage>
          -
          <lpage>1543</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          ,
          <source>in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</source>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>T. B. Brown</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Mann</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ryder</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Subbiah</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Kaplan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Dhariwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Neelakantan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Shyam</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Sastry</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Askell</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbert-Voss</surname>
            , G. Krueger,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Henighan</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Child</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ramesh</surname>
            ,
            <given-names>D. M.</given-names>
          </string-name>
          <string-name>
            <surname>Ziegler</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Winter</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Hesse</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            , E. Sigler,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Litwin</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Gray</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Chess</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Berner</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>McCandlish</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Radford</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Amodei</surname>
          </string-name>
          ,
          <article-title>Language models are few-shot learners</article-title>
          , arXiv preprint arXiv:
          <year>2005</year>
          .
          <volume>14165</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Hill</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Reichart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korhonen</surname>
          </string-name>
          , Simlex-
          <volume>999</volume>
          :
          <article-title>Evaluating semantic models with (genuine) similarity estimation</article-title>
          ,
          <source>Computational Linguistics</source>
          <volume>41</volume>
          (
          <year>2015</year>
          )
          <fpage>665</fpage>
          -
          <lpage>695</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>