<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Lost in Disambiguation: How Instruction-Tuned LLMs Master Lexical Ambiguity</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Capone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Serena Auriemma</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martina Miliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessando Bondielli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessandro Lenci</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CoLing Lab, Dipartimento di Filologia, Letteratura e Linguistica, Università di Pisa</institution>
          ,
          <addr-line>Via Santa Maria, Pisa, 56126</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Dipartimento di Informatica, Università di Pisa</institution>
          ,
          <addr-line>Largo B. Pontecorvo, 3 Pisa, 56127</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper investigates how decoder-only instruction-tuned LLMs handle lexical ambiguity. Two distinct methodologies are employed: Eliciting rating scores from the model via prompting and analysing the cosine similarity between pairs of polysemous words in context. Ratings and embeddings are obtained by providing pairs of sentences from Haber and Poesio [1] to the model. These ratings and cosine similarity scores are compared with each other and with the human similarity judgments in the dataset. Surprisingly, the model scores show only a moderate correlation with the subjects' similarity judgments and no correlation with the target word embedding similarities. A vector space anisotropy inspection has also been performed, as a potential source of the experimental results. The analysis reveals that the embedding spaces of two out of the three analyzed models exhibit poor anisotropy, while the third model shows relatively moderate anisotropy compared to previous findings for models with similar architecture [ 2]. These findings ofer new insights into the relationship between generation quality and vector representations in decoder-only LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Lexical ambiguity</kwd>
        <kwd>Decoder models</kwd>
        <kwd>Transformer</kwd>
        <kwd>LLM</kwd>
        <kwd>Cosine similarity</kwd>
        <kwd>Human rating</kwd>
        <kwd>Anisotropy</kwd>
        <kwd>Model generation</kwd>
        <kwd>Model ratings</kwd>
        <kwd>Polysemy</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>In this paper, we aim to investigate how LLMs han</title>
        <p>dle LA. Specifically, we challenged three decoder-only
Lexical ambiguity (LA) is a peculiar characteristics of instruction-tuned models to generate lexical similarity
human language communication. Words often carry mul- ratings for word pairs used in two diferent contexts,
tiple meanings, and discerning the intended sense re- with various degrees of sense similarity. To achieve this,
quires nuanced comprehension of contextual cues. LA is we employed a chain-of-thought approach, prompting
a broad concept subsuming several semantic phenomena, the models to produce a step-by-step reasoning process
such as regular and irregular polysemy, homonymy, and before assigning their ratings, allowing them to better
the coinage of new senses. Humans handle such ambigu- distinguish between diferent senses of the same term.
ity efortlessly, leveraging contextual information, prior For this task, we used the dataset released by Haber and
knowledge, and pragmatic inference. However, for Large Poesio [1], which includes human similarity judgments.
Language Models (LLMs), which rely on statistical pat- The models’ generated ratings were correlated with
huterns in text data, accurately resolving lexical ambiguity man similarity judgments to determine whether their
remains a challenging task. lexical disambiguation competence aligns with that of</p>
        <p>Despite their remarkable capability of using words ap- humans. Additionally, we computed the cosine similarity
propriately in context, one critical aspect that requires between the models’ internal representation of the
amdeeper investigation is whether such models possess biguous target words. Our research question is twofold:
human-like lexical competence, enabling them to gener- i.) to assess if the models’ generated ratings are
conalize from multiple instances of the same phenomenon, sistent with their internal representations of the
or if they are simply mimicking these instances. target words; ii.) to determine whether the internal
representations have a more similar distribution to
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, human ratings than the generated responses.
Dec 04 — 06, 2024, Pisa, Italy We are aware that context-sensitive word embeddings,
† For the specific purposes of Italian Academy, Luca Capone is re- like those of LLMs, can sufer from a representation
degen1spaonndsi4b.l1e, fMorarSteinctaioMnsili2a,n3i.4foarnsde3ct.5io,nSser3e.n1a,3A.3urainedm3m.6a, fAolressescatniodnros eration problem (see Section ?? for further details), which
Bondielli for sections 3.2 and 4 and Alessandro Lenci for section 5 limits their semantic representational power. Hence, we
$ luca.capone@fileli.unipi.it (L. Capone) included in our analysis a brief overview of how this
0000-0002-1872-6956 (L. Capone); 0009-0006-6846-5826 phenomenon afects the internal representational space
(S. Auriemma); 0000-0003-1124-9955 (M. Miliani); of the models under our investigation.
0000-0003©-3204224C6o-p6y6ri4gh3t fo(Arth.isBpaopnerdbyieitlslaiu)t;h0or0s.0U0se-p0e0rm0i1tte-d5u7n9de0r-C4re3a0tiv8e C(Aom.mLonesnLiccein)se To the best of our knowledge, this is the first study in</p>
        <p>Attribution 4.0 International (CC BY 4.0).
which diferent decoder-only models were tested on their
metalinguistic competence regarding LA. Understanding
how LLMs manage this type of complex semantic
phenomenon, based on the interplay of multiple contextual
factors, can guide new improvements in training
methodologies for the development of more sophisticated and
robust models that better mimic human-like language
understanding.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Related works</title>
      <p>One of the main reasons for the success of
Transformerbased LMs is their ability to represent context-dependent
meaning. The specific meaning a token assumes in a
given context is encoded within the internal layers of
these models and is reflected in the spatial distribution of
the produced embeddings, where unique context vectors
for each token occurrence are placed distinctly [2].</p>
      <p>Yenicelik et al. [3], extending Ethayarajh [2]’s study,
sought to obtain a general overview of BERT’s [4]
embedding space concerning polysemous words. They
conifrmed that BERT does indeed form contextual clusters,
which nevertheless obey semantic regularities in a broad
sense. These clusters may fulfill denotative, connotative,
or syntactic criteria, with converging groups consistent
with the idea of polysemy as a gradual continuum.
However, the embedding space of such models shows
regularities influenced not only by linguistic factors but also
by one of the model’s training objectives, i.e., Next
sentence Prediction [5]. This confirms the flexibility and
richness of contextual representations but raises
questions about their representativeness of proper linguistic
features. Several studies compared the contextual vectors
of encoder models like BERT and ELMO with human
similarity judgments, demonstrating that human judgments
usually correlate with the cosine similarity of
polysemous word pairs [1, 6], and even more with homonyms
pairs [7].</p>
      <p>Recently, the correlation between human similarity
judgments and model competence regarding LA was also
explored for larger decoder-models, such as GPT-4 [8].
However, this analysis only considers GPT’s generated
ratings, without examining the internal representations
of polysemous words. Hu and Levy [9] pointed out that
prompting might not be the most reliable way to evaluate
models, as the generated responses are not always
consistent with the model’s probability distribution.Their work
primarily addresses two tasks: token prediction and
sentence pair selection. In their evaluations, token prediction
is determined by identifying the token with the highest
probability from the entire vocabulary, while sentence
pair selection is based on the perplexity of two
competing propositions. While their methodology yields strong
results, it is not directly applicable to our study due to the
non-deterministic nature of model outputs in response to
the task we propose. Specifically, presenting the model
with two alternative sentences is not feasible in our
experiment, as the objective is to have the model generate a
chain-of-thought output that diferentiates between the
distinct senses of an ambiguous term and subsequently
produces a rating. One alternative would be to have the
model directly predict the rating and check which
vocabulary token (among the numbers in the rating scale) has
the highest probability. However, this approach would
not generate the contextual embeddings for the target
term necessary for our comparisons. Furthermore, as
discussed in section 3.3, ratings produced without the
chain-of-thought approach were inconsistent.</p>
      <p>Since we are dealing with word similarities, the most
straightforward way to measure a model’s internal
knowledge about polysemic words is by using
cosinesimilarities. However, given the contextual nature of
these models, embeddings might not transparently reflect
semantic properties, as they can be influenced by other
superficial contextual factors. This makes it challenging
to discern whether a high value of cosine similarity is due
to word sense similarity or to a general closeness of the
word embeddings in the space, the so-called anisotropy.</p>
      <p>Anisotropy can indeed negatively afect the
representational power of embeddings, and several methods have
been proposed to mitigate its efect [ 10, 11, 12].
Nevertheless, it has been demonstrated that anisotropy does
not have a negative impact on model performance [12].</p>
      <p>Given these complexities, we decided to further
investigate LA with large decoder-only models to highlight
diferences with results obtained from smaller encoders
and to determine whether their behaviour aligns with
the human competence on LA. We compared the
performance of diferent instruction-tuned decoders to obtain
a more comprehensive overview of how these models
handle this phenomenon. To ensure a thorough
evaluation, we consider both the models’ generated ratings
for polysemous words and their cosine similarities.
Additionally, in our analysis, we took into account the level
of anisotropy exhibited by these models.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Experimental settings</title>
      <sec id="sec-3-1">
        <title>3.1. Dataset</title>
        <sec id="sec-3-1-1">
          <title>We use the dataset introduced in Haber and Poesio [1],</title>
          <p>which includes a set of target words in various contexts.
Human judgments were collected on sentence pairs with
the same word, by asking participants to rate the
similarity of the target word meaning in the diferent contexts.
We chose to focus only on in-vocabulary tokens, as we
aimed to compare models’ performances on their
generated embeddings, without employing additional
operations (e.g., mean pooling of subword embeddings). Thus,
we retain about 79% of the dataset sentence pairs (i.e.,
236 out of the original 297).</p>
          <p>We further categorized sentence pairs according to
the distribution of the human ratings, dividing them
into four similarity classes depending on their
interquartile ranges.1 We also included the two manually
identified groups from Haber and Poesio [1]. One consists
of sentence pairs with homonyms, and the other
consists of words having the same sense in highly similar
contexts. As these groups did not have human ratings,
we assigned ten ratings to each data point, randomly
selected around 0.01 for homonyms (indicating completely
diferent meanings) and around 1.00 for the other group.
The human ratings serve as the ground truth for the
posthoc analysis in Section 4. The final dataset counts 35
target word types (see Figure 1 for their list and token
distribution), with a set of similarity judgments for each
pair.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Models 3.3. Prompting</title>
        <p>We report experimental results using a single prompt.4
The prompt was designed to closely follow the
methodology used by Haber and Poesio [1] for modeling the LA
task to collect crowdsourced data, ensuring a fair
comparison between LLMs’ ratings and human judgments.
In our setup, we provided the models with two sentences,
each containing the same target word. We then prompted
the models to return a rating score indicating how similar
the word’s usage was in the two occurrences. The rating
score ranged from 1 to 100, where 1 indicated that the
word was used with completely diferent senses in the
two sentences, and 100 indicated that the word was used
with the same sense across sentences. We formulated
the instructions following common rules of thumb for
prompting LLMs [14].</p>
        <p>In preliminary experiments, we asked the model to
return the similarity rating first and then to return the
motivation of such rating. We observed that i.) the rating
was quite inconsistent with the underlying motivations
given by the models, ii.) the motivations were usually
more appropriate than the ratings, and that iii.) the
models tended to return the same rating for all the sentence
pairs. Thus, we chose to ask the model to provide the
motivation first, followed by the rating. This allowed
the models to provide more accurate ratings. Such a
behavior is in line with the literature on “chain-of-thought”
prompting [15]. Additionally, we chose beam search as a
generation strategy, with 2 beams. The models sampled
the next generated token among the 50 most probable
words. We combined this strategy with nucleus sampling,
by setting a probability threshold of 0.95.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.4. Embedding Extraction and</title>
      </sec>
      <sec id="sec-3-4">
        <title>Cosine-similarity</title>
        <p>To assess the capability of LLMs to capture varying Building on the experiments in Haber and Poesio [1]
degrees of LA, we selected three decoder-only open and Loureiro and Jorge [16], we used the embeddings
models of comparable size. We chose instruction- generated from the last layer and the average of the
emtuned models exclusively, as this configuration beddings from the last four layers as contextual
embedis more suitable for conditional text generation: dings for the generated tokens. The idea behind this
Meta-Llama-3-8B-Instruct [13], hereafter referred approach is that the last layer embeddings represent the
to as LLaMA; Gemma-1.1-7B2, hereafter referred to as most contextual and generation-focused features, while
Gemma; and Mistral-7B-Instruct-v0.23, hereafter the preceding layers capture more general aspects of the
referred to as Mistral. All models are instruction-tuned processed sequence. This method allowed us to obtain
autoregressive LLMs with around 7 Billion parameters. two sets of contextual embeddings for each generation.
We chose these models as they are representative of Due to the unidirectional design of the decoder
architecpopular and widely used open-weights LLMs. We used tures, the repetition of the input sentences across
generathe Huggingface implementation of the models for our tions was necessary. The model had to process all tokens
experiments. in both sentences before providing suficient contextual
embeddings, making the input vectors unsuitable for the
1See Appendix 4 for the interquartile ranges values and a visual task. Once the vectors for each generated token were
representation. obtained, we isolated the embeddings corresponding to
2https://huggingface.co/google/gemma-1.1-7b-it
3https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2 4The full prompt is available in Appendix A.
the tokens of the target words contained in the stimulus
sentences (repeated by the model at the beginning of the
generation). Afterwards, cosine similarity values were
calculated between the target word vectors extracted
from the last layer and the last four layers.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Investigating anisotropy in decoder-only models</title>
        <p>The so-called representation degeneration problem [17] is
a well-known phenomenon observed in several
Transformer architectures, even in those trained on data other
than text [18]. This issue causes most of the model’s
learned word embeddings to drift to a narrow region of
the vector space [2], making them very close to each
other in terms of cosine similarity, and consequently
limiting their semantic representational power. Since our
work primarily focuses on analyzing LLMs’ ability to
capture subtle semantic properties such as polysemic
relations and relies in part on the computation of cosine
similarity between token pair embeddings, we decided
to further investigate this phenomenon.</p>
        <p>We conducted an analysis of the distribution of the
models’ generated tokens in the vector space to
understand the extent of representation degeneration and its
implications for the semantic representation of our
target tokens. For each model, we sampled 1,000 pairs of
random tokens from all generations of the model across
the entire dataset. We extracted the representations of
these tokens from both the last layer and the average
of the last four layers. We then computed the average
cosine similarity of the sampled embedding pairs for the
last and last four layers separately.</p>
      </sec>
      <sec id="sec-3-6">
        <title>3.6. Evaluation</title>
        <sec id="sec-3-6-1">
          <title>We compared the Model Rating Scores (MRSs), the Cosine</title>
          <p>Similarity Scores (CSSs), and the Human Rating Scores
(HRSs) collected by Haber and Poesio [1] by means of
Spearman Correlation. The correlation between MRSs
and CSSs should shed light on the internal coherence of
each model and aims at answering the following
question: Is the metalinguistic knowledge of the model
consistent with its internal representations? By
comparing HRSs with MRSs and HRSs with CSSs, we aim
to explore a diferent issue: Do the human ratings
have a more similar distribution to what a model
generates rather than its internal representation
or vice-versa? Before computing the correlation, we
rescaled the CSSs in the range 0.01 − 1.00. We also
rescaled the MRSs from the range 1 − 100, to the range
0.01 − 1.00. As for the HRSs, we used the average of the
reflects the similarity distribution indicated by the human
subjects far less accurately than the MRS.</p>
          <p>Finally, to evaluate the internal coherence of the
models in terms of the agreement between the generated
collected ratings for each sentence pair in the correlation.
similarity scores and hidden representations, we also
compared the cosine similarities and model ratings of
4. Results and analyses each model. In this case, the highest correlation is
obtained by LLaMa, which nonetheless exhibits a very weak
Table 2 reports the correlations among human ratings, correlation (0.118 on the last layer), meaning that one
model ratings, and cosine similarities. First, we consider can not reliably predict MSR based on the CSS. We
specthe correlation between cosine similarities and human ulate that a complex phenomenon like polysemy is only
ratings. The three models exhibit a near-zero correla- sub-optimally represented at the token embedding level.
tion between CSS and HRS, which is always negative for
Mistral (− 0.020) and positive for LLaMa (0.016, 0.110). 4.1. Anisotropy
Second, we compare model ratings to human ones. We
observe that there is a moderate-to-high correlation for As shown in Table 3, the degree of anisotropy varies
LLaMa (0.616), and a low-to-moderate correlation for quite significantly among the three decoder-only models,
Mistral (0.404) and Gemma (0.446). Thus, despite being especially between Gemma and the other two models,
more correlated than cosine similarities, the models’ rat- Mistral and LLaMA. Gemma exhibited the highest
coings often difer from human ones. We observed some re- sine similarity scores, approximately 0.67 for the last
current patterns in the score assignments by each model5. four layers and slightly higher for the last layer (0.75),
LLaMA frequently assigns similarity ratings of 20, 60, corroborating the findings of [ 2] regarding anisotropy
and 80. Gemma shows a preference for very low or very in decoder models such as GPT-2, which peaks in the
high scores, leaving the middle range sparsely populated. last layer. Conversely, Mistral showed the lowest scores
Mistral appears the most balanced in its evaluations, yet (0.137 for both the last and last four layers), followed by
it still favors round values (100, 90, 80, etc.) and shows LLaMA (0.24 for the last four layers and 0.228 for the
a strong preference for values close to 1. However, these last layer), indicating a much more isotropic space than
rating preferences do not seem to correspond to lexical one would expect for models with similar architecture
preferences. Although MRS appears to correlate better and comparable size. This suggests that anisotropy might
with HRS than CSS, the unstable nature of prompt results not be the same in all Transformer-based models. Rather,
and their sensitivity to biases from the data or prior train- it appears to be a property that is present at varying
deing make them less suitable for inspecting the model’s grees in models, with some exhibiting greater anisotropy
competence regarding complex semantic features like than others. This may be due to specific diferences in
polysemy. how models were trained, both in terms of data used, and</p>
          <p>In addition to this, we observe that in the comparison pre-training, fine-tuning, and post-training techniques.
between CSS and HRS, the cosine similarity distributions We aim to further investigate this aspect in the future.
of Mistral and LLaMA appear similar, while Gemma’s Due to these diferences, we decided not to apply
distribution is shifted towards higher values. We can any post-processing method [12, 10] to mitigate the
surmise that this may be attributed to a greater anisotropy anisotropy of our target vectors. However, looking in
in the embedding space characterizing the Gemma model detail at the relationship between the models’ anisotropy
(see Section 4.1 for a thorough analysis). Overall, the CSS and their respective cosine similarities, it seems that the
relatively low degree of anisotropy in both Mistral and
LLaMa does not result in a better correlation between
their CSS and HRS. On the contrary, despite a generally
5Figure 3 in the appendix enables a detailed examination of the
ratings generated by the models. An interactive version of these
plots will be available on GitHub.
moderate level of anisotropy found in these decoder-only
models, the CSS of the target tokens correlate less with
the HRS than the MRS. This finding suggests that the low
correlations of cosine similarities can not be (entirely)
due to the embedding anisotropy and that conversely the
latter does not afect the model generation abilities
significantly. This appears to confirm recent trends suggesting
that cosine similarity is a suboptimal measure to explore
Transformers’ geometries [19].
5. Conclusion and future work
unclear and calls for further research. Despite the low
anisotropy of the examined models, cosine similarity did
not reveal a correlation between the generations and
the internal representations of the models, indicating
a need for deeper investigation. We plan to repeat the
experiments by leveraging recent results with sparse
autoencoders [20] to decompose the meanings of lexically
ambiguous words. This could provide a deeper
understanding of the models’ ability to handle and represent
polysemy.</p>
          <p>We could not extract embeddings from commercial
models, such as those provided by OpenAI, which are
accessible only through APIs. However, it would be
valuable in future research, if and when this functionality
becomes available, to analyze and compare the internal
representations and the generated outputs of these
stateof-the-art models.</p>
          <p>Another promising avenue for future research is to
examine the diferences between vector representations
and generated tokens with respect to linguistic
phenomena beyond polysemy and lexical ambiguity. For instance,
incorporating out-of-vocabulary words could allow for
an exploration of semantic shifts caused by the addition
of prefixes or sufixes (e.g., “order” vs. “dis-order”),
ofering valuable insights. This analysis would benefit from
using a tokenization strategy that treats morphemes as
subtokens, alongside an investigation into the degree of
anisotropy in these models.</p>
          <p>Our study investigates how LLMs handle LA, using two
distinct methodologies: Eliciting rating scores from the
model and analyzing the cosine similarity between pairs
of polysemous words. We calculated the Spearman
correlation between HRS vs. MRS, HRS vs. CSS, and MRS
vs. CSS. The aim was to determine whether the model’s
metalinguistic knowledge aligns with its internal
representations and to assess if human ratings more closely
match the outputs generated by the model than its
internal representations.</p>
          <p>The lack of correlation between CSS and MRS provides
intriguing insights into the relationship between the
internal representations of LLMs and the responses they
generate in metalinguistics tasks, like explicitly assigning
similarity ratings. Specifically, the argument presented
by Hu and Levy [9] appears to be validated: Generated
responses do not always reflect the model’s internal
processing. Hu and Levy [9] compared model generations Acknowledgments
with their probability distributions and found the latter
method to be more accurate. In contrast, in our study, We acknowledge financial support under the PRIN
using the internal representations of the model (i.e., the 2022 Project Title "Computational and linguistic
benchcontextual embeddings, as motivated in Section 2) proved marks for the study of verb argument structure" – CUP
to be a less reliable method. The most straightforward I53D23004050006 - Grant Assignment Decree No. 1016
conclusion is that generative LLMs might be suboptimal adopted on 07/07/2023 by the Italian Ministry of
Univerfor estimating word sense similarity. The superior per- sity and Research (MUR). This work was also supported
formance of probability estimation reported by Hu and under the PNRR—M4C2—Investimento 1.3, Partenariato
Levy [9] might be due to its direct link to the predic- Esteso PE00000013—“FAIR—Future Artificial Intelligence
tion training objectives of LLMs. To further investigate Research”—Spoke 1 “Human-centered AI,” funded by the
the relationship between CSS and MRS, we inspected European Commission under the NextGeneration EU
prothe anisotropy of the embeddings. The average cosine gramme, and partially supported by the Italian Ministry
similarity among a sample of generated tokens was rela- of University and Research (MUR) in the framework of
tively low, indicating that anisotropy did not afect our the PON 2014-2021 “Research and Innovation" resources
cosine similarity measures and is not characteristic of – Innovation Action - DM MUR 1062/2021 - Title of the
all decoder-only models under investigation. The lack Research: “Modelli semantici multimodali per l’industria
of anisotropy observed in some of the analyzed decoder- 4.0 e le digital humanities.”
only models is at odds with the conclusions of Ethayarajh
[2], who reported a higher anisotropic space for GPT-2.</p>
          <p>Only MRS yielded a moderate correlation with HRS, References
indicating that LA is not fully captured by the analyzed
models, in text generation and vector representations. In
conclusion, the relationship between human judgments,
model generations, and internal representations appears
[1] J. Haber, M. Poesio, Patterns of polysemy and
homonymy in contextualised language models, in:
Findings of the Association for Computational
Linguistics: EMNLP 2021, 2021, pp. 2663–2676.
[2] K. Ethayarajh, How contextual are contextualized arXiv preprint arXiv:1906.10007 (2019).
word representations? comparing the geometry of [17] J. Gao, D. He, X. Tan, T. Qin, L. Wang, T.-Y.
bert, elmo, and gpt-2 embeddings, arXiv preprint Liu, Representation degeneration problem in
trainarXiv:1909.00512 (2019). ing natural language generation models, 2019.
[3] D. Yenicelik, F. Schmidt, Y. Kilcher, How does bert arXiv:1907.12009.</p>
          <p>capture semantics? a closer look at polysemous [18] N. Godey, É. de la Clergerie, B. Sagot, Anisotropy
words, in: Proceedings of the Third BlackboxNLP is inherent to self-attention in transformers, arXiv
Workshop on Analyzing and Interpreting Neural preprint arXiv:2401.12143 (2024).</p>
          <p>Networks for NLP, 2020, pp. 156–162. [19] H. Steck, C. Ekanadham, N. Kallus, Is
cosine[4] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, similarity of embeddings really about similarity?,
Bert: Pre-training of deep bidirectional transform- in: Companion Proceedings of the ACM on Web
ers for language understanding, arXiv preprint Conference 2024, 2024, pp. 887–890.
arXiv:1810.04805 (2018). [20] T. Bricken, A. Templeton, J. Batson, B. Chen,
[5] T. Mickus, D. Paperno, M. Constant, K. van Deemter, A. Jermyn, T. Conerly, N. Turner, C. Anil, C.
DeniWhat do you mean, bert?, in: Proceedings of the son, A. Askell, et al., Towards monosemanticity:
DeSociety for Computation in Linguistics 2020, 2020, composing language models with dictionary
learnpp. 279–290. ing, Transformer Circuits Thread 2 (2023).
[6] S. Trott, B. Bergen, Raw-c: Relatedness of
ambiguous words–in context (a new lexical resource for
english), arXiv preprint arXiv:2105.13266 (2021).
[7] S. Nair, M. Srinivasan, S. Meylan,
Contextualized word embeddings encode aspects of
humanlike word sense knowledge, arXiv preprint
arXiv:2010.13057 (2020).
[8] S. Trott, Can large language models help augment
english psycholinguistic datasets?, Behavior
Research Methods (2024) 1–19.
[9] J. Hu, R. Levy, Prompting is not a substitute for
probability measurements in large language models,
in: Proceedings of the 2023 Conference on
Empirical Methods in Natural Language Processing, 2023,
pp. 5040–5060.
[10] J. Mu, S. Bhat, P. Viswanath, All-but-the-top:
Simple and efective postprocessing for word
representations, arXiv preprint arXiv:1702.01417 (2017).
[11] V. Zhelezniak, A. Savkov, A. Shen, N. Y. Hammerla,</p>
          <p>Correlation coeficients and semantic textual
similarity, arXiv preprint arXiv:1905.07790 (2019).
[12] W. Timkey, M. Van Schijndel, All bark and no bite:</p>
          <p>Rogue dimensions in transformer language models
obscure representational quality, arXiv preprint
arXiv:2109.04404 (2021).
[13] AI@Meta, Llama 3 model card (2024). URL:
https://github.com/meta-llama/llama3/blob/main/</p>
          <p>MODEL_CARD.md.
[14] J. Phoenix, M. Taylor, Prompt Engineering for
Gen</p>
          <p>erative AI, O’Reilly Media, Inc., 2024.
[15] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia,</p>
          <p>E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought
prompting elicits reasoning in large language
models, Advances in neural information processing
systems 35 (2022) 24824–24837.
[16] D. Loureiro, A. Jorge, Language modelling makes
sense: Propagating representations through
wordnet for full-coverage word sense disambiguation,</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>A. The prompt</title>
      <sec id="sec-4-1">
        <title>The following text box shows the prompt used to test LLMs in our lexical ambiguity experiment. The underlined text was replaced by sentences and word targets from the dataset shared by Haber and Poesio [1].</title>
      </sec>
      <sec id="sec-4-2">
        <title>You will receive two sentences. Your task is to rate how similar is the use of the word ‘word’ in the two sentences. • Sentence 1: s1</title>
        <p>• Sentence 2: s2
You must follow the following principles:
• Assign a rating on a scale of 1-100, where
1 means that the word is used with
completely diferent senses in the two
sentences and 100 means that the word is
used in the same sense across the two
sentences.
• Return your answer in this way:
– Rewrite the two sentences
following this template:
∗ Sentence1: &lt;text&gt;
∗ Sentence2: &lt;text&gt;
– Motivation: &lt;a concise
motivation for your rating&gt;
– Rating score: &lt;only a float
number on a scale of 1-100 and nothing
else&gt;.
• Interrupt generation after the rating
score.</p>
      </sec>
      <sec id="sec-4-3">
        <title>Question: how similar is the use of the word</title>
        <p>word in the following two sentences?
s1
s2</p>
        <p>Answer:</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>B. More on human-rated pairs</title>
      <p>C. Additional Figures</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>