1. Introduction

The Time-Embedding Travelers at WiC-ITA

Francesco Periti

Haim Dubossarsky

0 0 Queen Mary University of London , England 1 University of Milan , Italy

The WiC-ITA shared task aims to determine whether a word appearing in two distinct sentences carries the same meaning. The task consists of two subtasks: binary classification (Subtask 1) and ranking (Subtask 2). Each subtask is designed in both a monolingual (Italian) and multilingual (Italian-English) setting. In this report, we present the results of our participation in WiC-ITA. In our experiments, we leverage the condition number of the cosine similarity matrix between XLM-R embeddings and demonstrate competitive performance, ranking among the top positions in both the monolingual and cross-lingual setting. Our results indicate that semantic information is present not only in the last layers but also across the middle layers of XLM-R and throughout the entire architecture. This suggests potential avenues for future research to explore the use of the complete set of embeddings, rather than solely relying on the embeddings extracted from the last layer(s).

eol>Word-in-Context Contextualized Embeddings Condition Number

1. Introduction

novel benchmark for evaluating WiC for both a monolingual (L) setting in Italian and a cross-lingual (XL) setting

In the last decade, the use of Word Embedding tech- from Italian to English (Cassotti et al., 2023 [11]; Lai et

niques has improved the modeling of lexical semantics. al., 2023 [12]). Inspired by the previous work, WiC-ITA Initially, static embedding models have been employed challenges its participants with two sub-tasks: to encode the dominant semantics of a word into a single vector representation, i.e., word embedding (Mikolov 1. Binary Classification: to establish if a target word et al., 2013 [1]). However, understanding the meaning occurring in a pair of sentences ⟨1, 2⟩ has of words in their specific contexts is a crucial task for the same meaning or not (Subtask 1); modeling language efectively. This motivated the recent 2. Ranking: to rank the pair of sentences ⟨1, 2⟩ eforts to create contextualized models capable of gener- by the degree of similarity of the target word’s ating diferent vector representations according to the meaning (Subtask 2). context in which the words occur (Devlin et al., 2019 [2]).

Despite the growing popularity of contextualized embeddings in research fields such as Word Sense Disambiguation or Lexical Semantic Shift Detection (Scarlini et al., 2020 [3]; Montanelli and Periti, 2023 [4]), Wordin-Context (WiC) benchmarks that specifically focus on the dynamic of word semantics are relatively recent. The ifrst WiC benchmarks were limited to English (Pilehvar et al., 2019 [5]; Loureiro et al., 2022 [6]). Their success prompted the development of new WiC benchmarks to cover a wider scope of languages (Raganato et al., 2020 [7]; Liu et al., 2021 [8]), test the transfer learning ability in cross-lingual settings (Martelli et al., 2021 [9]), and evaluate graded word similarity in context (Armendariz et al., 2020 [10]).

The WiC-ITA shared task at EVALITA 2023 provides a 2. Background and motivation

embedding model to compute a continuous similarity score. This score indicates the extent to which the target BERT is a powerful contextualized model that leverages carries the same meaning in the sentences 1 and 2. the Transformer encoder to capture the contextual se- More precisely, consider a sentence that contains mantics of words (Devlin et al., 2019 [2]; Vaswani et al., the word . Given a contextualized model , a vector 2017 [13]). Typically, the success of BERT is attributed representation of is extracted from every layer of the to its multi-layer (e.g., 12) and multi-head (e.g., 12) self- model . This way, the word in the sentence is assoattention blocks. However, most of the SOTA work only ciated with a set of contextualized embeddings denoted uses the outputs of the final layer(s) (i.e., word embed- by . It’s worth noting that ∈ R× , where is the dings) as input for solving NLP tasks, while ignoring the number of encoders of the model (e.g., 12) and is the output of the earlier layers. As a result, the role of difer- dimension of the embeddings (e.g., 768). As a result, we ent embedding layers for representing the semantics of denote as 1 and 2 the contextualized embeddings of word occurrences is still unclear. Recently, a limited num- extracted from the sentences 1 and 2, respectively. ber of studies have been conducted to explore the nature In order to evaluate the similarity of the word in and characteristics of the BERT embeddings. In particu- the contexts 1 and 2, we collect the pairwise cosine lar,Jawahar et al. (2019) [14] indicate that BERT’s lower similarities between 1 and 2. We denote as the layers capture surface features pertaining to phrase-level similarity matrix between 1 and 2 (see Figure 3 as an information, middle layers capture syntactic features, example 1) . Our hypothesis is that taking into account and higher layers capture semantic features. Devlin et information from all layers at once will provide a richer al. (2019) [2] report that combining the last four hid- and more comprehensive rapport of the nature of usage den layers could be beneficial for mainstream tasks such similarities of a word between the two sentences. We as Named Entitiy Recognition. Ethayarajh (2019) [15] hypothesize that because many layers are known to capdemonstrates that the geometry of the embedding space ture relevant semantic information, we should consider exhibits anisotropy, meaning that the embeddings of all as many of them as possible together, as they may conlayers occupy a narrow cone within the vector space. tain more comprehensive information than a single layer Other work involves probing tasks, as proposed in He- comparison approach. witt et al. (2019) [16]. These tasks consist of training an In order to tap into this pool of similarity scores enauxiliary classifier on top of a model, where the contextu- coded within (that contains 144 times more information alized embeddings serve as features to predict syntactic than a single layer) we use a measure called the condition (e.g., part-of-speech tags) and semantic (e.g., word rela- number. The condition number of a matrix, which was tions) properties of words. The idea is that if the auxiliary already successfully applied in other domains in NLP classifier accurately predicts a linguistic property, we can (Dubossarsky et al., 2020 [19]), provides us with a unified assume that the property is encoded in the tested model. measure that takes into account the many similarities

In line with this work, Coenen et al. (2019) [17] investi- scores between the representations of of the pair 1

gate the capability of word sense prediction and indicate and 2 throughout the diferent layers. that earlier-layer embeddings contain significantly more Originally, the condition number of a matrix was semantic information than conventionally believed. used to measure its sensitivity to perturbations, or small

Thus, our experiments are motivated by the latter find- changes, in its input. A large condition number indicates ing and inspired by linguistic research that highlights that the matrix is ill-conditioned, meaning it is sensitive the influential role of morphology and syntax in shaping to small perturbations. On the other hand, a small condiword meanin gs (Wysocki and Jenkins, 1987 [18]). In this tion number indicates that the matrix is well-conditioned, paper, we challenge the hypothesis that word meanings meaning that small changes will not afect it much. should be investigated by considering the full output of In the setting of the WiC task, we interpret the condipre-trained models to encompass not only semantic fea- tion number of a similarity matrix as associated with the tures of the last layers but also the intricate interplay of stability of meaning between the two sentences. Higher semantic, surface, and syntactic features present in the similarity scores in overall indicate two similar word middle and lower layers of the contextualized models. usage and are expected to produce lower (and better) condition number. On the other hand, less similar and more

3. System overview Our system is a simple threshold-based classifier based

on the similarity of two sets of word vectors. In particular, given a pair of sentences ⟨1, 2⟩, and a target word , we use the output embeddings of a contextualized varied similarity scores indicate more unrelated usages with ∈ 1, ..., . Additionally, we compute the coresulting in a higher (and worse) condition number. sine similarity CS between the word embeddings

The condition number of a matrix is defined as the obtained by averaging the last four embeddings of 1

multiplication of the matrix’s norm by the norm of its and 2, respectively. reciprocal (i.e., the inverse of the matrix). The norm In line with the WiC-ITA guidelines, we compute the could be Euclidean norm, Max norm, Frobenius norm, etc. Spearman correlation between the estimated similarity

In our experiments, we calculate the condition number scores and the gold answers. This serves as the evaluation

(COND) of the similarity matrix using the Frobenius metric for Subtask 2. norm as follows: In Subtask1, our binary predictions are derived from the similarity scores obtained in Subtask 2. We employ a COND () = ‖‖ · ‖ − 1‖ threshold-based classifier, selecting the threshold value that optimizes the F1 score on the set of sentence pairs used as training set.

When we compute the condition number from the

similarity matrix , we assess the degree of semantic similarity of a word in each pair ⟨1, 2⟩ as = COND() 2. 4. Experimental setup

Furthermore, we also investigate the similarity

by considering only a subset of . We test COND , In this task, we compared two diferent contextual

COND , and COND based on the similarities collected ized multilingual models, namely mBERT (Devlin et al.,

from the first, middle, and last four layers of the model 2019 [2]), and XLM-R (Conneau et al., 2020 [20]). We , respectively. use the Transformers library by HuggingFace to extract

For the sake of comparison, we set as reference base- contextual word embeddings from mBERT and XLM-R

lines the cosine similarity (CS) of the embeddings ex- models without performing any fine-tuning stage (Wolf tracted from all the layers of the model individually, et al., 2020 [21]) . We use the base versions, with 12 layers meaning that we compute diferent CS scores as and 768 hidden dimensions: bert-base-multilingual-cased, 1[] · 2[] and xlm-roberta-base, respectively.

CS(1[], 2[]) = , Given a target word and a pair ⟨1, 2⟩. The acqui‖1[]‖‖2[]‖ sition of contextual embeddings is done by feeding the models with the sentences 1 and 2 individually. For every sentence, we extract the token embedding for the target word from each layer of the model. Due to the

2For ease of interpretation, in our experiment, we utilized the

COND metric. We chose to associate smaller condition numbers with unrelated usages (annotated as 1), while larger numbers with identical usages (annotated as 4). Measures COND COND COND CS10 CS11 CS

CS9 CS8 CS7 CS6 CS12 CS5

CS4 COND

CS3 CS2 CS1 byte-pair input encoding scheme employed by BERT-like Train sets. models, some tokens may not correspond to complete Motivated by the superior results achieved during the words but rather to word pieces. In such cases, when development phase, we relied on XLM-R for our final a word is split into multiple tokens, we build a single submissions. In particular, we submit the predictions word embedding by averaging the embeddings of its con- obtained with COND, COND , COND . However, in stituent word pieces. Table 2, it is worth noting that COND also emerged as the

Finally, to assess the graded word similarity in the leading measure for the mBERT model, proving its concontext of a pair of sentences, we calculate similarity sistency. Moreover, we note that for the WiC-ITA task, scores between the contextualized embeddings of the the embeddings from the last layer of both XLM-R and target word under consideration (See Section 3). BERT, as well the embeddings derived by the aggregation of the last four layers, are not as efective as those from other layers. For instance, it is interesting to observe that 5. Experimental results layer 8 seems to be efective for Subtask1.

In the final evaluation leaderboard for the WiC-ITA

In our submissions, we rely on XLM-R as it proved to task, we ranked 2nd for L-Subtask1, 1st for XL-Subtask1, be more efective than mBERT. To maximize the perfor- 2nd for L-Subtask2, and 1st for XL-Subtask2. The leadermance of our system, we leverage the available train board is reported in Table 3. and dev set as a whole. In particular, we randomly gen- Our final results at WiC-ITa demonstrate that COND erate 100 diferent train-test splits, with sizes of 2000 efectively captures semantic features of word meanings and 1305 respectively (equivalent to 60% and 40% of the and can be successfully applied to tasks like WiC. Based full dataset). We conduct cross-validation on these 100 on our development results, we assert that COND consplits to validate the use of COND for Subtask2. Addi- sistently outperforms the CS measure computed over tionally, we leverage cross-validation to determine the individual contextualized embeddings, for Subtask 1 and optimal threshold for Subtask1, meaning that we rely on 2 in both in L and XL setting. This is particularly interthe average of the 100 best thresholds obtained during esting considering that CS is commonly utilized in NLP cross-validation. The average scores of Spearman cor- tasks to capture contextual semantics in contextualized relation, Precision, Recall, and F1 score are presented embeddings. in Table 1 for each tested measure. For Subtask1 and 2 Finally, COND consistently achieves good results and for both the L and XL setting, our three submissions by considering medium layers alone. These results are correspond to the top three measures based on the F1 in line with the findings of Coenen et al. (2019) [ 17], score and Spearman correlation, respectively (i.e., COND, and suggest that the middle layers of BERT-like models COND , COND). contain valuable information for efectively representing

For the sake of comparison, Table 2 presents the pre- meaning. Therefore, future work should explore the apliminary performance achieved during the development plication of COND for WiC and other related NLP tasks phase with both XLM-R and mBERT over the Dev and such as Lexical Semantic Change Detection (Montanelli

1–6 Preliminary performance achieved during the development phase with both XLM-R and mBERT over the Dev and Train sets. We report in bold the best result for each metric, model, and data set.

Evaluation Phase Subtask1

Subtask2

Teams BERT 4EVER BERT 4EVER BERT 4EVER

LG extremITA extremITA

Baseline The Time-Embedding Travelers The Time-Embedding Travelers The Time-Embedding Travelers

Run run1 run2 run3

LG COND COND

COND camoscio lora it5

L-WiC

6. Conclusion Our experiments for the WiC-ITA shared task ranked 2nd for L-Subtask1, 1st for XL-Subtask1, 2nd for L-Subtask2, and 1st for XL-Subtask2. In our submissions, we use the Acknowledgments This work has in part been funded by the project Towards Computational Lexical Semantic Change Detection sup

condition number of the cosine similarity matrix between ported by the Swedish Research Council (2019–2022; con

XLM-R embeddings extracted from diferent layers. Our

tract 2018-01184), and in part by the research program results support our initial hypothesis that leveraging all

Change is Key! supported by Riksbankens Jubileumsfond the information provided by the pre-trained model can (under reference number M21-0021).

EVALITA , CEUR.org, Parma, Italy, 2023 . [1]

Mikolov ,

Chen , G. Corrado,

Dean , Efi- [13]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit,

Scottsdale , Arizona, 2013 . volume 30 , CAI , 2017 . [2]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: [14]

Jawahar ,

Sagot ,

Seddah , What Does BERT

Language

Understanding , in: Proc. of NAACL-HLT , of

ACL

, ACL, Florence, Italy, 2019 , pp. 3651 - 3657 .

ACL , Minneapolis, Minnesota, 2019 , pp. 4171 - 4186 . [15]

Ethayarajh , How Contextual are Contextualized [3]

Scarlini ,

Pasini ,

Navigli , With More Con- Word Representations ? Comparing the Geometry

texts Comes Better

Performance: Contextualized of BERT, ELMo, and GPT-2 Embeddings, in: Proc.

Sense Embeddings for All-Round Word Sense Dis- of EMNLP-IJCNLP, ACL , Hong

Kong

, China, 2019 ,

ambiguation, in : Proc. of EMNLP , ACL, Online, pp. 55 - 65 .

2020 , pp. 3528 - 3539 . [16]

Hewitt ,

Liang , Designing and Interpreting [4]

Montanelli ,

Periti , A Survey on Con- Probes with Control Tasks , in: Proc. of EMNLP-

textualised Semantic Shift Detection , 2023 . IJCNLP, ACL, Hong

Kong

, China, 2019 , pp. 2733 -

arXiv:2304.01666 . 2743. [5]

M. T.

Pilehvar , J. Camacho-Collados, WiC: the [17]

Coenen ,

Reif ,

Yuan ,

Kim ,

Pearce , F. Vié-

NAACL-HLT , ACL , Minneapolis, Minnesota, 2019 , Hook, NY , USA, 2019 .

pp. 1267 - 1273 . [18]

Wysocki ,

J. R.

Jenkins , Deriving word meanings [6]

Loureiro , A. D'Souza , A. N.

Muhajab , I. A.

White , through morphological generalization, Reading

Wong ,

Espinosa-Anke ,

Neves ,

Barbieri , Research Quarterly ( 1987 ) 66 - 81 .

Camacho-Collados , TempoWiC: An Evaluation [19]

Dubossarsky , I. Vulić ,

Reichart , A . Korho-

public of Korea , 2022 , pp. 3353 - 3359 . Measures, in : Proceedings of the 2020 Conference [7]

Raganato ,

Pasini ,

Camacho-Collados , M. T. on Empirical Methods in Natural Language Process-

Pilehvar , XL-WiC: A Multilingual Benchmark for ing (EMNLP ), 2020 , pp. 2377 - 2390 .

Evaluating

Semantic Contextualization , in: Proc. [20]

Conneau ,

Khandelwal ,

Goyal , V. Chaud-

EMNLP

, ACL, Online, 2020 , pp. 7193 - 7206 . hary, G. Wenzek,

Guzmán , E. Grave,

Ott , [8]

Liu ,

E. M.

Ponti ,

McCarthy , I. Vulić ,

Ko- L. Zettlemoyer ,

Stoyanov , Unsupervised Cross-

rhonen , AM2iCo: Evaluating Word Meaning in lingual Representation Learning at Scale , 2020 .

Context across Low-Resource Languages with Ad- arXiv: 1911 .02116.

versarial Examples, in: Proc. of EMNLP , ACL, Punta [21]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. De-

Cana , Dominican

Republic , 2021 , pp. 7151 - 7162 . langue, A. Moi,

Cistac ,

Rault ,

Louf , M. Fun[9]

Martelli ,

Kalach , G. Tola, R. Navigli, SemEval- towicz, J. Davison,

Shleifer , P. von Platen, C. Ma,

2021 Task 2 : Multilingual and Cross-lingual Word- Y. Jernite , J.

Plu , C.

Xu , T. Le

Scao , S. Gugger,

SemEval

, ACL, Online, 2021 , pp. 24 - 36 . of-the-Art Natural Language Processing , in: Proc. [10]

C. S.

Armendariz ,

Purver ,

Ulčar , S. Pollak, of

EMNLP

, ACL, Online, 2020 , pp. 38 - 45 .

Ljubešić ,

Granroth-Wilding , CoSimLex: A [22]

Tahmasebi ,

Dubossarsky , Computa-

Resource for Evaluating Graded Word Similarity in tional modeling of semantic change , 2023 .

Context , in : Proc. of LREC , ELRA, Marseille, France, arXiv: 2304 . 06337 .

2020 , pp. 5878 - 5886 . [23]

Periti ,

Ferrara ,

Montanelli ,

Ruskov , What [11]

Cassotti ,

Siciliani ,

Passaro , M.

Gatto, is Done is Done: an Incremental Approach to Se-

Basile , WiC-ITA at EVALITA2023: Overview mantic Shift Detection , in: Proceedings of the 3rd

of the EVALITA2023 Word-in-Context for ITAlian Workshop on Computational Approaches to His-

2023. tational Linguistics, Dublin, Ireland, 2022 , pp. 33 - [ 12]

Lai ,

Menini ,

Polignano ,

Russo , R. Sprug- 43. URL: https://aclanthology.org/ 2022 .lchange- 1 .4.

noli , G. Venturi, EVALITA

2023 : Overview of the doi : 10 .18653/v1/ 2022 .lchange- 1 .4.

8th Evaluation Campaign of Natural Language Pro-