1. Introduction and motivation

WiC-ITA at EVALITA2023: Overview of the EVALITA2023 Word-in-Context for ITAlian Task

Pierluigi Cassotti

Lucia Siciliani

Lucia C. Passaro

Maristella Gatto

Pierpaolo Basile

0 0 Department of Computer Science, University of Bari , Italy 1 Department of Computer Science, University of Pisa , Italy 2 Dipartimento di Ricerca e Innovazione Umanistica, University of Bari , Italy

WiC-ita is a shared task proposed at the EVALITA 2023 campaign. The task focuses on the meaning of words in specific contexts and has been modelled as both a binary classification and a ranking problem. Overall, 4 groups took part in both subtasks, with 9 different runs. In this report, we describe how the task was set up, we report the system results, and we discuss them. models for the Italian language. In fact, the transfer learning ability of WiC models across different languages is Word Sense Disambiguation [1] is a Natural Language proven in previous works [4], where models improve their Processing task with a long history and is extremely inter- performance by training in other languages. Several iniesting for the Computational Linguistics community. In tiatives have been proposed throughout the years: the first Word Sense Disambiguation (WSD), the goal is to disam- one [3] being the proposal of the WiC task, which also biguate each word occurrence assigning to it the correct came along with a dataset but was limited to English. For sense from a predefined sense inventory, such as WordNet this reason, it was followed by the XL-WiC [5] dataset [2]. The introduction of contextualized models, such as which tried to tackle this issue by taking into account BERT, allowing the representation of a word in different a total of 15 languages. Next, the MCL-WiC [4] was contexts, steers the research focus to new tasks, such as the first WiC dataset to introduce the Cross-lingual task. the Word in Context (WiC) task [3]. The main motivation behind this particular choice was to WSD and the WiC task are highly related: while the cover scenarios where systems have to deal with different former models in an explicit way the relationship between languages simultaneously, further highlighting the importhe target word and its sense (taken from a predefined tance of this task in real-world applications. With AM2iCo sense inventory), the latter reduces it to a binary task. The [6], the main aim was to focus on low-resource languages WiC task requires determining if a word occurring in two and to ensure participating models must consider both the different sentences has the same meaning or not. In recent target word and the context to achieve good performances. years, there has been a growing interest in the WiC task, Finally, in CoSimLex [7], the task is extended to pairs of demonstrated by the creation of several different resources words that appear in a shared context, and the goal is to and shared tasks covering more than 20 languages. determine to which degree they refer to the same concept. In general, the WiC task is of broad-scope interest, This is done to capture the word polysemy as well as the as it is not limited to specific domains and can be use- context-dependency of words. ful for several NLP tasks. Furthermore, the training and Shared tasks regarding the WiC usually preserve its the evaluation on a monolingual (Italian) or cross-lingual binary design, where the two possible outcomes for each (English-Italian) dataset is advantageous not only for the entry are: true if the meaning of the target word changes between the two sentences/contexts and false if it does not. However, there can be some cases where it is not so simple to determine the lack or presence of semantic similarity in a discrete way. For this reason, we exploit the 4-point relatedness scale introduced by [8, 9] in the annotation process. The scale consists of 4 values, namely 4: Identical; 3: Closely Related; 2: Distantly Related;

eol>Word in Context Lexical Semantics Evaluation Dataset

1. Introduction and motivation

Unfortunately, as often happens in the Natural Language Processing research area, some languages are more represented than others, and the WiC task makes no exception in this sense. With the WiC-ITA task at EVALITA 2023 [10], we aim to fill this gap in the literature, making openly available a resource that can undoubtedly foster novel research.

2. Task Description

noun verb adjective adverb The general goal of the WiC-ITA task is to establish if pairs, only a small part of these are manually annotated/a word occurring in two different sentences 1 and 2 validated by human experts. has the same meaning or not. The task is modelled with Differently from previous datasets, for the WiC-ITA two different subtasks, namely a binary classification one task, we relied on sense inventories only for the target (Subtask 1) and a ranking one (Subtask 2). Participants words selection stage, while we extract the list of sentence were allowed to participate in one or both of the subtasks. pairs from large unlabelled corpora. Moreover, human Details and examples of annotation are available on the annotation is carried out for all the sentence pairs, thus task website. 1 making WiC-ITA the largest manually annotated resource for the WiC task.

In addition to this, the WiC-ITA dataset includes both 2.1. Subtask 1: Binary Classification monolingual (Italian) and cross-lingual (English-Italian) Subtask 1 is structured as follows: given a word oc- data. curring in two different sentences 1 and 2, the goal is The dataset is split into training, development, and test to provide the sentences pair with a score determining portions. In particular: whether maintains the same meaning or not. Possible outcomes for this subtask are: • 0: the word has not the same meaning in the

two sentences 1 and 2; • 1: the word has the same meaning in the two

sentences 1 and 2.

2.2. Subtask 2: Ranking

Subtask 2 is structured as follows: Given a word occurring in two different sentences 1 and 2, the goal is to provide the sentences pair with a score indicating to which extent, in a 1-4 scale, has the same meaning in the two sentences. The scoring system for this subtask is a continuous value where ∈ [ 1, 4 ]. A higher score corresponds to a higher degree of semantic similarity.

3. Dataset

The creation of datasets for the WiC task usually relies on using sense inventories, such as WordNet or BabelNet [11]. More specifically, sense inventories are often exploited for selecting target words, which should exhibit polysemia, and for the generation of sentence pairs using the sense examples provided, i.e. sentences in which the target word occurs with the respective sense. After the selection of target words and the generation of sentence

1http://wic-ita.github.io/.

• the training and development set consists of annotated pairs of monolingual (Italian) sentences; • the test set consists of annotated pairs of monolingual (Italian) sentences and annotated pairs of cross-lingual (English-Italian) sentences.

We create the monolingual datasets by selecting target

words based on the number of synsets in WordNet and senses reported in Wiktionary. To achieve this, we generate a list of candidate target words for each part of speech (PoS) using lemmas from both WordNet and Wiktionary. For each lemma , we calculate the count of WordNet synsets () and senses reported Wiktionary (). We then compute min(, ) and order all the target words in descending order.

To construct the cross-lingual dataset, we use the MultiSemCor [12], which is based on SemCor [13], the most extensive and widely used dataset for Word Sense Disambiguation. Specifically, we extracted word pairs (ItalianEnglish) that are frequently translated in SemCor. For these word pairs, we compute the frequency of specific synsets. Then, we took the union of synsets for each word pair and computed the probability distribution over the synsets for both the Italian and English words. The Jensen–Shannon Divergence (JSD) is computed for each pair, and the pairs are sorted accordingly in decreasing order.

We sample the top-k words for the monolingual setting and the top-k pair of words for the cross-lingual setting according to the min(, ) and the JSD respec

The annotation process is carried out using Doccano [17].

Each data point (i.e., sentence pair) is annotated by two independent annotators. Tables 2 and 32 show the statistics in terms of number of annotated examples and agreement (computed as the Spearman correlation) for each pair of annotators. In the monolingual setting, the Spearman correlation for the annotations consistently exceeds 0.6, with the exception of two cases. On the other hand, in the cross-lingual setting, the average correlation is lower compared to the correlation obtained in the monolingual setting. However, the correlation between annotators in the cross-lingual scenario is also computed on smaller samples, which can impact the reliability of the computed correlation.

The data points for which at least one of the annotators voted 0 (Cannot decide) were discarded from the official dataset for the sake of simplicity. The score for Subtask 2 is obtained by averaging the scores assigned by the two annotators. The ground truth labels for Subtask 1 (binary) were drived from the labels of Subtask 2. Specifically, we considered the data points for which the two annotators agreed, namely the case in which both annotators provided a score in the set {1, 2} and the case in which both the annotators provided a score in the set {3, 4}. In the former case, the example was labelled with 0, while in the latter, it was labelled with 1.

The dataset is available for download on the website of the task.3 The dataset has been constructed using available corpora. We refer to [14, 15] for the details about copyright and usage. Below, we further describe the details of the two sub-tasks.

3.1. Subtask 1: Ranking We provide two datasets for model development:

tively. The number of sampled words per PoS tag are reported in Table 1.

The monolingual and the cross-lingual sentence pairs are extracted from the itWaC and ukWaC corpora, both part of the WaCKy project [14, 15]. ukWaC is a corpus obtained by crawling the web pages under the .uk domain. It consists of more than 2 billion words, annotated with PoS tags and lemmatized using the TreeTagger tool [16]. itWaC, differently from ukWaC, is lemmatized using Morph-it! and is obtained by crawling web pages under the .it domain.

Each sentence pair extracted from the aforementioned resources has been attributed with the average score as- The training dataset is highly unbalanced, consisting of signed by two annotators according to the 4-point relat- 71.27% of positive and 28.73% of negative examples. At edness scale, i.e. from 4 (Identical meaning) to 1 (Un- the same time, we provide balanced development and test related), the offsets of the target word on the respective datasets consisting of 50% positive and 50% of negative sentences, and the lemma of the target word. Note that we examples. For each In-Vocabulary target word of the only considered the Italian lemma for the cross-lingual 2The annotator groups for the two tasks are independent. examples, albeit providing the offsets for both languages. 3https://wic-ita.github.io/data/. • The training dataset which consists of 2,805 training examples. This dataset should be employed to train the model • The development dataset which consists of 500 training examples. This dataset should be employed to evaluate the model in the training phase, e.g., tune hyper-parameters • The monolingual test dataset which consists of

500 examples • The cross-lingual test dataset which consists of

500 examples development and test datasets, at least one positive and one negative example are provided in the training set.

Overall statistics are reported in Table 4.

The baseline model for the task has been constructed

according to [ 5 ]. It exploits the BERT architecture [18] for encoding the target sub-words. To deal with cases in 3.2. Subtask 2: Ranking which the target word is split into multiple sub-tokens, the We provide four datasets for model development: ifrst sub-token is considered. Differently from [ 5 ], we use as pre-trained model XLM-RoBERTa [19] and train the • The training dataset which consists of 2,805 train- baseline to minimise the difference between the model ing examples for which the two annotators agree prediction and the gold score computing the mean squared (This dataset contains the same examples provided error.

for training in Subtask 1) We set the learning rate to 1− 5 and weight decay to 0. • A training dataset which consists of 1,015 training The best checkpoint over the ten epochs is selected using examples for which the two annotators disagree the development data. • An overall training dataset which consists of 3,820 The binary baseline for Subtask 1 applies the threshold training examples. This dataset include both in- = 2 to the model predictions to obtain discrete labels. stances where annotators have reached a consen- To ensure fair reproducibility and comparisons, the sus and those in disagreement evaluation scripts are available for download. 4 • The development dataset which consists of 500 examples (This dataset contains the same examples 5. Participants provided for development in Subtask 1) • The monolingual test dataset which consists of Overall, different teams participated in the task with 9 500 examples (This dataset contains the same ex- distinct runs. We highlight below the main strategies amples provided for test in Subtask 1) adopted by the teams to deal with the WiC-ITA tasks. • The cross-lingual test dataset which consists of The BERT 4EVER team 5 proposed three variants of 500 examples (This dataset contains the same ex- a system based on BERT. The strategy behind the first amples provided for test in Subtask 1) model involves using the Labse pre-trained model to perform matching judgment tasks. It applies four different strategies for encoding and matching the spliced sentences, 4. Evaluation including the addition of [CLS] vectors and siamese vectors. The output probabilities of the four models are fused, The ranking of the participating systems is provided ac- with task 2 treated as a six-classification task. The seccording to each subtask and test set. In other words, for ond model for task 1 uses the bert-base-italian-cased preeach subtask, we provide the evaluation in both the mono- trained model and follows the same encoding and matchlingual and the cross-lingual setting. ing strategies as the first model. Again, the output probabilities of the four models are fused. For task 2, the Labse 4.1. Subtask 1 (Binary Classification) pre-trained model is used, and the strategies are identical to those in Model 1, but the predicted classification results Systems’ predictions are evaluated against the ground are averaged. Finally, a third variant combines both the truth using the macro F1-Score, i.e. we compute the F1- bert-base-italian-cased and Labse pre-trained models. It score for each class and we take the average of these applies the same encoding and matching strategies as the scores to obtain the macro F1-score. previous models, but this time, the output probabilities of all eight models (four from each pre-trained model) are

4.2. Subtask 2 (Ranking) fused.

The ExtremITA team proposed two models fine-tuned Systems’ predictions are evaluated against the ground on the EVALITA 2023 training data. The first system truth using Spearman’s rank correlation. It measures the is based on the Large Language Model from Meta AI rank correlation of two variables and : (LLaMA), i.e., the Italian version called Camoscio [ 20 ].

6 ∑︀ 2 The model is pre-trained to generate text based on user in = 1 − (2 − 1) (1) structions and fine-tuned on task-specific triples of <task, input, output> derived from the training data of EVALITA where = () − () is the difference between the ranks of each observation and is the number of observations.

4https://github.com/wic-ita/data/blob/main/evaluation.py. 5The team did not submit the final report.

Team LG TTET TTET TTET extremITA baseline BERT 4EVER BERT 4EVER BERT 4EVER extremITA run 3 run 2 run 1 camoscio lora 2023 challenges. The LoRA technique for training was applied, and the model is further fine-tuned on the EVALITA 2023 training data. The second system is based on the Italian version of T5 (IT5) [ 21 ]. It underwent fine-tuning on task-specific input-output pairs derived from the training data of EVALITA 2023 challenges. The phrasal forms from the training data were used to train the model. The details of the models developed by the team are reported in [ 22 ].

The LG team proposed a single system based on the automatic translation of target words in different languages. Opus-MT models have been used for the translation of data into 21 languages. The words are lemmatized and aligned, and the feature vectors are created from the equivalence of the target lemma in translation. Then SVMs are used for solving tasks. PoS-Tagging and lemmatization of Italian sentences have been performed through TreeTagger6. Lemmatization in 21 languages has been roughly performed through Simplemma7. The details of the models developed by the team are reported in [ 23 ].

The models developed by The Time-Embedding Travelers team (afterwards mentioned as TTET) are all based on the XLM-RoBERTa-base architecture. Each model is a straightforward threshold-based classifier that utilises the conditional number of the cosine similarity or distance matrix to make predictions. The embeddings of the target word are extracted from both sentences, and pairwise similarities or distances are calculated. The threshold for classification is tuned by selecting the value that maximizes accuracy on a combined train and dev set. The ifnal threshold for prediction is determined as the average of the threshold values obtained from multiple iterations. Model 1 and Model 2 use the last 4 layers of embeddings, while Model 3 uses embeddings from all 12 layers. The details of the models developed by the team are reported in [ 24 ].

6. Results 6https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ 7https://pypi.org/project/simplemma/

With respect to the second subtask, where participants The WiC-ITA task was approached by four different were asked to provide a ranking, the best results were teams. The results from the evaluation of four differobtained by the baseline for the Italian test set and by ent teams’ systems revealed interesting trends. While the TTET Team for the Italian-English test set. However, three of the systems were based on the Transformer arnone of the proposed systems provided satisfactory results chitecture, one team developed an SVM classifier based on the output of a Machine Translation system (using the Transformer model). In the binary classification task, the for the Italian test set, but the TTET team was ranked first for the Italian-English test set. The results are reported in Table 5.

Table 7 presents detailed results for each system, including the classification of in-vocabulary (IV) and outof-vocabulary (OOV) words. The aim is to evaluate the system’s capability to classify words that were not part of the training data. In this regard, the LG system exhibits the highest performance in Subtask 1 for both IV and OOV words. However, in Subtask 2, only the TTET system surpasses the baseline for OOV words.

Interestingly, the performance on OOV targets shows an overall improvement. We propose that the models may have become overly specialized to the specific distribution of IV word classes during training, resulting in overfitting.

7. Conclusions

Team BERT 4EVER BERT 4EVER BERT 4EVER LG TTET TTET TTET extremITA extremITA baseline best-performing systems demonstrated a significant improvement over the baseline by 14 percentage points on the Italian test set and 17 percentage points on the English test set. However, in the ranking task, the baseline system outperformed all the proposed systems for the Italian test set, whereas the proposed systems achieved a notable enhancement of 14 percentage points over the baseline for the Italian-English test set.

For the Italian test set, the best result was achieved by the system based on SVM and Machine Translation. This team submitted results only for the monolingual task. In the English test set, the best result is obtained by the system based on the XLM-RoBERTa-base architecture. It is interesting to underline that the worst performances were obtained by the system that adopts instructions-based ifne-tuning of a specific LLM for Italian. On the one hand, these results highlight the effectiveness and potential of the different systems in addressing the classification and ranking tasks for the meaning of words in context. On the other hand, the results of the competition highlight that there is still room for improvement and that the task is still far from the results obtained by similar campaigns in English.

Acknowledgments We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

https://doi.org/10.18653/v1/2020.emnlp-main.584. [11] R. Navigli, S. P. Ponzetto, BabelNet: Building doi:10.18653/v1/2020.emnlp-main.58 a Very Large Multilingual Semantic Network, in: 4. J. Hajic, S. Carberry, S. Clark (Eds.), ACL 2010, [6] Q. Liu, E. M. Ponti, D. McCarthy, I. Vulic, A. Ko- Proceedings of the 48th Annual Meeting of the Asrhonen, AM2iCo: Evaluating Word Meaning in sociation for Computational Linguistics, July 11Context across Low-Resource Languages with Ad- 16, 2010, Uppsala, Sweden, The Association for versarial Examples, in: M. Moens, X. Huang, Computer Linguistics, 2010, pp. 216–225. URL: L. Specia, S. W. Yih (Eds.), Proceedings of the https://aclanthology.org/P10-1023/. 2021 Conference on Empirical Methods in Natu- [12] L. Bentivogli, E. Pianta, Exploiting parallel texts ral Language Processing, EMNLP 2021, Virtual in the creation of multilingual semantically annoEvent / Punta Cana, Dominican Republic, 7-11 tated resources: the MultiSemCor Corpus, NatNovember, 2021, Association for Computational ural Language Engineering 11 (2005) 247–261. Linguistics, 2021, pp. 7151–7162. URL: https: doi:10.1017/S1351324905003839. //doi.org/10.18653/v1/2021.emnlp- main.571. [13] G. A. Miller, C. Leacock, R. Tengi, R. Bunker, A Sedoi:10.18653/v1/2021.emnlp-main.57 mantic Concordance, in: Human Language Technol1. ogy: Proc. of a Workshop Held at Plainsboro, New [7] C. S. Armendariz, M. Purver, M. Ulcar, S. Pollak, Jersey, USA, March 21-24, 1993, Morgan KaufN. Ljubesic, M. Granroth-Wilding, CoSimLex: A mann, 1993. URL: https://aclanthology.org/H93-1 Resource for Evaluating Graded Word Similarity 061/. in Context, in: N. Calzolari, F. Béchet, P. Blache, [14] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta, K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- The wacky wide web: a collection of very large hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, linguistically processed web-crawled corpora, LanJ. Odijk, S. Piperidis (Eds.), Proceedings of The guage resources and evaluation 43 (2009) 209–226. 12th Language Resources and Evaluation Confer- [15] M. Baroni, A. Kilgarriff, Large linguisticallyence, LREC 2020, Marseille, France, May 11-16, processed web corpora for multiple languages, in: 2020, European Language Resources Association, EACL’06: Proceedings of the Eleventh Conference 2020, pp. 5878–5886. URL: https://aclanthology.o of the European Chapter of the Association for Comrg/2020.lrec-1.720/. putational Linguistics: Posters & Demonstrations; [8] D. Schlechtweg, S. S. im Walde, S. Eckmann, Di- 2006 Apr 5-6; Trento, Italy. Stroudsburg (PA): Asachronic Usage Relatedness (DURel): A Frame- sociation for Computational Linguistics; 2006. p. work for the Annotation of Lexical Semantic 87-90, ACL (Association for Computational LinChange, in: M. A. Walker, H. Ji, A. Stent (Eds.), guistics), 2006.

Proceedings of the 2018 Conference of the North [16] H. Schmid, Probabilistic part-of speech tagging American Chapter of the Association for Compu- using decision trees, in: New methods in language tational Linguistics: Human Language Technolo- processing, 2013, p. 154. gies, NAACL-HLT, New Orleans, Louisiana, USA, [17] H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi, June 1-6, 2018, Volume 2 (Short Papers), Associa- X. Liang, doccano: Text annotation tool for tion for Computational Linguistics, 2018, pp. 169– human, 2018. URL: h t t p s : / / g i t h u b . c o m 174. URL: https://doi.org/10.18653/v1/n18-2027. / d o c c a n o / d o c c a no, software available from doi:10.18653/v1/n18-2027. https://github.com/doccano/doccano. [9] S. W. Brown, Choosing sense distinctions for WSD: [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, psycholinguistic evidence, in: ACL 2008, Proceed- BERT: Pre-training of deep bidirectional transformings of the 46th Annual Meeting of the Association ers for language understanding, in: Proceedings for Computational Linguistics, June 15-20, 2008, of the 2019 Conference of the North American Columbus, Ohio, USA, Short Papers, The Associa- Chapter of the Association for Computational Lintion for Computer Linguistics, 2008, pp. 249–252. guistics: Human Language Technologies, Volume URL: https://aclanthology.org/P08-2063/. 1 (Long and Short Papers), Association for Compu[10] M. Lai, S. Menini, M. Polignano, V. Russo, tational Linguistics, Minneapolis, Minnesota, 2019, R. Sprugnoli, G. Venturi, Evalita 2023: Overview of pp. 4171–4186. URL: https://aclanthology.org/N19 the 8th evaluation campaign of natural language pro- -1423. doi:10.18653/v1/N19-1423. cessing and speech tools for italian, in: Proceedings [19] A. Conneau, K. Khandelwal, N. Goyal, V. Chaudof the Eighth Evaluation Campaign of Natural Lan- hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, guage Processing and Speech Tools for Italian. Fi- L. Zettlemoyer, V. Stoyanov, Unsupervised Crossnal Workshop (EVALITA 2023), CEUR.org, Parma, lingual Representation Learning at Scale, in: D. JuItaly, 2023. rafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.),

[1]

Bevilacqua ,

Pasini ,

Raganato ,

Navigli , Recent Trends in Word Sense Disambiguation: A Survey , in: Z. Zhou (Ed.), Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI 2021 , Virtual Event / Montreal, Canada, 19 -27 August 2021 , ijcai .org, 2021 , pp. 4330 - 4338 . URL: https://doi.org/10.24963/ijcai.2 021 /593. doi: 10 .24963/ijcai. 2021 /593.

[2]

G. A.

Miller , WORDNET: a lexical database for english , in: Speech and Natural Language: Proceedings of a Workshop Held at Harriman, New York, USA, February 23 - 26 , 1992 , Morgan Kaufmann, 1992 . URL: https://aclanthology.org/H92-1116/.

[3]

M. T.

Pilehvar , J. Camacho-Collados, WiC: the Word-in-Context Dataset for Evaluating ContextSensitive Meaning Representations , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis , MN, USA, June 2-7, 2019 , Volume 1 (Long and Short Papers), Association for Computational Linguistics , 2019 , pp. 1267 - 1273 . URL: https://doi.org/10.18653/v1/n19- 1128 . doi: 10 .18653/v1/n19- 1128 .

[4]

Martelli ,

Kalach , G. Tola,

Navigli , SemEval2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC) , in: A. Palmer , N.

Schneider , N.

Schluter , G.

Emerson , A.

Herbelot , X. Zhu (Eds.), Proceedings of the 15th International Workshop on Semantic Evaluation , SemEval@ACL/IJCNLP 2021, Virtual Event / Bangkok, Thailand, August 5-6 , 2021 , Association for Computational Linguistics, 2021 , pp. 24 - 36 . URL: https://doi.org/10.18653/v1/ 2021 .semeval- 1 .3. doi: 10 .18653/v1/ 2021 .semeval- 1 .3.

[5]

Raganato ,

Pasini ,

Camacho-Collados ,

M. T.

Pilehvar , XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization , in: B. Webber , T. Cohn, Y. He , Y. Liu (Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020 , Online, November 16-20 , 2020 , Association for Computational Linguistics, 2020 , pp. 7193 - 7206 . URL: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , ACL 2020, Online, July 5 - 10 , 2020 , Association for Computational Linguistics, 2020 , pp. 8440 - 8451 . URL: https://doi.org/10.18653/v1/ 2020 .acl-main. 747 . doi: 10 .18653/v1/ 2020 .acl-main. 747 .

[20]

Santilli , Camoscio: An italian instruction-tuned llama , https://github.com/teelinsan/camoscio, 2023 .

[21]

Sarti , M. Nissim, It5: Large-scale text-to-text pretraining for italian language understanding and generation , ArXiv preprint 2203.03759 ( 2022 ). URL: https://arxiv.org/abs/2203.03759.

[22] C. D. Hromei , D.

Croce , V.

Basile , R.

Basili , Extremita@evalita2023: Multi-task sustainable scaling to large language models at its extreme, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[23]

Gregori , Lg at wic-ita: Exploring the relation between semantic shifts and equivalences in translation, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .

[24]

Periti ,

Dubossarsky , The time-embedding travelers@wic-ita, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian . Final Workshop (EVALITA 2023 ), CEUR.org, Parma, Italy, 2023 .