<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>WiC-ITA at EVALITA2023: Overview of the EVALITA2023 Word-in-Context for ITAlian Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Pierluigi Cassotti</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia C. Passaro</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Maristella Gatto</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Computer Science, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Dipartimento di Ricerca e Innovazione Umanistica, University of Bari</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>WiC-ita is a shared task proposed at the EVALITA 2023 campaign. The task focuses on the meaning of words in specific contexts and has been modelled as both a binary classification and a ranking problem. Overall, 4 groups took part in both subtasks, with 9 different runs. In this report, we describe how the task was set up, we report the system results, and we discuss them. models for the Italian language. In fact, the transfer learning ability of WiC models across different languages is Word Sense Disambiguation [1] is a Natural Language proven in previous works [4], where models improve their Processing task with a long history and is extremely inter- performance by training in other languages. Several iniesting for the Computational Linguistics community. In tiatives have been proposed throughout the years: the first Word Sense Disambiguation (WSD), the goal is to disam- one [3] being the proposal of the WiC task, which also biguate each word occurrence assigning to it the correct came along with a dataset but was limited to English. For sense from a predefined sense inventory, such as WordNet this reason, it was followed by the XL-WiC [5] dataset [2]. The introduction of contextualized models, such as which tried to tackle this issue by taking into account BERT, allowing the representation of a word in different a total of 15 languages. Next, the MCL-WiC [4] was contexts, steers the research focus to new tasks, such as the first WiC dataset to introduce the Cross-lingual task. the Word in Context (WiC) task [3]. The main motivation behind this particular choice was to WSD and the WiC task are highly related: while the cover scenarios where systems have to deal with different former models in an explicit way the relationship between languages simultaneously, further highlighting the importhe target word and its sense (taken from a predefined tance of this task in real-world applications. With AM2iCo sense inventory), the latter reduces it to a binary task. The [6], the main aim was to focus on low-resource languages WiC task requires determining if a word occurring in two and to ensure participating models must consider both the different sentences has the same meaning or not. In recent target word and the context to achieve good performances. years, there has been a growing interest in the WiC task, Finally, in CoSimLex [7], the task is extended to pairs of demonstrated by the creation of several different resources words that appear in a shared context, and the goal is to and shared tasks covering more than 20 languages. determine to which degree they refer to the same concept. In general, the WiC task is of broad-scope interest, This is done to capture the word polysemy as well as the as it is not limited to specific domains and can be use- context-dependency of words. ful for several NLP tasks. Furthermore, the training and Shared tasks regarding the WiC usually preserve its the evaluation on a monolingual (Italian) or cross-lingual binary design, where the two possible outcomes for each (English-Italian) dataset is advantageous not only for the entry are: true if the meaning of the target word changes between the two sentences/contexts and false if it does not. However, there can be some cases where it is not so simple to determine the lack or presence of semantic similarity in a discrete way. For this reason, we exploit the 4-point relatedness scale introduced by [8, 9] in the annotation process. The scale consists of 4 values, namely 4: Identical; 3: Closely Related; 2: Distantly Related;</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Word in Context</kwd>
        <kwd>Lexical Semantics</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Dataset</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction and motivation</title>
      <p>Unfortunately, as often happens in the Natural
Language Processing research area, some languages are more
represented than others, and the WiC task makes no
exception in this sense. With the WiC-ITA task at EVALITA
2023 [10], we aim to fill this gap in the literature, making
openly available a resource that can undoubtedly foster
novel research.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Task Description</title>
      <p>noun
verb
adjective
adverb
The general goal of the WiC-ITA task is to establish if pairs, only a small part of these are manually
annotated/a word  occurring in two different sentences 1 and 2 validated by human experts.
has the same meaning or not. The task is modelled with Differently from previous datasets, for the WiC-ITA
two different subtasks, namely a binary classification one task, we relied on sense inventories only for the target
(Subtask 1) and a ranking one (Subtask 2). Participants words selection stage, while we extract the list of sentence
were allowed to participate in one or both of the subtasks. pairs from large unlabelled corpora. Moreover, human
Details and examples of annotation are available on the annotation is carried out for all the sentence pairs, thus
task website. 1 making WiC-ITA the largest manually annotated resource
for the WiC task.</p>
      <p>In addition to this, the WiC-ITA dataset includes both
2.1. Subtask 1: Binary Classification monolingual (Italian) and cross-lingual (English-Italian)
Subtask 1 is structured as follows: given a word  oc- data.
curring in two different sentences 1 and 2, the goal is The dataset is split into training, development, and test
to provide the sentences pair with a score determining portions. In particular:
whether  maintains the same meaning or not. Possible
outcomes for this subtask are:
• 0: the word  has not the same meaning in the</p>
      <p>two sentences 1 and 2;
• 1: the word  has the same meaning in the two</p>
      <p>sentences 1 and 2.</p>
      <sec id="sec-2-1">
        <title>2.2. Subtask 2: Ranking</title>
        <p>
          Subtask 2 is structured as follows: Given a word 
occurring in two different sentences 1 and 2, the goal is
to provide the sentences pair with a score indicating to
which extent, in a 1-4 scale,  has the same meaning in
the two sentences. The scoring system for this subtask is
a continuous value where  ∈ [
          <xref ref-type="bibr" rid="ref1 ref4">1, 4</xref>
          ]. A higher score
corresponds to a higher degree of semantic similarity.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The creation of datasets for the WiC task usually relies
on using sense inventories, such as WordNet or BabelNet
[11]. More specifically, sense inventories are often
exploited for selecting target words, which should exhibit
polysemia, and for the generation of sentence pairs using
the sense examples provided, i.e. sentences in which the
target word occurs with the respective sense. After the
selection of target words and the generation of sentence</p>
      <sec id="sec-3-1">
        <title>1http://wic-ita.github.io/.</title>
        <p>• the training and development set consists of
annotated pairs of monolingual (Italian) sentences;
• the test set consists of annotated pairs of
monolingual (Italian) sentences and annotated pairs of
cross-lingual (English-Italian) sentences.</p>
      </sec>
      <sec id="sec-3-2">
        <title>We create the monolingual datasets by selecting target</title>
        <p>words based on the number of synsets in WordNet and
senses reported in Wiktionary. To achieve this, we
generate a list of candidate target words for each part of speech
(PoS) using lemmas from both WordNet and Wiktionary.
For each lemma , we calculate the count of WordNet
synsets () and senses reported Wiktionary ().
We then compute min(, ) and order all the
target words in descending order.</p>
        <p>To construct the cross-lingual dataset, we use the
MultiSemCor [12], which is based on SemCor [13], the most
extensive and widely used dataset for Word Sense
Disambiguation. Specifically, we extracted word pairs
(ItalianEnglish) that are frequently translated in SemCor. For
these word pairs, we compute the frequency of specific
synsets. Then, we took the union of synsets for each
word pair and computed the probability distribution over
the synsets for both the Italian and English words. The
Jensen–Shannon Divergence (JSD) is computed for each
pair, and the pairs are sorted accordingly in decreasing
order.</p>
        <p>We sample the top-k words for the monolingual setting
and the top-k pair of words for the cross-lingual setting
according to the min(, ) and the JSD
respec</p>
      </sec>
      <sec id="sec-3-3">
        <title>The annotation process is carried out using Doccano [17].</title>
        <p>Each data point (i.e., sentence pair) is annotated by two
independent annotators. Tables 2 and 32 show the statistics
in terms of number of annotated examples and agreement
(computed as the Spearman correlation) for each pair of
annotators. In the monolingual setting, the Spearman
correlation for the annotations consistently exceeds 0.6,
with the exception of two cases. On the other hand, in
the cross-lingual setting, the average correlation is lower
compared to the correlation obtained in the monolingual
setting. However, the correlation between annotators in
the cross-lingual scenario is also computed on smaller
samples, which can impact the reliability of the computed
correlation.</p>
        <p>The data points for which at least one of the annotators
voted 0 (Cannot decide) were discarded from the official
dataset for the sake of simplicity. The score for Subtask 2
is obtained by averaging the scores assigned by the two
annotators. The ground truth labels for Subtask 1 (binary)
were drived from the labels of Subtask 2. Specifically, we
considered the data points for which the two annotators
agreed, namely the case in which both annotators provided
a score in the set {1, 2} and the case in which both the
annotators provided a score in the set {3, 4}. In the former
case, the example was labelled with 0, while in the latter,
it was labelled with 1.</p>
        <p>The dataset is available for download on the website of
the task.3 The dataset has been constructed using available
corpora. We refer to [14, 15] for the details about
copyright and usage. Below, we further describe the details of
the two sub-tasks.</p>
        <sec id="sec-3-3-1">
          <title>3.1. Subtask 1: Ranking</title>
        </sec>
      </sec>
      <sec id="sec-3-4">
        <title>We provide two datasets for model development:</title>
        <p>tively. The number of sampled words per PoS tag are
reported in Table 1.</p>
        <p>The monolingual and the cross-lingual sentence pairs
are extracted from the itWaC and ukWaC corpora, both
part of the WaCKy project [14, 15]. ukWaC is a corpus
obtained by crawling the web pages under the .uk
domain. It consists of more than 2 billion words, annotated
with PoS tags and lemmatized using the TreeTagger tool
[16]. itWaC, differently from ukWaC, is lemmatized using
Morph-it! and is obtained by crawling web pages under
the .it domain.</p>
        <p>Each sentence pair extracted from the aforementioned
resources has been attributed with the average score as- The training dataset is highly unbalanced, consisting of
signed by two annotators according to the 4-point relat- 71.27% of positive and 28.73% of negative examples. At
edness scale, i.e. from 4 (Identical meaning) to 1 (Un- the same time, we provide balanced development and test
related), the offsets of the target word on the respective datasets consisting of 50% positive and 50% of negative
sentences, and the lemma of the target word. Note that we examples. For each In-Vocabulary target word of the
only considered the Italian lemma for the cross-lingual 2The annotator groups for the two tasks are independent.
examples, albeit providing the offsets for both languages. 3https://wic-ita.github.io/data/.
• The training dataset which consists of 2,805
training examples. This dataset should be employed to
train the model
• The development dataset which consists of 500
training examples. This dataset should be
employed to evaluate the model in the training phase,
e.g., tune hyper-parameters
• The monolingual test dataset which consists of</p>
        <p>500 examples
• The cross-lingual test dataset which consists of</p>
        <p>500 examples
development and test datasets, at least one positive and
one negative example are provided in the training set.</p>
        <p>Overall statistics are reported in Table 4.</p>
      </sec>
      <sec id="sec-3-5">
        <title>The baseline model for the task has been constructed</title>
        <p>
          according to [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. It exploits the BERT architecture [18]
for encoding the target sub-words. To deal with cases in
3.2. Subtask 2: Ranking which the target word is split into multiple sub-tokens, the
We provide four datasets for model development: ifrst sub-token is considered. Differently from [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ], we use
as pre-trained model XLM-RoBERTa [19] and train the
• The training dataset which consists of 2,805 train- baseline to minimise the difference between the model
ing examples for which the two annotators agree prediction and the gold score computing the mean squared
(This dataset contains the same examples provided error.
        </p>
        <p>for training in Subtask 1) We set the learning rate to 1− 5 and weight decay to 0.
• A training dataset which consists of 1,015 training The best checkpoint over the ten epochs is selected using
examples for which the two annotators disagree the development data.
• An overall training dataset which consists of 3,820 The binary baseline for Subtask 1 applies the threshold
training examples. This dataset include both in-  = 2 to the model predictions to obtain discrete labels.
stances where annotators have reached a consen- To ensure fair reproducibility and comparisons, the
sus and those in disagreement evaluation scripts are available for download. 4
• The development dataset which consists of 500
examples (This dataset contains the same examples 5. Participants
provided for development in Subtask 1)
• The monolingual test dataset which consists of Overall, different teams participated in the task with 9
500 examples (This dataset contains the same ex- distinct runs. We highlight below the main strategies
amples provided for test in Subtask 1) adopted by the teams to deal with the WiC-ITA tasks.
• The cross-lingual test dataset which consists of The BERT 4EVER team 5 proposed three variants of
500 examples (This dataset contains the same ex- a system based on BERT. The strategy behind the first
amples provided for test in Subtask 1) model involves using the Labse pre-trained model to
perform matching judgment tasks. It applies four different
strategies for encoding and matching the spliced sentences,
4. Evaluation including the addition of [CLS] vectors and siamese
vectors. The output probabilities of the four models are fused,
The ranking of the participating systems is provided ac- with task 2 treated as a six-classification task. The
seccording to each subtask and test set. In other words, for ond model for task 1 uses the bert-base-italian-cased
preeach subtask, we provide the evaluation in both the mono- trained model and follows the same encoding and
matchlingual and the cross-lingual setting. ing strategies as the first model. Again, the output
probabilities of the four models are fused. For task 2, the Labse
4.1. Subtask 1 (Binary Classification) pre-trained model is used, and the strategies are identical
to those in Model 1, but the predicted classification results
Systems’ predictions are evaluated against the ground are averaged. Finally, a third variant combines both the
truth using the macro F1-Score, i.e. we compute the F1- bert-base-italian-cased and Labse pre-trained models. It
score for each class and we take the average of these applies the same encoding and matching strategies as the
scores to obtain the macro F1-score. previous models, but this time, the output probabilities of
all eight models (four from each pre-trained model) are</p>
        <sec id="sec-3-5-1">
          <title>4.2. Subtask 2 (Ranking) fused.</title>
          <p>
            The ExtremITA team proposed two models fine-tuned
Systems’ predictions are evaluated against the ground on the EVALITA 2023 training data. The first system
truth using Spearman’s rank correlation. It measures the is based on the Large Language Model from Meta AI
rank correlation of two variables  and  : (LLaMA), i.e., the Italian version called Camoscio [
            <xref ref-type="bibr" rid="ref6">20</xref>
            ].
          </p>
          <p>6 ∑︀ 2 The model is pre-trained to generate text based on user
in = 1 − (2 − 1) (1) structions and fine-tuned on task-specific triples of &lt;task,
input, output&gt; derived from the training data of EVALITA
where  = () − () is the difference between
the ranks of each observation and  is the number of
observations.</p>
        </sec>
      </sec>
      <sec id="sec-3-6">
        <title>4https://github.com/wic-ita/data/blob/main/evaluation.py. 5The team did not submit the final report.</title>
        <p>
          Team
LG
TTET
TTET
TTET
extremITA
baseline
BERT 4EVER
BERT 4EVER
BERT 4EVER
extremITA
run 3
run 2
run 1
camoscio lora
2023 challenges. The LoRA technique for training was
applied, and the model is further fine-tuned on the EVALITA
2023 training data. The second system is based on the
Italian version of T5 (IT5) [
          <xref ref-type="bibr" rid="ref7">21</xref>
          ]. It underwent fine-tuning
on task-specific input-output pairs derived from the
training data of EVALITA 2023 challenges. The phrasal forms
from the training data were used to train the model. The
details of the models developed by the team are reported
in [
          <xref ref-type="bibr" rid="ref8">22</xref>
          ].
        </p>
        <p>
          The LG team proposed a single system based on the
automatic translation of target words in different languages.
Opus-MT models have been used for the translation of
data into 21 languages. The words are lemmatized and
aligned, and the feature vectors are created from the
equivalence of the target lemma in translation. Then SVMs are
used for solving tasks. PoS-Tagging and lemmatization of
Italian sentences have been performed through
TreeTagger6. Lemmatization in 21 languages has been roughly
performed through Simplemma7. The details of the
models developed by the team are reported in [
          <xref ref-type="bibr" rid="ref9">23</xref>
          ].
        </p>
        <p>
          The models developed by The Time-Embedding
Travelers team (afterwards mentioned as TTET) are all based
on the XLM-RoBERTa-base architecture. Each model is
a straightforward threshold-based classifier that utilises
the conditional number of the cosine similarity or distance
matrix to make predictions. The embeddings of the
target word are extracted from both sentences, and pairwise
similarities or distances are calculated. The threshold for
classification is tuned by selecting the value that
maximizes accuracy on a combined train and dev set. The
ifnal threshold for prediction is determined as the average
of the threshold values obtained from multiple iterations.
Model 1 and Model 2 use the last 4 layers of embeddings,
while Model 3 uses embeddings from all 12 layers. The
details of the models developed by the team are reported
in [
          <xref ref-type="bibr" rid="ref10">24</xref>
          ].
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>6. Results</title>
      <sec id="sec-4-1">
        <title>6https://www.cis.uni-muenchen.de/ schmid/tools/TreeTagger/ 7https://pypi.org/project/simplemma/</title>
        <p>With respect to the second subtask, where participants The WiC-ITA task was approached by four different
were asked to provide a ranking, the best results were teams. The results from the evaluation of four
differobtained by the baseline for the Italian test set and by ent teams’ systems revealed interesting trends. While
the TTET Team for the Italian-English test set. However, three of the systems were based on the Transformer
arnone of the proposed systems provided satisfactory results chitecture, one team developed an SVM classifier based
on the output of a Machine Translation system (using the
Transformer model). In the binary classification task, the
for the Italian test set, but the TTET team was ranked first
for the Italian-English test set. The results are reported in
Table 5.</p>
        <p>Table 7 presents detailed results for each system,
including the classification of in-vocabulary (IV) and
outof-vocabulary (OOV) words. The aim is to evaluate the
system’s capability to classify words that were not part
of the training data. In this regard, the LG system
exhibits the highest performance in Subtask 1 for both IV
and OOV words. However, in Subtask 2, only the TTET
system surpasses the baseline for OOV words.</p>
        <p>Interestingly, the performance on OOV targets shows
an overall improvement. We propose that the models may
have become overly specialized to the specific distribution
of IV word classes during training, resulting in overfitting.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>7. Conclusions</title>
      <p>Team
BERT 4EVER
BERT 4EVER
BERT 4EVER
LG
TTET
TTET
TTET
extremITA
extremITA
baseline
best-performing systems demonstrated a significant
improvement over the baseline by 14 percentage points on
the Italian test set and 17 percentage points on the English
test set. However, in the ranking task, the baseline system
outperformed all the proposed systems for the Italian test
set, whereas the proposed systems achieved a notable
enhancement of 14 percentage points over the baseline for
the Italian-English test set.</p>
      <p>For the Italian test set, the best result was achieved
by the system based on SVM and Machine Translation.
This team submitted results only for the monolingual task.
In the English test set, the best result is obtained by the
system based on the XLM-RoBERTa-base architecture.
It is interesting to underline that the worst performances
were obtained by the system that adopts instructions-based
ifne-tuning of a specific LLM for Italian. On the one hand,
these results highlight the effectiveness and potential of
the different systems in addressing the classification and
ranking tasks for the meaning of words in context. On the
other hand, the results of the competition highlight that
there is still room for improvement and that the task is
still far from the results obtained by similar campaigns in
English.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <sec id="sec-6-1">
        <title>We acknowledge the support of the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</title>
        <p>https://doi.org/10.18653/v1/2020.emnlp-main.584. [11] R. Navigli, S. P. Ponzetto, BabelNet: Building
doi:10.18653/v1/2020.emnlp-main.58 a Very Large Multilingual Semantic Network, in:
4. J. Hajic, S. Carberry, S. Clark (Eds.), ACL 2010,
[6] Q. Liu, E. M. Ponti, D. McCarthy, I. Vulic, A. Ko- Proceedings of the 48th Annual Meeting of the
Asrhonen, AM2iCo: Evaluating Word Meaning in sociation for Computational Linguistics, July
11Context across Low-Resource Languages with Ad- 16, 2010, Uppsala, Sweden, The Association for
versarial Examples, in: M. Moens, X. Huang, Computer Linguistics, 2010, pp. 216–225. URL:
L. Specia, S. W. Yih (Eds.), Proceedings of the https://aclanthology.org/P10-1023/.
2021 Conference on Empirical Methods in Natu- [12] L. Bentivogli, E. Pianta, Exploiting parallel texts
ral Language Processing, EMNLP 2021, Virtual in the creation of multilingual semantically
annoEvent / Punta Cana, Dominican Republic, 7-11 tated resources: the MultiSemCor Corpus,
NatNovember, 2021, Association for Computational ural Language Engineering 11 (2005) 247–261.
Linguistics, 2021, pp. 7151–7162. URL: https: doi:10.1017/S1351324905003839.
//doi.org/10.18653/v1/2021.emnlp- main.571. [13] G. A. Miller, C. Leacock, R. Tengi, R. Bunker, A
Sedoi:10.18653/v1/2021.emnlp-main.57 mantic Concordance, in: Human Language
Technol1. ogy: Proc. of a Workshop Held at Plainsboro, New
[7] C. S. Armendariz, M. Purver, M. Ulcar, S. Pollak, Jersey, USA, March 21-24, 1993, Morgan
KaufN. Ljubesic, M. Granroth-Wilding, CoSimLex: A mann, 1993. URL: https://aclanthology.org/H93-1
Resource for Evaluating Graded Word Similarity 061/.
in Context, in: N. Calzolari, F. Béchet, P. Blache, [14] M. Baroni, S. Bernardini, A. Ferraresi, E. Zanchetta,
K. Choukri, C. Cieri, T. Declerck, S. Goggi, H. Isa- The wacky wide web: a collection of very large
hara, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, linguistically processed web-crawled corpora,
LanJ. Odijk, S. Piperidis (Eds.), Proceedings of The guage resources and evaluation 43 (2009) 209–226.
12th Language Resources and Evaluation Confer- [15] M. Baroni, A. Kilgarriff, Large
linguisticallyence, LREC 2020, Marseille, France, May 11-16, processed web corpora for multiple languages, in:
2020, European Language Resources Association, EACL’06: Proceedings of the Eleventh Conference
2020, pp. 5878–5886. URL: https://aclanthology.o of the European Chapter of the Association for
Comrg/2020.lrec-1.720/. putational Linguistics: Posters &amp; Demonstrations;
[8] D. Schlechtweg, S. S. im Walde, S. Eckmann, Di- 2006 Apr 5-6; Trento, Italy. Stroudsburg (PA):
Asachronic Usage Relatedness (DURel): A Frame- sociation for Computational Linguistics; 2006. p.
work for the Annotation of Lexical Semantic 87-90, ACL (Association for Computational
LinChange, in: M. A. Walker, H. Ji, A. Stent (Eds.), guistics), 2006.</p>
        <p>Proceedings of the 2018 Conference of the North [16] H. Schmid, Probabilistic part-of speech tagging
American Chapter of the Association for Compu- using decision trees, in: New methods in language
tational Linguistics: Human Language Technolo- processing, 2013, p. 154.
gies, NAACL-HLT, New Orleans, Louisiana, USA, [17] H. Nakayama, T. Kubo, J. Kamura, Y. Taniguchi,
June 1-6, 2018, Volume 2 (Short Papers), Associa- X. Liang, doccano: Text annotation tool for
tion for Computational Linguistics, 2018, pp. 169– human, 2018. URL: h t t p s : / / g i t h u b . c o m
174. URL: https://doi.org/10.18653/v1/n18-2027. / d o c c a n o / d o c c a no, software available from
doi:10.18653/v1/n18-2027. https://github.com/doccano/doccano.
[9] S. W. Brown, Choosing sense distinctions for WSD: [18] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova,
psycholinguistic evidence, in: ACL 2008, Proceed- BERT: Pre-training of deep bidirectional
transformings of the 46th Annual Meeting of the Association ers for language understanding, in: Proceedings
for Computational Linguistics, June 15-20, 2008, of the 2019 Conference of the North American
Columbus, Ohio, USA, Short Papers, The Associa- Chapter of the Association for Computational
Lintion for Computer Linguistics, 2008, pp. 249–252. guistics: Human Language Technologies, Volume
URL: https://aclanthology.org/P08-2063/. 1 (Long and Short Papers), Association for
Compu[10] M. Lai, S. Menini, M. Polignano, V. Russo, tational Linguistics, Minneapolis, Minnesota, 2019,
R. Sprugnoli, G. Venturi, Evalita 2023: Overview of pp. 4171–4186. URL: https://aclanthology.org/N19
the 8th evaluation campaign of natural language pro- -1423. doi:10.18653/v1/N19-1423.
cessing and speech tools for italian, in: Proceedings [19] A. Conneau, K. Khandelwal, N. Goyal, V.
Chaudof the Eighth Evaluation Campaign of Natural Lan- hary, G. Wenzek, F. Guzmán, E. Grave, M. Ott,
guage Processing and Speech Tools for Italian. Fi- L. Zettlemoyer, V. Stoyanov, Unsupervised
Crossnal Workshop (EVALITA 2023), CEUR.org, Parma, lingual Representation Learning at Scale, in: D.
JuItaly, 2023. rafsky, J. Chai, N. Schluter, J. R. Tetreault (Eds.),</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Bevilacqua</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pasini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>Recent Trends in Word Sense Disambiguation: A Survey</article-title>
          , in: Z.
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          (Ed.),
          <source>Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence, IJCAI</source>
          <year>2021</year>
          , Virtual Event / Montreal, Canada,
          <fpage>19</fpage>
          -27
          <source>August</source>
          <year>2021</year>
          ,
          <article-title>ijcai</article-title>
          .org,
          <year>2021</year>
          , pp.
          <fpage>4330</fpage>
          -
          <lpage>4338</lpage>
          . URL: https://doi.org/10.24963/ijcai.2
          <volume>021</volume>
          /593. doi:
          <volume>10</volume>
          .24963/ijcai.
          <year>2021</year>
          /593.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>G. A.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <surname>WORDNET:</surname>
          </string-name>
          <article-title>a lexical database for english</article-title>
          ,
          <source>in: Speech and Natural Language: Proceedings of a Workshop Held</source>
          at Harriman, New York, USA, February
          <volume>23</volume>
          -
          <issue>26</issue>
          ,
          <year>1992</year>
          , Morgan Kaufmann,
          <year>1992</year>
          . URL: https://aclanthology.org/H92-1116/.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pilehvar</surname>
          </string-name>
          , J. Camacho-Collados,
          <article-title>WiC: the Word-in-Context Dataset for Evaluating ContextSensitive Meaning Representations</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <source>Proceedings of the</source>
          <year>2019</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          , Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>1267</fpage>
          -
          <lpage>1273</lpage>
          . URL: https://doi.org/10.18653/v1/n19-
          <fpage>1128</fpage>
          . doi:
          <volume>10</volume>
          .18653/v1/n19-
          <fpage>1128</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Kalach</surname>
          </string-name>
          , G. Tola,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <article-title>SemEval2021 Task 2: Multilingual and Cross-lingual Word-in-Context Disambiguation (MCL-WiC)</article-title>
          , in: A.
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schneider</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          <string-name>
            <surname>Emerson</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Herbelot</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          Zhu (Eds.),
          <source>Proceedings of the 15th International Workshop on Semantic Evaluation</source>
          , SemEval@ACL/IJCNLP 2021, Virtual Event / Bangkok, Thailand,
          <source>August 5-6</source>
          ,
          <year>2021</year>
          , Association for Computational Linguistics,
          <year>2021</year>
          , pp.
          <fpage>24</fpage>
          -
          <lpage>36</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .3. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .semeval-
          <volume>1</volume>
          .3.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>A.</given-names>
            <surname>Raganato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pasini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Camacho-Collados</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. T.</given-names>
            <surname>Pilehvar</surname>
          </string-name>
          ,
          <article-title>XL-WiC: A Multilingual Benchmark for Evaluating Semantic Contextualization</article-title>
          , in: B.
          <string-name>
            <surname>Webber</surname>
            , T. Cohn,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>He</surname>
          </string-name>
          , Y. Liu (Eds.),
          <source>Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2020</year>
          , Online,
          <source>November 16-20</source>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>7193</fpage>
          -
          <lpage>7206</lpage>
          . URL:
          <article-title>Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</article-title>
          ,
          <source>ACL 2020, Online, July</source>
          <volume>5</volume>
          -
          <issue>10</issue>
          ,
          <year>2020</year>
          , Association for Computational Linguistics,
          <year>2020</year>
          , pp.
          <fpage>8440</fpage>
          -
          <lpage>8451</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .acl-main.
          <volume>747</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          ,
          <string-name>
            <surname>Camoscio:</surname>
          </string-name>
          <article-title>An italian instruction-tuned llama</article-title>
          , https://github.com/teelinsan/camoscio,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, It5: Large-scale text-to-text pretraining for italian language understanding and generation</article-title>
          ,
          <source>ArXiv preprint 2203.03759</source>
          (
          <year>2022</year>
          ). URL: https://arxiv.org/abs/2203.03759.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [22]
          <string-name>
            <surname>C. D. Hromei</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Croce</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Basile</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Basili</surname>
          </string-name>
          ,
          <article-title>Extremita@evalita2023: Multi-task sustainable scaling to large language models at its extreme, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gregori</surname>
          </string-name>
          ,
          <article-title>Lg at wic-ita: Exploring the relation between semantic shifts and equivalences in translation, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>F.</given-names>
            <surname>Periti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Dubossarsky</surname>
          </string-name>
          ,
          <article-title>The time-embedding travelers@wic-ita, in: Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian</article-title>
          .
          <source>Final Workshop (EVALITA</source>
          <year>2023</year>
          ), CEUR.org, Parma, Italy,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>