=Paper=
{{Paper
|id=Vol-3878/116_calamita_preface_long
|storemode=property
|title=CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian
|pdfUrl=https://ceur-ws.org/Vol-3878/116_calamita_preface_long.pdf
|volume=Vol-3878
|authors=Giuseppe Attanasio,Pierpaolo Basile,Federico Borazio,Danilo Croce,Maria Francis,Jacopo Gili,Elio Musacchio,Malvina Nissim,Viviana Patti,Matteo Rinaldi,Daniel Scalena
|dblpUrl=https://dblp.org/rec/conf/clic-it/AttanasioBBCFGM24
}}
==CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian==
CALAMITA: Challenge the Abilities of LAnguage Models in
ITAlian
Giuseppe Attanasio1,∗,† , Pierpaolo Basile2,∗,† , Federico Borazio3,† , Danilo Croce3,∗,† ,
Maria Francis4,5,† , Jacopo Gili6,† , Elio Musacchio2,† , Malvina Nissim4,∗,† , Viviana Patti6,∗,† ,
Matteo Rinaldi6,† and Daniel Scalena7,4,†
1
Instituto de Telecomunicações, Lisbon, Portugal
2
University of Bari “Aldo Moro”, Bari, Italy
3
University of Rome “Tor Vergata”, Rome, Italy
4
CLCG, University of Groningen, Groningen, The Netherlands
5
University of Trento, Trento, Italy
6
Computer Science Department, University of Turin, Turin, Italy
7
University of Milan Bicocca, Milan, Italy
Abstract
The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track
progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they
predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a
long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced
evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in
ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA
emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian
by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This
paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.
Keywords
Italian Benchmark, Shared Task, Language Models
1. Introduction
In parallel with the ongoing and constant development
CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, of new Large Language Models (LLMs), it has increased
Dec 04 — 06, 2024, Pisa, Italy the need for understanding their abilities, how they dif-
∗
Corresponding authors. fer from one another, and how they improve compared
†
These authors contributed equally. to previous iterations. To meet this need, the last cou-
Envelope-Open giuseppe.attanasio@lx.it.pt (G. Attanasio);
pierpaolo.basile@uniba.it (P. Basile); borazio@ing.uniroma2.it
ple of years have witnessed multiple efforts to put to-
(F. Borazio); croce@info.uniroma2.it (D. Croce); gether new—or revisiting existing—benchmarks against
maria.francis287@gmail.com (M. Francis); which the performance and progress of LLMs can be
jacopo.gili584@edu.unito.it (J. Gili); elio.musacchio@phd.unipi.it monitored. These benchmarks include different tasks
(E. Musacchio); m.nissim@rug.nl (M. Nissim); to test a variety of characteristics and abilities that are
viviana.patti@unito.it (V. Patti); matteo.rinaldi@unito.it
(M. Rinaldi); d.scalena@campus.unimib.it (D. Scalena)
assumed to be associated with LLMs at different degrees.
GLOBE https://gattanasio.cc/ (G. Attanasio); To mention a few, these span from multiple-choice ques-
https://swap.di.uniba.it/members/basile.pierpaolo/ (P. Basile); tions of various sorts, commonsense and mathematical
https://github.com/crux82 (D. Croce); https://github.com/rosakun reasoning, and a variety of linguistic phenomena. BIG-
(M. Francis); https://github.com/Jj-source (J. Gili); bench [1] is currently the largest and most comprehen-
https://github.com/m-elio (E. Musacchio);
https://malvinanissim.github.io (M. Nissim);
sive benchmark, including over 200 tasks, almost all in
https://github.com/vivpatti (V. Patti); https://github.com/mrinaldi97 English, which have been collaboratively contributed by
(M. Rinaldi); https://github.com/DanielSc4 (D. Scalena) researchers across the globe.
Orcid 0000-0001-6945-3698 (G. Attanasio); 0000-0002-0545-1105 However, benchmarking progress for languages other
(P. Basile); 0009-0000-0193-2131 (F. Borazio); 0000-0001-9111-1950 than English has not improved with comparable quality.
(D. Croce); 0009-0007-7638-9963 (M. Francis); 0009-0007-1343-3760
In many cases, evaluation datasets are automatic transla-
(J. Gili); 0009-0006-9670-9998 (E. Musacchio); 0000-0001-5289-0971
(M. Nissim); 0000-0001-5991-370X (V. Patti); 0009-0004-7488-8855 tions of their English counterparts, yielding not only a
(M. Rinaldi); 0009-0006-0518-6504 (D. Scalena) less native and possibly ungrammatical language but also
© 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
Attribution 4.0 International (CC BY 4.0).
CEUR
ceur-ws.org
Workshop ISSN 1613-0073
Proceedings
a cultural picture that is distant from the target language. a strong collaborative nature. The Italian Association for
In the Italian NLP landscape, there is a long tradition Computational Linguistics (AILC, https://www.ai-lc.it)
of evaluation through the contribution of shared tasks. launched a public call, mainly aimed at the Italian NLP
These benchmarks have been collected and run for al- community but spread across the standard international
most 20 years in the context of the EVALITA campaigns communication channels, asking for challenges and cor-
(https://www.evalita.it/). The campaigns have fostered responding datasets, that LLMs could be tested on.
the creation of training and evaluation resources and Participants contributing to a challenge were expected
models natively developed for Italian. Based on such to provide an explanation and motivation for a given chal-
resources, UINAUIL (Unified Interactive Natural Under- lenge, as well as a dataset that reflects that challenge. It
standing of the Italian Language)[2], an integrated bench- was also asked to provide any information relevant to the
mark for Italian NLU including six tasks has been recently dataset (provenance, annotation, distribution of labels or
proposed, and tested with available Italian and multilin- phenomena, etc.) Evaluation metrics and examples were
gual language models. also expected, along with the task and dataset submis-
Except for CHANGE-IT [3], a generation task focused sion. Existing relevant datasets could also be submitted
on headline transformation and organized within the as long as they made an interesting contribution to the
EVALITA 2020 edition, all EVALITA tasks have focused benchmark and were natively created in Italian. To stan-
on classification problems (some have been recast as gen- dardize the contribution to the CALAMITA benchmark,
eration problems as part of a resource release within all proposed tasks with existing or new datasets had to
the “Risorse per la Lingua Italiana” (RiTA) community follow a predefined template created and distributed by
[4]). However, to improve upon existing benchmarks, the CALAMITA organizers.
we wanted the core of a dynamic reference benchmark Creating the CALAMITA benchmark and the first
for Italian to include new tasks specifically focused on round of LLM evaluation required several steps. In the
testing LLMs’ abilities. first phase, all prospective participants submitted a pre-
Therefore, in the steps of this solid Italian benchmark- proposal. In case of a positive evaluation, based on
ing tradition, and in line with the most recent devel- compliance with the requirements and balance across
opments regarding the evaluation of LLMs, AILC—the submissions – participants were then asked to submit
Italian Association for Computational Linguistics—has the final and complete challenge, following the provided
launched “Challenge the Abilities of LAnguage Models in CALAMITA template, in phase two. A final report was
ITAlian” (CALAMITA), a large-scale collaborative initia- also requested for each accepted task, providing informa-
tive across the whole Italian NLP community to develop tion on implementing the code for the evaluation.
a dynamic and growing benchmark for evaluating LLMs’ The data and evaluation team set up the final
capabilities in Italian. This strategy would ensure a high CALAMITA benchmark by compiling the data and code
diversity of tasks and, thus, of tested capabilities. It would of all the proposed tasks. We forked the Language Model
distribute the effort of creative resources natively in Ital- Evaluation Harness tool2 to create a custom CALAMITA
ian across many researchers and practitioners. version by including all the accepted tasks. Once the
In the long term, we aim to establish a continuously benchmark was assembled, the CALAMITA organiz-
growing suite of tasks that can be accessed through a ers ran zero- or few-shot experiments with a selection
shared platform and a live leaderboard so that any newly of LLMs. No tuning materials or experiments are ex-
developed LLM, either multilingual or Italian monolin- pected at this project stage. Also, while we expect that
gual, can be readily assessed. In the short term, we have CALAMITA, in the longer run, will be further populated
started to build the CALAMITA benchmark through a by additional tasks and will have its own publicly ac-
series of challenges collaboratively contributed by the re- cessible leaderboard, allowing for model testing, in this
search community (Section 2). Also, we have established first stage, the choice of LLMs to be evaluated and the
an evaluation framework that enables running the cur- evaluation procedure is centralized.
rent and possibly future challenges in a centralized and
coherent manner. This short paper summarises the col-
laborative procedure, the challenges currently included 3. Challenges
in CALAMITA1 , and the evaluation procedure.
The preliminary call for tasks yielded the submission of
over 20 proposals. Almost all of them were retained and
2. Collaborative Methodology are part of the present CALAMITA challenge, apart from
the proposals that aimed at testing abilities that LLMs
The CALAMITA approach is inspired by standard Natural should not be expected to have, such as abilities typical
Language Processing shared tasks, giving the benchmark of information retrieval engines and the proposals that
1 2
The CALAMITA website: https://clic2024.ilc.cnr.it/calamita/. https://github.com/EleutherAI/lm-evaluation-harness
Ability tested Description Count
Commonsense knowledge General knowledge about the world that is typically taken for granted in everyday 19
life, e.g., everyday cause-and-effect relationships, situational judgments, physical
properties, and basic social interactions.
Factual knowledge Knowledge of concrete, verifiable facts about the world, e.g., definitions, historical 12
events, or scientific concepts.
Linguistic knowledge Linguistically motivated tasks that test specific language skills, e.g., word sense 22
disambiguation, coreference resolution, or acceptability judgment.
Formal reasoning Ability to understand and use formally logical principles to solve problems, e.g., 9
mathematical problems.
Fairness and bias Evaluates a model’s capacity to handle sensitive tasks, including exclusive and 6
stereotyped language understanding and detecting offensive or biased language
towards social groups.
Code generation Ability to generate fully functioning code for a specific programming language. 1
Machine translation Ability to translate a sentence from a source language into another language, with 2
one of the two being Italian.
Summarization Ability to create relevant summaries of a given excerpt, e.g., news headline gener- 2
ation or news reduction.
Table 1
Categories of abilities tested by CALAMITA tasks. Tasks test general abilities such as knowledge about true facts, commonsense,
and logical reasoning (top) or specific NLP-oriented abilities such as code generation or machine translation (bottom). Each
task may require models to exhibit more than one ability.
required manual evaluation. In what follows, we briefly tative component as a premise or conclusion. In contrast,
describe each task included in CALAMITA and refer the the second and third tasks aim at classifying the type of
reader to each of the challenges’ reports for further de- premise: legal vs factual, and its corresponding argumen-
tails. In Table 1, we describe the macro categories under tation scheme. The classes are highly unbalanced, hence
which the CALAMITA tasks can be grouped, where cate- evaluation is based on the macro F1 score.
gories are broad classes of tested abilities. Table 2 shows
which abilities apply to each challenge. BEEP (BEst DrivEr’s License Performer) [7] is a
benchmark to evaluate large language models in the con-
ABRICOT (ABstRactness and Inclusiveness in COn- text of a simulated Italian driver’s license exam. This chal-
texT) [5] is a task designed to evaluate Italian language lenge tests the models’ ability to understand and apply
models on their ability to understand and assess the ab- traffic laws, road safety regulations, and vehicle-related
stractness and inclusiveness of language, two nuanced knowledge through a series of true/false questions. The
features that humans naturally convey in everyday com- dataset is derived from official ministerial materials used
munication. Unlike binary categorizations such as ab- in the Italian licensing process, explicitly targeting Cate-
stract/concrete or inclusive/exclusive, these features exist gory B licenses.
on a continuous spectrum with varying degrees of inten-
sity. The task is based on a manual collection of sentences BLM-It (Blackbird Language Matrices) [8] is a task
that present the same noun phrase (NP) in different con- made of linguistic puzzles (matrices) around language-
texts, allowing its interpretation to vary between the related problems, focusing on formal and semantic prop-
extremes of abstractness and inclusiveness. This chal- erties of language. A BLM matrix consists of a context
lenge aims to verify how LLMs perceive subtle linguistic set and an answer set. The context is a sequence of sen-
variations and their implications in natural language. tences that encodes implicitly an underlying generative
linguistic rule. The contrastive multiple-choice answer
AMELIA (Argument Mining Evaluation on Legal set includes negative examples following corrupted gen-
documents in ItAlian) [6] is a challenge consisting erating rules. The models are prompted in a few-shot
of three classification tasks in the context of argument setting. The datasets comprise a few prompts for a few-
mining in the legal domain. The tasks are based on a shot setting.
dataset of 225 Italian decisions on Value Added Tax, an-
notated to identify and categorize argumentative text. DIMMI (Drug InforMation Mining in Italian) [9]
The objective of the first task is to classify each argumen- is a task aimed at evaluating the proficiency of Large
Language Models in extracting drug-specific information lingual scenarios. It includes three tasks: (1) the detection
from Patient Information Leaflets. The challenge eval- of gender-marked expressions in Italian sentences, (2)
uates the effectiveness of processing complex medical the rewriting of gendered expressions into gender-fair
information in Italian and is approached as an informa- alternatives, and (3) the generation of gender-fair lan-
tion extraction task in a zero-shot setting, based on the guage in automatic translation from English to Italian.
model’s pre-existing knowledge or through in-context The challenge relies on three different annotated datasets:
learning. Evaluation is performed against a manually the GFL-it corpus, which contains Italian texts extracted
created gold standard. from administrative documents provided by the Univer-
sity of Brescia; GeNTE, a bilingual test set for gender-
ECWCA (Educational CrossWord Clues Answering) neutral rewriting and translation built upon a subset of
[10] is designed to evaluate the knowledge and reasoning the Europarl dataset; Neo-GATE, a bilingual test set de-
capabilities of LLMs through crossword clue-answering. signed to assess the use of non-binary neomorphemes in
The challenge consists of two tasks: a standard question- Italian for both fair formulation and translation tasks.
answering format where the LLM is asked to solve cross-
word clues and a variation where the model is given hints GITA (Graded Italian Annotated Dataset) [15] in-
about the word lengths of the answers, which is expected vestigates the physical commonsense reasoning capabili-
to help models with reasoning abilities. ties of large language models, assessing their low-level
understanding of the physical world using a test set in
EurekaRebus [11] is a task that tests the ability of the Italian language. Three specific tasks are evaluated:
LLMs to conduct multi-step, knowledge-intensive infer- identifying plausible and implausible stories within our
ences while respecting predefined constraints. LLMs dataset, identifying the conflict that generates an implau-
are prompted to reason step-by-step to solve verbalized sible story, and identifying the physical states that make
variants of rebus games. Verbalized rebuses replace vi- a story implausible. It is written and annotated by a
sual cues with crossword definitions to create an en- professional linguist.
crypted first pass, making the problem entirely text-based.
Multiple metrics are used to grasp the models’ perfor- INVALSI [16] is a benchmark based on the Invalsi tests
mance in knowledge recall, constraints adherence, and administered to students within the Italian school system.
re-segmentation abilities across reasoning steps. Expert pedagogists prepare these tests with the explicit
goal of testing average students’ performance over time
GATTINA (GenerAtion of TiTles for Italian News across Italy. There are two benchmarks: Invalsi MATE
Articles) [12] is a task that aims to assess the ability (420 questions), which targets the models’ performance
of LLMs to generate headlines for science news articles. on mathematical understanding, and Invalsi ITA (1279
Aspects such as the appropriateness of the summary, questions), which evaluates language understanding in
creativity, and attractiveness are evaluated through a Italian.
battery of metrics. The benchmark consists of a large
dataset of science news articles and their corresponding ITA-SENSE (ITAlian word SENSE disambiguation)
published headlines from ANSA Scienza and Galileo, two [17] is a task that assesses LLMs’ abilities in understand-
prominent Italian media outlets. ing lexical semantics through Word Sense Disambigua-
tion. The classical Word Sense Disambiguation task is
GEESE (Generating and Evaluating Explanations cast as a generative problem formalized as two tasks:
for Semantic Entailment) [13] is focused on evaluat- [T1] Given a target word and a sentence in which the
ing the impact of generated explanations on the predic- word occurs, generate the correct meaning definition;
tive performance of language models for the task of Rec- [T2] Given a target word and a sentence in which the
ognizing Textual Entailment in Italian. Using a dataset word occurs, choose the correct meaning definition from
enriched with human-written explanations, two large a predefined set. For CALAMITA, LLMs are tested in a
language models are employed to generate and utilize ex- zero-shot setting.
planations for semantic relationships between sentence
pairs. GEESE assesses the quality of generated explana- MACID (Multimodal ACtion IDentification) [18]
tions by measuring changes in prediction accuracy when is a task aimed at evaluating LLMs to differentiate be-
explanations are provided. tween closely related action concepts based on textual
descriptions alone. The challenge is inspired by the ”find
GFG (Gender-Fair Generation) [14] is a task de- the intruder” task, where models must identify an out-
signed to assess and monitor the recognition and gener- lier among a set of 4 sentences that describe similar yet
ation of gender-fair language in both mono- and cross- distinct actions. The dataset highlights action-predicate
mismatches, where the same verb may describe different designed to evaluate the ability of LLMs to comprehend a
actions, or different verbs may refer to the same action. specific type of complex syntactic construction in Italian:
Although mono-modal (text-only), the task is designed object relative clauses. The challenge is framed as a
for future multimodal integration, linking visual and tex- binary entailment task where, given a complex sentence,
tual representations to enhance action recognition. the model is tasked with determining whether it logically
entails a simpler yes/no implication.
MT (Machine Translation) [19] is a task that aims
at testing the ability of LLMs in automatic translation, Termite [24] focuses on the Text-to-SQL task in Ital-
focusing on Italian and English (in both directions). The ian. Natural language queries are written natively in Ital-
task proposes a benchmark composed of two datasets ian, and the models are expected to turn them into SQL
covering different domains and with varying distribution queries. The dataset is built to be invisible to search en-
policies. Performances are reported in terms of four eval- gines since it is locked under an encryption key delivered
uation metrics, whose scores allow an overall evaluation along the resource to reduce accidental inclusion in up-
of the quality of the automatically generated translations. coming training sets. It contains hand-crafted databases
in different domains, each with a balanced set of NL-SQL
Mult-IT [20] is a large-scale Multi-Choice Question An- query pairs. The NL questions are built in such a way
swering (MCQA) dataset for evaluating the factual knowl- that they can be solved by a model relying only on its
edge and reasoning abilities of LLMs in Italian. This linguistic proficiency and an analysis of the schema, with
contribution aims to counteract the disadvantages of us- no external knowledge needed.
ing MCQA benchmarks that are automatically translated
from English and may sound unnatural, contain errors, VeryfIT [25] is designed to evaluate the in-memory
or use linguistics constructions that do not align with the factual knowledge of language models on data written
target language. In addition, they may introduce topical by professional fact-checkers, posing it as a true or false
and ideological biases reflecting Anglo-centric perspec- question. Topics of the statements vary, but most are
tives. Mult-IT comprises over 110,000 manually written in specific domains related to the Italian government,
questions sourced directly from preparation quizzes for policies, and social issues. The task presents several chal-
Italian university entrance exams or for exams for public lenges: extracting statements from segments of speeches,
sector employment in Italy. determining appropriate contextual relevance both tem-
porally and factually, and verifying the statements’ accu-
PejorativITy [21] is a task to investigate misogyny racy.
expressed through neutral words that can assume a nega-
tive connotation when functioning as pejorative epithets. ItaEval [26] is a multifaceted evaluation suite com-
This challenge addresses a) the disambiguation of such prising three overarching task categories: (i) natural
ambiguous words in a given context; b) the detection language understanding, (ii) commonsense and factual
of misogyny in instances that contain such polysemic knowledge, and (iii) bias, fairness, and safety [4]. ItaE-
words. The task is divided into two parts, both framed val is a collection of 18 tasks encompassing existing and
as a binary classification. In Task A, the model is asked new datasets. The so-compiled ItaEval suite provides
to define if, given a tweet, the target word is used in a a standardized, multifaceted framework for evaluating
pejorative or non-pejorative way. In Task B, the model Italian language models, facilitating more rigorous and
is asked whether the whole sentence is misogynous. comparative assessments of model performance.
PERSEID (PERSpEctivist Irony Detection) [22] con-
siders the task of irony detection from short social me-
4. Evaluation Strategy
dia conversations collected from Twitter (X) and Red- Rooted in its very nature, CALAMITA’s biggest challenge
dit. Data is leveraged from MultiPICO, a recent multilin- is standardizing evaluation across many tasks and sce-
gual dataset with disaggregated annotations and annota- narios. To account for such high variability, we settled
tors’ metadata. The dataset evaluates whether prompting on a few fundamental choices that shape CALAMITA’s
LLMs with additional annotators’ demographic informa- core principles (Design choices) and left broad freedom
tion (gender only, age only, and the combination of the to challenge participants to specify fine-grained aspects
two) improves performance compared to a baseline in of their tasks (Participant choices). Base design choices
which only the input text is provided. shared across all tasks and high task-specific customiza-
tion balance standardization and versatility.
TRACE-it (Testing Relative clAuses Comprehension
through Entailment in ITalian) [23] is a benchmark
Task Type
ABRICOT
AMELIA
BEEP
BLM-It
DIMMI *
ECWCA
EurekaRebus
GATTINA
GEESE
GFG
GITA
INVALSI
ITA-SENSE
MACID
MT
Mult-IT
PejorativITy
PERSEID
Termite
TRACE-it
VeryfIT
ItaEval
ItaCoLA
Belebele-it *
News-Sum
IronITA
SENTIPOLC
SQuAD-it *
TruthfulQA-it
ARC-it
XCOPA-it
HellaSwag-it
AMI
HONEST **
GeNTE rephrasing
Multilingual HateCheck **
HaSpeeDe2
Table 2
Abilities tested by each task in CALAMITA. ∗ : task that require contextualized factual knowledge, e.g., reading comprehension
tasks. ∗∗ : tasks that require stereotypical commonsense knowledge, e.g., understanding the concept of misogyny.
Design choices. Following recent practices for lan- guage models. Llama’s 3.1 variant introduces multilin-
guage model evaluation [e.g., 27, 28], we consider every gual support to the family’s previous iteration. ANITA
received task as a downstream task to be solved via stan- is a fine-tuned version of Llama 3 specializing in English
dard prompting. We support two types of tasks: Multiple- and Italian tasks.
Choice (MC) and Open-Ended (OE) generation. MC tasks Our choice was driven by three primary reasons. First,
require a model to pick one or more correct answers from both models are open-weight, well-known within the
a finite set. OE tasks require models to generate output Italian NLP community, and explicitly support the Italian
tokens until a stopping criterion is met. For evaluating language. Second, they have been instruction fine-tuned,
multiple-choice tasks, we rank all candidates by their like- a training step that facilitates addressing tasks in zero-
lihood conditioned on the prompt and pick the highest shot Third, they are within the 8 billion parameter range,
[29]. We normalize each option probability by the num- which allows for fast iteration and good performance.
ber of tokens. Closed-question question-answering is an
example of an MC task. We do not adopt a single strat- Results. At the time of writing, some of the results
egy for OE tasks, as evaluation depends on the semantics are still being collected. To provide a comprehensive
of the output. Machine translation and summarization and dynamic overview, we refer the reader to the ex-
are examples of OE tasks. Moreover, we standardize the ternal page where they get regularly updated: https:
decoding strategy across OE tasks. We use beam search //calamita-ailc.github.io/calamita2024/.
(𝑛 = 5) for machine translation and greedy decoding for
all other tasks. See Appendix A for the complete details.
To foster reproducibility, we base CALAMITA’s code- 5. Limitations
base on open-source tools. We forked and built our eval-
uation code upon lm-eval [30]. When possible, we rec- CALAMITA is not intended to be an exhaustive bench-
ommended public and accessible data release to the par- mark for testing abilities of Italian LLMs, especially at
ticipants through the HuggingFace Hub.3 We release our this first release. Considering the strong collaborative na-
evaluation code at https://github.com/CALAMITA-AILC/ ture of this benchmark, coherence across tasks might not
lm-evaluation-harness. be optimal, in spite of the efforts put in by the organisers
to uniform all datasets and the evaluation procedure. Al-
though we have paid attention to this issue, we cannot be
Participant choices. In addition to the data associ-
absolutely certain that none of the datasets, in one form
ated with the task and the type (MC or OE), we request
or another, have ended up in some training set, already.
that each participating team provides specifics regard-
ing compiling an arbitrary prompt and evaluating an
arbitrary model generation. Among prompting details, Acknowledgments
task proposers specified a prompt template and the num-
ber of task demonstrations (0 for zero-shot, N for N- The ItaEval tasks submitted to CALAMITA are the re-
shot prompting). In few-shot cases, we requested where sult of a joint effort of members of the “Risorse
to sample the demonstrations and the sampling strat- per la Lingua Italiana” community (rita-nlp.org): we
egy (static, dynamic-random, or dynamic-sequential). thank every member who dedicated their time to the
Among the evaluation details, we requested that par- project. For providing the computational resources we
ticipants specify any post-processing function for model thank CINECA (ISCRA grant: HP10C3RW9F; ISCRA C
raw outputs, one or more evaluation metrics, and relative grant: CALAMITA – HP10CKZDYT), the Center for
information. For reporting purposes, we collected a sin- Information Technology of the University of Gronin-
gle evaluation score (the first metric listed by proposers). gen for their support and for providing access to the
Crucially, we relied upon meta-description and code Hábrók high performance computing cluster and Univer-
to streamline the communication between the task pro- sity of Turin for providing access to the HPC4AI cluster
posers and the challenge organizers. Participants were [33]. Malvina Nissim’s work is also part of the “Hu-
tasked to provide such information through a single file mane AI” theme of the Dutch Sectorplan for the Human-
following a set of guidelines.4 . ities. The work of Viviana Patti was is partially sup-
ported by “HARMONIA” project - M4-C2, I1.3 Partenar-
Model Selection. We tested Llama 3.1 8B Instruct [31] iati Estesi - Cascade Call - FAIR - CUP C63C22000770006
and ANITA [32], two state-of-the-art decoder-only lan- - PE PE0000013 under the NextGenerationEU programme.
The work by Giuseppe Attanasio was supported by the
3
Resulting from the effort for CALAMITA, 35 new datasets have Portuguese Recovery and Resilience Plan through project
been released with a permissive license.
4
See the guidelines at https://github.com/CALAMITA-AILC/ C645008882-00000055 (Center for Responsible AI) and by
calamita2024 and the information file at https://gist.github.com/ Fundação para a Ciência e Tecnologia through contract
g8a9/f5e82d38ce12831323b20dc79b0452c9 UIDB/50008/2020. The work by Pierpaolo Basile and Elio
Musacchio was supported by the PNRR project FAIR - Challenge, in: Proceedings of the 10th Italian
Future AI Research (PE00000013), Spoke 6 - Symbiotic AI Conference on Computational Linguistics (CLiC-it
(CUP H97G22000210007) under the NRRP MUR program 2024), Pisa, Italy, December 4 - December 6, 2024,
funded by the NextGenerationEU. The work of Matteo CEUR Workshop Proceedings, CEUR-WS.org, 2024.
Rinaldi and Jacopo Gili has been partly supported by the [8] C. Jiang, G. Samo, V. Nastase, P. Merlo, BLM-
Spoke “Future HPC & Big Data” of the ICSC - Centro It - Blackbird Language Matrices for Italian: A
Nazionale di Ricerca in “High Performance Computing, CALAMITA Challenge, in: Proceedings of the 10th
Big Data and Quantum Computing”, funded by European Italian Conference on Computational Linguistics
Union - NextGenerationEU. (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
2024, CEUR Workshop Proceedings, CEUR-WS.org,
2024.
References [9] R. Manna, M. P. Di Buono, L. Giordano, DIMMI -
Drug InforMation Mining in Italian: A CALAMITA
[1] A. Srivastava, D. Kleyjo, Z. Wu, Beyond the im-
Challenge, in: Proceedings of the 10th Italian
itation game: Quantifying and extrapolating the
Conference on Computational Linguistics (CLiC-it
capabilities of language models, Transactions on
2024), Pisa, Italy, December 4 - December 6, 2024,
Machine Learning Research (2023).
CEUR Workshop Proceedings, CEUR-WS.org, 2024.
[2] V. Basile, L. Bioglio, A. Bosca, C. Bosco, V. Patti,
[10] A. Zugarini, K. Zeinalipour, A. Fusco, A. Zanollo,
UINAUIL: A unified benchmark for Italian nat-
ECWCA - Educational CrossWord Clues Answer-
ural language understanding, in: D. Bollegala,
ing A CALAMITA Challenge, in: Proceedings of
R. Huang, A. Ritter (Eds.), Proceedings of the 61st
the 10th Italian Conference on Computational Lin-
Annual Meeting of the Association for Compu-
guistics (CLiC-it 2024), Pisa, Italy, December 4 -
tational Linguistics (Volume 3: System Demon-
December 6, 2024, CEUR Workshop Proceedings,
strations), Association for Computational Linguis-
CEUR-WS.org, 2024.
tics, Toronto, Canada, 2023, pp. 348–356. URL:
[11] G. Sarti, T. Caselli, A. Bisazza, M. Nissim, Eu-
https://aclanthology.org/2023.acl-demo.33. doi:10.
rekaRebus - Verbalized Rebus Solving with LLMs: A
18653/v1/2023.acl- demo.33 .
CALAMITA Challenge, in: Proceedings of the 10th
[3] L. De Mattei, M. Cafagna, A. AI, F. Dell’Orletta,
Italian Conference on Computational Linguistics
M. Nissim, A. Gatt, Change-it@ evalita 2020:
(CLiC-it 2024), Pisa, Italy, December 4 - December 6,
Change headlines, adapt news, generate, EVALITA
2024, CEUR Workshop Proceedings, CEUR-WS.org,
Evaluation of NLP and Speech Tools for Italian-
2024.
December 17th, 2020 (2020) 235.
[12] M. Francis, M. Rinaldi, J. Gili, L. De Cosmo, S. Ian-
[4] G. Attanasio, P. Delobelle, M. La Quatra, A. Santilli,
naccone, M. Nissim, V. Patti, GATTINA - Gen-
B. Savoldi, Itaeval and tweetyita: A new extensive
erAtion of TiTles for Italian News Articles: A
benchmark and efficiency-first language model for
CALAMITA Challenge, in: Proceedings of the 10th
italian, in: CLiC-it 2024: Tenth Italian Conference
Italian Conference on Computational Linguistics
on Computational Linguistics, Date: 2024/12/04-
(CLiC-it 2024), Pisa, Italy, December 4 - December 6,
2024/12/06, Location: Pisa, Italy, 2024.
2024, CEUR Workshop Proceedings, CEUR-WS.org,
[5] G. Puccetti, C. Collacciani, A. A. Ravelli, A. Esuli,
2024.
M. Bolognesi, ABRICOT - ABstRactness and Inclu-
[13] A. Zaninello, B. Magnini, GEESE - Generating and
siveness in COntexT: A CALAMITA Challenge, in:
Evaluating Explanations for Semantic Entailment: a
Proceedings of the 10th Italian Conference on Com-
CALAMITA Challenge, in: Proceedings of the 10th
putational Linguistics (CLiC-it 2024), Pisa, Italy, De-
Italian Conference on Computational Linguistics
cember 4 - December 6, 2024, CEUR Workshop Pro-
(CLiC-it 2024), Pisa, Italy, December 4 - December 6,
ceedings, CEUR-WS.org, 2024.
2024, CEUR Workshop Proceedings, CEUR-WS.org,
[6] G. Grundler, A. Galassi, P. Santin, A. Fidelangeli,
2024.
F. Galli, E. Palmieri, F. Lagioia, G. Sartor, P. Tor-
[14] S. Frenda, A. Piergentili, B. Savoldi, M. Madeddu,
roni, AMELIA - Argument Mining Evaluation on
M. Rosola, S. Casola, C. Ferrando, V. Patti, M. Negri,
Legal documents in ItAlian: A CALAMITA Chal-
L. Bentivogli, GFG - Gender-Fair Generation: A
lenge, in: Proceedings of the 10th Italian Confer-
CALAMITA Challenge, in: Proceedings of the 10th
ence on Computational Linguistics (CLiC-it 2024),
Italian Conference on Computational Linguistics
Pisa, Italy, December 4 - December 6, 2024, CEUR
(CLiC-it 2024), Pisa, Italy, December 4 - December 6,
Workshop Proceedings, CEUR-WS.org, 2024.
2024, CEUR Workshop Proceedings, CEUR-WS.org,
[7] F. Mercorio, D. Potertì, A. Serino, A. Seveso, BEEP
2024.
- BEst DrivEr’s License Performer: A CALAMITA
[15] G. Pensa, E. Azurmendi, J. Etxaniz, B. Altuna,
I. Gonzalez-Dios, GITA4CALAMITA - Evaluat- Comprehension through Entailment in ITalian: A
ing the Physical Commonsense Understanding CALAMITA Challenge, in: Proceedings of the 10th
of Italian LLMs in a Multi-layered Approach: A Italian Conference on Computational Linguistics
CALAMITA Challenge, in: Proceedings of the 10th (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
Italian Conference on Computational Linguistics 2024, CEUR Workshop Proceedings, CEUR-WS.org,
(CLiC-it 2024), Pisa, Italy, December 4 - December 6, 2024.
2024, CEUR Workshop Proceedings, CEUR-WS.org, [24] F. Ranaldi, E. S. Ruzzetti, D. Onorati, F. M. Zan-
2024. zotto, L. Ranaldi, Termite Italian Text-to-SQL: A
[16] G. Puccetti, M. Cassese, A. Esuli, INVALSI - Mathe- CALAMITA Challenge, in: Proceedings of the 10th
matical and Language Understanding in Italian: A Italian Conference on Computational Linguistics
CALAMITA Challenge, in: Proceedings of the 10th (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
Italian Conference on Computational Linguistics 2024, CEUR Workshop Proceedings, CEUR-WS.org,
(CLiC-it 2024), Pisa, Italy, December 4 - December 6, 2024.
2024, CEUR Workshop Proceedings, CEUR-WS.org, [25] J. Gili, V. Patti, L. Passaro, T. Caselli, VeryfIT -
2024. Benchmark of Fact-Checked Claims for Italian: A
[17] P. Basile, E. Musacchio, L. Siciliani, ITA-SENSE CALAMITA Challenge, in: Proceedings of the 10th
- Evaluate LLMs’ ability for ITAlian word SENSE Italian Conference on Computational Linguistics
disambiguation: A CALAMITA Challenge, in: Pro- (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
ceedings of the 10th Italian Conference on Com- 2024, CEUR Workshop Proceedings, CEUR-WS.org,
putational Linguistics (CLiC-it 2024), Pisa, Italy, 2024.
December 4 - December 6, 2024, CEUR Workshop [26] G. Attanasio, M. La Quatra, A. Santilli, B. Savoldi,
Proceedings, CEUR-WS.org, 2024. ItaEval: A CALAMITA Challenge, in: Proceedings
[18] A. A. Ravelli, R. Varvara, L. Gregori, MACID - Mul- of the 10th Italian Conference on Computational
timodal ACtion IDentification: A CALAMITA Chal- Linguistics (CLiC-it 2024), Pisa, Italy, December 4
lenge, in: Proceedings of the 10th Italian Confer- - December 6, 2024, CEUR Workshop Proceedings,
ence on Computational Linguistics (CLiC-it 2024), CEUR-WS.org, 2024.
Pisa, Italy, December 4 - December 6, 2024, CEUR [27] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin,
Workshop Proceedings, CEUR-WS.org, 2024. C. Sun, S. I. Mirzadeh, M. Najibi, D. Belenko, P. Zat-
[19] M. Cettolo, A. Piergentili, S. Papi, M. Gaido, M. Ne- loukal, et al., OpenELM: An efficient language
gri, L. Bentivogli, MAGNET - MAchines GeNEr- model family with open training and inference
ating Translations: A CALAMITA Challenge, in: framework, in: Workshop on Efficient Systems
Proceedings of the 10th Italian Conference on Com- for Foundation Models II@ ICML2024, 2024.
putational Linguistics (CLiC-it 2024), Pisa, Italy, De- [28] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia,
cember 4 - December 6, 2024, CEUR Workshop Pro- R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Mag-
ceedings, CEUR-WS.org, 2024. nusson, Y. Wang, et al., OLMo: Accelerating the
[20] M. Rinaldi, J. Gili, M. Francis, M. Goffetti, V. Patti, science of language models, in: Proceedings of
M. Nissim, Mult-IT Multiple Choice Questions the 62nd Annual Meeting of the Association for
on Multiple Topics in Italian: A CALAMITA Chal- Computational Linguistics (Volume 1: Long Pa-
lenge, in: Proceedings of the 10th Italian Confer- pers), Association for Computational Linguistics,
ence on Computational Linguistics (CLiC-it 2024), Bangkok, Thailand, 2024, pp. 15789–15809. URL:
Pisa, Italy, December 4 - December 6, 2024, CEUR https://aclanthology.org/2024.acl-long.841. doi:10.
Workshop Proceedings, CEUR-WS.org, 2024. 18653/v1/2024.acl- long.841 .
[21] A. Muti, PejorativITy - In-Context Pejorative Lan- [29] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
guage Disambiguation: A CALAMITA Challenge, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
in: Proceedings of the 10th Italian Conference G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
on Computational Linguistics (CLiC-it 2024), Pisa, G. Krueger, T. Henighan, R. Child, A. Ramesh,
Italy, December 4 - December 6, 2024, CEUR Work- D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
shop Proceedings, CEUR-WS.org, 2024. E. Sigler, M. teusz Litwin, S. Gray, B. Chess, J. Clark,
[22] V. Basile, S. Casola, S. Frenda, S. M. Lo, PERSEID - C. Berner, S. McCandlish, A. Radford, I. Sutskever,
Perspectivist Irony Detection: A CALAMITA Chal- D. Amodei, Language models are few-shot
lenge, in: Proceedings of the 10th Italian Confer- learners, in: Proceedings of the 34th International
ence on Computational Linguistics (CLiC-it 2024), Conference on Neural Information Processing
Pisa, Italy, December 4 - December 6, 2024, CEUR Systems, NeurIPS ’20, Curran Associates Inc., Red
Workshop Proceedings, CEUR-WS.org, 2024. Hook, NY, USA, 2020, pp. 1877–1901. URL: https:
[23] D. Brunato, TRACE-it: Testing Relative clAuses //proceedings.neurips.cc/paper_files/paper/2020/
file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf. A. Experimental Details
[30] S. Biderman, H. Schoelkopf, L. Sutawika, L. Gao,
J. Tow, B. Abbasi, A. F. Aji, P. S. Ammanamanchi, A.1. Technical Details
S. Black, J. Clive, et al., Lessons from the trenches on
We run our experiments on the LEONARDO HPC infras-
reproducible evaluation of language models, arXiv
tructure (Booster partition). The booster module par-
preprint arXiv:2405.14782 (2024).
tition is based on BullSequana XH2135 supercomputer
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-
nodes, each with four NVIDIA Tensor Core GPUs (custom
Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
Ampere A100 GPU 64GB HBM2e, NVLink 3.0 (200GB/s))
A. Fan, et al., The Llama 3 Herd of Models, arXiv
and a single Intel CPU.5
preprint arXiv:2407.21783 (2024).
We forked the lm-eval-harness offi-
[32] M. Polignano, P. Basile, G. Semeraro, Ad-
cial repository at the commit with hash
vanced natural-based interaction for the italian
b2bf7bc4a601c643343757c92c1a51eb69caf1d7 .
language: Llamantino-3-anita, arXiv preprint
We report all technical details on our official webpage.6
arXiv:2405.07101 (2024).
[33] M. Aldinucci, S. Rabellino, M. Pironti, F. Spiga, P. Vi-
viani, M. Drocco, M. Guerzoni, G. Boella, M. Mellia, A.2. Generation Configuration
P. Margara, I. Drago, R. Marturano, G. Marchetto,
Table 3 reports the generation parameters we used for
E. Piccolo, S. Bagnasco, S. Lusso, S. Vallero, G. At-
Open-Ended tasks.
tardi, A. Barchiesi, A. Colla, F. Galeazzi, Hpc4ai,
an ai-on-demand federated platform endeavour,
Parameter Value
in: ACM Computing Frontiers, Ischia, Italy,
Batch size 1∗
2018. URL: https://iris.unito.it/retrieve/handle/ Temperature 0.0
2318/1765596/689772/2018_hpc4ai_ACM_CF.pdf. Sampling False
doi:10.1145/3203217.3205340 . Stopping criteria \n\n, , <|im_end|>, “. ”, <|eot_id|>, <|end_of_text|>
Table 3
Generation Parameters. ∗ : we set beam search to 5 for ma-
chine translation tasks.
5
https://www.hpc.cineca.it/systems/hardware/leonardo/
6
https://calamita-ailc.github.io/calamita2024/