=Paper=
{{Paper
|id=Vol-3878/116_calamita_preface_long
|storemode=property
|title=CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian
|pdfUrl=https://ceur-ws.org/Vol-3878/116_calamita_preface_long.pdf
|volume=Vol-3878
|authors=Giuseppe Attanasio,Pierpaolo Basile,Federico Borazio,Danilo Croce,Maria Francis,Jacopo Gili,Elio Musacchio,Malvina Nissim,Viviana Patti,Matteo Rinaldi,Daniel Scalena
|dblpUrl=https://dblp.org/rec/conf/clic-it/AttanasioBBCFGM24
}}
==CALAMITA: Challenge the Abilities of LAnguage Models in ITAlian==
<pdf width="1500px">https://ceur-ws.org/Vol-3878/116_calamita_preface_long.pdf</pdf>
<pre>
                                CALAMITA: Challenge the Abilities of LAnguage Models in
                                ITAlian
                                Giuseppe Attanasio1,∗,† , Pierpaolo Basile2,∗,† , Federico Borazio3,† , Danilo Croce3,∗,† ,
                                Maria Francis4,5,† , Jacopo Gili6,† , Elio Musacchio2,† , Malvina Nissim4,∗,† , Viviana Patti6,∗,† ,
                                Matteo Rinaldi6,† and Daniel Scalena7,4,†
                                1
                                  Instituto de Telecomunicações, Lisbon, Portugal
                                2
                                  University of Bari “Aldo Moro”, Bari, Italy
                                3
                                  University of Rome “Tor Vergata”, Rome, Italy
                                4
                                  CLCG, University of Groningen, Groningen, The Netherlands
                                5
                                  University of Trento, Trento, Italy
                                6
                                  Computer Science Department, University of Turin, Turin, Italy
                                7
                                  University of Milan Bicocca, Milan, Italy


                                               Abstract
                                               The rapid development of Large Language Models (LLMs) has called for robust benchmarks to assess their abilities, track
                                               progress, and compare iterations. While existing benchmarks provide extensive evaluations across diverse tasks, they
                                               predominantly focus on English, leaving other languages underserved. For Italian, the EVALITA campaigns have provided a
                                               long-standing tradition of classification-focused shared tasks. However, their scope does not fully align with the nuanced
                                               evaluation required for modern LLMs. To address this gap, we introduce “Challenge the Abilities of LAnguage Models in
                                               ITAlian” (CALAMITA), a collaborative effort to create a dynamic and growing benchmark tailored to Italian. CALAMITA
                                               emphasizes diversity in task design to test a wide range of LLM capabilities through resources natively developed in Italian
                                               by the community. This initiative includes a shared platform, live leaderboard, and centralized evaluation framework. This
                                               paper outlines the collaborative process, initial challenges, and evaluation framework of CALAMITA.

                                               Keywords
                                               Italian Benchmark, Shared Task, Language Models


                                                                                                                                        1. Introduction
                                                                                                                                        In parallel with the ongoing and constant development
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                    of new Large Language Models (LLMs), it has increased
                                Dec 04 — 06, 2024, Pisa, Italy                                                                          the need for understanding their abilities, how they dif-
                                ∗
                                     Corresponding authors.                                                                             fer from one another, and how they improve compared
                                †
                                    These authors contributed equally.                                                                  to previous iterations. To meet this need, the last cou-
                                Envelope-Open giuseppe.attanasio@lx.it.pt (G. Attanasio);
                                pierpaolo.basile@uniba.it (P. Basile); borazio@ing.uniroma2.it
                                                                                                                                        ple of years have witnessed multiple efforts to put to-
                                (F. Borazio); croce@info.uniroma2.it (D. Croce);                                                        gether new—or revisiting existing—benchmarks against
                                maria.francis287@gmail.com (M. Francis);                                                                which the performance and progress of LLMs can be
                                jacopo.gili584@edu.unito.it (J. Gili); elio.musacchio@phd.unipi.it                                      monitored. These benchmarks include different tasks
                                (E. Musacchio); m.nissim@rug.nl (M. Nissim);                                                            to test a variety of characteristics and abilities that are
                                viviana.patti@unito.it (V. Patti); matteo.rinaldi@unito.it
                                (M. Rinaldi); d.scalena@campus.unimib.it (D. Scalena)
                                                                                                                                        assumed to be associated with LLMs at different degrees.
                                GLOBE https://gattanasio.cc/ (G. Attanasio);                                                            To mention a few, these span from multiple-choice ques-
                                https://swap.di.uniba.it/members/basile.pierpaolo/ (P. Basile);                                         tions of various sorts, commonsense and mathematical
                                https://github.com/crux82 (D. Croce); https://github.com/rosakun                                        reasoning, and a variety of linguistic phenomena. BIG-
                                (M. Francis); https://github.com/Jj-source (J. Gili);                                                   bench [1] is currently the largest and most comprehen-
                                https://github.com/m-elio (E. Musacchio);
                                https://malvinanissim.github.io (M. Nissim);
                                                                                                                                        sive benchmark, including over 200 tasks, almost all in
                                https://github.com/vivpatti (V. Patti); https://github.com/mrinaldi97                                   English, which have been collaboratively contributed by
                                (M. Rinaldi); https://github.com/DanielSc4 (D. Scalena)                                                 researchers across the globe.
                                Orcid 0000-0001-6945-3698 (G. Attanasio); 0000-0002-0545-1105                                              However, benchmarking progress for languages other
                                (P. Basile); 0009-0000-0193-2131 (F. Borazio); 0000-0001-9111-1950                                      than English has not improved with comparable quality.
                                (D. Croce); 0009-0007-7638-9963 (M. Francis); 0009-0007-1343-3760
                                                                                                                                        In many cases, evaluation datasets are automatic transla-
                                (J. Gili); 0009-0006-9670-9998 (E. Musacchio); 0000-0001-5289-0971
                                (M. Nissim); 0000-0001-5991-370X (V. Patti); 0009-0004-7488-8855                                        tions of their English counterparts, yielding not only a
                                (M. Rinaldi); 0009-0006-0518-6504 (D. Scalena)                                                          less native and possibly ungrammatical language but also
                                         © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                         Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
a cultural picture that is distant from the target language.       a strong collaborative nature. The Italian Association for
   In the Italian NLP landscape, there is a long tradition         Computational Linguistics (AILC, https://www.ai-lc.it)
of evaluation through the contribution of shared tasks.            launched a public call, mainly aimed at the Italian NLP
These benchmarks have been collected and run for al-               community but spread across the standard international
most 20 years in the context of the EVALITA campaigns              communication channels, asking for challenges and cor-
(https://www.evalita.it/). The campaigns have fostered             responding datasets, that LLMs could be tested on.
the creation of training and evaluation resources and                 Participants contributing to a challenge were expected
models natively developed for Italian. Based on such               to provide an explanation and motivation for a given chal-
resources, UINAUIL (Unified Interactive Natural Under-             lenge, as well as a dataset that reflects that challenge. It
standing of the Italian Language)[2], an integrated bench-         was also asked to provide any information relevant to the
mark for Italian NLU including six tasks has been recently         dataset (provenance, annotation, distribution of labels or
proposed, and tested with available Italian and multilin-          phenomena, etc.) Evaluation metrics and examples were
gual language models.                                              also expected, along with the task and dataset submis-
Except for CHANGE-IT [3], a generation task focused                sion. Existing relevant datasets could also be submitted
on headline transformation and organized within the                as long as they made an interesting contribution to the
EVALITA 2020 edition, all EVALITA tasks have focused               benchmark and were natively created in Italian. To stan-
on classification problems (some have been recast as gen-          dardize the contribution to the CALAMITA benchmark,
eration problems as part of a resource release within              all proposed tasks with existing or new datasets had to
the “Risorse per la Lingua Italiana” (RiTA) community              follow a predefined template created and distributed by
[4]). However, to improve upon existing benchmarks,                the CALAMITA organizers.
we wanted the core of a dynamic reference benchmark                   Creating the CALAMITA benchmark and the first
for Italian to include new tasks specifically focused on           round of LLM evaluation required several steps. In the
testing LLMs’ abilities.                                           first phase, all prospective participants submitted a pre-
   Therefore, in the steps of this solid Italian benchmark-        proposal. In case of a positive evaluation, based on
ing tradition, and in line with the most recent devel-             compliance with the requirements and balance across
opments regarding the evaluation of LLMs, AILC—the                 submissions – participants were then asked to submit
Italian Association for Computational Linguistics—has              the final and complete challenge, following the provided
launched “Challenge the Abilities of LAnguage Models in            CALAMITA template, in phase two. A final report was
ITAlian” (CALAMITA), a large-scale collaborative initia-           also requested for each accepted task, providing informa-
tive across the whole Italian NLP community to develop             tion on implementing the code for the evaluation.
a dynamic and growing benchmark for evaluating LLMs’                  The data and evaluation team set up the final
capabilities in Italian. This strategy would ensure a high         CALAMITA benchmark by compiling the data and code
diversity of tasks and, thus, of tested capabilities. It would     of all the proposed tasks. We forked the Language Model
distribute the effort of creative resources natively in Ital-      Evaluation Harness tool2 to create a custom CALAMITA
ian across many researchers and practitioners.                     version by including all the accepted tasks. Once the
   In the long term, we aim to establish a continuously            benchmark was assembled, the CALAMITA organiz-
growing suite of tasks that can be accessed through a              ers ran zero- or few-shot experiments with a selection
shared platform and a live leaderboard so that any newly           of LLMs. No tuning materials or experiments are ex-
developed LLM, either multilingual or Italian monolin-             pected at this project stage. Also, while we expect that
gual, can be readily assessed. In the short term, we have          CALAMITA, in the longer run, will be further populated
started to build the CALAMITA benchmark through a                  by additional tasks and will have its own publicly ac-
series of challenges collaboratively contributed by the re-        cessible leaderboard, allowing for model testing, in this
search community (Section 2). Also, we have established            first stage, the choice of LLMs to be evaluated and the
an evaluation framework that enables running the cur-              evaluation procedure is centralized.
rent and possibly future challenges in a centralized and
coherent manner. This short paper summarises the col-
laborative procedure, the challenges currently included            3. Challenges
in CALAMITA1 , and the evaluation procedure.
                                                                   The preliminary call for tasks yielded the submission of
                                                                   over 20 proposals. Almost all of them were retained and
2. Collaborative Methodology                                       are part of the present CALAMITA challenge, apart from
                                                                   the proposals that aimed at testing abilities that LLMs
The CALAMITA approach is inspired by standard Natural              should not be expected to have, such as abilities typical
Language Processing shared tasks, giving the benchmark             of information retrieval engines and the proposals that
1                                                                  2
    The CALAMITA website: https://clic2024.ilc.cnr.it/calamita/.       https://github.com/EleutherAI/lm-evaluation-harness
 Ability tested                     Description                                                                              Count
     Commonsense knowledge          General knowledge about the world that is typically taken for granted in everyday         19
                                    life, e.g., everyday cause-and-effect relationships, situational judgments, physical
                                    properties, and basic social interactions.
     Factual knowledge              Knowledge of concrete, verifiable facts about the world, e.g., definitions, historical    12
                                    events, or scientific concepts.
     Linguistic knowledge           Linguistically motivated tasks that test specific language skills, e.g., word sense       22
                                    disambiguation, coreference resolution, or acceptability judgment.
     Formal reasoning               Ability to understand and use formally logical principles to solve problems, e.g.,         9
                                    mathematical problems.
     Fairness and bias              Evaluates a model’s capacity to handle sensitive tasks, including exclusive and            6
                                    stereotyped language understanding and detecting offensive or biased language
                                    towards social groups.
     Code generation                Ability to generate fully functioning code for a specific programming language.            1
     Machine translation            Ability to translate a sentence from a source language into another language, with         2
                                    one of the two being Italian.
     Summarization                  Ability to create relevant summaries of a given excerpt, e.g., news headline gener-        2
                                    ation or news reduction.

Table 1
Categories of abilities tested by CALAMITA tasks. Tasks test general abilities such as knowledge about true facts, commonsense,
and logical reasoning (top) or specific NLP-oriented abilities such as code generation or machine translation (bottom). Each
task may require models to exhibit more than one ability.


required manual evaluation. In what follows, we briefly           tative component as a premise or conclusion. In contrast,
describe each task included in CALAMITA and refer the             the second and third tasks aim at classifying the type of
reader to each of the challenges’ reports for further de-         premise: legal vs factual, and its corresponding argumen-
tails. In Table 1, we describe the macro categories under         tation scheme. The classes are highly unbalanced, hence
which the CALAMITA tasks can be grouped, where cate-              evaluation is based on the macro F1 score.
gories are broad classes of tested abilities. Table 2 shows
which abilities apply to each challenge.                          BEEP (BEst DrivEr’s License Performer) [7] is a
                                                                  benchmark to evaluate large language models in the con-
ABRICOT (ABstRactness and Inclusiveness in COn-                   text of a simulated Italian driver’s license exam. This chal-
texT) [5] is a task designed to evaluate Italian language         lenge tests the models’ ability to understand and apply
models on their ability to understand and assess the ab-          traffic laws, road safety regulations, and vehicle-related
stractness and inclusiveness of language, two nuanced             knowledge through a series of true/false questions. The
features that humans naturally convey in everyday com-            dataset is derived from official ministerial materials used
munication. Unlike binary categorizations such as ab-             in the Italian licensing process, explicitly targeting Cate-
stract/concrete or inclusive/exclusive, these features exist      gory B licenses.
on a continuous spectrum with varying degrees of inten-
sity. The task is based on a manual collection of sentences       BLM-It (Blackbird Language Matrices) [8] is a task
that present the same noun phrase (NP) in different con-          made of linguistic puzzles (matrices) around language-
texts, allowing its interpretation to vary between the            related problems, focusing on formal and semantic prop-
extremes of abstractness and inclusiveness. This chal-            erties of language. A BLM matrix consists of a context
lenge aims to verify how LLMs perceive subtle linguistic          set and an answer set. The context is a sequence of sen-
variations and their implications in natural language.            tences that encodes implicitly an underlying generative
                                                                  linguistic rule. The contrastive multiple-choice answer
AMELIA (Argument Mining Evaluation on Legal                       set includes negative examples following corrupted gen-
documents in ItAlian) [6] is a challenge consisting               erating rules. The models are prompted in a few-shot
of three classification tasks in the context of argument          setting. The datasets comprise a few prompts for a few-
mining in the legal domain. The tasks are based on a              shot setting.
dataset of 225 Italian decisions on Value Added Tax, an-
notated to identify and categorize argumentative text. DIMMI (Drug InforMation Mining in Italian) [9]
The objective of the first task is to classify each argumen- is a task aimed at evaluating the proficiency of Large
Language Models in extracting drug-specific information     lingual scenarios. It includes three tasks: (1) the detection
from Patient Information Leaflets. The challenge eval-      of gender-marked expressions in Italian sentences, (2)
uates the effectiveness of processing complex medical       the rewriting of gendered expressions into gender-fair
information in Italian and is approached as an informa-     alternatives, and (3) the generation of gender-fair lan-
tion extraction task in a zero-shot setting, based on the   guage in automatic translation from English to Italian.
model’s pre-existing knowledge or through in-context        The challenge relies on three different annotated datasets:
learning. Evaluation is performed against a manually        the GFL-it corpus, which contains Italian texts extracted
created gold standard.                                      from administrative documents provided by the Univer-
                                                            sity of Brescia; GeNTE, a bilingual test set for gender-
ECWCA (Educational CrossWord Clues Answering) neutral rewriting and translation built upon a subset of
[10] is designed to evaluate the knowledge and reasoning the Europarl dataset; Neo-GATE, a bilingual test set de-
capabilities of LLMs through crossword clue-answering. signed to assess the use of non-binary neomorphemes in
The challenge consists of two tasks: a standard question- Italian for both fair formulation and translation tasks.
answering format where the LLM is asked to solve cross-
word clues and a variation where the model is given hints GITA (Graded Italian Annotated Dataset) [15] in-
about the word lengths of the answers, which is expected vestigates the physical commonsense reasoning capabili-
to help models with reasoning abilities.                    ties of large language models, assessing their low-level
                                                            understanding of the physical world using a test set in
EurekaRebus [11] is a task that tests the ability of the Italian language. Three specific tasks are evaluated:
LLMs to conduct multi-step, knowledge-intensive infer- identifying plausible and implausible stories within our
ences while respecting predefined constraints. LLMs dataset, identifying the conflict that generates an implau-
are prompted to reason step-by-step to solve verbalized sible story, and identifying the physical states that make
variants of rebus games. Verbalized rebuses replace vi- a story implausible. It is written and annotated by a
sual cues with crossword definitions to create an en- professional linguist.
crypted first pass, making the problem entirely text-based.
Multiple metrics are used to grasp the models’ perfor- INVALSI [16] is a benchmark based on the Invalsi tests
mance in knowledge recall, constraints adherence, and administered to students within the Italian school system.
re-segmentation abilities across reasoning steps.           Expert pedagogists prepare these tests with the explicit
                                                            goal of testing average students’ performance over time
GATTINA (GenerAtion of TiTles for Italian News across Italy. There are two benchmarks: Invalsi MATE
Articles) [12] is a task that aims to assess the ability (420 questions), which targets the models’ performance
of LLMs to generate headlines for science news articles. on mathematical understanding, and Invalsi ITA (1279
Aspects such as the appropriateness of the summary, questions), which evaluates language understanding in
creativity, and attractiveness are evaluated through a Italian.
battery of metrics. The benchmark consists of a large
dataset of science news articles and their corresponding ITA-SENSE (ITAlian word SENSE disambiguation)
published headlines from ANSA Scienza and Galileo, two [17] is a task that assesses LLMs’ abilities in understand-
prominent Italian media outlets.                            ing lexical semantics through Word Sense Disambigua-
                                                            tion. The classical Word Sense Disambiguation task is
GEESE (Generating and Evaluating Explanations cast as a generative problem formalized as two tasks:
for Semantic Entailment) [13] is focused on evaluat- [T1] Given a target word and a sentence in which the
ing the impact of generated explanations on the predic- word occurs, generate the correct meaning definition;
tive performance of language models for the task of Rec- [T2] Given a target word and a sentence in which the
ognizing Textual Entailment in Italian. Using a dataset word occurs, choose the correct meaning definition from
enriched with human-written explanations, two large a predefined set. For CALAMITA, LLMs are tested in a
language models are employed to generate and utilize ex- zero-shot setting.
planations for semantic relationships between sentence
pairs. GEESE assesses the quality of generated explana- MACID (Multimodal ACtion IDentification) [18]
tions by measuring changes in prediction accuracy when is a task aimed at evaluating LLMs to differentiate be-
explanations are provided.                                  tween closely related action concepts based on textual
                                                            descriptions alone. The challenge is inspired by the ”find
GFG (Gender-Fair Generation) [14] is a task de- the intruder” task, where models must identify an out-
signed to assess and monitor the recognition and gener- lier among a set of 4 sentences that describe similar yet
ation of gender-fair language in both mono- and cross- distinct actions. The dataset highlights action-predicate
mismatches, where the same verb may describe different        designed to evaluate the ability of LLMs to comprehend a
actions, or different verbs may refer to the same action.     specific type of complex syntactic construction in Italian:
Although mono-modal (text-only), the task is designed         object relative clauses. The challenge is framed as a
for future multimodal integration, linking visual and tex-    binary entailment task where, given a complex sentence,
tual representations to enhance action recognition.           the model is tasked with determining whether it logically
                                                              entails a simpler yes/no implication.
MT (Machine Translation) [19] is a task that aims
at testing the ability of LLMs in automatic translation,    Termite [24] focuses on the Text-to-SQL task in Ital-
focusing on Italian and English (in both directions). The   ian. Natural language queries are written natively in Ital-
task proposes a benchmark composed of two datasets          ian, and the models are expected to turn them into SQL
covering different domains and with varying distribution    queries. The dataset is built to be invisible to search en-
policies. Performances are reported in terms of four eval-  gines since it is locked under an encryption key delivered
uation metrics, whose scores allow an overall evaluation    along the resource to reduce accidental inclusion in up-
of the quality of the automatically generated translations. coming training sets. It contains hand-crafted databases
                                                            in different domains, each with a balanced set of NL-SQL
Mult-IT [20] is a large-scale Multi-Choice Question An- query pairs. The NL questions are built in such a way
swering (MCQA) dataset for evaluating the factual knowl- that they can be solved by a model relying only on its
edge and reasoning abilities of LLMs in Italian. This linguistic proficiency and an analysis of the schema, with
contribution aims to counteract the disadvantages of us- no external knowledge needed.
ing MCQA benchmarks that are automatically translated
from English and may sound unnatural, contain errors, VeryfIT [25] is designed to evaluate the in-memory
or use linguistics constructions that do not align with the factual knowledge of language models on data written
target language. In addition, they may introduce topical by professional fact-checkers, posing it as a true or false
and ideological biases reflecting Anglo-centric perspec- question. Topics of the statements vary, but most are
tives. Mult-IT comprises over 110,000 manually written in specific domains related to the Italian government,
questions sourced directly from preparation quizzes for policies, and social issues. The task presents several chal-
Italian university entrance exams or for exams for public lenges: extracting statements from segments of speeches,
sector employment in Italy.                                 determining appropriate contextual relevance both tem-
                                                            porally and factually, and verifying the statements’ accu-
PejorativITy [21] is a task to investigate misogyny racy.
expressed through neutral words that can assume a nega-
tive connotation when functioning as pejorative epithets. ItaEval [26] is a multifaceted evaluation suite com-
This challenge addresses a) the disambiguation of such prising three overarching task categories: (i) natural
ambiguous words in a given context; b) the detection language understanding, (ii) commonsense and factual
of misogyny in instances that contain such polysemic knowledge, and (iii) bias, fairness, and safety [4]. ItaE-
words. The task is divided into two parts, both framed val is a collection of 18 tasks encompassing existing and
as a binary classification. In Task A, the model is asked new datasets. The so-compiled ItaEval suite provides
to define if, given a tweet, the target word is used in a a standardized, multifaceted framework for evaluating
pejorative or non-pejorative way. In Task B, the model Italian language models, facilitating more rigorous and
is asked whether the whole sentence is misogynous.          comparative assessments of model performance.

PERSEID (PERSpEctivist Irony Detection) [22] con-
siders the task of irony detection from short social me-
                                                              4. Evaluation Strategy
dia conversations collected from Twitter (X) and Red-         Rooted in its very nature, CALAMITA’s biggest challenge
dit. Data is leveraged from MultiPICO, a recent multilin-     is standardizing evaluation across many tasks and sce-
gual dataset with disaggregated annotations and annota-       narios. To account for such high variability, we settled
tors’ metadata. The dataset evaluates whether prompting       on a few fundamental choices that shape CALAMITA’s
LLMs with additional annotators’ demographic informa-         core principles (Design choices) and left broad freedom
tion (gender only, age only, and the combination of the       to challenge participants to specify fine-grained aspects
two) improves performance compared to a baseline in           of their tasks (Participant choices). Base design choices
which only the input text is provided.                        shared across all tasks and high task-specific customiza-
                                                              tion balance standardization and versatility.
TRACE-it (Testing Relative clAuses Comprehension
through Entailment in ITalian) [23] is a benchmark
               Task                                                                                     Type
               ABRICOT
               AMELIA
               BEEP
               BLM-It
               DIMMI                                       *
               ECWCA
               EurekaRebus
               GATTINA
               GEESE
               GFG
               GITA
               INVALSI
               ITA-SENSE
               MACID
               MT
               Mult-IT
               PejorativITy
               PERSEID
               Termite
               TRACE-it
               VeryfIT
               ItaEval
                  ItaCoLA
                  Belebele-it                              *
                  News-Sum
                  IronITA
                  SENTIPOLC
                  SQuAD-it                                 *
                  TruthfulQA-it
                  ARC-it
                  XCOPA-it
                  HellaSwag-it
                  AMI
                  HONEST                          **
                  GeNTE rephrasing
                  Multilingual HateCheck          **
                  HaSpeeDe2

Table 2
Abilities tested by each task in CALAMITA. ∗ : task that require contextualized factual knowledge, e.g., reading comprehension
tasks. ∗∗ : tasks that require stereotypical commonsense knowledge, e.g., understanding the concept of misogyny.
Design choices. Following recent practices for lan-              guage models. Llama’s 3.1 variant introduces multilin-
guage model evaluation [e.g., 27, 28], we consider every         gual support to the family’s previous iteration. ANITA
received task as a downstream task to be solved via stan-        is a fine-tuned version of Llama 3 specializing in English
dard prompting. We support two types of tasks: Multiple-         and Italian tasks.
Choice (MC) and Open-Ended (OE) generation. MC tasks                Our choice was driven by three primary reasons. First,
require a model to pick one or more correct answers from         both models are open-weight, well-known within the
a finite set. OE tasks require models to generate output         Italian NLP community, and explicitly support the Italian
tokens until a stopping criterion is met. For evaluating         language. Second, they have been instruction fine-tuned,
multiple-choice tasks, we rank all candidates by their like-     a training step that facilitates addressing tasks in zero-
lihood conditioned on the prompt and pick the highest            shot Third, they are within the 8 billion parameter range,
[29]. We normalize each option probability by the num-           which allows for fast iteration and good performance.
ber of tokens. Closed-question question-answering is an
example of an MC task. We do not adopt a single strat-           Results. At the time of writing, some of the results
egy for OE tasks, as evaluation depends on the semantics         are still being collected. To provide a comprehensive
of the output. Machine translation and summarization             and dynamic overview, we refer the reader to the ex-
are examples of OE tasks. Moreover, we standardize the           ternal page where they get regularly updated: https:
decoding strategy across OE tasks. We use beam search            //calamita-ailc.github.io/calamita2024/.
(𝑛 = 5) for machine translation and greedy decoding for
all other tasks. See Appendix A for the complete details.
   To foster reproducibility, we base CALAMITA’s code-           5. Limitations
base on open-source tools. We forked and built our eval-
uation code upon lm-eval [30]. When possible, we rec-               CALAMITA is not intended to be an exhaustive bench-
ommended public and accessible data release to the par-             mark for testing abilities of Italian LLMs, especially at
ticipants through the HuggingFace Hub.3 We release our              this first release. Considering the strong collaborative na-
evaluation code at https://github.com/CALAMITA-AILC/                ture of this benchmark, coherence across tasks might not
lm-evaluation-harness.                                              be optimal, in spite of the efforts put in by the organisers
                                                                    to uniform all datasets and the evaluation procedure. Al-
                                                                    though we have paid attention to this issue, we cannot be
Participant choices. In addition to the data associ-
                                                                    absolutely certain that none of the datasets, in one form
ated with the task and the type (MC or OE), we request
                                                                    or another, have ended up in some training set, already.
that each participating team provides specifics regard-
ing compiling an arbitrary prompt and evaluating an
arbitrary model generation. Among prompting details, Acknowledgments
task proposers specified a prompt template and the num-
ber of task demonstrations (0 for zero-shot, N for N- The ItaEval tasks submitted to CALAMITA are the re-
shot prompting). In few-shot cases, we requested where sult of a joint effort of members of the                         “Risorse
to sample the demonstrations and the sampling strat- per la Lingua Italiana” community (rita-nlp.org): we
egy (static, dynamic-random, or dynamic-sequential). thank every member who dedicated their time to the
Among the evaluation details, we requested that par- project. For providing the computational resources we
ticipants specify any post-processing function for model thank CINECA (ISCRA grant: HP10C3RW9F; ISCRA C
raw outputs, one or more evaluation metrics, and relative grant: CALAMITA – HP10CKZDYT), the Center for
information. For reporting purposes, we collected a sin- Information Technology of the University of Gronin-
gle evaluation score (the first metric listed by proposers). gen for their support and for providing access to the
    Crucially, we relied upon meta-description and code Hábrók high performance computing cluster and Univer-
to streamline the communication between the task pro- sity of Turin for providing access to the HPC4AI cluster
posers and the challenge organizers. Participants were [33]. Malvina Nissim’s work is also part of the “Hu-
tasked to provide such information through a single file mane AI” theme of the Dutch Sectorplan for the Human-
following a set of guidelines.4 .                                   ities. The work of Viviana Patti was is partially sup-
                                                                    ported by “HARMONIA” project - M4-C2, I1.3 Partenar-
Model Selection. We tested Llama 3.1 8B Instruct [31] iati Estesi - Cascade Call - FAIR - CUP C63C22000770006
and ANITA [32], two state-of-the-art decoder-only lan- - PE PE0000013 under the NextGenerationEU programme.
                                                                    The work by Giuseppe Attanasio was supported by the
3
  Resulting from the effort for CALAMITA, 35 new datasets have Portuguese Recovery and Resilience Plan through project
  been released with a permissive license.
4
  See the guidelines at https://github.com/CALAMITA-AILC/ C645008882-00000055 (Center for Responsible AI) and by
  calamita2024 and the information file at https://gist.github.com/ Fundação para a Ciência e Tecnologia through contract
  g8a9/f5e82d38ce12831323b20dc79b0452c9                             UIDB/50008/2020. The work by Pierpaolo Basile and Elio
Musacchio was supported by the PNRR project FAIR -                  Challenge, in: Proceedings of the 10th Italian
Future AI Research (PE00000013), Spoke 6 - Symbiotic AI             Conference on Computational Linguistics (CLiC-it
(CUP H97G22000210007) under the NRRP MUR program                    2024), Pisa, Italy, December 4 - December 6, 2024,
funded by the NextGenerationEU. The work of Matteo                  CEUR Workshop Proceedings, CEUR-WS.org, 2024.
Rinaldi and Jacopo Gili has been partly supported by the        [8] C. Jiang, G. Samo, V. Nastase, P. Merlo, BLM-
Spoke “Future HPC & Big Data” of the ICSC - Centro                  It - Blackbird Language Matrices for Italian: A
Nazionale di Ricerca in “High Performance Computing,                CALAMITA Challenge, in: Proceedings of the 10th
Big Data and Quantum Computing”, funded by European                 Italian Conference on Computational Linguistics
Union - NextGenerationEU.                                           (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
                                                                    2024, CEUR Workshop Proceedings, CEUR-WS.org,
                                                                    2024.
References                                                      [9] R. Manna, M. P. Di Buono, L. Giordano, DIMMI -
                                                                    Drug InforMation Mining in Italian: A CALAMITA
 [1] A. Srivastava, D. Kleyjo, Z. Wu, Beyond the im-
                                                                    Challenge, in: Proceedings of the 10th Italian
     itation game: Quantifying and extrapolating the
                                                                    Conference on Computational Linguistics (CLiC-it
     capabilities of language models, Transactions on
                                                                    2024), Pisa, Italy, December 4 - December 6, 2024,
     Machine Learning Research (2023).
                                                                    CEUR Workshop Proceedings, CEUR-WS.org, 2024.
 [2] V. Basile, L. Bioglio, A. Bosca, C. Bosco, V. Patti,
                                                               [10] A. Zugarini, K. Zeinalipour, A. Fusco, A. Zanollo,
     UINAUIL: A unified benchmark for Italian nat-
                                                                    ECWCA - Educational CrossWord Clues Answer-
     ural language understanding, in: D. Bollegala,
                                                                    ing A CALAMITA Challenge, in: Proceedings of
     R. Huang, A. Ritter (Eds.), Proceedings of the 61st
                                                                    the 10th Italian Conference on Computational Lin-
     Annual Meeting of the Association for Compu-
                                                                    guistics (CLiC-it 2024), Pisa, Italy, December 4 -
     tational Linguistics (Volume 3: System Demon-
                                                                    December 6, 2024, CEUR Workshop Proceedings,
     strations), Association for Computational Linguis-
                                                                    CEUR-WS.org, 2024.
     tics, Toronto, Canada, 2023, pp. 348–356. URL:
                                                               [11] G. Sarti, T. Caselli, A. Bisazza, M. Nissim, Eu-
     https://aclanthology.org/2023.acl-demo.33. doi:10.
                                                                    rekaRebus - Verbalized Rebus Solving with LLMs: A
     18653/v1/2023.acl- demo.33 .
                                                                    CALAMITA Challenge, in: Proceedings of the 10th
 [3] L. De Mattei, M. Cafagna, A. AI, F. Dell’Orletta,
                                                                    Italian Conference on Computational Linguistics
     M. Nissim, A. Gatt, Change-it@ evalita 2020:
                                                                    (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     Change headlines, adapt news, generate, EVALITA
                                                                    2024, CEUR Workshop Proceedings, CEUR-WS.org,
     Evaluation of NLP and Speech Tools for Italian-
                                                                    2024.
     December 17th, 2020 (2020) 235.
                                                               [12] M. Francis, M. Rinaldi, J. Gili, L. De Cosmo, S. Ian-
 [4] G. Attanasio, P. Delobelle, M. La Quatra, A. Santilli,
                                                                    naccone, M. Nissim, V. Patti, GATTINA - Gen-
     B. Savoldi, Itaeval and tweetyita: A new extensive
                                                                    erAtion of TiTles for Italian News Articles: A
     benchmark and efficiency-first language model for
                                                                    CALAMITA Challenge, in: Proceedings of the 10th
     italian, in: CLiC-it 2024: Tenth Italian Conference
                                                                    Italian Conference on Computational Linguistics
     on Computational Linguistics, Date: 2024/12/04-
                                                                    (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     2024/12/06, Location: Pisa, Italy, 2024.
                                                                    2024, CEUR Workshop Proceedings, CEUR-WS.org,
 [5] G. Puccetti, C. Collacciani, A. A. Ravelli, A. Esuli,
                                                                    2024.
     M. Bolognesi, ABRICOT - ABstRactness and Inclu-
                                                               [13] A. Zaninello, B. Magnini, GEESE - Generating and
     siveness in COntexT: A CALAMITA Challenge, in:
                                                                    Evaluating Explanations for Semantic Entailment: a
     Proceedings of the 10th Italian Conference on Com-
                                                                    CALAMITA Challenge, in: Proceedings of the 10th
     putational Linguistics (CLiC-it 2024), Pisa, Italy, De-
                                                                    Italian Conference on Computational Linguistics
     cember 4 - December 6, 2024, CEUR Workshop Pro-
                                                                    (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     ceedings, CEUR-WS.org, 2024.
                                                                    2024, CEUR Workshop Proceedings, CEUR-WS.org,
 [6] G. Grundler, A. Galassi, P. Santin, A. Fidelangeli,
                                                                    2024.
     F. Galli, E. Palmieri, F. Lagioia, G. Sartor, P. Tor-
                                                               [14] S. Frenda, A. Piergentili, B. Savoldi, M. Madeddu,
     roni, AMELIA - Argument Mining Evaluation on
                                                                    M. Rosola, S. Casola, C. Ferrando, V. Patti, M. Negri,
     Legal documents in ItAlian: A CALAMITA Chal-
                                                                    L. Bentivogli, GFG - Gender-Fair Generation: A
     lenge, in: Proceedings of the 10th Italian Confer-
                                                                    CALAMITA Challenge, in: Proceedings of the 10th
     ence on Computational Linguistics (CLiC-it 2024),
                                                                    Italian Conference on Computational Linguistics
     Pisa, Italy, December 4 - December 6, 2024, CEUR
                                                                    (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     Workshop Proceedings, CEUR-WS.org, 2024.
                                                                    2024, CEUR Workshop Proceedings, CEUR-WS.org,
 [7] F. Mercorio, D. Potertì, A. Serino, A. Seveso, BEEP
                                                                    2024.
     - BEst DrivEr’s License Performer: A CALAMITA
                                                               [15] G. Pensa, E. Azurmendi, J. Etxaniz, B. Altuna,
     I. Gonzalez-Dios, GITA4CALAMITA - Evaluat-                     Comprehension through Entailment in ITalian: A
     ing the Physical Commonsense Understanding                     CALAMITA Challenge, in: Proceedings of the 10th
     of Italian LLMs in a Multi-layered Approach: A                 Italian Conference on Computational Linguistics
     CALAMITA Challenge, in: Proceedings of the 10th                (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     Italian Conference on Computational Linguistics                2024, CEUR Workshop Proceedings, CEUR-WS.org,
     (CLiC-it 2024), Pisa, Italy, December 4 - December 6,          2024.
     2024, CEUR Workshop Proceedings, CEUR-WS.org,             [24] F. Ranaldi, E. S. Ruzzetti, D. Onorati, F. M. Zan-
     2024.                                                          zotto, L. Ranaldi, Termite Italian Text-to-SQL: A
[16] G. Puccetti, M. Cassese, A. Esuli, INVALSI - Mathe-            CALAMITA Challenge, in: Proceedings of the 10th
     matical and Language Understanding in Italian: A               Italian Conference on Computational Linguistics
     CALAMITA Challenge, in: Proceedings of the 10th                (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     Italian Conference on Computational Linguistics                2024, CEUR Workshop Proceedings, CEUR-WS.org,
     (CLiC-it 2024), Pisa, Italy, December 4 - December 6,          2024.
     2024, CEUR Workshop Proceedings, CEUR-WS.org,             [25] J. Gili, V. Patti, L. Passaro, T. Caselli, VeryfIT -
     2024.                                                          Benchmark of Fact-Checked Claims for Italian: A
[17] P. Basile, E. Musacchio, L. Siciliani, ITA-SENSE               CALAMITA Challenge, in: Proceedings of the 10th
     - Evaluate LLMs’ ability for ITAlian word SENSE                Italian Conference on Computational Linguistics
     disambiguation: A CALAMITA Challenge, in: Pro-                 (CLiC-it 2024), Pisa, Italy, December 4 - December 6,
     ceedings of the 10th Italian Conference on Com-                2024, CEUR Workshop Proceedings, CEUR-WS.org,
     putational Linguistics (CLiC-it 2024), Pisa, Italy,            2024.
     December 4 - December 6, 2024, CEUR Workshop              [26] G. Attanasio, M. La Quatra, A. Santilli, B. Savoldi,
     Proceedings, CEUR-WS.org, 2024.                                ItaEval: A CALAMITA Challenge, in: Proceedings
[18] A. A. Ravelli, R. Varvara, L. Gregori, MACID - Mul-            of the 10th Italian Conference on Computational
     timodal ACtion IDentification: A CALAMITA Chal-                Linguistics (CLiC-it 2024), Pisa, Italy, December 4
     lenge, in: Proceedings of the 10th Italian Confer-             - December 6, 2024, CEUR Workshop Proceedings,
     ence on Computational Linguistics (CLiC-it 2024),              CEUR-WS.org, 2024.
     Pisa, Italy, December 4 - December 6, 2024, CEUR          [27] S. Mehta, M. H. Sekhavat, Q. Cao, M. Horton, Y. Jin,
     Workshop Proceedings, CEUR-WS.org, 2024.                       C. Sun, S. I. Mirzadeh, M. Najibi, D. Belenko, P. Zat-
[19] M. Cettolo, A. Piergentili, S. Papi, M. Gaido, M. Ne-          loukal, et al., OpenELM: An efficient language
     gri, L. Bentivogli, MAGNET - MAchines GeNEr-                   model family with open training and inference
     ating Translations: A CALAMITA Challenge, in:                  framework, in: Workshop on Efficient Systems
     Proceedings of the 10th Italian Conference on Com-             for Foundation Models II@ ICML2024, 2024.
     putational Linguistics (CLiC-it 2024), Pisa, Italy, De-   [28] D. Groeneveld, I. Beltagy, P. Walsh, A. Bhagia,
     cember 4 - December 6, 2024, CEUR Workshop Pro-                R. Kinney, O. Tafjord, A. H. Jha, H. Ivison, I. Mag-
     ceedings, CEUR-WS.org, 2024.                                   nusson, Y. Wang, et al., OLMo: Accelerating the
[20] M. Rinaldi, J. Gili, M. Francis, M. Goffetti, V. Patti,        science of language models, in: Proceedings of
     M. Nissim, Mult-IT Multiple Choice Questions                   the 62nd Annual Meeting of the Association for
     on Multiple Topics in Italian: A CALAMITA Chal-                Computational Linguistics (Volume 1: Long Pa-
     lenge, in: Proceedings of the 10th Italian Confer-             pers), Association for Computational Linguistics,
     ence on Computational Linguistics (CLiC-it 2024),              Bangkok, Thailand, 2024, pp. 15789–15809. URL:
     Pisa, Italy, December 4 - December 6, 2024, CEUR               https://aclanthology.org/2024.acl-long.841. doi:10.
     Workshop Proceedings, CEUR-WS.org, 2024.                       18653/v1/2024.acl- long.841 .
[21] A. Muti, PejorativITy - In-Context Pejorative Lan-        [29] T. B. Brown, B. Mann, N. Ryder, M. Subbiah,
     guage Disambiguation: A CALAMITA Challenge,                    J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
     in: Proceedings of the 10th Italian Conference                 G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,
     on Computational Linguistics (CLiC-it 2024), Pisa,             G. Krueger, T. Henighan, R. Child, A. Ramesh,
     Italy, December 4 - December 6, 2024, CEUR Work-               D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen,
     shop Proceedings, CEUR-WS.org, 2024.                           E. Sigler, M. teusz Litwin, S. Gray, B. Chess, J. Clark,
[22] V. Basile, S. Casola, S. Frenda, S. M. Lo, PERSEID -           C. Berner, S. McCandlish, A. Radford, I. Sutskever,
     Perspectivist Irony Detection: A CALAMITA Chal-                D. Amodei,         Language models are few-shot
     lenge, in: Proceedings of the 10th Italian Confer-             learners, in: Proceedings of the 34th International
     ence on Computational Linguistics (CLiC-it 2024),              Conference on Neural Information Processing
     Pisa, Italy, December 4 - December 6, 2024, CEUR               Systems, NeurIPS ’20, Curran Associates Inc., Red
     Workshop Proceedings, CEUR-WS.org, 2024.                       Hook, NY, USA, 2020, pp. 1877–1901. URL: https:
[23] D. Brunato, TRACE-it: Testing Relative clAuses                 //proceedings.neurips.cc/paper_files/paper/2020/
     file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf.           A. Experimental Details
[30] S. Biderman, H. Schoelkopf, L. Sutawika, L. Gao,
     J. Tow, B. Abbasi, A. F. Aji, P. S. Ammanamanchi,          A.1. Technical Details
     S. Black, J. Clive, et al., Lessons from the trenches on
                                                                We run our experiments on the LEONARDO HPC infras-
     reproducible evaluation of language models, arXiv
                                                                tructure (Booster partition). The booster module par-
     preprint arXiv:2405.14782 (2024).
                                                                tition is based on BullSequana XH2135 supercomputer
[31] A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-
                                                                nodes, each with four NVIDIA Tensor Core GPUs (custom
     Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang,
                                                                Ampere A100 GPU 64GB HBM2e, NVLink 3.0 (200GB/s))
     A. Fan, et al., The Llama 3 Herd of Models, arXiv
                                                                and a single Intel CPU.5
     preprint arXiv:2407.21783 (2024).
                                                                   We      forked     the     lm-eval-harness        offi-
[32] M. Polignano, P. Basile, G. Semeraro,                Ad-
                                                                cial repository at the commit with hash
     vanced natural-based interaction for the italian
                                                                b2bf7bc4a601c643343757c92c1a51eb69caf1d7 .
     language: Llamantino-3-anita, arXiv preprint
                                                                We report all technical details on our official webpage.6
     arXiv:2405.07101 (2024).
[33] M. Aldinucci, S. Rabellino, M. Pironti, F. Spiga, P. Vi-
     viani, M. Drocco, M. Guerzoni, G. Boella, M. Mellia,       A.2. Generation Configuration
     P. Margara, I. Drago, R. Marturano, G. Marchetto,
                                                                Table 3 reports the generation parameters we used for
     E. Piccolo, S. Bagnasco, S. Lusso, S. Vallero, G. At-
                                                                Open-Ended tasks.
     tardi, A. Barchiesi, A. Colla, F. Galeazzi, Hpc4ai,
     an ai-on-demand federated platform endeavour,
                                                                Parameter           Value
     in: ACM Computing Frontiers, Ischia, Italy,
                                                                Batch size          1∗
     2018. URL: https://iris.unito.it/retrieve/handle/          Temperature         0.0
     2318/1765596/689772/2018_hpc4ai_ACM_CF.pdf.                Sampling            False
     doi:10.1145/3203217.3205340 .                              Stopping criteria   \n\n, </s>, <|im_end|>, “. ”, <|eot_id|>, <|end_of_text|>

                                                                Table 3
                                                                Generation Parameters. ∗ : we set beam search to 5 for ma-
                                                                chine translation tasks.


                                                                5
                                                                    https://www.hpc.cineca.it/systems/hardware/leonardo/
                                                                6
                                                                    https://calamita-ailc.github.io/calamita2024/

</pre>