<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Leaderboard for Benchmarking LLMs on Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Madeddu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Michele Resta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Zanoli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Martin Cimmino</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Albano</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Viviana Patti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Domyn</institution>
          ,
          <addr-line>Via Principe Amedeo, 5, 20124 Milano</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fondazione Bruno Kessler (FBK)</institution>
          ,
          <addr-line>Via Sommarive 18, 38123 Povo, Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>University of Torino, Computer Science Department</institution>
          ,
          <addr-line>Corso Svizzera 185, 10149 Torino</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>We present Evalita-LLM, a comprehensive benchmark and leaderboard designed to evaluate Large Language Models (LLMs) on Italian tasks. Evalita-LLM covers ten native Italian tasks, including both multiple-choice and generative formats, and enables fair and transparent comparisons by using multiple prompts per task, addressing LLMs' sensitivity to prompt phrasing. The leaderboard supports both zero-shot and few-shot evaluation settings and currently reports results for 23 open-source models. Our findings show consistent performance improvements with few-shot prompting and larger model sizes. Additionally, more recent versions of LLMs generally outperform their predecessors. However, no single model excels across all tasks, which highlights the task-dependent nature of LLM performance. Notably, generative tasks remain significantly more challenging than multiple-choice ones. Hosted on Hugging Face, the Evalita-LLM leaderboard ofers a public and continuously updated platform for benchmarking and transparent evaluation of LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLMs</kwd>
        <kwd>Benchmarking</kwd>
        <kwd>Leaderboard</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Although LLM benchmarks have driven significant</title>
        <p>progress, they currently show limitations that afect the
Leaderboards have become essential tools for assessing fairness and completeness of the evaluations process.
performance in the rapidly evolving landscape of Large First, the focus on English, makes them less useful for
Language Models (LLMs), ofering standardized compar- testing models meant to serve other languages,
includisons across a large variety of tasks, such as language ing Italian. This is particularly relevant because of the
understanding, dialogue, reasoning and code generation. recent growth of LLMs with a specific training on
ItalAmong available leaderboards, the Hugging Face Open ian, like for instance LLaMAntino [2], the Minerva
famLLM Leaderboard 1 is a popular and widely used resource ily [3], Italia7, Velvet8 and the recent model MIIA9. On
for researchers, particularly in the open-source commu- the other side, current benchmarks for Italian, as for
innity. Now in its second version, it introduces more chal- stance Ita-bench10, often rely on automatic translations
lenging and reliable benchmarks, including MMLU-Pro, of English datasets, which is non optimal, due to poor
GPQA, MuSR, MATH, IFEval, and BBH. Other notable translation quality and cultural diferences that make fair
platforms, such as Scale SEAL2, Vellum.ai3, and LLM- testing harder. We also want to mention the
collaboraStats.com4, support evaluation eforts. In addition, open- tive CALAMITA efort [ 4] which gathered a variety of
source initiatives focused on human preference evalua- diferent tasks based on native data from the community.
tion, like Chatbot Arena5 and the Chatbot Arena LLM A second issue in benchmarking LLMs is that most
Leaderboard6, are playing a key role in advancing the benchmarks are based on a single-prompt approach (i.e.,
benchmarking landscape. one prompt is arbitrarily selected for each task).
HowCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ever, it is well known that LLMs are very sensitive to
tics, September 24 — 26, 2025, Cagliari, Italy [1] how prompts are phrased [5, 6, 7], and that even small
$ magnini@fbk.eu (B. Magnini); marco.madeddu@unito.it changes in wording can lead to big diferences in
perfor(M. Madeddu); michele.resta@domyn.com (M. Resta); mance, making single-prompt evaluations less reliable
zanoli@fbk.eu (R. Zanoli); martin.cimmino@domyn.com and harder to compare. For example, IberBench [8], a
(vMiv.iaCniam.pmaitntio@);upnaiotolo.i.ta(lbVa.nPoa@ttid)omyn.com (P. Albano); benchmark designed for Iberian languages, employs a
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License single-prompt evaluation methodology. While this
sim1https://hAutgtrigbuitniognf4.a0 cInete.rcnoati/osnpal a(CcCeBsY/4o.0p).en-llm-leaderboard/open_llm_ plifies the evaluation pipeline, the authors acknowledge
leaderboard#/ that alternative prompts could lead to diferent
perfor2https://scale.com/leaderboard
3https://www.vellum.ai/llm-leaderboard 7https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1
4https://llm-stats.com/ 8https://huggingface.co/Almawave/Velvet-14B
5https://openlm.ai/chatbot-arena/ 9https://huggingface.co/Fastweb/FastwebMIIA-7B
6https://huggingface.co/spaces/lmarena-ai/ 10https://huggingface.co/collections/sapienzanlp/
chatbot-arena-leaderboard ita-bench-italian-benchmarks-for-llms-66337ca59e6df7d7d4933896
mance outcomes.
almTohsirtde,xtchluesvivaesltymoanjomriutyltiopflec-ucrhroeincte btaesnkcsh,mdraarsktsicraelllyy CaTnadsidkate CParnodmidpatste Dev LLMs
limiting the capacity to test the generative abilities of Prompt
eLrLaMtios,nw.Ahilcthhohuagvhe mbeuelntipmlea-icnhlyoitcreaifnoerdmoantsoimpepnl-ifietesxstcgoer-n- EvaTlausaktion Evaluation EvMaelutraictison
ing, it often require artificial task reformulations that
hide the model’s natural ability to generate text. In
contrast, generative tasks, although better reflect real-world Figure 1: Evalita-LLM incremental validation methodology.
applications, they pose challenges, including less reliable
evaluation metrics and inconsistent output formatting.</p>
        <p>To address the above mentioned issues, we introduce metrics. During this process, prompts that resulted in
Evalita-LLM11, a comprehensive benchmark with its as- weaker performances across the various models were
sociated leaderboard, specifically designed to evaluate discarded, and overly dificult tasks were also excluded.
LLMs on Italian tasks. The benchmark includes a diverse The Evalita-LLM benchmark was developed using the
set of carefully validated tasks and uses multiple prompts lm-evaluation-harness library16 [10], which provides a
per task to ensure more consistent and reliable evalua- unified interface for evaluating language models across a
tions. All tasks are originally written in Italian, avoid- variety of tasks and formats. Since models’ performance
ing issues related to translation quality or cultural mis- can be sensitive to their parameters, particularly
temperamatches. The benchmark combines both multiple-choice ture and maximum context length, the library allows users
and generative tasks, ofering a balanced and practical to adjust settings to some extent. In our setup, we follow
way to assess the full range of model abilities. Evalita- the library’s standard configuration to ensure consistency
LLM is supported by a public leaderboard hosted on Hug- across evaluations. By default, temperature is set to 0.0,
ging Face12, which allows to conduct fair comparisons resulting in deterministic (greedy) decoding, which
fabetween models and tasks and helps the community to vors reproducibility. To determine each model’s input
better understand how Italian LLMs perform and can capacity, the maximum context length (the number of
be improved. The results on the Leaderboard confirm tokens a model can process per input) is retrieved
dynamthat using few-shot context-learning works better than ically by inspecting the model’s configuration fields such
using no examples (zero-shot) for most of the models. Re- as n_positions, max_position_embeddings or the
sults also confirm that bigger and newer models usually tokenizer’s model_max_length.
perform better, showing how fast LLMs are improving. The benchmark construction followed three main
steps:</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Benchmarking Methodology</title>
      <sec id="sec-2-1">
        <title>The Evalita-LLM benchmark is created using existing</title>
        <p>datasets almost exclusively from the Evalita campaigns13,
supported by the Italian Association for Computational
Linguistics (AILC14). Over the past 15 years, Evalita has
produced approximately 70 datasets covering various
language tasks. Around 35 of these are freely available
through the European Language Grid (ELG)15, thanks
to the Evalita4ELG project [9] led by the University of
Turin.</p>
        <p>We selected 15 native Italian datasets: half for multiple- Figure 1 shows how the benchmark was created step by
choice tasks and half for open-ended ones. For each step. At the end of the process, we selected ten tasks that
task, we created approximately 20 prompt candidates, cover diferent language types, text styles and real-world
adapted from similar tasks (often in English) and refined uses.
through several rounds of testing. The prompts were
tested on various Italian LLMs using fixed evaluation
• Dataset selection: datasets were converted into</p>
        <p>Hugging Face (HF) format and uploaded.
• Task definition: creating prompts, choosing
fewshot or zero-shot, formatting output, and setting
up metrics. The tasks are defined for evaluation
only and are not used for model training.
• Model evaluation: tasks are tested on Italian</p>
        <p>LLMs during development to check if prompts
work well.</p>
        <sec id="sec-2-1-1">
          <title>2.1. Prompting Approach</title>
          <p>11https://github.com/EleutherAI/lm-evaluation-harness
12https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard
13https://www.evalita.it
14https://www.ai-lc.it
15https://live.european-language-grid.eu
Prompt design is crucial since LLMs are highly sensitive
to minor wording changes [11, 12, 13, 5, 6]. To address
this issue, Evalita-LLM combines three main strategies:
16https://github.com/EleutherAI/lm-evaluation-harness
setting general rules for prompt design, using a
compositional method to build prompts, and applying multiple
prompts per task to ensure robustness and reliability.
2.1.1. General Prompting Rules</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>The following rules guide the construction of prompts</title>
        <p>to ensure consistency, simplicity and alignment with the
objectives of Evalita-LLM. The exact prompts used for
each task are available on the leaderboard webpage17.
Additional examples translated in English can be found
in Appendix A.</p>
        <p>• Prompts are entirely in Italian, including output
labels.
• We avoid assigning roles to the model (e.g., “You
are an assistant. . . ”).
• Prompts are short and simple to reduce bias.
• Each prompt specifies the type of input for the
specific task (e.g., tweet, news, sentence).
2.1.2. Compositional Prompting
To ensure flexibility and systematic variation, we adopt
a compositional approach, building each prompt from a
combination of key elements:
• Core question or instruction (this is required for
all prompts);
• High level task description (optional);
• Answer options (optional, for multiple-choice
tasks);
• Output format instructions (optional, for
generative tasks);</p>
        <p>Keeping some components fixed reduces unnecessary
prompt variations and simplifies evaluation. Around
20 templates were created for each task; after a testing
phase, we kept 6 templates for multiple-choice and 4 for
generative tasks, due to higher computational cost for
generative evaluation.
2.1.3. Multiple Prompts for Multiple-choice Tasks
For multiple-choice tasks, we use six distinct prompt
templates, each adapted to the specific task. The templates
systematically vary the inclusion of a task description,
the core question and the answer options:
• Prompt 1: Question. A base question that the
model must answer, following general prompt
guidelines.
• Prompt 2: Task description + Question. A brief task
description is prepended to the question.
17https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard
• Prompt 3: Question + Answer. The possible
answers are appended to the question.
• Prompt 4: Task description + Question + Answer.</p>
        <p>This combines both the task description and the
answer options with the question.
• Prompt 5: Afirmative. A simple afirmative
statement that implicitly asks for an answer, without
listing options.
• Prompt 6: Task description + Afirmative. The task
description is prepended to the afirmative
statement.</p>
      </sec>
      <sec id="sec-2-3">
        <title>It has to be noted that in multiple-choice prompts, the</title>
        <p>answer options can be either explicitly embedded in the
prompt or provided as options for evaluation process.</p>
        <p>To minimize bias in model evaluation, attention was
given to the order of answer choices in multiple-choice
prompts. Only Prompt 3 and Prompt 4 are susceptible to
such bias, as they explicitly list options (A, B, C, etc.). For
tasks with fixed answer sets like Textual Entailment,
options were kept in a natural order (e.g., A: True, B: False)
to reflect typical human presentation. In contrast, for
tasks with more open-ended answers, such as Admission
Tests, the answer choices were shufled during dataset
creation to reduce positional bias.
2.1.4. Multiple Prompts for Generative Tasks
Generative prompts require the model to produce textual
output, which is then evaluated for correctness using
appropriate metrics. We adopt a compositional approach
involving three key elements: (i) a mandatory request
expressing the task; (ii) an optional brief task
description placed at the beginning; (iii) optional output format
instructions at the end.</p>
        <p>Because generative tasks are computationally more
expensive than multiple-choice tasks, we created four
prompt types, which have been tested pairwise in our
tasks. Tasks that need structured outputs get clear
formatting instructions to help with parsing and scoring,
while others allow freer text generation. The four prompt
types are:
• Prompt 7: Request. A base generative request
adhering to the general prompting guidelines.
• Prompt 8: Task description + Request. Adds a short
task description before the request.
• Prompt 9: Request + Output format. Adds explicit
instructions on the required output format.
• Prompt 10: Task description + Request + Output
format. Combines the description, request, and
output format instructions.</p>
      </sec>
      <sec id="sec-2-4">
        <title>This modular design balances prompt diversity and evaluation eficiency across generative tasks.</title>
        <sec id="sec-2-4-1">
          <title>2.2. Evaluation Metrics</title>
        </sec>
      </sec>
      <sec id="sec-2-5">
        <title>To select efective prompts for each task in Evalita-LLM,</title>
        <p>we adopt four prompt-scoring metrics inspired by [5]:
maximum, average, minimum, and combined performance.
These are used both to evaluate models over prompts and
prompts over models.</p>
        <p>Let  be an LLM,  = {(, )} a task,  a set
of prompts for  , and  (,  , ) ∈ [0, 1] the model’s
performance on task  with prompt .</p>
        <p>Minimum Performance
prompt across all models:</p>
        <p>Lowest performance of a
  (,  ,  ) =
min  (,  , )
∈
Maximum Performance
across prompts:
Best performance of a model</p>
      </sec>
      <sec id="sec-2-6">
        <title>Best performance of a prompt across models:</title>
        <p>(,  ,  ) = max  (,  , )
∈
  (,  ,  ) =
max  (,  , )
∈
Average Performance
prompts:
 (,  ,  ) =
1</p>
        <p>∑︁  (,  , )
| | ∈</p>
      </sec>
      <sec id="sec-2-7">
        <title>Mean prompt performance over models:</title>
        <p>(,  ,  ) =
1</p>
        <p>∑︁  (,  , ) (5)
| | ∈
Combined Performance Score (CPS) This score
integrates both stability (robustness) and best observed
performance. First, saturation is defined as:
 (,  ,  ) = 1 − (  −  )
2.1.5. Few-Shot Prompting
 ( ,  ,  ) = 1 − (  −  )
(7)
Few-shot prompting helps to improve performance by Then, CPS for models and prompts:
adding few examples of inputs and their corresponding
correct responses within the prompt. For Evalita-LLM,   (,  ,  ) =  ·   (8)
we used a 5-shot learning method. Except for Relation
Extraction (REL) and Named Entity Recognition (NER),   ( ,  ,  ) =  ·   (9)
ifve examples were automatically selected from the train- These metrics filter out unstable or poor-performing
ing sets using LM-evaluation-harness. For REL and NER, prompts and assist in choosing prompt sets that balance
examples were manually chosen to ensure full label cov- reliability and top performance across language models.
erage and output diversity, as many sentences for the
two tasks do not contain any relevant entity or relation.</p>
      </sec>
      <sec id="sec-2-8">
        <title>Mean model performance over</title>
        <sec id="sec-2-8-1">
          <title>3.1. Evalita-LLM Tasks</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Benchmark Leaderboard</title>
      <p>The Evalita-LLM leaderboard is a comprehensive
platform that evaluates LLMs on 10 Italian-language tasks,
both multiple-choice and generative. The leaderboard
displays detailed metrics for each model and task, such as
average performance over all prompts, best prompt
performance and a combined score balancing accuracy and
prompt consistency. Tasks span through multiple-choice
questions, like Hate Speech and Sentiment Analysis,
as well as generative requests, including Named Entity
Recognition and Summarization. For each task, results
are reported per prompt and combined for overall
ranking. Users can filter and compare models by attributes
like few-shot learning setup. Currently, the leaderboard
presents evaluation results for 23 open source models in
both zero-shot and few-shot settings, with new models
being added as they become publicly available on the
Hugging Face platform.</p>
      <p>To optimize leaderboard management, models are
indexed by their Hugging Face name. Only new, previously
unlisted models are considered for evaluation, while
revisions of already indexed models are skipped to save
computational resources. Likewise, models are not
reevaluated on updated datasets ensuring resources are
used for assessing new models.</p>
      <p>Word in Context (WiC). The Word in Context (WiC)
task, proposed at Evalita 202318, focuses on word sense
disambiguation in context. It consists of two sub-tasks:
binary classification and ranking. For LLM evaluation,
we focus on the binary classification task aimed at
determining whether a target word w has the same meaning
in two sentences, s1 and s2. The best-performing system
in the original challenge achieved an F1-macro score of
85.00. In our experiments, the following dataset19 was
used.
18https://wic-ita.github.io/task
19https://huggingface.co/datasets/evalitahf/wic
Textual Entailment (TE). The Recognizing Textual Lexical Substitution (LS). Task A of the Lexical
SubEntailment (RTE) task was introduced at Evalita 200920. stitution challenge at Evalita 200929 focuses on
identiIt involves determining whether a hypothesis sentence fying the most appropriate synonym for a target word
is logically entailed by a given text sentence. The dataset given its context, without relying on predefined sense
inconsists of sentences sourced from Italian Wikipedia revi- ventories. Systems are required to produce contextually
sion histories, labeled as entailed or not. The best model relevant lemmas as substitutes. Evaluation is based on
achieved 71% accuracy. We adapted this dataset21 for our two metrics: Best, which scores the top candidate, and
experiments. Out-of-Ten (oot), which considers the top 10 suggestions.
The best system achieved an 1 score of 7.64 for Best and
Sentiment Analysis (SA). The SENTIment POLar- 38.82 for oot. In our experiments, we use the processed
ity Classification (SENTIPOLC) task was introduced at dataset30, and follow the oot evaluation setting
Evalita 201622. It focuses on sentiment analysis of Italian
tweets and includes three subtasks: polarity classification,
subjectivity classification and irony detection. The best
model achieved an 1-macro score of 66.38. Our study
concentrates on polarity classification, which categorizes
each tweet’s sentiment as positive, negative, neutral or
mixed. We use this processed dataset23.</p>
      <p>Named Entity Recognition (NER). The Named
Entity Recognition task at Evalita 202331 focuses on
identifying and classifying person, organization, and location
entities in Italian texts from multiple domains. The dataset,
derived from the Kessler Italian Named-entities Dataset,
includes documents from three sources: Wikinews,
Literature, and Political Writings. The best model achieved an
Hate Speech (HS). The HaSpeeDe 2 challenge at 1-macro score of 88%. We use this processed dataset32
Evalita 202024 focuses on detecting hateful content in Ital- in our experiments.
ian tweets and news headlines, targeting specific groups
such as immigrants, Muslims, and Roma. Top-performing
BERT-based models achieved an 1-macro score of 80.88
on Twitter data and 77.44 on headlines. We use the
adapted dataset25, which combines both sources.</p>
      <p>Frequently Asked Questions &amp; Question Answering
(FAQ). The QA4FAQ task, introduced at Evalita 201626,
focuses on retrieving the most relevant FAQ entry given
a user query. Systems must identify the closest matching
question from a database of FAQs and return its answer.</p>
      <p>We transformed the dataset27 into a multiple-choice for- Summarization (SUM). The summarization task,
mat with four candidate answers per query. based on the Fanpage dataset [16], involves generating
concise summaries of Italian news articles. The dataset
Admission Tests (AT). The Admission Test task, in- includes news articles with titles, abstracts, and full texts
troduced in [14], is not part of the Evalita campaign. It across 9 categories. In the original study, mBART models
consists of answering multiple-choice questions from Ital- achieved ROUGE-1: 38.91 and ROUGE-2: 21.38. For
evalian medical specialty entrance exams (SSM), where each uation, we use a 10% subset of the original dataset35, from
question has five options and only one correct answer. which 100 samples were randomly selected for testing.
The questions cover a wide range of medical topics and
often require complex reasoning beyond factual recall. 3.2. Models’ Performance
We use this adapted dataset28.</p>
      <p>Relation Extraction (REL). The CLinkaRT task at
Evalita 202333 addresses relation extraction in the clinical
domain, focusing on linking laboratory results (RML) to
their corresponding test events (EVENT) in Italian
medical narratives[15]. Systems were evaluated using
Precision, Recall, and 1 score, with the best model achieving
an 1 of 62.99. We use the processed dataset34, where
entity pairs are restricted to occur within sentence
boundaries.
20https://www.evalita.it/campaigns/evalita-2009/tasks/</p>
      <p>textual-entailment
21https://huggingface.co/datasets/evalitahf/textual_entailment
22https://www.evalita.it/campaigns/evalita-2016/tasks-challenge/</p>
      <p>sentipolc
23https://huggingface.co/datasets/evalitahf/sentiment_analysis
24http://www.di.unito.it/~tutreeb/haspeede-evalita20/index.html
25https://huggingface.co/datasets/evalitahf/hatespeech_detection
26https://www.evalita.it/campaigns/evalita-2016/tasks-challenge/</p>
      <p>qa4faq
27https://huggingface.co/datasets/evalitahf/faq
28https://huggingface.co/datasets/evalitahf/admission_test
Table 2 summarizes the performance of 23 models on two
diferent testing conditions: few-shot (FS) and zero-shot
(ZS). In the FS setting, models are given a few examples
to guide their responses, while in ZS, they are asked
to perform tasks without prior examples. Each model’s
29https://www.evalita.it/2009/tasks/lexical
30https://huggingface.co/datasets/evalitahf/lexical_substitution
31https://nermud.fbk.eu
32https://huggingface.co/datasets/evalitahf/entity_recognition
33https://e3c.fbk.eu/clinkart
34https://huggingface.co/datasets/evalitahf/relation_extraction
35https://huggingface.co/datasets/evalitahf/summarization-fp
performance was evaluated using the specific accuracy against established reference scores, which come from
measure employed in the original task, and the results the best systems in previous Evalita shared tasks or
origiare combined into an average combined performance nal task publications. It is important to note that these
refscore (AvgCPS) across all tasks. The best performing erence scores were obtained using supervised approaches.
model in the FS setting is gemma-3-27b-it, achieving That is, models were trained on the corresponding
taskan AvgCPS score of 57.42, while the lowest is Minerva- specific training data. In contrast, the models evaluated
7B-base-v1.0 with 35.06. In ZS, scores range from 50.29 in this study were tested in zero-shot or few-shot
conAvgCPS (gemma-3-27b-it) down to 30.23 (Volare). ifgurations, without using any of the training data to
ifne-tune or train the models on the specific tasks.
DeTable 2 spite this diference in setup, the results show that some
Model performance in few-shot (FS) and zero-shot (ZS) set- tasks benefit substantially from the advances in LLMs: for
tings, reported in terms of Avg. Combined Performance example, Textual Entailment (TE) accuracy improves by
Score (AvgCPS). Models are sorted in descending order by over 22%, and Sentiment Analysis (SA) by nearly 22%. On
FS AvgCPS. the other hand, some tasks remain challenging. Named
Entity Recognition (NER) shows a large accuracy drop of
Model FS ZS
more than 53%, and Relation Extraction (RE) decreases
by over 18%.</p>
      <sec id="sec-3-1">
        <title>Figures 2 and 3 show two important trends about model size and in-context learning ability. First, the accuracy values tend to increase with model size, although</title>
        <p>gemma-3-27b-it
Qwen2.5-14B-Instruct-1M
gemma-3-12b-it
gemma-2-9b-it
Qwen2.5-7B-Instruct
phi-4
Llama-3.1-SuperNova-Lite
granite-3.1-8b-instruct
Phi-3-medium-4k-instruct
Meta-Llama-3.1-8B-Instruct
Phi-3.5-mini-instruct
Llama-3-8b-Ita
LLaMAntino-3-ANITA-8B
maestrale-chat-v0.4-beta
aya-expanse-8b
Mistral-7B-Instruct-v0.3
gemma-3-4b-it
Llama-3-8B-4bit-UltraChat
Volare
occiglot-7b-it-en-instruct
Velvet-14B
Minerva-7B-instruct-v1.0
Minerva-7B-base-v1.0</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Discussion</title>
      <sec id="sec-4-1">
        <title>In this section we analyze the results of the Evalita-LLM leaderboard across several perspectives to better understand the strengths and limitations of current LLMs on Italian tasks.</title>
        <p>Model Size vs. Performance. Figure 2 shows a
moderate positive correlation between the number of model
parameters and accuracy. Specifically, the Pearson
correlation coeficient is 0.4816 for the 5-shot setting and
0.4567 for the zero-shot setting. While larger models
generally tend to achieve higher accuracy, the
relationship is not strongly linear. This indicates that factors
beyond model size, such as the model architecture, the
quality of the training data and of the instruction tuning,
significantly influence performance.</p>
        <p>Performance Evolution within a Model Family.</p>
        <p>We compared two large language models from the same
family, Gemma-2 27B and Gemma-3 27B, in both ZS and
FS configurations. Our goal was to see whether
performance improves from one generation to the next and to
identify which tasks benefit most from the newer model.</p>
        <p>In the FS setting, Gemma-3 shows the best overall
performance, with the highest average CPS (57.42), which
is 3.56 points higher than Gemma-2. In the ZS setting,
however, Gemma-2 slightly outperforms Gemma-3 (50.60
vs. 49.89). Looking at individual tasks, Gemma-3
performs better than Gemma-2 in 9 out of 10 tasks in the FS
setting, especially in: Relation Extraction (+11.9),
Lexical Substitution (+7.6) and Sentiment Analysis (+6.0). In
the ZS configuration, Gemma-3 performs better on 6 out
of 10 tasks, particularly in: Lexical Substitution (+6.37) Model Specialization by Task. The results presented
and Hate Speech Detection (+4.88). Gemma-2 outper- in Table 4 show that diferent models are better at
difforms Gemma-3 on 4 tasks. Notably, Relation Extraction ferent tasks. In fact, no single model achieves the best
and Word in Context shows the largest gap in favor of performance in all tasks, which means that performance
Gemma-2 (+34.8, +15, respectively). This result suggests crucially depends on the characteristics of the individual
that Gemma-3 can be better efectively optimized for task. For example, Qwen2.5-14B-Instruct-1M performs as
in-context learning and prompt-based fine-tuning. the best model on multiple-choice tasks as Textual
Entailment and Hate Speech Detection, while gemma-3-12b-it
performs best on Sentiment Analysis and the Admission
Test.</p>
        <p>Generative vs. Multiple-Choice Tasks.
Generative tasks appear to be more challenging for large
language models compared to multiple-choice tasks.
Unlike multiple-choice format, where the output space is 5. Conclusion
constrained and the model only needs to select among
predefined options, generative tasks require models not This study introduced Evalita-LLM, a comprehensive
only to understand the content of the request, but also benchmark and leaderboard designed to evaluate LLMs
to produce structured outputs in specific formats, which on Italian language tasks. The benchmarks and the
evalhas then to be correctly parsed by a scoring script. As uation metrics consider critical aspects of generative
an example, formatting constraints in the Named En- models (e.g., multiple-prompting, generative tasks output
tity Recognition (NER) generative task poses significant postprocessing,...).
challenges for LLMs, regardless of their ability to detect Our findings show that few-shot settings generally
outentities. When asked to output entities in the format perform zero-shot settings, especially in generative tasks.
$, models often fail in the zero-shot setting, This advantage is particularly noticeable in tasks such
with low output rates and formatting errors (e.g., using as Relation Extraction and Named Entity Recognition,
commas instead of the dollar sign as separator). Models where concrete examples help models produce correctly
improved performance with 5-shot prompting, mainly formatted outputs. We also found that mid-sized models
due to better adherence to the required output structure. benefit the most from few-shot learning. While there is</p>
        <p>Additionally, evaluating generative outputs is difi- a positive correlation between model size and accuracy,
cult due to limitations in current metrics like BLEU and factors such as training data quality, and instruction
tunROUGE, which focus on surface-level text overlap. Al- ing play significant roles. Additionally, newer versions
though advanced metrics like BERTScore and COMET within the same model family tend to outperform their
consider context and meaning, they still cannot fully predecessors on many tasks, but not all.
replicate human judgment. Combining multiple metrics The publicly available Evalita-LLM leaderboard on
might efectively mitigate these limitations by providing Hugging Face can be used as a valuable resource for
a more comprehensive assessment of task complexity ongoing benchmarking and transparent comparison of
from diferent perspectives. emerging models on Italian tasks. The overall goal is</p>
        <p>To better understand how much harder generative to provide an evaluation tool that is easy to access and
tasks are for models, we compared their performance to that can provide a fair assessment of a model and track
reference scores from the Evalita benchmarking initia- diference in performance caused by diferent variables
tive (or the original dataset authors when Evalita scores (model’s size, model’s version and more).
were unavailable). Results in Table 3 confirm that while
models often outperform reference baselines in multiple- Limitations The number of datasets included for each
choice tasks such as Textual Entailment (+22.04%), Senti- task of the Evalita-LLM benchmark is limited in order to
ment Analysis (+21.69%), they have some dificulties in allow reasonable running times. In fact, the goal is not
performing on generative tasks. For instance, model ac- to create a repository that gathers all Italian datasets but
curacy falls short in Named Entity Recognition (–53.68%) rather to provide a tool for strong evaluation of models.
and Relation Extraction (–18.15). It is important to note, The metrics used for each tasks are the ones proposed
however, that the reference baselines were obtained us- in the original challenges and papers to allow for a direct
ing supervised models trained on task-specific datasets, comparison between systems. For this reason, we opted
whereas the models evaluated in this study were tested in to not include more recent metrics such as BERT-score,
zero-shot or few-shot settings, without any task-specific which can be useful additions in the future.
ifne-tuning. These results further demonstrate how
effectively modern LLMs can generalize to new tasks.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Acknowledgments</title>
      <sec id="sec-5-1">
        <title>This work has been partially supported by the PNRR</title>
        <p>project FAIR - Future AI Research (PE00000013),
under the NRRP MUR program funded by
NextGenerationEU. The work of Marco Madeddu and Viviana Patti
is partially supported by “HARMONIA” project -
M4C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP
C63C22000770006 - PE PE0000013 under the
NextGenerationEU programme. We warmly thank Alessandro
Ercolani and Samuele Colombo for their invaluable
support and guidance in writing the code and implementing
this leaderboard.
sult to its test event in the clinical domain, in:
M. Lai, S. Menini, M. Polignano, V. Russo, R.
Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth
Evaluation Campaign of Natural Language
Processing and Speech Tools for Italian. Final
Workshop (EVALITA 2023), Parma, Italy, September
7th8th, 2023, volume 3473 of CEUR Workshop
Proceedings, CEUR-WS.org, Parma, Italy, 2023. URL:
https://ceur-ws.org/Vol-3473/paper43.pdf .
[16] N. Landro, I. Gallo, R. La Grassa, E. Federici,
Two new datasets for italian-language
abstractive text summarization, Information 13 (2022).</p>
        <p>URL: https://www.mdpi.com/2078-2489/13/5/228.</p>
        <p>doi:10.3390/info13050228.
The following tweet: ‘{{text}}’ expresses a
sentiment that is
You have to carry out a sentiment analy- [Positive, Negative,
Neusis task. The following tweet: ‘{{text}}’ ex- tral, Mixed]
presses a sentiment that is
Summarize the following newspaper article: ‘source’ \n Summary:
You have to carry out an automatic synthesis task. Summarize the
following newspaper article: ‘source’ \n Summary:
Extract all entities of type PER (person), LOC (place), and ORG
(organization) from the following text. Report each entity in the format:
Entity$Type, separated by ‘,‘. If there are no entities, respond with
‘&amp;&amp;NOENT&amp;&amp;‘. \n Text: ‘text’ \n Entities:
You have to carry out a named entity recognition task. Extract all
entities of type PER (person), LOC (place), and ORG (organization)
from the following text. Report each entity in the format: Entity$Type,
separated by ‘,‘. If there are no entities, respond with ‘&amp;&amp;NOENT&amp;&amp;‘.
\n Text: ‘text’ \n Entities:</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Prompt Examples for</title>
    </sec>
    <sec id="sec-7">
      <title>Evalita-LLM Tasks</title>
      <p>Declaration on Generative AI
During the preparation of this work, the author(s) did not use any generative AI tools or services.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>