1. Introduction

A Leaderboard for Benchmarking LLMs on Italian

Bernardo Magnini

Marco Madeddu

Michele Resta

Roberto Zanoli

Martin Cimmino

Paolo Albano

Viviana Patti

2 0 Domyn , Via Principe Amedeo, 5, 20124 Milano , Italy 1 Fondazione Bruno Kessler (FBK) , Via Sommarive 18, 38123 Povo, Trento , Italy 2 University of Torino, Computer Science Department , Corso Svizzera 185, 10149 Torino , Italy

2025

We present Evalita-LLM, a comprehensive benchmark and leaderboard designed to evaluate Large Language Models (LLMs) on Italian tasks. Evalita-LLM covers ten native Italian tasks, including both multiple-choice and generative formats, and enables fair and transparent comparisons by using multiple prompts per task, addressing LLMs' sensitivity to prompt phrasing. The leaderboard supports both zero-shot and few-shot evaluation settings and currently reports results for 23 open-source models. Our findings show consistent performance improvements with few-shot prompting and larger model sizes. Additionally, more recent versions of LLMs generally outperform their predecessors. However, no single model excels across all tasks, which highlights the task-dependent nature of LLM performance. Notably, generative tasks remain significantly more challenging than multiple-choice ones. Hosted on Hugging Face, the Evalita-LLM leaderboard ofers a public and continuously updated platform for benchmarking and transparent evaluation of LLMs.

eol>LLMs Benchmarking Leaderboard

1. Introduction Although LLM benchmarks have driven significant

progress, they currently show limitations that afect the Leaderboards have become essential tools for assessing fairness and completeness of the evaluations process. performance in the rapidly evolving landscape of Large First, the focus on English, makes them less useful for Language Models (LLMs), ofering standardized compar- testing models meant to serve other languages, includisons across a large variety of tasks, such as language ing Italian. This is particularly relevant because of the understanding, dialogue, reasoning and code generation. recent growth of LLMs with a specific training on ItalAmong available leaderboards, the Hugging Face Open ian, like for instance LLaMAntino [2], the Minerva famLLM Leaderboard 1 is a popular and widely used resource ily [3], Italia7, Velvet8 and the recent model MIIA9. On for researchers, particularly in the open-source commu- the other side, current benchmarks for Italian, as for innity. Now in its second version, it introduces more chal- stance Ita-bench10, often rely on automatic translations lenging and reliable benchmarks, including MMLU-Pro, of English datasets, which is non optimal, due to poor GPQA, MuSR, MATH, IFEval, and BBH. Other notable translation quality and cultural diferences that make fair platforms, such as Scale SEAL2, Vellum.ai3, and LLM- testing harder. We also want to mention the collaboraStats.com4, support evaluation eforts. In addition, open- tive CALAMITA efort [ 4] which gathered a variety of source initiatives focused on human preference evalua- diferent tasks based on native data from the community. tion, like Chatbot Arena5 and the Chatbot Arena LLM A second issue in benchmarking LLMs is that most Leaderboard6, are playing a key role in advancing the benchmarks are based on a single-prompt approach (i.e., benchmarking landscape. one prompt is arbitrarily selected for each task). HowCLiC-it 2025: Eleventh Italian Conference on Computational Linguis- ever, it is well known that LLMs are very sensitive to tics, September 24 — 26, 2025, Cagliari, Italy [1] how prompts are phrased [5, 6, 7], and that even small $ magnini@fbk.eu (B. Magnini); marco.madeddu@unito.it changes in wording can lead to big diferences in perfor(M. Madeddu); michele.resta@domyn.com (M. Resta); mance, making single-prompt evaluations less reliable zanoli@fbk.eu (R. Zanoli); martin.cimmino@domyn.com and harder to compare. For example, IberBench [8], a (vMiv.iaCniam.pmaitntio@);upnaiotolo.i.ta(lbVa.nPoa@ttid)omyn.com (P. Albano); benchmark designed for Iberian languages, employs a © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License single-prompt evaluation methodology. While this sim1https://hAutgtrigbuitniognf4.a0 cInete.rcnoati/osnpal a(CcCeBsY/4o.0p).en-llm-leaderboard/open_llm_ plifies the evaluation pipeline, the authors acknowledge leaderboard#/ that alternative prompts could lead to diferent perfor2https://scale.com/leaderboard 3https://www.vellum.ai/llm-leaderboard 7https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1 4https://llm-stats.com/ 8https://huggingface.co/Almawave/Velvet-14B 5https://openlm.ai/chatbot-arena/ 9https://huggingface.co/Fastweb/FastwebMIIA-7B 6https://huggingface.co/spaces/lmarena-ai/ 10https://huggingface.co/collections/sapienzanlp/ chatbot-arena-leaderboard ita-bench-italian-benchmarks-for-llms-66337ca59e6df7d7d4933896 mance outcomes. almTohsirtde,xtchluesvivaesltymoanjomriutyltiopflec-ucrhroeincte btaesnkcsh,mdraarsktsicraelllyy CaTnadsidkate CParnodmidpatste Dev LLMs limiting the capacity to test the generative abilities of Prompt eLrLaMtios,nw.Ahilcthhohuagvhe mbeuelntipmlea-icnhlyoitcreaifnoerdmoantsoimpepnl-ifietesxstcgoer-n- EvaTlausaktion Evaluation EvMaelutraictison ing, it often require artificial task reformulations that hide the model’s natural ability to generate text. In contrast, generative tasks, although better reflect real-world Figure 1: Evalita-LLM incremental validation methodology. applications, they pose challenges, including less reliable evaluation metrics and inconsistent output formatting.

To address the above mentioned issues, we introduce metrics. During this process, prompts that resulted in Evalita-LLM11, a comprehensive benchmark with its as- weaker performances across the various models were sociated leaderboard, specifically designed to evaluate discarded, and overly dificult tasks were also excluded. LLMs on Italian tasks. The benchmark includes a diverse The Evalita-LLM benchmark was developed using the set of carefully validated tasks and uses multiple prompts lm-evaluation-harness library16 [10], which provides a per task to ensure more consistent and reliable evalua- unified interface for evaluating language models across a tions. All tasks are originally written in Italian, avoid- variety of tasks and formats. Since models’ performance ing issues related to translation quality or cultural mis- can be sensitive to their parameters, particularly temperamatches. The benchmark combines both multiple-choice ture and maximum context length, the library allows users and generative tasks, ofering a balanced and practical to adjust settings to some extent. In our setup, we follow way to assess the full range of model abilities. Evalita- the library’s standard configuration to ensure consistency LLM is supported by a public leaderboard hosted on Hug- across evaluations. By default, temperature is set to 0.0, ging Face12, which allows to conduct fair comparisons resulting in deterministic (greedy) decoding, which fabetween models and tasks and helps the community to vors reproducibility. To determine each model’s input better understand how Italian LLMs perform and can capacity, the maximum context length (the number of be improved. The results on the Leaderboard confirm tokens a model can process per input) is retrieved dynamthat using few-shot context-learning works better than ically by inspecting the model’s configuration fields such using no examples (zero-shot) for most of the models. Re- as n_positions, max_position_embeddings or the sults also confirm that bigger and newer models usually tokenizer’s model_max_length. perform better, showing how fast LLMs are improving. The benchmark construction followed three main steps:

2. Benchmarking Methodology The Evalita-LLM benchmark is created using existing

datasets almost exclusively from the Evalita campaigns13, supported by the Italian Association for Computational Linguistics (AILC14). Over the past 15 years, Evalita has produced approximately 70 datasets covering various language tasks. Around 35 of these are freely available through the European Language Grid (ELG)15, thanks to the Evalita4ELG project [9] led by the University of Turin.

We selected 15 native Italian datasets: half for multiple- Figure 1 shows how the benchmark was created step by choice tasks and half for open-ended ones. For each step. At the end of the process, we selected ten tasks that task, we created approximately 20 prompt candidates, cover diferent language types, text styles and real-world adapted from similar tasks (often in English) and refined uses. through several rounds of testing. The prompts were tested on various Italian LLMs using fixed evaluation • Dataset selection: datasets were converted into

Hugging Face (HF) format and uploaded. • Task definition: creating prompts, choosing fewshot or zero-shot, formatting output, and setting up metrics. The tasks are defined for evaluation only and are not used for model training. • Model evaluation: tasks are tested on Italian

LLMs during development to check if prompts work well.

2.1. Prompting Approach

11https://github.com/EleutherAI/lm-evaluation-harness 12https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard 13https://www.evalita.it 14https://www.ai-lc.it 15https://live.european-language-grid.eu Prompt design is crucial since LLMs are highly sensitive to minor wording changes [11, 12, 13, 5, 6]. To address this issue, Evalita-LLM combines three main strategies: 16https://github.com/EleutherAI/lm-evaluation-harness setting general rules for prompt design, using a compositional method to build prompts, and applying multiple prompts per task to ensure robustness and reliability. 2.1.1. General Prompting Rules

The following rules guide the construction of prompts

to ensure consistency, simplicity and alignment with the objectives of Evalita-LLM. The exact prompts used for each task are available on the leaderboard webpage17. Additional examples translated in English can be found in Appendix A.

• Prompts are entirely in Italian, including output labels. • We avoid assigning roles to the model (e.g., “You are an assistant. . . ”). • Prompts are short and simple to reduce bias. • Each prompt specifies the type of input for the specific task (e.g., tweet, news, sentence). 2.1.2. Compositional Prompting To ensure flexibility and systematic variation, we adopt a compositional approach, building each prompt from a combination of key elements: • Core question or instruction (this is required for all prompts); • High level task description (optional); • Answer options (optional, for multiple-choice tasks); • Output format instructions (optional, for generative tasks);

Keeping some components fixed reduces unnecessary prompt variations and simplifies evaluation. Around 20 templates were created for each task; after a testing phase, we kept 6 templates for multiple-choice and 4 for generative tasks, due to higher computational cost for generative evaluation. 2.1.3. Multiple Prompts for Multiple-choice Tasks For multiple-choice tasks, we use six distinct prompt templates, each adapted to the specific task. The templates systematically vary the inclusion of a task description, the core question and the answer options: • Prompt 1: Question. A base question that the model must answer, following general prompt guidelines. • Prompt 2: Task description + Question. A brief task description is prepended to the question. 17https://huggingface.co/spaces/evalitahf/evalita_llm_leaderboard • Prompt 3: Question + Answer. The possible answers are appended to the question. • Prompt 4: Task description + Question + Answer.

This combines both the task description and the answer options with the question. • Prompt 5: Afirmative. A simple afirmative statement that implicitly asks for an answer, without listing options. • Prompt 6: Task description + Afirmative. The task description is prepended to the afirmative statement.

It has to be noted that in multiple-choice prompts, the

answer options can be either explicitly embedded in the prompt or provided as options for evaluation process.

To minimize bias in model evaluation, attention was given to the order of answer choices in multiple-choice prompts. Only Prompt 3 and Prompt 4 are susceptible to such bias, as they explicitly list options (A, B, C, etc.). For tasks with fixed answer sets like Textual Entailment, options were kept in a natural order (e.g., A: True, B: False) to reflect typical human presentation. In contrast, for tasks with more open-ended answers, such as Admission Tests, the answer choices were shufled during dataset creation to reduce positional bias. 2.1.4. Multiple Prompts for Generative Tasks Generative prompts require the model to produce textual output, which is then evaluated for correctness using appropriate metrics. We adopt a compositional approach involving three key elements: (i) a mandatory request expressing the task; (ii) an optional brief task description placed at the beginning; (iii) optional output format instructions at the end.

Because generative tasks are computationally more expensive than multiple-choice tasks, we created four prompt types, which have been tested pairwise in our tasks. Tasks that need structured outputs get clear formatting instructions to help with parsing and scoring, while others allow freer text generation. The four prompt types are: • Prompt 7: Request. A base generative request adhering to the general prompting guidelines. • Prompt 8: Task description + Request. Adds a short task description before the request. • Prompt 9: Request + Output format. Adds explicit instructions on the required output format. • Prompt 10: Task description + Request + Output format. Combines the description, request, and output format instructions.

This modular design balances prompt diversity and evaluation eficiency across generative tasks. 2.2. Evaluation Metrics To select efective prompts for each task in Evalita-LLM,

we adopt four prompt-scoring metrics inspired by [5]: maximum, average, minimum, and combined performance. These are used both to evaluate models over prompts and prompts over models.

Let be an LLM, = {(, )} a task, a set of prompts for , and (, , ) ∈ [0, 1] the model’s performance on task with prompt .

Minimum Performance prompt across all models:

Lowest performance of a (, , ) = min (, , ) ∈ Maximum Performance across prompts: Best performance of a model

Best performance of a prompt across models:

(, , ) = max (, , ) ∈ (, , ) = max (, , ) ∈ Average Performance prompts: (, , ) = 1

∑︁ (, , ) | | ∈

Mean prompt performance over models:

(, , ) = 1

∑︁ (, , ) (5) | | ∈ Combined Performance Score (CPS) This score integrates both stability (robustness) and best observed performance. First, saturation is defined as: (, , ) = 1 − ( − ) 2.1.5. Few-Shot Prompting ( , , ) = 1 − ( − ) (7) Few-shot prompting helps to improve performance by Then, CPS for models and prompts: adding few examples of inputs and their corresponding correct responses within the prompt. For Evalita-LLM, (, , ) = · (8) we used a 5-shot learning method. Except for Relation Extraction (REL) and Named Entity Recognition (NER), ( , , ) = · (9) ifve examples were automatically selected from the train- These metrics filter out unstable or poor-performing ing sets using LM-evaluation-harness. For REL and NER, prompts and assist in choosing prompt sets that balance examples were manually chosen to ensure full label cov- reliability and top performance across language models. erage and output diversity, as many sentences for the two tasks do not contain any relevant entity or relation.

Mean model performance over 3.1. Evalita-LLM Tasks 3. Benchmark Leaderboard

The Evalita-LLM leaderboard is a comprehensive platform that evaluates LLMs on 10 Italian-language tasks, both multiple-choice and generative. The leaderboard displays detailed metrics for each model and task, such as average performance over all prompts, best prompt performance and a combined score balancing accuracy and prompt consistency. Tasks span through multiple-choice questions, like Hate Speech and Sentiment Analysis, as well as generative requests, including Named Entity Recognition and Summarization. For each task, results are reported per prompt and combined for overall ranking. Users can filter and compare models by attributes like few-shot learning setup. Currently, the leaderboard presents evaluation results for 23 open source models in both zero-shot and few-shot settings, with new models being added as they become publicly available on the Hugging Face platform.

To optimize leaderboard management, models are indexed by their Hugging Face name. Only new, previously unlisted models are considered for evaluation, while revisions of already indexed models are skipped to save computational resources. Likewise, models are not reevaluated on updated datasets ensuring resources are used for assessing new models.

Word in Context (WiC). The Word in Context (WiC) task, proposed at Evalita 202318, focuses on word sense disambiguation in context. It consists of two sub-tasks: binary classification and ranking. For LLM evaluation, we focus on the binary classification task aimed at determining whether a target word w has the same meaning in two sentences, s1 and s2. The best-performing system in the original challenge achieved an F1-macro score of 85.00. In our experiments, the following dataset19 was used. 18https://wic-ita.github.io/task 19https://huggingface.co/datasets/evalitahf/wic Textual Entailment (TE). The Recognizing Textual Lexical Substitution (LS). Task A of the Lexical SubEntailment (RTE) task was introduced at Evalita 200920. stitution challenge at Evalita 200929 focuses on identiIt involves determining whether a hypothesis sentence fying the most appropriate synonym for a target word is logically entailed by a given text sentence. The dataset given its context, without relying on predefined sense inconsists of sentences sourced from Italian Wikipedia revi- ventories. Systems are required to produce contextually sion histories, labeled as entailed or not. The best model relevant lemmas as substitutes. Evaluation is based on achieved 71% accuracy. We adapted this dataset21 for our two metrics: Best, which scores the top candidate, and experiments. Out-of-Ten (oot), which considers the top 10 suggestions. The best system achieved an 1 score of 7.64 for Best and Sentiment Analysis (SA). The SENTIment POLar- 38.82 for oot. In our experiments, we use the processed ity Classification (SENTIPOLC) task was introduced at dataset30, and follow the oot evaluation setting Evalita 201622. It focuses on sentiment analysis of Italian tweets and includes three subtasks: polarity classification, subjectivity classification and irony detection. The best model achieved an 1-macro score of 66.38. Our study concentrates on polarity classification, which categorizes each tweet’s sentiment as positive, negative, neutral or mixed. We use this processed dataset23.

Named Entity Recognition (NER). The Named Entity Recognition task at Evalita 202331 focuses on identifying and classifying person, organization, and location entities in Italian texts from multiple domains. The dataset, derived from the Kessler Italian Named-entities Dataset, includes documents from three sources: Wikinews, Literature, and Political Writings. The best model achieved an Hate Speech (HS). The HaSpeeDe 2 challenge at 1-macro score of 88%. We use this processed dataset32 Evalita 202024 focuses on detecting hateful content in Ital- in our experiments. ian tweets and news headlines, targeting specific groups such as immigrants, Muslims, and Roma. Top-performing BERT-based models achieved an 1-macro score of 80.88 on Twitter data and 77.44 on headlines. We use the adapted dataset25, which combines both sources.

Frequently Asked Questions & Question Answering (FAQ). The QA4FAQ task, introduced at Evalita 201626, focuses on retrieving the most relevant FAQ entry given a user query. Systems must identify the closest matching question from a database of FAQs and return its answer.

We transformed the dataset27 into a multiple-choice for- Summarization (SUM). The summarization task, mat with four candidate answers per query. based on the Fanpage dataset [16], involves generating concise summaries of Italian news articles. The dataset Admission Tests (AT). The Admission Test task, in- includes news articles with titles, abstracts, and full texts troduced in [14], is not part of the Evalita campaign. It across 9 categories. In the original study, mBART models consists of answering multiple-choice questions from Ital- achieved ROUGE-1: 38.91 and ROUGE-2: 21.38. For evalian medical specialty entrance exams (SSM), where each uation, we use a 10% subset of the original dataset35, from question has five options and only one correct answer. which 100 samples were randomly selected for testing. The questions cover a wide range of medical topics and often require complex reasoning beyond factual recall. 3.2. Models’ Performance We use this adapted dataset28.

Relation Extraction (REL). The CLinkaRT task at Evalita 202333 addresses relation extraction in the clinical domain, focusing on linking laboratory results (RML) to their corresponding test events (EVENT) in Italian medical narratives[15]. Systems were evaluated using Precision, Recall, and 1 score, with the best model achieving an 1 of 62.99. We use the processed dataset34, where entity pairs are restricted to occur within sentence boundaries. 20https://www.evalita.it/campaigns/evalita-2009/tasks/

textual-entailment 21https://huggingface.co/datasets/evalitahf/textual_entailment 22https://www.evalita.it/campaigns/evalita-2016/tasks-challenge/

sentipolc 23https://huggingface.co/datasets/evalitahf/sentiment_analysis 24http://www.di.unito.it/~tutreeb/haspeede-evalita20/index.html 25https://huggingface.co/datasets/evalitahf/hatespeech_detection 26https://www.evalita.it/campaigns/evalita-2016/tasks-challenge/

qa4faq 27https://huggingface.co/datasets/evalitahf/faq 28https://huggingface.co/datasets/evalitahf/admission_test Table 2 summarizes the performance of 23 models on two diferent testing conditions: few-shot (FS) and zero-shot (ZS). In the FS setting, models are given a few examples to guide their responses, while in ZS, they are asked to perform tasks without prior examples. Each model’s 29https://www.evalita.it/2009/tasks/lexical 30https://huggingface.co/datasets/evalitahf/lexical_substitution 31https://nermud.fbk.eu 32https://huggingface.co/datasets/evalitahf/entity_recognition 33https://e3c.fbk.eu/clinkart 34https://huggingface.co/datasets/evalitahf/relation_extraction 35https://huggingface.co/datasets/evalitahf/summarization-fp performance was evaluated using the specific accuracy against established reference scores, which come from measure employed in the original task, and the results the best systems in previous Evalita shared tasks or origiare combined into an average combined performance nal task publications. It is important to note that these refscore (AvgCPS) across all tasks. The best performing erence scores were obtained using supervised approaches. model in the FS setting is gemma-3-27b-it, achieving That is, models were trained on the corresponding taskan AvgCPS score of 57.42, while the lowest is Minerva- specific training data. In contrast, the models evaluated 7B-base-v1.0 with 35.06. In ZS, scores range from 50.29 in this study were tested in zero-shot or few-shot conAvgCPS (gemma-3-27b-it) down to 30.23 (Volare). ifgurations, without using any of the training data to ifne-tune or train the models on the specific tasks. DeTable 2 spite this diference in setup, the results show that some Model performance in few-shot (FS) and zero-shot (ZS) set- tasks benefit substantially from the advances in LLMs: for tings, reported in terms of Avg. Combined Performance example, Textual Entailment (TE) accuracy improves by Score (AvgCPS). Models are sorted in descending order by over 22%, and Sentiment Analysis (SA) by nearly 22%. On FS AvgCPS. the other hand, some tasks remain challenging. Named Entity Recognition (NER) shows a large accuracy drop of Model FS ZS more than 53%, and Relation Extraction (RE) decreases by over 18%.

Figures 2 and 3 show two important trends about model size and in-context learning ability. First, the accuracy values tend to increase with model size, although

gemma-3-27b-it Qwen2.5-14B-Instruct-1M gemma-3-12b-it gemma-2-9b-it Qwen2.5-7B-Instruct phi-4 Llama-3.1-SuperNova-Lite granite-3.1-8b-instruct Phi-3-medium-4k-instruct Meta-Llama-3.1-8B-Instruct Phi-3.5-mini-instruct Llama-3-8b-Ita LLaMAntino-3-ANITA-8B maestrale-chat-v0.4-beta aya-expanse-8b Mistral-7B-Instruct-v0.3 gemma-3-4b-it Llama-3-8B-4bit-UltraChat Volare occiglot-7b-it-en-instruct Velvet-14B Minerva-7B-instruct-v1.0 Minerva-7B-base-v1.0

4. Discussion In this section we analyze the results of the Evalita-LLM leaderboard across several perspectives to better understand the strengths and limitations of current LLMs on Italian tasks.

Model Size vs. Performance. Figure 2 shows a moderate positive correlation between the number of model parameters and accuracy. Specifically, the Pearson correlation coeficient is 0.4816 for the 5-shot setting and 0.4567 for the zero-shot setting. While larger models generally tend to achieve higher accuracy, the relationship is not strongly linear. This indicates that factors beyond model size, such as the model architecture, the quality of the training data and of the instruction tuning, significantly influence performance.

Performance Evolution within a Model Family.

We compared two large language models from the same family, Gemma-2 27B and Gemma-3 27B, in both ZS and FS configurations. Our goal was to see whether performance improves from one generation to the next and to identify which tasks benefit most from the newer model.

In the FS setting, Gemma-3 shows the best overall performance, with the highest average CPS (57.42), which is 3.56 points higher than Gemma-2. In the ZS setting, however, Gemma-2 slightly outperforms Gemma-3 (50.60 vs. 49.89). Looking at individual tasks, Gemma-3 performs better than Gemma-2 in 9 out of 10 tasks in the FS setting, especially in: Relation Extraction (+11.9), Lexical Substitution (+7.6) and Sentiment Analysis (+6.0). In the ZS configuration, Gemma-3 performs better on 6 out of 10 tasks, particularly in: Lexical Substitution (+6.37) Model Specialization by Task. The results presented and Hate Speech Detection (+4.88). Gemma-2 outper- in Table 4 show that diferent models are better at difforms Gemma-3 on 4 tasks. Notably, Relation Extraction ferent tasks. In fact, no single model achieves the best and Word in Context shows the largest gap in favor of performance in all tasks, which means that performance Gemma-2 (+34.8, +15, respectively). This result suggests crucially depends on the characteristics of the individual that Gemma-3 can be better efectively optimized for task. For example, Qwen2.5-14B-Instruct-1M performs as in-context learning and prompt-based fine-tuning. the best model on multiple-choice tasks as Textual Entailment and Hate Speech Detection, while gemma-3-12b-it performs best on Sentiment Analysis and the Admission Test.

Generative vs. Multiple-Choice Tasks. Generative tasks appear to be more challenging for large language models compared to multiple-choice tasks. Unlike multiple-choice format, where the output space is 5. Conclusion constrained and the model only needs to select among predefined options, generative tasks require models not This study introduced Evalita-LLM, a comprehensive only to understand the content of the request, but also benchmark and leaderboard designed to evaluate LLMs to produce structured outputs in specific formats, which on Italian language tasks. The benchmarks and the evalhas then to be correctly parsed by a scoring script. As uation metrics consider critical aspects of generative an example, formatting constraints in the Named En- models (e.g., multiple-prompting, generative tasks output tity Recognition (NER) generative task poses significant postprocessing,...). challenges for LLMs, regardless of their ability to detect Our findings show that few-shot settings generally outentities. When asked to output entities in the format perform zero-shot settings, especially in generative tasks. $, models often fail in the zero-shot setting, This advantage is particularly noticeable in tasks such with low output rates and formatting errors (e.g., using as Relation Extraction and Named Entity Recognition, commas instead of the dollar sign as separator). Models where concrete examples help models produce correctly improved performance with 5-shot prompting, mainly formatted outputs. We also found that mid-sized models due to better adherence to the required output structure. benefit the most from few-shot learning. While there is

Additionally, evaluating generative outputs is difi- a positive correlation between model size and accuracy, cult due to limitations in current metrics like BLEU and factors such as training data quality, and instruction tunROUGE, which focus on surface-level text overlap. Al- ing play significant roles. Additionally, newer versions though advanced metrics like BERTScore and COMET within the same model family tend to outperform their consider context and meaning, they still cannot fully predecessors on many tasks, but not all. replicate human judgment. Combining multiple metrics The publicly available Evalita-LLM leaderboard on might efectively mitigate these limitations by providing Hugging Face can be used as a valuable resource for a more comprehensive assessment of task complexity ongoing benchmarking and transparent comparison of from diferent perspectives. emerging models on Italian tasks. The overall goal is

To better understand how much harder generative to provide an evaluation tool that is easy to access and tasks are for models, we compared their performance to that can provide a fair assessment of a model and track reference scores from the Evalita benchmarking initia- diference in performance caused by diferent variables tive (or the original dataset authors when Evalita scores (model’s size, model’s version and more). were unavailable). Results in Table 3 confirm that while models often outperform reference baselines in multiple- Limitations The number of datasets included for each choice tasks such as Textual Entailment (+22.04%), Senti- task of the Evalita-LLM benchmark is limited in order to ment Analysis (+21.69%), they have some dificulties in allow reasonable running times. In fact, the goal is not performing on generative tasks. For instance, model ac- to create a repository that gathers all Italian datasets but curacy falls short in Named Entity Recognition (–53.68%) rather to provide a tool for strong evaluation of models. and Relation Extraction (–18.15). It is important to note, The metrics used for each tasks are the ones proposed however, that the reference baselines were obtained us- in the original challenges and papers to allow for a direct ing supervised models trained on task-specific datasets, comparison between systems. For this reason, we opted whereas the models evaluated in this study were tested in to not include more recent metrics such as BERT-score, zero-shot or few-shot settings, without any task-specific which can be useful additions in the future. ifne-tuning. These results further demonstrate how effectively modern LLMs can generalize to new tasks.

Acknowledgments This work has been partially supported by the PNRR

project FAIR - Future AI Research (PE00000013), under the NRRP MUR program funded by NextGenerationEU. The work of Marco Madeddu and Viviana Patti is partially supported by “HARMONIA” project - M4C2, I1.3 Partenariati Estesi - Cascade Call - FAIR - CUP C63C22000770006 - PE PE0000013 under the NextGenerationEU programme. We warmly thank Alessandro Ercolani and Samuele Colombo for their invaluable support and guidance in writing the code and implementing this leaderboard. sult to its test event in the clinical domain, in: M. Lai, S. Menini, M. Polignano, V. Russo, R. Sprugnoli, G. Venturi (Eds.), Proceedings of the Eighth Evaluation Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop (EVALITA 2023), Parma, Italy, September 7th8th, 2023, volume 3473 of CEUR Workshop Proceedings, CEUR-WS.org, Parma, Italy, 2023. URL: https://ceur-ws.org/Vol-3473/paper43.pdf . [16] N. Landro, I. Gallo, R. La Grassa, E. Federici, Two new datasets for italian-language abstractive text summarization, Information 13 (2022).

URL: https://www.mdpi.com/2078-2489/13/5/228.

doi:10.3390/info13050228. The following tweet: ‘{{text}}’ expresses a sentiment that is You have to carry out a sentiment analy- [Positive, Negative, Neusis task. The following tweet: ‘{{text}}’ ex- tral, Mixed] presses a sentiment that is Summarize the following newspaper article: ‘source’ \n Summary: You have to carry out an automatic synthesis task. Summarize the following newspaper article: ‘source’ \n Summary: Extract all entities of type PER (person), LOC (place), and ORG (organization) from the following text. Report each entity in the format: Entity$Type, separated by ‘,‘. If there are no entities, respond with ‘&&NOENT&&‘. \n Text: ‘text’ \n Entities: You have to carry out a named entity recognition task. Extract all entities of type PER (person), LOC (place), and ORG (organization) from the following text. Report each entity in the format: Entity$Type, separated by ‘,‘. If there are no entities, respond with ‘&&NOENT&&‘. \n Text: ‘text’ \n Entities:

A. Prompt Examples for Evalita-LLM Tasks

Declaration on Generative AI During the preparation of this work, the author(s) did not use any generative AI tools or services.