-

1613-0073

ITA-Bench: Towards a More Comprehensive Evaluation for Italian LLMs

Luca Moroni

moroni@diag.uniroma1.it 0 1 2 3 4 5

Simone Conia

conia@diag.uniroma1.it 0 1 2 3 4 5

Federico Martelli

martelli@diag.uniroma1.it 0 1 2 3 4 5

Roberto Navigli

navigli@diag.uniroma1.it 0 1 2 3 4 5

Large Language Models, Natural Language Processing, Evaluation, Italian Language

0 CLiC-it 2024: Tenth Italian Conference on Computational Linguistics 1 Having gold LLM benchmarks natively written in Ital- 2 Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome , Italy 3 rate evaluation of LLMs' capabilities in the Italian lan- 4 three main benchmark translations , namely, MMLU, Hel- 5 Open Ita LLM Leaderboard”, which is one of the most

Recent Large Language Models (LLMs) have shown impressive performance in addressing complex aspects of human language. These models have also demonstrated significant capabilities in processing and generating Italian text, achieving state-ofthe-art results on current benchmarks for the Italian language. However, the number and quality of such benchmarks is still insuficient. A case in point is the “Open Ita LLM Leaderboard” which only supports three benchmarks, despite being one of the most popular evaluation suite for the evaluation of Italian-language LLMs. In this paper, we analyze the current limitations of existing evaluation suites and propose two ways of addressing this gap: i) a new suite of automatically-translated benchmarks, drawn from the most popular English benchmarks; and ii) the adaptation of existing manual datasets so that they can be used to complement the evaluation of Italian LLMs. We discuss the pros and cons of both approaches, releasing our data to foster further research on the evaluation of Italian-language LLMs.

CEUR

ceur-ws.org 1. Introduction showing impressive results on an increasing range of standard benchmarks, thanks in particular to their reasoning and in-context-learning capabilities [ 1, 2 ]. The current trend points towards increasingly larger models trained on massive amounts of data [ 3, 4 ]. However, despite these advancements, there remains a significant gap in the availability of high-quality benchmarks for languages other than English, including Italian, which is often considered too optimistically as a high-resource language. Benchmarks are essential for measuring progress in NLP, providing a standardized pecially important for Italian given the growing amount of language-specific models that are being developed for the language [ 5, 6, 7, 8, 9 ]. High-quality benchmarks must be well-crafted to ensure they accurately reflect the complexities of the language and the specific challenges it presents.

As of today, most of the existing Italian benchmarks are translations of English datasets, which may not fully capture the nuances and unique characteristics of the Italian language. Nevertheless, the ability to automatically translate English benchmarks into Italian is valuable and enticing for two main reasons. First, it provides a way tions of existing datasets from English and the creation of brand-new datasets in Italian, there is the option of adapting existing Italian datasets that were originally created for a diferent purpose, to measure the capabilities way to evaluate and compare models, and this is now es- ities of modern LLMs to be fully analyzed, even though tion. This direction has gained traction over the past few LLM Leaderboard”. This is a collection of three datasets – months, with eforts that focus on repurposing Italian HellaSwag [ 15 ], MMLU [ 16 ], and ARC-Challenge [ 17 ] – tests (usually designed for humans) to evaluate LLMs that were automatically translated into Italian. Although instead [ 14 ]. this set of three benchmarks is generally considered to be

In this paper, we follow both directions and intro- of high-quality (thanks to the fact that the translations duce ITA-Bench, a more comprehensive benchmark suite were produced using GPT-3.5), there are still several isfor the evaluation of Italian-language LLMs. First, ITA- sues that limit the quality of this evaluation suite: Bench proposes a new extended suite of benchmarks Coverage: Open Ita LLM Leaderboard only covers created by automatically translating the most popular three benchmarks. There are plenty of other datasets English benchmarks into Italian. Second, ITA-Bench in- that are generally used to test the capabilities of LLMs cludes existing manually curated datasets, adapted to en- in English, so limiting the assessment of Italian LLMs to hance the evaluation framework for Italian LLMs. These just three datasets may result in the evaluation of some two complementary approaches aim to bridge the eval- important aspects of their capabilities in Italian being uation gap and provide a more thorough understand- overlooked. ing of the capabilities of Italian-language LLMs. With Reproducibility: The code and models used to transITA-Bench, we hope to foster further development and late these three benchmarks are not directly available, refinement of evaluation techniques for Italian LLMs, making it hard – if not impossible – to reproduce the ultimately contributing to the broader field of multilin- translations.3 gual NLP. ITA-Bench is available at https://github.com/ Transparency: The fact that the translations are not sapienzanlp/ita-bench. reproducible makes it dificult to analyze whether there are errors or there is margin for improvement in the 2. ITA-Bench: a New Evaluation translation process originally used to translate the three benchmarks.

Suite for Italian LLMs English specificity: Despite the translation process, these benchmarks actually remain tied to the English lanIn this section, we introduce our methodology for the guage. Indeed, the prompts used as input to the language creation of ITA-Bench, a more comprehensive evalua- model contain parts that are in English (for example, in tion suite for Italian LLMs. Our objective is to focus on the creation of the examples used for few-shot evaluathe Italian language and, more specifically, to create a tion). This is undesirable because it inherently favours benchmark suite that is able to test a wide variety of LLMs that are bilingual, more specifically, LLMs that can aspects of LLMs that “generate” Italian text. To accom- “speak” fluent English in addition to Italian. plish this objective we focus on two distinct directions: Uniformity: The translation of benchmarks from Eni) translating existing English benchmarks that are cur- glish to a target language is usually done on a benchmarkrently used to evaluate the capabilities of state-of-the-art by-benchmark basis. On one hand, this allows developers LLMs in English, and ii) adapting existing Italian bench- to specialize the translation code to each dataset; on the marks, drawing from popular repositories, conferences, other hand, this approach prevents the translation proshared tasks, and community initiatives, such as the sev- cess from being comparable across datasets, which makes eral EVALITA editions1 and SemEval tasks.2 In the case performing a root-cause-analysis on the origin of an error of adaptation of existing datasets, most of the work con- in the translated dataset more complex. sists in adapting the scope of the tasks, i.e., since many of these tasks were not designed to evaluate LLMs, the core of the work lies in reframing the problem in a way 2.1.2. Re-translating English benchmarks that a prompt can be used to test the capability of a particular LLM to solve a specific task. Table 1 reports the overall statistics of the datasets that we consider for our ITA-Bench suite.

Here we describe our methodology that is aimed at ad

dressing the issues that are present in existing benchmark translations, including the ones used in Open Ita LLM Leaderboard. More specifically, we introduce a new library called OBenTO (Open Benchmark Translation for the Others) that is designed to translate existing benchmarks in a uniform, reproducible and fullytransparent way. Moreover, it is also designed to be easily extensible, in such a way that the research community can add new benchmark translations and even

3For example, the version of GPT-3.5 used to translate the bench

marks is not known. Also note that OpenAI has already deprecated many GPT 3.5 versions.

2.1. Translating English Benchmarks 2.1.1. Issues with existing translations The most popular and widely-used evaluation suite for

Italian produced via translation is perhaps the “Open Ita

1https://www.evalita.it/campaigns/

2https://semeval.github.io/ new languages besides Italian. We release OBenTO at campaigns and the SemEval shared tasks. These sources https://github.com/sapienzanlp/obento. provide Italian data and annotations for a variety of tasks, covering a broad spectrum of linguistic capabilities and Translation model. The OBenTO library is designed phenomena in the Italian language. to be easily adaptable to new backbones, but at the time of The key step in adapting these Italian benchmarks – writing this article, the library relies on TowerLLM [ 18 ], originally designed for diferent use cases – is to reframe a recent open LLM that is built on top of open-weight each task as a question answering task, enabling LLMs LLMs, such as LLaMA and Mistral. TowerLLM continues to approach and solve them efectively through promptthe pretraining stage on 10 languages to improve multi- ing. In practice, this involves transforming the input of lingual capabilities of the starting LLM. Moreover, Tow- each task into a natural question and the output into a erLLM is fine-tuned on translation and other translation- corresponding natural answer or continuation. Where related tasks, including grammar error correction, named applicable, we also design a set of incorrect answers or entity recognition and post-translation correction. distractors of varying complexity. In our adaptation process, we diferentiate between two prompting strategies: Translated benchmarks. We translate the following multiple-choice and cloze style. In the multiple-choice datasets from English to Italian: approach, the LLM is given a question along with a predetermined set of possible answers from which it must ARC Challenge and ARC Easy (ARC-E) [17, choose the correct one. In the setting of adapting exARC-C, ARC-E]: These are two benchmarks on rea- isting benchmarks, the multiple-choice style will also soning and scientific knowledge, created from a single encompass binary classification prompting, where the dataset; the ARC Challenge split is obtained by selecting only possible responses are “sì” (yes) or “no”. In the cloze all those questions that QA systems at the time were not style approach, instead, the LLM is required to generable to answer correctly. ate the correct answer based solely on the question, or GSM8K [ 19 ]: A benchmark that tests the capability of equivalently, the generation of correct class verbalization, an LLM to solve simple math problems whose solution for classification tasks. Given the large search space of only requires the use of basic arithmetic operations. potential answers in this format, the evaluation focuses BoolQ [ 20 ]: A benchmark obtained from queries by on ensuring that the likelihood of the correct answer is search engine users. The task consists in answering higher than that of a predefined set of incorrect answers. Yes or No depending on an input passage that provides We discuss the details of the adaptation process for context. each dataset in the following sections and in Appendix C. HellaSwag [15, HS]: A commonsense reasoning dataset We ofer multiple-choice and cloze style implementations that requires a system to select the most suitable contin- for all datasets except QUANDHO and DISCOTEX, uation for a given text, based on implicit commonsense which have only multiple-choice due to their sentenceknowledge. and paragraph-length choices.

MMLU [ 16 ]: A benchmark which encompass several AMI [25]: Automatic Misoginy Identification is a questions over 57 subjects across STEM, the humanities, classification task in which the goal is to understand the social sciences, and more. whether or not a tweet is misogynist. The original task is divided into two subtasks, Behaviour and Synth.

PIQA [ 21 ]: A benchmark that evaluates the capability Behaviour consists in classifying a tweet into one of of an LLM to reason about physical interactions. three classes, namely, no misogyny, mild misoginy, SciQ [ 22 ]: A reading comprehension test set that and aggresive misogyny. Instead, Synth consists of a challenges an LLM to extract the answer from a passage binary classification task, misogyny v. no misogyny. and question given in input. ITA-Bench includes both subtasks, but in this work we TruthfulQA [23, TQA]: A question answering bench- focus on Synth, as Behaviour is more complex due to its mark with a focus on popular misconceptions found unbalanced class distribution. across the Web.

Winogrande [24, WG]: a commonsense reasoning dataset that requires choosing between two options based on coreference resolution.

NERMuD [26]: Named Entity Recognition on

Multi-domain Documents was first presented at EVALITA-2023. The task uses standard NER classes, namely, Person, Organization, and Location, to tag entities in a text. In ITA-Bench, we adapt NERMuD and create task instances comprised of three elements: i) the sentence that contains the entity mention, ii) the mention of the entity in the sentence, and iii) the correct class associated with the mention in the given

2.2. Adapting Italian Benchmarks In addition to our new automatically-translated benchmarks, ITA-Bench also includes the adaptation of existing Italian benchmarks from two main sources: the EVALITA

Dataset Train set Valid set Test set PRELEARN [28]: Prerequisite RElation LEARNARC-C 1068 286 1132 ing is a task from EVALITA 2020 on concept prerequisite ARC-E 2157 549 2258 learning. This task consists in identifying whether a BGoSoMlQ8K 97349793 32-59 13-19 concept A is a prerequisite of another concept B, i.e., HS 39722 9998 - if learning concept B requires having already learnt MMLU 269 1402 13127 concept A. The original dataset comes with four domains, PIQA 15038 1713 - namely, Geometry, Precalculus, Physics, and Data Mining, STcruiQthfulQA -- 799823 9-85 and we maintain these same domains in ITA-Bench. Winogrande 4717 1176 - WiC [29]: Word-in-Context for Italian. We focus AMI 7014 - 2908 on the binary-classification sub-task of the original WiC 2805 500 500 formulation. In ITA-Bench, an LLM is tasked with NERMuD 14529 4079 3943 determining if a word occurring in two diferent PPrReETLEENASRN 52833278 -- 14566909 sentences 1 and 2 has the same meaning in 1 and 2. DISCOTEX 16000 - 1600 QUANDHO [30]: The QUestion ANswering Data GhigliotinAI 62 - 553 for Italian HistOry dataset is an Italian questionQUANDHO 384 - 1416 answering dataset focused on Italy’s history during the Table 1 ifrst half of the 20th century. It provides Wikipedia Statistics of the ITA-Bench datasets, for each dataset the car- passages that may contain the answer to specific dinalities of the training, validation and test set are reported. questions. Each question in the dataset appears in multiple (, ) pairs, where the answer can be either correct or incorrect. In ITA-Bench, we select context. We distinguish between two subdomains: ADG, the pair with an answer marked as correct and three writings and speeches from the Italian politician Alcide distractors from the occurrences of incorrect answers De Gasperi, and WN, news texts from the past decades. paired with the same question.

DISCOTEX [27]: Assessing DIScourse COherence in GhigliottinAI: Starting from two diferent EVALITA Italian TEXts is a task focused on modelling discourse tasks, nlp4fun [31] and ghigliottin-AI [32], we collect coherence in real-word Italian texts. In ITA-Bench, we about 600 diferent games extracted from the TV focus only on the first sub-task of DISCOTEX: Last Sen- show and the boardgame of “L’Eredità”, a popular tence Classification , where, given a short input paragraph quiz game in Italy. “La Ghigliottina” is a challenging and a sentence, the goal is to tell whether the sentence game that requires extensive knowledge of the Italian is a valid continuation of the paragraph. To assess the culture. The goal is to find a single word that links five capability of an LLM to solve this task, we reframe seemingly unrelated words. However, since multiple DISCOTEX as a multi-choice question answering task. solutions are often possible and computing all potential More specifically, given an input paragraph, the LLM is answers is impractical, in ITA-Bench, we reframe the tasked with selecting the most appropriate continuation problem as a multi-choice question answering task, from among five options that we provide (the original i.e., a simplified version in which four possible words dataset does not provide distractors). Therefore, for the are given and, among these, only one can be linked to subset of instances with valid continuations, we create all the five input words. In ITA-Bench, we also select a set of distractors by sampling continuations from three distractor words in such a way that the distractors other instances at random. Instead, for the instances are linked to three of the five input words. We ensure with invalid continuations, we create a new correct that the distractors are not too similar one to the other option “nessuna delle precedenti” (none of the above), and by maximixing the cosine distance of their FastText add a set of four random distractors from other instances. embeddings. The distractors are also designed to be at PreTENS4: Presupposed Taxonomies was first most one character shorter or longer than the correct proposed for SemEval-2022. This task focuses on word, resulting in a task that is easy for humans but semantic competence, and evaluates the ability of challenging for LLMs. an LLM to recognize valid taxonomic relationships between two nominal arguments. For example, this 3. Evaluation Results can require recognizing whether or not a concept is a subclass of another concept. In ITA-Bench, an LLM is tasked with identifying whether the relationship between two concepts in the same sentence is acceptable.

In this section, we discuss the results of various LLMs

on ITA-Bench: we first present the results on the automatically-translated benchmarks and then on the adapted benchmarks. ITA-Bench implements all the task formulations using the lm-evaluation-harness li

4https://sites.google.com/view/semeval2022-PreTENS

Size Name 0.4B Minerva-350M-base-v1.0 1B Minerva-1B-base-v1.0 3B OpenELM-3B 3B XGLM-2.9B 3B Minerva-3B-base-v1.0 7B OLMo-7B-0724-hf 7B LLaMAntino-2-7b 7B Minerva-7B-base-v1.0 7B Mistral-7B-v0.1 8B Llama-3.1-8B 7B Mistral-7B-Instruct-v0.1 7B Maestrale-chat-v0.4-beta 8B LLaMa-3.1-8B-Instruct 8B LLaMAntino-3-ANITA 9B Italia-9B-Instruct-v0.1 brary [33], which allows us to calculate the likelihoods the results on GhigliottinAI, Italian LLMs seem to perfor each possible continuation in a simple and compa- form well and surpass the results obtained by English rable way, as lm-evaluation-harness is also used by models. This may indicate that this task needs a diferHugging Face for the Open LLM Leaderboard. ent type of competence and/or knowledge in order to be solved. Indeed, we hypothesize that the task requires 3.1. Automatic Translation a deeper understanding of some elements of the Italian culture, e.g., entities and concepts that are more comThe results of various LLMs on our translated bench- monly known in Italy than in other countries. Therefore, marks are reported in Table 2, which provides an pretraining and fine-tuning on Italian documents might overview of the zero-shot scores on cloze style task for- be the key to obtaining better results in GhigliottinAI. mulations, i.e., the input prompt to an LLM includes only the question without the possible answers. More specifically, we compare the results of several open-weight 4. Manual Error Analysis LLMs having diferent sizes, ranging from less than 1B parameters up to 9B parameters and focusing on LLMs that have been pretrained, fine-tuned and/or adapted on/to the Italian language. As we can see, the scores of the LLMs are roughly correlated to their size in terms of number of parameters. Notably, the smaller versions of the Minerva LLMs are able to compete with larger models, thanks to the fact that a significant portion of their pretraining dataset is composed of Italian text (rather than English).

In order to assess the quality and reliability of our

automatically-translated data, we conduct a manual error analysis. To this end, we examine the translations into Italian produced by four language models: two open-source ones, namely, TowerInstruct-7B-v0.25 and TowerInstruct-Mistral-7B-v0.26 [ 34 ], and two proprietary ones, that is, GPT-3.5-turbo and GPT-4o-mini [ 35 ].7 First, we describe the data and the analysis procedure employed. We then discuss the results of our manual analysis and review some crucial error patterns.

3.2. Adapting Italian Datasets 4.1. Data and analysis procedure Moving to the adapted benchmarks in ITA-Bench, Table 3

reports the scores of diferent state-of-the-art models, As the source of the data for our linguistic analysis, we ranging from 350M parameters models to 9B parameters. rely on the ARC dataset, which includes multiple-choice Here, we focus on the results of the LLMs in cloze style question answering in a wide range of domains. Specifitasks, except for QUANDHO and DISCOTEX, as ITA- cally, we randomly select a sample of 100 instances from Bench supports only the multi-choice formulation for these two tasks. Unsurprisingly, the size of the LLMs and their pretraining data are discriminators for reaching better results. Most importantly, even the strongest Italian LLMs, such as ANITA, still struggle to compete against their English counterparts. However, as we can see from 5https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2 6https://huggingface.co/Unbabel/TowerInstruct-Mistral-7B-v0.2 7We employ the OBenTO pipeline to process the translations generated by the open-source models. As for GPT-3.5-turbo, we use the translations available at: https://huggingface.co/datasets/ alexandrainst/m_arc. We also translate the datasets using GPT-4omini with a pipeline similar to the one used for GPT-3.5-turbo. the ARC Challenge dataset and we manually analyze the employ a new annotator to assess their quality based on quality of the translations produced by all language mod- our guidelines. We obtain a Cohen’s kappa of 0.85, which els considered. For each instance, we assess the degree of indicates a strong agreement. comprehensibility and fidelity of the translation of both questions and answers, assigning a binary label which 4.2. Results indicates whether a translation is acceptable or not. Crucially, we distinguish between minor and major errors Our analysis shows that GPT-4o-mini outperforms all its depending on the impact on the comprehensibility and competitors. With an error rate8 of 4%, it is markedly ifdelity of the target translation. We then identify error more accurate than TowerInstruct-7B-v0.2, which expatterns, some of which we describe below, highlighting hibits an error rate of 23%. TowerInstruct-Mistral-7Bthe cases in which the translation impedes understanding v0.2 and GPT-3.5-turbo show a similar performance, that of either the questions or the answers, or fails to faith- is, 8% and 9% error rate, respectively. Finally, the most fully reproduce the source text, thus altering the original frequent error patterns are omissions, especially when meaning. Finally, we discuss the results of our analysis. considering open-source models, and incorrect terminolAnnotation guidelines are reported in Appendix A. ogy.

4.1.1. Key error patterns

5. Conclusion As part of our manual annotation process, we identify error patterns, of which we report four key ones, namely: In this paper, we introduce a novel evaluation suite i) omissions, which consist in omitting one or multiple aimed at advancing the Italian community’s ability to source words in the translation; ii) incorrect terminol- assess the competencies of LLMs on Italian data. Our ogy, that is, the incorrect translation of one or multiple approach follows two main directions. First, we define terms into the target language; iii) untranslated source a novel pipeline called OBenTO, which involves transtext, where one or multiple source words are reported lating state-of-the-art English benchmarks into Italian. as-is in the translation, despite these words not being Second, we rephrase existing Italian benchmarks to be commonly used in the target language; and iv) grammat- used for prompting and testing large language models. ical errors, which include orthographic, morphological Additionally, we conduct a comprehensive evaluation and syntactical errors. Instances of the aforementioned of the quality of automatically translated benchmarks, error patterns can be found in Appendix B. highlighting the inherent challenges of such an approach and analyzing the errors made by four LLMs. We hope 4.1.2. Inter-annotator agreement that our work can provide a solid evaluation framework for evaluating the capabilities of current and future LLMs In order to assess the reliability of our manual annota- in Italian. tions, we compute the inter-annotator agreement. With this aim in view, we select the already-annotated trans- 8We emphasize that this error rate does not provide a nuanced evalulations produced by one randomly-chosen model and sauticohnaosf flutheencayfoarenmdeidnitoiomnaetdicaintyd. other crucial aspects of translation, Acknowledgments

Simone Conia gratefully acknowledges the PNRR MUR

project PE0000013-FAIR, which fully funds his fellowship. Federico Martelli and Roberto Navigli acknowledge the support of the CREATIVE project (CRoss-modal understanding and gEnerATIon of Visual and tExtual content, Progetti di Interesse Nazionale - PRIN 2020). Finally, we acknowledge the work of the M.Sc. students of Prof. Navigli’s multilingual NLP course for the Academic Year 2024, whose contributions provided valuable insights and ideas for the adaptation of existing Italian benchmarks. We acknowledge the CINECA award IsB28_medit under the ISCRA initiative for the availability of high-performance computing resources and support.

A. Annotation Guidelines

In this section, we report the annotation guidelines

adopted to ensure consistency throughout our analysis. Annotators receive a document containing the source text and the translations produced by four language models,

6. Inadequate register: The tone or style of the

translation does not align with the context of the source text. 7. Addition of one or more words: Additional words or phrases (not present in the source text) are included in the translation. [...] Un uomo con una combinazione di alleli RR per il tratto produce uno zigote con una donna con una combinazione di alleli rr per il tratto. Quale combinazione di alleli potrebbe verificarsi nello zigote?

B. Examples of key error patterns reported in Table 10. We present all prompts in the same format as the LM-Evaluation-Harness implementation. To In this section, we report examples of the key error pat- ensure clarity and conciseness, we use Jinja templating9 terns described in Section 4.1.1. Specifically, we report in- for all prompts. stances of omissions (Table 4), incorrect terminology (Table 5), untranslated source words (Table 6) and grammatical errors (Table 7). Errors are highlighted by square brackets D. In-domain Results in red. Importantly, all examples in the aforementioned Tables are considered major errors, with the sole excep- PRELEARN and NERMuD have been reported as average tion of the first instance of omission reported in Table 4. accuracies on the main part of this paper. Results are Specifically, the omission of the word pests has a limited reported in Table 11 and Table 12, looking at each domain impact on the comprehensibility and fidelity of the trans- separately for the two diferent tasks. While results for lation and, therefore, for the purposes of the task at hand the zero-shot setting are reported in Table 13 and in and our analysis, the translation is considered accept- Table 14. We reported the results twice, dividing the able. As for untranslated source words, we note several multi-choice and cloze style prompt setting. issues in the data. As reported in Table 4, we note that GPT-4o-mini translates the term weathering as the Italian E. Other results for adapted tasks equivalent of erosion. However, weathering and erosion are two diferent geological processes. In fact, weathering In this section we report other results about adapted (which could be translated into Italian as degradazione tasks. More precisely, in Table 15 are collected the metrics meteorica) refers to the breaking down of rocks and min- for the adapted tasks in zero shot setting, where all the erals at their original location through physical, chemical, tasks are proposed in cloze style prompting, except for or biological means, without the material being moved DISCOTEX and QUANDHO which are reported in multielsewhere. In contrast, erosion involves the removal and choice prompting. transportation of weathered material by agents such as Since we employed a Multi-Choice (MC) style promptwater, wind, or ice. Hence, in translating weathering as ing for all adapted tasks. Table 16 presents the results the Italian equivalent of the word erosion, the model fails for these tasks in the zero-shot setting, while Table 17 to capture the precise meaning of the source term, signif- shows the results in the five-shot setting. icantly altering the content of the source text. Our error analysis also shows that MT systems still struggle with disambiguation of concepts [ 37, 38 ] and entities [ 39, 13 ].

C. Adapted Tasks Prompts

In this section we report all the prompts chosen for the

adapted tasks. The cloze style prompts are reported in Table 8, while multi-choice-style prompts can be seen in Table 9. For each task we also defined a system prompt, which consists of a text prepended to the model before the sample prompts, the proposed system prompts are

9https://jinja.palletsprojects.com/en/3.1.x/templates/

TowerInstruct-7B-v0.2 Una cellula [eugleno] possiede una struttura chiamata [macula occhiolare] che rileva la luce. Un [parameno] non possiede una [macula occhiolare] e quindi non riesce a rilevare la luce. Perché un [parameno] non ha bisogno di una [macula occhiolare]? A plateau is most likely formed by [a] runof from a river. [b] weathering by waves. [c] erosion of rock debris. [d] a buildup of cooled lava.

Un plateau è più probabilmente formato da [a] deflusso da un fiume. [b] [erosione] da onde. [c] erosione di detriti rocciosi. [d] un accumulo di lava rafreddata. In un bicchiere è [stato versato] dell’acqua fino a metà. Sono stati messi cinque cubetti di ghiaccio nel bicchiere, facendo sì che il livello dell’acqua raggiungesse il bordo del bicchiere. Quale delle seguenti afermazioni spiega al meglio l’aumento del livello dell’acqua? Which of the following is most likely an adaptation that resulted from habitat destruction?

Qual è più probabile un[’]adattamento che è risultato dalla distruzione dell’habitat? PRELEARN NERMuD PreTENS PRELEARN QuandHO DISCOTEX GhigliottinAI WiC PRELEARN QuandHO DISCOTEX GhigliottinAI WiC

Indica se i seguenti tweet presentano caratteristiche misogine.

Data una frase e un’entità, indica se tale entità rappresenta un luogo, un’organizzazione o una persona. Indica se le seguenti frasi hanno senso.

Dati due concetti A e B, indica se il primo concetto è un prerequisito per il secondo.

Il concetto A è prerequisito per il concetto B, se per comprendere B devi prima aver compreso A. I seguenti concetti appartengono al dominio: {{domain}}.

Ti saranno poste domande di storia italiana.

Identifica quali paragrafi contengono la risposta alle domande date.

Ti verranno poste delle domande nelle quali è presente un paragrafo, e come possibili risposte varie frasi che possono essere o meno la continuazione del paragrafo.

Indica la frase che rappresenta la continuazione più probabile del paragrafo, oppure “nessuna delle precedenti” se nessuna delle continuazioni è corretta.

Ti viene chiesto di risolvere il gioco della ghigliottina.

Il gioco della ghigliottina consiste nel trovare un concetto che lega cinque parole date. Tale concetto è esprimibile tramite una singola parola.

Date due frasi, che contengono un lemma in comune, indica se tale lemma ha lo stesso significato in entrambe le frasi. WN WN Minerva-350M-base-v1.0 Minerva-1B-base-v1.0 XGLM-2.9B OpenELM-3B Minerva-3B-base-v1.0

[1]

Kojima ,

S. S.

Gu ,

Reid ,

Matsuo ,

Iwasawa , Large language models are zero-shot reasoners , Advances in neural information processing systems 35 ( 2022 ) 22199 - 22213 .

[2]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[3]

Hofmann ,

Borgeaud ,

Mensch , E. Buchatskaya,

Cai ,

Rutherford , D. d. L. Casas , L. A.

Hendricks , J.

Welbl , A.

Clark , et al., Training compute-optimal large language models , arXiv preprint arXiv:2203.15556 ( 2022 ).

[4]

Wei ,

Tay ,

Bommasani ,

Rafel ,

Zoph ,

Borgeaud ,

Yogatama ,

Bosma ,

Zhou ,

Metzler ,

E. H.

Chi ,

Hashimoto ,

Vinyals ,

Liang ,

Dean ,

Fedus , Emergent abilities of large language models , Trans. Mach. Learn. Res . 2022 ( 2022 ). URL: https://openreview.net/forum? id=yzkSU5zdwD.

[5]

Basile , E. Musacchio,

Polignano ,

Siciliani , G. Fiameni, G. Semeraro, Llamantino: Llama 2 models for efective text generation in Italian language , arXiv preprint arXiv:2312.09993 ( 2023 ).

[6]

Bacciu ,

Trappolini ,

Santilli ,

Rodolà ,

Silvestri , Fauno: The Italian large language model that will leave you senza parole! , in: F.

Nardini ,

Tonellotto ,

Faggioli , A . Ferrara (Eds.), Proceedings of the 13th Italian Information Retrieval Workshop (IIR 2023 ), Pisa, Italy, June 8- 9, 2023 , volume 3448 of CEUR Workshop Proceedings, CEUR-WS.org , 2023 , pp. 9 - 17 . URL: https: //ceur-ws. org/ Vol- 3448 /paper-24.pdf.

[7]

Bacciu ,

Campagnano ,

Trappolini ,

Silvestri , DanteLLM: Let's push Italian LLM Research Forward! , in: N. Calzolari , M.- Y.

Kan , V.

Hoste , A.

Lenci , S.

Sakti , N. Xue (Eds.), Proceedings of the 2024 Joint International Conference on Computational Linguistics , Language Resources and Evaluation (LREC-COLING 2024), ELRA and ICCL , Torino , Italia, 2024 , pp. 4343 - 4355 . URL: https: //aclanthology.org/ 2024 .lrec-main. 388 .

[8]

Polignano ,

Basile , G. Semeraro, Advanced Natural-based interaction for the Italian language: Llamantino-3-anita , arXiv preprint arXiv: 2405 .07101 ( 2024 ).

[9]

Orlando ,

Moroni , P.-L. Huguet Cabot , E.

Barba , S.

Conia , S.

Orlandini , G. Fiameni, R.

Navigli , Minerva LLMs: The first family of Large Language Models trained from scratch on Italian data , Proc. of CLiC-it 2024 - Tenth Italian Conference on Computational Linguistics ( 2024 ).

[10] ItaEval and TweetyIta: A new extensive benchmark and eficiency-first language model for Italian , 2024 . URL: https://rita-nlp.org/static/ItaEval_TweetyIta_ v1.pdf.

[11]

Hershcovich ,

Frank ,

Lent , M. de Lhoneux,

Abdou ,

Brandl ,

Bugliarello ,

L. Cabello

Piqueras , I. Chalkidis ,

Cui ,

Fierro ,

Margatina ,

Rust ,

Søgaard , Challenges and strategies in cross-cultural NLP , in: S. Muresan,

Nakov , A . Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Dublin, Ireland, 2022 , pp. 6997 - 7013 . URL: https://aclanthology.org/ 2022 . acl-long . 482 . doi: 10 . 18653/v1/ 2022 .acl- long.482.

[12]

Navigli ,

Conia ,

Ross , Biases in large language models: Origins, inventory, and discussion , ACM J. Data Inf. Qual . 15 ( 2023 ) 10 : 1 - 10 : 21 . URL: https://doi.org/10.1145/3597307. doi: 10 .1145/ 3597307.

[13]

Conia ,

Lee ,

Li ,

U. F.

Minhas ,

Potdar ,

Li , Towards cross-cultural machine translation with retrieval-augmented generation from multilingual knowledge graphs , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Miami, Florida, USA, 2024 . URL: https://arxiv.org/abs/2410.14057.

[14]

Esuli , G. Puccetti, The invalsi benchmark: measuring language models mathematical and language understanding in Italian , arXiv preprint arXiv:2403.18697 ( 2024 ).

[15]

Zellers ,

Holtzman ,

Bisk ,

Farhadi ,

Choi , Hellaswag: Can a machine really finish your sentence? , in: A. Korhonen , D. R. Traum , L. Màrquez (Eds.), Proceedings of the 57th Conference of the Association for Computational Linguistics , ACL 2019 , Florence, Italy, July 28- August 2 , 2019 , Vol- S. Muresan,

Nakov , A . Villavicencio (Eds.), Proume 1: Long Papers, Association for Computa- ceedings of the 60th Annual Meeting of the Astional Linguistics , 2019 , pp. 4791 - 4800 . URL: https: sociation for Computational Linguistics (Volume //doi.org/10.18653/v1/p19- 1472 . doi: 10 .18653/V1/ 1: Long Papers), ACL 2022 , Dublin, Ireland, May P19- 1472 . 22 - 27 , 2022 , Association for Computational Lin-

[16]

Hendrycks ,

Burns ,

Basart , A . Zou, guistics, 2022 , pp. 3214 - 3252 . URL: https://doi.org/ M. Mazeika,

Song ,

Steinhardt , Measuring 10 .18653/v1/ 2022 . acl-long . 229 . doi: 10 .18653/ V1/ massive multitask language understanding , in: 2022.ACL- LONG . 229 . 9th International Conference on Learning Rep- [24]

Sakaguchi ,

R. L.

Bras ,

Bhagavatula , Y. Choi, resentations, ICLR 2021 ,

Virtual

Event , Austria, Winogrande: An adversarial winograd schema chalMay 3-7 , 2021 , OpenReview.net, 2021 . URL: https: lenge at scale, Communications of the ACM 64 //openreview.net/forum?id= d7KBjmI3GmQ . ( 2021 ) 99 - 106 .

[17]

Clark ,

Cowhey ,

Etzioni ,

Khot , A . Sab- [25]

Fersini ,

Nozza ,

Rosso , et al., Ami@ harwal,

Schoenick ,

Tafjord , Think you have evalita2020: Automatic misogyny identification, in: solved question answering? try arc, the ai2 rea- Proceedings of the 7th evaluation campaign of Natusoning challenge , arXiv preprint arXiv: 1803 . 05457 ral Language Processing and Speech tools for Italian ( 2018 ). (EVALITA 2020 ), 2020 .

[18] D. M. Alves , J.

Pombal , N. M.

Guerreiro , P. H.

Mar- [26] A.

Palmero Aprosio , T.

Paccosi , et al., Nermud at tins, J. Alves , A.

Farajian , B.

Peters , R.

Rei , P. Fernan- evalita 2023 : overview of the named-entities recogdes, S. Agrawal , et al., Tower: An open multilingual nition on multi-domain documents task, in: CEUR large language model for translation-related tasks , WORKSHOP PROCEEDINGS , volume 3473 , CEUR, arXiv preprint arXiv: 2402 .17733 ( 2024 ). 2023 .

[19]

Cobbe ,

Kosaraju ,

Bavarian , M. Chen, [27]

Brunato ,

Colla ,

Dell'Orletta ,

Dini ,

D. P. H.

Jun ,

Kaiser ,

Plappert ,

Tworek ,

Hilton , Radicioni,

A. A.

Ravelli , et al., Discotex

at evalita R.

Nakano , et al., Training verifiers to solve 2023: overview of the assessing discourse cohermath word problems , 2021 , URL https://arxiv. ence in Italian texts task, in: CEUR WORKSHOP org/abs/2110.14168 ( 2021 ). PROCEEDINGS, volume 3473 , CEUR , 2023 , pp. 1 - 8 .

[20]

Clark ,

Lee ,

Chang , T. Kwiatkowski, [28]

Alzetta ,

Miaschi ,

Dell'Orletta ,

Koceva , M. Collins,

Toutanova , Boolq: Exploring the I. Torre, Prelearn@ evalita 2020: Overview of surprising dificulty of natural yes/no questions, the prerequisite relation learning task for Italian , in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceed- EVALITA Evaluation of NLP and Speech Tools for ings of the 2019 Conference of the North American Italian-December 17th, 2020 ( 2020 ) 363 . Chapter of the Association for Computational Lin- [29]

Cassotti ,

Siciliani ,

L. C.

Passaro , M. Gatto, guistics: Human Language Technologies , NAACL- P. Basile , et al., Wic-ita at evalita2023: Overview HLT 2019 , Minneapolis , MN, USA, June 2-7, 2019 , of the evalita2023 word-in-context for Italian task ., Volume 1 (Long and Short Papers), Association for EVALITA ( 2023 ). Computational Linguistics , 2019 , pp. 2924 - 2936 . [30]

Menini ,

Sprugnoli , A . Uva, “who was pietro URL: https://doi.org/10.18653/v1/n19- 1300 . doi:10. badoglio?” towards a qa system for Italian his18653/V1/N19- 1300. tory , in: Proceedings of the Tenth International

[21]

Bisk ,

Zellers ,

Gao ,

Choi , et al., Piqa: Conference on Language Resources and Evaluation Reasoning about physical commonsense in natural ( LREC'16) , 2016 , pp. 430 - 435 . language, in: Proceedings of the AAAI confer- [31]

Basile , M. De Gemmis , L. Siciliani , G. Semerence on artificial intelligence , volume 34 , 2020 , pp. aro , Overview of the evalita 2018 solving language 7432-7439. games (nlp4fun) task, EVALITA Evaluation of NLP

[22]

Welbl ,

N. F.

Liu ,

Gardner , Crowdsourcing and Speech Tools for Italian 12 ( 2018 ) 75. multiple choice science questions , in: L. Der- [32]

Basile ,

Lovetere ,

Monti ,

Pascucci ,

Sanczynski ,

Xu ,

Ritter , T. Baldwin (Eds.), Pro- gati, L. Siciliani, Ghigliottin-ai@ evalita2020: Evalceedings of the 3rd Workshop on Noisy User- uating artificial players for the language game “la generated Text , NUT@EMNLP 2017 , Copenhagen, ghigliottina”, EVALITA Evaluation of NLP and Denmark, September 7 , 2017 , Association for Com- Speech Tools for Italian-December 17th, 2020 ( 2020 ) putational Linguistics, 2017 , pp. 94 - 106 . URL: https: 345 . //doi.org/10.18653/v1/w17- 4413 . doi: 10 .18653/V1/ [33]

Gao ,

Tow ,

Abbasi ,

Biderman , S. Black, W17 - 4413 . A. DiPofi,

Foster ,

Golding ,

Hsu ,

Le Noac

'h,

[23]

Lin ,

Hilton ,

Evans , Truthfulqa: Measur - H. Li , K.

McDonell , N.

Muennighof , C.

Ociepa, ing how models mimic human falsehoods , in: J. Phang , L.

Reynolds , H.

Schoelkopf , A.

Skowron , L.

Sutawika , E.

Tang , A.

Thite , B.

Wang , K.

Wang , namely TowerInstruct-7B-v0.2 , TowerInstruct-MistralA. Zou , A framework for few-shot language model 7B-v0.2, GPT-3.5-turbo and GPT- 4o-mini. Annotators evaluation , 2023 . URL: https://zenodo.org/records/ are required to determine the correctness of a transla10256836. doi:10 .5281/zenodo.10256836. tion. In order for a translation to be deemed correct,

[34] D. M. Alves , J.

Pombal , N. M.

Guerreiro , P. H.

Mar- two key requirements must be satisfied, namely, comtins , J. Alves , A.

Farajian , B.

Peters , R.

Rei , P.

Fer- prehensibility and fidelity. A translation is considered nandes, S . Agrawal,

Colombo , J. G. C. de Souza, comprehensible if a native speaker can easily understand A . F. T. Martins, Tower: An open multilingual large the content of both the question and all the answers . Filanguage model for translation-related tasks, 2024. delity, on the other hand, refers to the degree to which arXiv:2402.17733. the translation conforms to the English source text . In

[35]

Achiam ,

Adler ,

Agarwal ,

Ahmad , I. Akkaya, order to determine whether both requirements are adeF . L. Aleman , D.

Almeida , J.

Altenschmidt , S.

Alt- quately satisfied, we categorize translation errors as miman, S. Anadkat , et al., Gpt-4 technical report , nor or major. While minor errors do not usually hamper arXiv preprint arXiv:2303.08774 ( 2023 ). the overall comprehensibility and fidelity , major errors -

[36]

Freitag ,

Foster ,

Grangier , V.

Ratnakar, which might even relate to just one single word - signifiQ.

Tan , W.

Macherey , Experts, errors, and context: cantly impede comprehensibility and fidelity, potentially A large-scale study of human evaluation for ma- leading to incorrect interpretations. Based on this catchine translation, Transactions of the Association egorization, annotators assign a binary label indicating for Computational Linguistics 9 ( 2021 ) 1460 - 1474 . whether the translation is deemed comprehensible and

[37]

Campolungo ,

Martelli ,

Saina , R. Nav- faithful. During the annotation process, annotators are igli, DiBiMT: A novel benchmark for measur- required to identify potential error patterns. Below, we ing Word Sense Disambiguation biases in Ma- report instances of error patterns often encountered in chine Translation , in: S. Muresan,

Nakov , Machine Translation [ 36 ] :

Villavicencio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Com- 1. Incorrect translation of source words: One putational Linguistics (Volume 1: Long Pa- or more source words are inaccurately translated . pers) , Association for Computational Linguistics, This error category also includes the use of Dublin , Ireland, 2022 , pp. 4331 - 4352 . URL: https:// incorrect terminology in the translation . aclanthology.org/ 2022 . acl-long . 298 . doi: 10 .18653/ v1/ 2022 .acl - long.298. 2. Omission of one or more words: Words from

[38]

Martelli ,

Perrella ,

Campolungo , T. Munda, the source text are missing in the translation . S. Koeva,

Tiberius , R. Navigli, DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambi- 3. Incorrect formulation of the output text: guity in Machine Translation, Computational Lin- Errors related to the syntactic and semantic guistics ( 2024 ) 1 - 79 . URL: https://doi.org/10.1162/ structure of the output text . coli_a_00541 . doi: 10 .1162/coli_a_ 00541 .

[39]

Conia ,

Li ,

Lee ,

Minhas , I. Ilyas, 4. Untranslated source text: One or more source Y. Li, Increasing coverage and precision of tex- words which are reproduced as-is in the output tual information in multilingual knowledge graphs, text, despite these words not being commonly in: H. Bouamor , J. Pino , K. Bali (Eds.), Pro- used in the target language . ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Asso- 5 . Grammatical errors: Errors in grammatical ciation for Computational Linguistics , Singapore, agreement, including mismatches in gender and 2023 , pp. 1612 - 1634 . URL: https://aclanthology. number. org/ 2023 .emnlp-main. 100 . doi: 10 .18653/v1/ 2023 . emnlp- main.100.