<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>ITA-Bench: Towards a More Comprehensive Evaluation for Italian LLMs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Moroni</string-name>
          <email>moroni@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Conia</string-name>
          <email>conia@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Federico Martelli</string-name>
          <email>martelli@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Navigli</string-name>
          <email>navigli@diag.uniroma1.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
          <xref ref-type="aff" rid="aff4">4</xref>
          <xref ref-type="aff" rid="aff5">5</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Natural Language Processing, Evaluation, Italian Language</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>CLiC-it 2024: Tenth Italian Conference on Computational Linguistics</institution>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Having gold LLM benchmarks natively written in Ital-</institution>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Sapienza NLP Group, Dipartimento di Ingegneria Informatica, Automatica e Gestionale, Sapienza University of Rome</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>rate evaluation of LLMs' capabilities in the Italian lan-</institution>
        </aff>
        <aff id="aff4">
          <label>4</label>
          <institution>three main benchmark translations</institution>
          ,
          <addr-line>namely, MMLU, Hel-</addr-line>
        </aff>
        <aff id="aff5">
          <label>5</label>
          <institution>Open Ita LLM Leaderboard”, which is one of the most</institution>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent Large Language Models (LLMs) have shown impressive performance in addressing complex aspects of human language. These models have also demonstrated significant capabilities in processing and generating Italian text, achieving state-ofthe-art results on current benchmarks for the Italian language. However, the number and quality of such benchmarks is still insuficient. A case in point is the “Open Ita LLM Leaderboard” which only supports three benchmarks, despite being one of the most popular evaluation suite for the evaluation of Italian-language LLMs. In this paper, we analyze the current limitations of existing evaluation suites and propose two ways of addressing this gap: i) a new suite of automatically-translated benchmarks, drawn from the most popular English benchmarks; and ii) the adaptation of existing manual datasets so that they can be used to complement the evaluation of Italian LLMs. We discuss the pros and cons of both approaches, releasing our data to foster further research on the evaluation of Italian-language LLMs.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>
        ceur-ws.org
1. Introduction
showing impressive results on an increasing range of
standard benchmarks, thanks in particular to their
reasoning and in-context-learning capabilities [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. The
current trend points towards increasingly larger
models trained on massive amounts of data [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ].
However, despite these advancements, there remains a
significant gap in the availability of high-quality
benchmarks for languages other than English, including
Italian, which is often considered too optimistically as a
high-resource language. Benchmarks are essential for
measuring progress in NLP, providing a standardized
pecially important for Italian given the growing amount
of language-specific models that are being developed for
the language [
        <xref ref-type="bibr" rid="ref5 ref6 ref7 ref8 ref9">5, 6, 7, 8, 9</xref>
        ]. High-quality benchmarks
must be well-crafted to ensure they accurately reflect the
complexities of the language and the specific challenges
it presents.
      </p>
      <p>
        As of today, most of the existing Italian benchmarks
are translations of English datasets, which may not fully
capture the nuances and unique characteristics of the
Italian language. Nevertheless, the ability to automatically
translate English benchmarks into Italian is valuable and
enticing for two main reasons. First, it provides a way
tions of existing datasets from English and the creation
of brand-new datasets in Italian, there is the option of
adapting existing Italian datasets that were originally
created for a diferent purpose, to measure the capabilities
way to evaluate and compare models, and this is now es- ities of modern LLMs to be fully analyzed, even though
tion. This direction has gained traction over the past few LLM Leaderboard”. This is a collection of three datasets –
months, with eforts that focus on repurposing Italian HellaSwag [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], MMLU [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], and ARC-Challenge [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] –
tests (usually designed for humans) to evaluate LLMs that were automatically translated into Italian. Although
instead [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]. this set of three benchmarks is generally considered to be
      </p>
      <p>In this paper, we follow both directions and intro- of high-quality (thanks to the fact that the translations
duce ITA-Bench, a more comprehensive benchmark suite were produced using GPT-3.5), there are still several
isfor the evaluation of Italian-language LLMs. First, ITA- sues that limit the quality of this evaluation suite:
Bench proposes a new extended suite of benchmarks Coverage: Open Ita LLM Leaderboard only covers
created by automatically translating the most popular three benchmarks. There are plenty of other datasets
English benchmarks into Italian. Second, ITA-Bench in- that are generally used to test the capabilities of LLMs
cludes existing manually curated datasets, adapted to en- in English, so limiting the assessment of Italian LLMs to
hance the evaluation framework for Italian LLMs. These just three datasets may result in the evaluation of some
two complementary approaches aim to bridge the eval- important aspects of their capabilities in Italian being
uation gap and provide a more thorough understand- overlooked.
ing of the capabilities of Italian-language LLMs. With Reproducibility: The code and models used to
transITA-Bench, we hope to foster further development and late these three benchmarks are not directly available,
refinement of evaluation techniques for Italian LLMs, making it hard – if not impossible – to reproduce the
ultimately contributing to the broader field of multilin- translations.3
gual NLP. ITA-Bench is available at https://github.com/ Transparency: The fact that the translations are not
sapienzanlp/ita-bench. reproducible makes it dificult to analyze whether there
are errors or there is margin for improvement in the
2. ITA-Bench: a New Evaluation translation process originally used to translate the three
benchmarks.</p>
      <p>Suite for Italian LLMs English specificity: Despite the translation process,
these benchmarks actually remain tied to the English
lanIn this section, we introduce our methodology for the guage. Indeed, the prompts used as input to the language
creation of ITA-Bench, a more comprehensive evalua- model contain parts that are in English (for example, in
tion suite for Italian LLMs. Our objective is to focus on the creation of the examples used for few-shot
evaluathe Italian language and, more specifically, to create a tion). This is undesirable because it inherently favours
benchmark suite that is able to test a wide variety of LLMs that are bilingual, more specifically, LLMs that can
aspects of LLMs that “generate” Italian text. To accom- “speak” fluent English in addition to Italian.
plish this objective we focus on two distinct directions: Uniformity: The translation of benchmarks from
Eni) translating existing English benchmarks that are cur- glish to a target language is usually done on a
benchmarkrently used to evaluate the capabilities of state-of-the-art by-benchmark basis. On one hand, this allows developers
LLMs in English, and ii) adapting existing Italian bench- to specialize the translation code to each dataset; on the
marks, drawing from popular repositories, conferences, other hand, this approach prevents the translation
proshared tasks, and community initiatives, such as the sev- cess from being comparable across datasets, which makes
eral EVALITA editions1 and SemEval tasks.2 In the case performing a root-cause-analysis on the origin of an error
of adaptation of existing datasets, most of the work con- in the translated dataset more complex.
sists in adapting the scope of the tasks, i.e., since many
of these tasks were not designed to evaluate LLMs, the
core of the work lies in reframing the problem in a way 2.1.2. Re-translating English benchmarks
that a prompt can be used to test the capability of a
particular LLM to solve a specific task. Table 1 reports the
overall statistics of the datasets that we consider for our
ITA-Bench suite.</p>
    </sec>
    <sec id="sec-2">
      <title>Here we describe our methodology that is aimed at ad</title>
      <p>dressing the issues that are present in existing
benchmark translations, including the ones used in Open Ita
LLM Leaderboard. More specifically, we introduce a
new library called OBenTO (Open Benchmark
Translation for the Others) that is designed to translate
existing benchmarks in a uniform, reproducible and
fullytransparent way. Moreover, it is also designed to be
easily extensible, in such a way that the research
community can add new benchmark translations and even</p>
    </sec>
    <sec id="sec-3">
      <title>3For example, the version of GPT-3.5 used to translate the bench</title>
      <p>marks is not known. Also note that OpenAI has already deprecated
many GPT 3.5 versions.</p>
      <sec id="sec-3-1">
        <title>2.1. Translating English Benchmarks</title>
        <sec id="sec-3-1-1">
          <title>2.1.1. Issues with existing translations</title>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>The most popular and widely-used evaluation suite for</title>
      <p>Italian produced via translation is perhaps the “Open Ita</p>
    </sec>
    <sec id="sec-5">
      <title>1https://www.evalita.it/campaigns/</title>
      <p>
        2https://semeval.github.io/
new languages besides Italian. We release OBenTO at campaigns and the SemEval shared tasks. These sources
https://github.com/sapienzanlp/obento. provide Italian data and annotations for a variety of tasks,
covering a broad spectrum of linguistic capabilities and
Translation model. The OBenTO library is designed phenomena in the Italian language.
to be easily adaptable to new backbones, but at the time of The key step in adapting these Italian benchmarks –
writing this article, the library relies on TowerLLM [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], originally designed for diferent use cases – is to reframe
a recent open LLM that is built on top of open-weight each task as a question answering task, enabling LLMs
LLMs, such as LLaMA and Mistral. TowerLLM continues to approach and solve them efectively through
promptthe pretraining stage on 10 languages to improve multi- ing. In practice, this involves transforming the input of
lingual capabilities of the starting LLM. Moreover, Tow- each task into a natural question and the output into a
erLLM is fine-tuned on translation and other translation- corresponding natural answer or continuation. Where
related tasks, including grammar error correction, named applicable, we also design a set of incorrect answers or
entity recognition and post-translation correction. distractors of varying complexity. In our adaptation
process, we diferentiate between two prompting strategies:
Translated benchmarks. We translate the following multiple-choice and cloze style. In the multiple-choice
datasets from English to Italian: approach, the LLM is given a question along with a
predetermined set of possible answers from which it must
ARC Challenge and ARC Easy (ARC-E) [17, choose the correct one. In the setting of adapting
exARC-C, ARC-E]: These are two benchmarks on rea- isting benchmarks, the multiple-choice style will also
soning and scientific knowledge, created from a single encompass binary classification prompting, where the
dataset; the ARC Challenge split is obtained by selecting only possible responses are “sì” (yes) or “no”. In the cloze
all those questions that QA systems at the time were not style approach, instead, the LLM is required to
generable to answer correctly. ate the correct answer based solely on the question, or
GSM8K [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: A benchmark that tests the capability of equivalently, the generation of correct class verbalization,
an LLM to solve simple math problems whose solution for classification tasks. Given the large search space of
only requires the use of basic arithmetic operations. potential answers in this format, the evaluation focuses
BoolQ [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]: A benchmark obtained from queries by on ensuring that the likelihood of the correct answer is
search engine users. The task consists in answering higher than that of a predefined set of incorrect answers.
Yes or No depending on an input passage that provides We discuss the details of the adaptation process for
context. each dataset in the following sections and in Appendix C.
HellaSwag [15, HS]: A commonsense reasoning dataset We ofer multiple-choice and cloze style implementations
that requires a system to select the most suitable contin- for all datasets except QUANDHO and DISCOTEX,
uation for a given text, based on implicit commonsense which have only multiple-choice due to their
sentenceknowledge. and paragraph-length choices.
      </p>
      <p>
        MMLU [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]: A benchmark which encompass several AMI [25]: Automatic Misoginy Identification is a
questions over 57 subjects across STEM, the humanities, classification task in which the goal is to understand
the social sciences, and more. whether or not a tweet is misogynist. The original
task is divided into two subtasks, Behaviour and Synth.
      </p>
      <p>
        PIQA [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]: A benchmark that evaluates the capability Behaviour consists in classifying a tweet into one of
of an LLM to reason about physical interactions. three classes, namely, no misogyny, mild misoginy,
SciQ [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]: A reading comprehension test set that and aggresive misogyny. Instead, Synth consists of a
challenges an LLM to extract the answer from a passage binary classification task, misogyny v. no misogyny.
and question given in input. ITA-Bench includes both subtasks, but in this work we
TruthfulQA [23, TQA]: A question answering bench- focus on Synth, as Behaviour is more complex due to its
mark with a focus on popular misconceptions found unbalanced class distribution.
across the Web.
      </p>
      <p>Winogrande [24, WG]: a commonsense reasoning
dataset that requires choosing between two options
based on coreference resolution.</p>
    </sec>
    <sec id="sec-6">
      <title>NERMuD [26]: Named Entity Recognition on</title>
      <p>Multi-domain Documents was first presented at
EVALITA-2023. The task uses standard NER classes,
namely, Person, Organization, and Location, to tag
entities in a text. In ITA-Bench, we adapt NERMuD
and create task instances comprised of three elements:
i) the sentence that contains the entity mention, ii)
the mention of the entity in the sentence, and iii) the
correct class associated with the mention in the given</p>
      <sec id="sec-6-1">
        <title>2.2. Adapting Italian Benchmarks</title>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>In addition to our new automatically-translated benchmarks, ITA-Bench also includes the adaptation of existing Italian benchmarks from two main sources: the EVALITA</title>
      <p>Dataset Train set Valid set Test set PRELEARN [28]: Prerequisite RElation
LEARNARC-C 1068 286 1132 ing is a task from EVALITA 2020 on concept prerequisite
ARC-E 2157 549 2258 learning. This task consists in identifying whether a
BGoSoMlQ8K 97349793 32-59 13-19 concept A is a prerequisite of another concept B, i.e.,
HS 39722 9998 - if learning concept B requires having already learnt
MMLU 269 1402 13127 concept A. The original dataset comes with four domains,
PIQA 15038 1713 - namely, Geometry, Precalculus, Physics, and Data Mining,
STcruiQthfulQA -- 799823 9-85 and we maintain these same domains in ITA-Bench.
Winogrande 4717 1176 - WiC [29]: Word-in-Context for Italian. We focus
AMI 7014 - 2908 on the binary-classification sub-task of the original
WiC 2805 500 500 formulation. In ITA-Bench, an LLM is tasked with
NERMuD 14529 4079 3943 determining if a word  occurring in two diferent
PPrReETLEENASRN 52833278 -- 14566909 sentences  1 and  2 has the same meaning in  1 and  2.
DISCOTEX 16000 - 1600 QUANDHO [30]: The QUestion ANswering Data
GhigliotinAI 62 - 553 for Italian HistOry dataset is an Italian
questionQUANDHO 384 - 1416 answering dataset focused on Italy’s history during the
Table 1 ifrst half of the 20th century. It provides Wikipedia
Statistics of the ITA-Bench datasets, for each dataset the car- passages that may contain the answer to specific
dinalities of the training, validation and test set are reported. questions. Each question in the dataset appears in
multiple (,   ) pairs, where the answer can
be either correct or incorrect. In ITA-Bench, we select
context. We distinguish between two subdomains: ADG, the pair with an answer marked as correct and three
writings and speeches from the Italian politician Alcide distractors from the occurrences of incorrect answers
De Gasperi, and WN, news texts from the past decades. paired with the same question.</p>
      <p>DISCOTEX [27]: Assessing DIScourse COherence in GhigliottinAI: Starting from two diferent EVALITA
Italian TEXts is a task focused on modelling discourse tasks, nlp4fun [31] and ghigliottin-AI [32], we collect
coherence in real-word Italian texts. In ITA-Bench, we about 600 diferent games extracted from the TV
focus only on the first sub-task of DISCOTEX: Last Sen- show and the boardgame of “L’Eredità”, a popular
tence Classification , where, given a short input paragraph quiz game in Italy. “La Ghigliottina” is a challenging
and a sentence, the goal is to tell whether the sentence game that requires extensive knowledge of the Italian
is a valid continuation of the paragraph. To assess the culture. The goal is to find a single word that links five
capability of an LLM to solve this task, we reframe seemingly unrelated words. However, since multiple
DISCOTEX as a multi-choice question answering task. solutions are often possible and computing all potential
More specifically, given an input paragraph, the LLM is answers is impractical, in ITA-Bench, we reframe the
tasked with selecting the most appropriate continuation problem as a multi-choice question answering task,
from among five options that we provide (the original i.e., a simplified version in which four possible words
dataset does not provide distractors). Therefore, for the are given and, among these, only one can be linked to
subset of instances with valid continuations, we create all the five input words. In ITA-Bench, we also select
a set of distractors by sampling continuations from three distractor words in such a way that the distractors
other instances at random. Instead, for the instances are linked to three of the five input words. We ensure
with invalid continuations, we create a new correct that the distractors are not too similar one to the other
option “nessuna delle precedenti” (none of the above), and by maximixing the cosine distance of their FastText
add a set of four random distractors from other instances. embeddings. The distractors are also designed to be at
PreTENS4: Presupposed Taxonomies was first most one character shorter or longer than the correct
proposed for SemEval-2022. This task focuses on word, resulting in a task that is easy for humans but
semantic competence, and evaluates the ability of challenging for LLMs.
an LLM to recognize valid taxonomic relationships
between two nominal arguments. For example, this 3. Evaluation Results
can require recognizing whether or not a concept is a
subclass of another concept. In ITA-Bench, an LLM
is tasked with identifying whether the relationship
between two concepts in the same sentence is acceptable.</p>
    </sec>
    <sec id="sec-8">
      <title>In this section, we discuss the results of various LLMs</title>
      <p>on ITA-Bench: we first present the results on the
automatically-translated benchmarks and then on the
adapted benchmarks. ITA-Bench implements all the
task formulations using the lm-evaluation-harness
li</p>
    </sec>
    <sec id="sec-9">
      <title>4https://sites.google.com/view/semeval2022-PreTENS</title>
      <p>Size Name
0.4B Minerva-350M-base-v1.0
1B Minerva-1B-base-v1.0
3B OpenELM-3B
3B XGLM-2.9B
3B Minerva-3B-base-v1.0
7B OLMo-7B-0724-hf
7B LLaMAntino-2-7b
7B Minerva-7B-base-v1.0
7B Mistral-7B-v0.1
8B Llama-3.1-8B
7B Mistral-7B-Instruct-v0.1
7B Maestrale-chat-v0.4-beta
8B LLaMa-3.1-8B-Instruct
8B LLaMAntino-3-ANITA
9B Italia-9B-Instruct-v0.1
brary [33], which allows us to calculate the likelihoods the results on GhigliottinAI, Italian LLMs seem to
perfor each possible continuation in a simple and compa- form well and surpass the results obtained by English
rable way, as lm-evaluation-harness is also used by models. This may indicate that this task needs a
diferHugging Face for the Open LLM Leaderboard. ent type of competence and/or knowledge in order to
be solved. Indeed, we hypothesize that the task requires
3.1. Automatic Translation a deeper understanding of some elements of the Italian
culture, e.g., entities and concepts that are more
comThe results of various LLMs on our translated bench- monly known in Italy than in other countries. Therefore,
marks are reported in Table 2, which provides an pretraining and fine-tuning on Italian documents might
overview of the zero-shot scores on cloze style task for- be the key to obtaining better results in GhigliottinAI.
mulations, i.e., the input prompt to an LLM includes only
the question without the possible answers. More
specifically, we compare the results of several open-weight 4. Manual Error Analysis
LLMs having diferent sizes, ranging from less than 1B
parameters up to 9B parameters and focusing on LLMs
that have been pretrained, fine-tuned and/or adapted
on/to the Italian language. As we can see, the scores of
the LLMs are roughly correlated to their size in terms of
number of parameters. Notably, the smaller versions of
the Minerva LLMs are able to compete with larger
models, thanks to the fact that a significant portion of their
pretraining dataset is composed of Italian text (rather
than English).</p>
    </sec>
    <sec id="sec-10">
      <title>In order to assess the quality and reliability of our</title>
      <p>
        automatically-translated data, we conduct a manual
error analysis. To this end, we examine the translations
into Italian produced by four language models: two
open-source ones, namely, TowerInstruct-7B-v0.25 and
TowerInstruct-Mistral-7B-v0.26 [
        <xref ref-type="bibr" rid="ref24">34</xref>
        ], and two proprietary
ones, that is, GPT-3.5-turbo and GPT-4o-mini [
        <xref ref-type="bibr" rid="ref25">35</xref>
        ].7 First,
we describe the data and the analysis procedure
employed. We then discuss the results of our manual analysis
and review some crucial error patterns.
      </p>
      <sec id="sec-10-1">
        <title>3.2. Adapting Italian Datasets</title>
      </sec>
      <sec id="sec-10-2">
        <title>4.1. Data and analysis procedure</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>Moving to the adapted benchmarks in ITA-Bench, Table 3</title>
      <p>reports the scores of diferent state-of-the-art models, As the source of the data for our linguistic analysis, we
ranging from 350M parameters models to 9B parameters. rely on the ARC dataset, which includes multiple-choice
Here, we focus on the results of the LLMs in cloze style question answering in a wide range of domains.
Specifitasks, except for QUANDHO and DISCOTEX, as ITA- cally, we randomly select a sample of 100 instances from
Bench supports only the multi-choice formulation for
these two tasks. Unsurprisingly, the size of the LLMs and
their pretraining data are discriminators for reaching
better results. Most importantly, even the strongest Italian
LLMs, such as ANITA, still struggle to compete against
their English counterparts. However, as we can see from
5https://huggingface.co/Unbabel/TowerInstruct-7B-v0.2
6https://huggingface.co/Unbabel/TowerInstruct-Mistral-7B-v0.2
7We employ the OBenTO pipeline to process the translations
generated by the open-source models. As for GPT-3.5-turbo, we
use the translations available at: https://huggingface.co/datasets/
alexandrainst/m_arc. We also translate the datasets using
GPT-4omini with a pipeline similar to the one used for GPT-3.5-turbo.
the ARC Challenge dataset and we manually analyze the employ a new annotator to assess their quality based on
quality of the translations produced by all language mod- our guidelines. We obtain a Cohen’s kappa of 0.85, which
els considered. For each instance, we assess the degree of indicates a strong agreement.
comprehensibility and fidelity of the translation of both
questions and answers, assigning a binary label which 4.2. Results
indicates whether a translation is acceptable or not.
Crucially, we distinguish between minor and major errors Our analysis shows that GPT-4o-mini outperforms all its
depending on the impact on the comprehensibility and competitors. With an error rate8 of 4%, it is markedly
ifdelity of the target translation. We then identify error more accurate than TowerInstruct-7B-v0.2, which
expatterns, some of which we describe below, highlighting hibits an error rate of 23%.
TowerInstruct-Mistral-7Bthe cases in which the translation impedes understanding v0.2 and GPT-3.5-turbo show a similar performance, that
of either the questions or the answers, or fails to faith- is, 8% and 9% error rate, respectively. Finally, the most
fully reproduce the source text, thus altering the original frequent error patterns are omissions, especially when
meaning. Finally, we discuss the results of our analysis. considering open-source models, and incorrect
terminolAnnotation guidelines are reported in Appendix A. ogy.</p>
      <sec id="sec-11-1">
        <title>4.1.1. Key error patterns</title>
        <p>5. Conclusion
As part of our manual annotation process, we identify
error patterns, of which we report four key ones, namely: In this paper, we introduce a novel evaluation suite
i) omissions, which consist in omitting one or multiple aimed at advancing the Italian community’s ability to
source words in the translation; ii) incorrect terminol- assess the competencies of LLMs on Italian data. Our
ogy, that is, the incorrect translation of one or multiple approach follows two main directions. First, we define
terms into the target language; iii) untranslated source a novel pipeline called OBenTO, which involves
transtext, where one or multiple source words are reported lating state-of-the-art English benchmarks into Italian.
as-is in the translation, despite these words not being Second, we rephrase existing Italian benchmarks to be
commonly used in the target language; and iv) grammat- used for prompting and testing large language models.
ical errors, which include orthographic, morphological Additionally, we conduct a comprehensive evaluation
and syntactical errors. Instances of the aforementioned of the quality of automatically translated benchmarks,
error patterns can be found in Appendix B. highlighting the inherent challenges of such an approach
and analyzing the errors made by four LLMs. We hope
4.1.2. Inter-annotator agreement that our work can provide a solid evaluation framework
for evaluating the capabilities of current and future LLMs
In order to assess the reliability of our manual annota- in Italian.
tions, we compute the inter-annotator agreement. With
this aim in view, we select the already-annotated trans- 8We emphasize that this error rate does not provide a nuanced
evalulations produced by one randomly-chosen model and sauticohnaosf flutheencayfoarenmdeidnitoiomnaetdicaintyd. other crucial aspects of translation,
Acknowledgments</p>
      </sec>
    </sec>
    <sec id="sec-12">
      <title>Simone Conia gratefully acknowledges the PNRR MUR</title>
      <p>project PE0000013-FAIR, which fully funds his fellowship.
Federico Martelli and Roberto Navigli acknowledge the
support of the CREATIVE project (CRoss-modal
understanding and gEnerATIon of Visual and tExtual content,
Progetti di Interesse Nazionale - PRIN 2020). Finally, we
acknowledge the work of the M.Sc. students of Prof.
Navigli’s multilingual NLP course for the Academic Year 2024,
whose contributions provided valuable insights and ideas
for the adaptation of existing Italian benchmarks. We
acknowledge the CINECA award IsB28_medit under the
ISCRA initiative for the availability of high-performance
computing resources and support.</p>
      <p>A. Annotation Guidelines</p>
    </sec>
    <sec id="sec-13">
      <title>In this section, we report the annotation guidelines</title>
      <p>adopted to ensure consistency throughout our analysis.
Annotators receive a document containing the source text
and the translations produced by four language models,</p>
    </sec>
    <sec id="sec-14">
      <title>6. Inadequate register: The tone or style of the</title>
      <p>translation does not align with the context of the
source text.
7. Addition of one or more words: Additional
words or phrases (not present in the source text)
are included in the translation.
[...] Un uomo con una combinazione di alleli RR per il tratto produce uno zigote con una donna con una
combinazione di alleli rr per il tratto. Quale combinazione di alleli potrebbe verificarsi nello zigote?</p>
      <p>
        B. Examples of key error patterns
reported in Table 10. We present all prompts in the same
format as the LM-Evaluation-Harness implementation. To
In this section, we report examples of the key error pat- ensure clarity and conciseness, we use Jinja templating9
terns described in Section 4.1.1. Specifically, we report in- for all prompts.
stances of omissions (Table 4), incorrect terminology (Table
5), untranslated source words (Table 6) and grammatical
errors (Table 7). Errors are highlighted by square brackets D. In-domain Results
in red. Importantly, all examples in the aforementioned
Tables are considered major errors, with the sole excep- PRELEARN and NERMuD have been reported as average
tion of the first instance of omission reported in Table 4. accuracies on the main part of this paper. Results are
Specifically, the omission of the word pests has a limited reported in Table 11 and Table 12, looking at each domain
impact on the comprehensibility and fidelity of the trans- separately for the two diferent tasks. While results for
lation and, therefore, for the purposes of the task at hand the zero-shot setting are reported in Table 13 and in
and our analysis, the translation is considered accept- Table 14. We reported the results twice, dividing the
able. As for untranslated source words, we note several multi-choice and cloze style prompt setting.
issues in the data. As reported in Table 4, we note that
GPT-4o-mini translates the term weathering as the Italian E. Other results for adapted tasks
equivalent of erosion. However, weathering and erosion
are two diferent geological processes. In fact, weathering In this section we report other results about adapted
(which could be translated into Italian as degradazione tasks. More precisely, in Table 15 are collected the metrics
meteorica) refers to the breaking down of rocks and min- for the adapted tasks in zero shot setting, where all the
erals at their original location through physical, chemical, tasks are proposed in cloze style prompting, except for
or biological means, without the material being moved DISCOTEX and QUANDHO which are reported in
multielsewhere. In contrast, erosion involves the removal and choice prompting.
transportation of weathered material by agents such as Since we employed a Multi-Choice (MC) style
promptwater, wind, or ice. Hence, in translating weathering as ing for all adapted tasks. Table 16 presents the results
the Italian equivalent of the word erosion, the model fails for these tasks in the zero-shot setting, while Table 17
to capture the precise meaning of the source term, signif- shows the results in the five-shot setting.
icantly altering the content of the source text. Our error
analysis also shows that MT systems still struggle with
disambiguation of concepts [
        <xref ref-type="bibr" rid="ref27 ref28">37, 38</xref>
        ] and entities [
        <xref ref-type="bibr" rid="ref13 ref29">39, 13</xref>
        ].
      </p>
      <p>C. Adapted Tasks Prompts</p>
    </sec>
    <sec id="sec-15">
      <title>In this section we report all the prompts chosen for the</title>
      <p>adapted tasks. The cloze style prompts are reported in
Table 8, while multi-choice-style prompts can be seen in
Table 9. For each task we also defined a system prompt,
which consists of a text prepended to the model before
the sample prompts, the proposed system prompts are</p>
    </sec>
    <sec id="sec-16">
      <title>9https://jinja.palletsprojects.com/en/3.1.x/templates/</title>
      <p>TowerInstruct-7B-v0.2
Una cellula [eugleno] possiede una struttura chiamata [macula occhiolare] che rileva la luce. Un [parameno]
non possiede una [macula occhiolare] e quindi non riesce a rilevare la luce. Perché un [parameno] non ha
bisogno di una [macula occhiolare]?
A plateau is most likely formed by [a] runof from a river. [b] weathering by waves. [c] erosion of rock debris. [d] a
buildup of cooled lava.</p>
      <p>Un plateau è più probabilmente formato da [a] deflusso da un fiume. [b] [erosione] da onde. [c] erosione di
detriti rocciosi. [d] un accumulo di lava rafreddata.
In un bicchiere è [stato versato] dell’acqua fino a metà. Sono stati messi cinque cubetti di ghiaccio nel bicchiere,
facendo sì che il livello dell’acqua raggiungesse il bordo del bicchiere. Quale delle seguenti afermazioni spiega
al meglio l’aumento del livello dell’acqua?
Which of the following is most likely an adaptation that resulted from habitat destruction?</p>
      <p>Qual è più probabile un[’]adattamento che è risultato dalla distruzione dell’habitat?
PRELEARN
NERMuD
PreTENS
PRELEARN
QuandHO
DISCOTEX
GhigliottinAI
WiC
PRELEARN
QuandHO
DISCOTEX
GhigliottinAI
WiC</p>
      <p>Indica se i seguenti tweet presentano caratteristiche misogine.</p>
      <p>Data una frase e un’entità, indica se tale entità rappresenta un luogo, un’organizzazione o una persona.
Indica se le seguenti frasi hanno senso.</p>
      <p>Dati due concetti A e B, indica se il primo concetto è un prerequisito per il secondo.</p>
      <p>Il concetto A è prerequisito per il concetto B, se per comprendere B devi prima aver compreso A.
I seguenti concetti appartengono al dominio: {{domain}}.</p>
      <p>Ti saranno poste domande di storia italiana.</p>
      <p>Identifica quali paragrafi contengono la risposta alle domande date.</p>
      <p>Ti verranno poste delle domande nelle quali è presente un paragrafo, e come possibili risposte varie frasi che
possono essere o meno la continuazione del paragrafo.</p>
      <p>Indica la frase che rappresenta la continuazione più probabile del paragrafo, oppure “nessuna delle precedenti”
se nessuna delle continuazioni è corretta.</p>
      <p>Ti viene chiesto di risolvere il gioco della ghigliottina.</p>
      <p>Il gioco della ghigliottina consiste nel trovare un concetto che lega cinque parole date. Tale concetto è esprimibile
tramite una singola parola.</p>
      <p>Date due frasi, che contengono un lemma in comune, indica se tale lemma ha lo stesso significato in entrambe
le frasi.
WN
WN
Minerva-350M-base-v1.0
Minerva-1B-base-v1.0
XGLM-2.9B
OpenELM-3B
Minerva-3B-base-v1.0</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kojima</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Reid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Matsuo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Iwasawa</surname>
          </string-name>
          ,
          <article-title>Large language models are zero-shot reasoners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>22199</fpage>
          -
          <lpage>22213</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hofmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          , E. Buchatskaya,
          <string-name>
            <given-names>T.</given-names>
            <surname>Cai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rutherford</surname>
          </string-name>
          , D. d. L.
          <string-name>
            <surname>Casas</surname>
            ,
            <given-names>L. A.</given-names>
          </string-name>
          <string-name>
            <surname>Hendricks</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Welbl</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Training compute-optimal large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2203.15556</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Bommasani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yogatama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bosma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Metzler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Hashimoto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <article-title>Emergent abilities of large language models</article-title>
          ,
          <source>Trans. Mach. Learn. Res</source>
          .
          <year>2022</year>
          (
          <year>2022</year>
          ). URL: https://openreview.net/forum? id=yzkSU5zdwD.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          , G. Fiameni, G. Semeraro,
          <article-title>Llamantino: Llama 2 models for efective text generation in Italian language</article-title>
          ,
          <source>arXiv preprint arXiv:2312.09993</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Santilli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Rodolà</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>Fauno: The Italian large language model that will leave you senza parole!</article-title>
          , in: F.
          <string-name>
            <given-names>M.</given-names>
            <surname>Nardini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Tonellotto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Faggioli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Ferrara (Eds.),
          <source>Proceedings of the 13th Italian Information Retrieval Workshop (IIR</source>
          <year>2023</year>
          ), Pisa, Italy, June 8- 9,
          <year>2023</year>
          , volume
          <volume>3448</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>9</fpage>
          -
          <lpage>17</lpage>
          . URL: https: //ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3448</volume>
          /paper-24.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silvestri</surname>
          </string-name>
          ,
          <article-title>DanteLLM: Let's push Italian LLM Research Forward!</article-title>
          , in: N.
          <string-name>
            <surname>Calzolari</surname>
            , M.-
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Kan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Hoste</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Sakti</surname>
          </string-name>
          , N. Xue (Eds.),
          <source>Proceedings of the 2024 Joint International Conference on Computational Linguistics</source>
          ,
          <article-title>Language Resources and Evaluation (LREC-COLING 2024), ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , pp.
          <fpage>4343</fpage>
          -
          <lpage>4355</lpage>
          . URL: https: //aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>388</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Semeraro, Advanced Natural-based interaction for the Italian language: Llamantino-3-anita</article-title>
          , arXiv preprint arXiv:
          <volume>2405</volume>
          .07101 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Barba</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Conia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Orlandini</surname>
            , G. Fiameni,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , Minerva LLMs:
          <article-title>The first family of Large Language Models trained from scratch on Italian data</article-title>
          ,
          <source>Proc. of CLiC-it 2024 - Tenth Italian Conference on Computational Linguistics</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <article-title>ItaEval and TweetyIta: A new extensive benchmark and eficiency-first language model for</article-title>
          <source>Italian</source>
          ,
          <year>2024</year>
          . URL: https://rita-nlp.org/static/ItaEval_TweetyIta_ v1.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hershcovich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Frank</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lent</surname>
          </string-name>
          , M. de Lhoneux,
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brandl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Bugliarello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Cabello</given-names>
            <surname>Piqueras</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Chalkidis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Cui</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Fierro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Margatina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rust</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Søgaard</surname>
          </string-name>
          ,
          <article-title>Challenges and strategies in cross-cultural NLP</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <source>Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Dublin, Ireland,
          <year>2022</year>
          , pp.
          <fpage>6997</fpage>
          -
          <lpage>7013</lpage>
          . URL: https://aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>482</volume>
          . doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2022</year>
          .acl- long.482.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <article-title>Biases in large language models: Origins, inventory, and discussion</article-title>
          ,
          <source>ACM J. Data Inf. Qual</source>
          .
          <volume>15</volume>
          (
          <year>2023</year>
          )
          <volume>10</volume>
          :
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          :
          <fpage>21</fpage>
          . URL: https://doi.org/10.1145/3597307. doi:
          <volume>10</volume>
          .1145/ 3597307.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U. F.</given-names>
            <surname>Minhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Potdar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Towards cross-cultural machine translation with retrieval-augmented generation from multilingual knowledge graphs</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2410.14057.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Esuli</surname>
          </string-name>
          ,
          <string-name>
            <surname>G. Puccetti,</surname>
          </string-name>
          <article-title>The invalsi benchmark: measuring language models mathematical and language understanding in Italian</article-title>
          ,
          <source>arXiv preprint arXiv:2403.18697</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , Hellaswag:
          <article-title>Can a machine really finish your sentence?</article-title>
          , in: A.
          <string-name>
            <surname>Korhonen</surname>
            ,
            <given-names>D. R.</given-names>
          </string-name>
          <string-name>
            <surname>Traum</surname>
          </string-name>
          , L. Màrquez (Eds.),
          <source>Proceedings of the 57th Conference of the Association for Computational Linguistics</source>
          ,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          <year>2019</year>
          , Florence, Italy,
          <source>July 28- August 2</source>
          ,
          <year>2019</year>
          , Vol- S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Villavicencio (Eds.),
          <article-title>Proume 1: Long Papers, Association for Computa- ceedings of the 60th Annual Meeting of the Astional Linguistics</article-title>
          ,
          <year>2019</year>
          , pp.
          <fpage>4791</fpage>
          -
          <lpage>4800</lpage>
          . URL:
          <article-title>https: sociation for Computational Linguistics</article-title>
          (Volume //doi.org/10.18653/v1/p19-
          <fpage>1472</fpage>
          . doi:
          <volume>10</volume>
          .18653/V1/ 1: Long Papers),
          <source>ACL</source>
          <year>2022</year>
          , Dublin, Ireland, May P19-
          <volume>1472</volume>
          .
          <fpage>22</fpage>
          -
          <lpage>27</lpage>
          ,
          <year>2022</year>
          , Association for Computational Lin-
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Zou, guistics,
          <year>2022</year>
          , pp.
          <fpage>3214</fpage>
          -
          <lpage>3252</lpage>
          . URL: https://doi.org/ M. Mazeika,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          , Measuring
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>229</volume>
          . doi:
          <volume>10</volume>
          .18653/
          <article-title>V1/ massive multitask language understanding</article-title>
          ,
          <source>in: 2022.ACL- LONG</source>
          .
          <year>229</year>
          . 9th International Conference on Learning Rep- [24]
          <string-name>
            <given-names>K.</given-names>
            <surname>Sakaguchi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bhagavatula</surname>
          </string-name>
          , Y. Choi, resentations,
          <source>ICLR</source>
          <year>2021</year>
          ,
          <string-name>
            <given-names>Virtual</given-names>
            <surname>Event</surname>
          </string-name>
          , Austria,
          <source>Winogrande: An adversarial winograd schema chalMay 3-7</source>
          ,
          <year>2021</year>
          , OpenReview.net,
          <year>2021</year>
          . URL: https: lenge at scale,
          <source>Communications of the ACM</source>
          <volume>64</volume>
          //openreview.net/forum?id=
          <fpage>d7KBjmI3GmQ</fpage>
          . (
          <year>2021</year>
          )
          <fpage>99</fpage>
          -
          <lpage>106</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cowhey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Sab- [25]
          <string-name>
            <given-names>E.</given-names>
            <surname>Fersini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , et al., Ami@ harwal,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schoenick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          ,
          <article-title>Think you have evalita2020: Automatic misogyny identification, in: solved question answering? try arc, the ai2 rea- Proceedings of the 7th evaluation campaign of Natusoning challenge</article-title>
          , arXiv preprint arXiv:
          <year>1803</year>
          .
          <article-title>05457 ral Language Processing and Speech tools for Italian (</article-title>
          <year>2018</year>
          ).
          <source>(EVALITA</source>
          <year>2020</year>
          ),
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <surname>D. M. Alves</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pombal</surname>
            ,
            <given-names>N. M.</given-names>
          </string-name>
          <string-name>
            <surname>Guerreiro</surname>
            ,
            <given-names>P. H.</given-names>
          </string-name>
          <string-name>
            <surname>Mar-</surname>
            [26]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Palmero Aprosio</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Paccosi</surname>
          </string-name>
          , et al., Nermud at tins, J.
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Farajian</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
          </string-name>
          , P. Fernan- evalita
          <year>2023</year>
          :
          <article-title>overview of the named-entities recogdes, S. Agrawal</article-title>
          , et al.,
          <article-title>Tower: An open multilingual nition on multi-domain documents task, in: CEUR large language model for translation-related tasks</article-title>
          ,
          <source>WORKSHOP PROCEEDINGS</source>
          , volume
          <volume>3473</volume>
          , CEUR, arXiv preprint arXiv:
          <volume>2402</volume>
          .17733 (
          <year>2024</year>
          ).
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>K.</given-names>
            <surname>Cobbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kosaraju</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bavarian</surname>
          </string-name>
          , M. Chen, [27]
          <string-name>
            <given-names>D.</given-names>
            <surname>Brunato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Colla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Dini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. P. H.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Plappert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tworek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          , Radicioni,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ravelli</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Discotex</surname>
            <given-names>at evalita R.</given-names>
          </string-name>
          <string-name>
            <surname>Nakano</surname>
          </string-name>
          , et al.,
          <article-title>Training verifiers to solve 2023: overview of the assessing discourse cohermath word problems</article-title>
          ,
          <year>2021</year>
          , URL https://arxiv. ence in Italian texts task, in: CEUR WORKSHOP org/abs/2110.14168 (
          <year>2021</year>
          ). PROCEEDINGS, volume
          <volume>3473</volume>
          ,
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          ,
          <year>2023</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>8</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          , T. Kwiatkowski, [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Alzetta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Miaschi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Dell'Orletta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Koceva</surname>
          </string-name>
          , M. Collins,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          ,
          <article-title>Boolq: Exploring the I. Torre, Prelearn@ evalita 2020: Overview of surprising dificulty of natural yes/no questions, the prerequisite relation learning task for Italian</article-title>
          , in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.),
          <article-title>Proceed- EVALITA Evaluation of NLP and Speech Tools for ings of the 2019 Conference of the North American Italian-December 17th,</article-title>
          <year>2020</year>
          (
          <year>2020</year>
          )
          <fpage>363</fpage>
          .
          <article-title>Chapter of the Association for Computational Lin-</article-title>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Cassotti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Siciliani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. C.</given-names>
            <surname>Passaro</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Gatto, guistics: Human Language Technologies</article-title>
          ,
          <string-name>
            <surname>NAACL- P. Basile</surname>
          </string-name>
          , et al.,
          <source>Wic-ita at evalita2023: Overview HLT</source>
          <year>2019</year>
          ,
          <article-title>Minneapolis</article-title>
          , MN, USA, June 2-7,
          <year>2019</year>
          ,
          <article-title>of the evalita2023 word-in-context for Italian task</article-title>
          ., Volume
          <volume>1</volume>
          (Long and Short Papers),
          <source>Association for EVALITA</source>
          (
          <year>2023</year>
          ).
          <source>Computational Linguistics</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>2924</fpage>
          -
          <lpage>2936</lpage>
          . [30]
          <string-name>
            <given-names>S.</given-names>
            <surname>Menini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Uva, “who was pietro URL: https://doi.org/10.18653/v1/n19-
          <fpage>1300</fpage>
          . doi:10. badoglio?”
          <article-title>towards a qa system for Italian his18653/V1/N19- 1300. tory</article-title>
          ,
          <source>in: Proceedings of the Tenth International</source>
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          , et al.,
          <source>Piqa: Conference on Language Resources</source>
          and
          <article-title>Evaluation Reasoning about physical commonsense in natural (</article-title>
          <source>LREC'16)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>430</fpage>
          -
          <lpage>435</lpage>
          . language,
          <source>in: Proceedings of the AAAI confer-</source>
          [31]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De Gemmis</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Siciliani</surname>
          </string-name>
          , G.
          <source>Semerence on artificial intelligence</source>
          , volume
          <volume>34</volume>
          ,
          <year>2020</year>
          , pp.
          <fpage>aro</fpage>
          ,
          <article-title>Overview of the evalita 2018 solving language 7432-7439. games (nlp4fun) task, EVALITA Evaluation of NLP</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          ,
          <source>Crowdsourcing and Speech Tools for Italian</source>
          <volume>12</volume>
          (
          <year>2018</year>
          )
          <article-title>75. multiple choice science questions</article-title>
          , in: L. Der- [32]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lovetere</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Monti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pascucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Sanczynski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ritter</surname>
          </string-name>
          , T. Baldwin (Eds.), Pro- gati, L. Siciliani,
          <article-title>Ghigliottin-ai@ evalita2020: Evalceedings of the 3rd Workshop on Noisy User- uating artificial players for the language game “la generated Text</article-title>
          ,
          <source>NUT@EMNLP</source>
          <year>2017</year>
          , Copenhagen, ghigliottina”,
          <source>EVALITA Evaluation of NLP and Denmark, September</source>
          <volume>7</volume>
          ,
          <year>2017</year>
          ,
          <article-title>Association for Com- Speech Tools for Italian-December 17th,</article-title>
          <year>2020</year>
          (
          <year>2020</year>
          ) putational Linguistics,
          <year>2017</year>
          , pp.
          <fpage>94</fpage>
          -
          <lpage>106</lpage>
          . URL: https:
          <fpage>345</fpage>
          . //doi.org/10.18653/v1/w17-
          <lpage>4413</lpage>
          . doi:
          <volume>10</volume>
          .18653/V1/ [33]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          , S. Black,
          <fpage>W17</fpage>
          -
          <lpage>4413</lpage>
          . A. DiPofi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Le Noac</surname>
          </string-name>
          <article-title>'h,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Evans</surname>
          </string-name>
          , Truthfulqa: Measur
          <string-name>
            <surname>- H. Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McDonell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Muennighof</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>Ociepa, ing how models mimic human falsehoods</article-title>
          , in: J.
          <string-name>
            <surname>Phang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schoelkopf</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Skowron</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Sutawika</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Thite</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
          </string-name>
          ,
          <source>namely TowerInstruct-7B-v0.2</source>
          ,
          <string-name>
            <surname>TowerInstruct-MistralA. Zou</surname>
          </string-name>
          ,
          <article-title>A framework for few-shot language model 7B-v0.2, GPT-3.5-turbo and</article-title>
          <string-name>
            <surname>GPT-</surname>
          </string-name>
          4o-mini.
          <source>Annotators evaluation</source>
          ,
          <year>2023</year>
          . URL: https://zenodo.org/records/ are required to determine
          <source>the correctness of a transla10256836. doi:10</source>
          .5281/zenodo.10256836. tion.
          <article-title>In order for a translation to be deemed correct,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [34]
          <string-name>
            <surname>D. M. Alves</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pombal</surname>
            ,
            <given-names>N. M.</given-names>
          </string-name>
          <string-name>
            <surname>Guerreiro</surname>
            ,
            <given-names>P. H.</given-names>
          </string-name>
          <article-title>Mar- two key requirements must be satisfied, namely, comtins</article-title>
          , J.
          <string-name>
            <surname>Alves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Farajian</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rei</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Fer- prehensibility and fidelity. A translation is considered nandes, S</article-title>
          . Agrawal,
          <string-name>
            <given-names>P.</given-names>
            <surname>Colombo</surname>
          </string-name>
          , J. G. C. de Souza,
          <article-title>comprehensible if a native speaker can easily understand A</article-title>
          . F. T. Martins,
          <string-name>
            <surname>Tower:</surname>
          </string-name>
          <article-title>An open multilingual large the content of both the question and all the answers</article-title>
          .
          <article-title>Filanguage model for translation-related tasks, 2024. delity, on the other hand, refers to the degree to which arXiv:2402.17733. the translation conforms to the English source text</article-title>
          . In
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Achiam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Adler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ahmad</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Akkaya, order to determine whether both requirements are adeF</article-title>
          . L.
          <string-name>
            <surname>Aleman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Almeida</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Altenschmidt</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Alt- quately satisfied, we categorize translation errors as miman, S. Anadkat</article-title>
          , et al.,
          <source>Gpt-4 technical report</source>
          , nor or major.
          <source>While minor errors do not usually hamper arXiv preprint arXiv:2303.08774</source>
          (
          <year>2023</year>
          ).
          <article-title>the overall comprehensibility and fidelity</article-title>
          , major errors -
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>M.</given-names>
            <surname>Freitag</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Grangier</surname>
          </string-name>
          ,
          <string-name>
            <surname>V.</surname>
          </string-name>
          <article-title>Ratnakar, which might even relate to just one single word - signifiQ.</article-title>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Macherey</surname>
          </string-name>
          , Experts, errors, and
          <article-title>context: cantly impede comprehensibility and fidelity, potentially A large-scale study of human evaluation for ma- leading to incorrect interpretations. Based on this catchine translation, Transactions of the Association egorization, annotators assign a binary label indicating for Computational Linguistics 9 (</article-title>
          <year>2021</year>
          )
          <fpage>1460</fpage>
          -
          <lpage>1474</lpage>
          .
          <article-title>whether the translation is deemed comprehensible and</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>N.</given-names>
            <surname>Campolungo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Martelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Saina</surname>
          </string-name>
          ,
          <string-name>
            <surname>R.</surname>
          </string-name>
          <article-title>Nav- faithful. During the annotation process, annotators are igli, DiBiMT: A novel benchmark for measur- required to identify potential error patterns. Below, we ing Word Sense Disambiguation biases in Ma- report instances of error patterns often encountered in chine Translation</article-title>
          , in: S. Muresan,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          , Machine Translation [
          <volume>36</volume>
          ]
          <string-name>
            <surname>:</surname>
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Villavicencio</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 60th Annual Meeting of the Association for Com- 1. Incorrect translation of source words: One putational Linguistics (Volume 1: Long Pa- or more source words are inaccurately translated</article-title>
          .
          <source>pers)</source>
          ,
          <article-title>Association for Computational Linguistics, This error category also includes the use of Dublin</article-title>
          , Ireland,
          <year>2022</year>
          , pp.
          <fpage>4331</fpage>
          -
          <lpage>4352</lpage>
          . URL:
          <article-title>https:// incorrect terminology in the translation</article-title>
          . aclanthology.org/
          <year>2022</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>298</volume>
          . doi:
          <volume>10</volume>
          .18653/ v1/
          <year>2022</year>
          .acl
          <article-title>- long.298. 2. Omission of one or more words: Words from</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>F.</given-names>
            <surname>Martelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Perrella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Campolungo</surname>
          </string-name>
          , T. Munda,
          <article-title>the source text are missing in the translation</article-title>
          . S. Koeva,
          <string-name>
            <given-names>C.</given-names>
            <surname>Tiberius</surname>
          </string-name>
          , R. Navigli,
          <article-title>DiBiMT: A Gold Evaluation Benchmark for Studying Lexical Ambi- 3. Incorrect formulation of the output text: guity in Machine Translation, Computational Lin- Errors related to the syntactic and semantic guistics (</article-title>
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>79</lpage>
          . URL: https://doi.org/10.1162/
          <article-title>structure of the output text</article-title>
          .
          <source>coli_a_00541</source>
          . doi:
          <volume>10</volume>
          .1162/coli_a_
          <fpage>00541</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>U.</given-names>
            <surname>Minhas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I.</surname>
          </string-name>
          <article-title>Ilyas, 4. Untranslated source text: One or more source Y. Li, Increasing coverage and precision of tex- words which are reproduced as-is in the output tual information in multilingual knowledge graphs, text, despite these words not being</article-title>
          commonly in: H.
          <string-name>
            <surname>Bouamor</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Pino</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          Bali (Eds.),
          <article-title>Pro- used in the target language</article-title>
          .
          <source>ceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Asso- 5</source>
          . Grammatical errors:
          <article-title>Errors in grammatical ciation for Computational Linguistics</article-title>
          , Singapore, agreement,
          <source>including mismatches in gender and 2023</source>
          , pp.
          <fpage>1612</fpage>
          -
          <lpage>1634</lpage>
          . URL: https://aclanthology. number. org/
          <year>2023</year>
          .emnlp-main.
          <volume>100</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2023</year>
          . emnlp- main.100.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>