<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Sustainable Italian LLM Evaluation: Community Perspectives and Methodological Guidelines</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Luca Moroni</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Gianmarco Pappacoda</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Barba</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Simone Conia</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Andrea Galassi</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Bernardo Magnini</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Navigli</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paolo Torroni</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Zanoli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler (FBK)</institution>
          ,
          <addr-line>Trento</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Sapienza NLP Group, Sapienza University of Rome</institution>
          ,
          <addr-line>Rome</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università di Bologna</institution>
          ,
          <addr-line>Bologna</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>The evaluation of large language models for Italian faces unique challenges due to morphosyntactic complexity, dialectal variation, cultural-specific knowledge, and limited availability of computational resources. This position paper presents a comprehensive framework for Italian LLM benchmarking, in which we identify key dimensions for LLM evaluation, including linguistic capabilities, knowledge domains, task types and prompt variations, proposing high-level methodological guidelines for current and future initiatives. We advocate a community-driven, sustainable benchmarking initiative that incorporates dynamic dataset management, open model prioritization, and collaborative infrastructure utilization. Our framework aims to establish a coordinated efort within the Italian NLP community to ensure rigorous, scientifically sound evaluation practices that can adapt to the evolving landscape of Italian LLMs.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Benchmarking</kwd>
        <kwd>Italian LLMs</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>• Section 2: What to benchmark – a framework for
prioritizing linguistic capabilities, knowledge
domains, and task types in Italian LLM evaluation.
• Section 3: How to benchmark – methodologi- This dimension covers the basic language skills needed
cal considerations including prompt engineering, for understanding at diferent levels. Italian’s typological
evaluation metrics, and aggregation strategies. characteristics, as a Romance language with rich
morphology and relatively flexible syntax, create evaluation
• Section 4: Where to benchmark – which datasets challenges distinct from those posed by English or other
and tasks to consider for a comprehensive evalu- languages. Our framework distinguishes between five
ation. hierarchical levels of linguistic analysis:
2.1. Linguistic Competence
sideration in developing language-specific benchmarks morphological inflection with complex agreement
sysinvolves the trade-ofs between creating native content tems, relatively free word order with pragmatic
conand translating from existing English resources. Indeed, straints, extensive use of clitics and null subjects, and a
while translation ofers scalability and cross-linguistic wealth of dialectal variation across regions. These
charcomparability, it may fail to capture language-specific acteristics, combined with Italy’s unique cultural and
phenomena, cultural nuances, and idiomatic expressions institutional landscape, create specific challenges for
lanthat are crucial for comprehensive evaluation. Native guage model evaluation that cannot be adequately
adItalian benchmarks, conversely, provide authentic lin- dressed through direct translation of existing English
guistic challenges but require substantial expertise and benchmarks. To address these challenges, we propose a
resources in order to be developed and maintained. multi-dimensional framework for Italian LLM evaluation</p>
      <p>This position paper synthesizes community experi- that captures the essential linguistic and cultural
dimenences in benchmarking Italian LLMs and proposes ac- sions of language understanding and generation, as
illustionable guidelines with the objective of incentivizing trated in Figure 1. Table 1 summarizes the coverage of
the development of more and better Italian LLM evalua- 25 publicly available datasets within our proposed
evalution resources in a sustainable manner. We address four ation ontology, highlighting the need for comprehensive
fundamental questions: benchmarks that encompass a wide range of linguistic
phenomena, knowledge domains, and task types.
• Section 5: Sustainable benchmarking –
addressing organizational, computational, and financial
challenges for long-term viability.</p>
      <sec id="sec-1-1">
        <title>We present empirical insights, practical guidelines, and open research questions to encourage community dialogue toward establishing comprehensive, sustainable evaluation standards for Italian LLMs.</title>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. What to Benchmark</title>
      <p>The fundamental question of what to benchmark in
Italian LLM evaluation requires careful consideration of the
nature of language understanding and generation
capabilities. While English-centric benchmarks have established
evaluation paradigms for general language
understanding, Italian presents unique linguistic challenges that
may require datasets and tasks specifically for the
language, i.e., native Italian benchmarks, rather than relying
solely on translated English resources. Drawing from
established evaluation frameworks, as well as
Italianspecific initiatives, we propose a systematic approach
to characterizing the evaluation space along three
critical dimensions that collectively capture the breadth of
abilities essential for robust Italian language modeling.</p>
      <p>Italian presents several distinctive features that
distinguish it from well-studied languages like English: rich</p>
      <sec id="sec-2-1">
        <title>Morphological Processing constitutes the founda</title>
        <p>tion, testing models’ ability to handle word formation,
inflection, and morpho-syntactic agreement. Recent
work has demonstrated the value of elementary linguistic
tasks [22] in revealing fundamental model capabilities
that may be obscured in more complex scenarios. For
Italian, this includes evaluating comprehension of
gender and number agreement (la casa bianca vs. i tavoli
bianchi), complex verbal conjugation patterns across
tenses and moods (andrei, andresti, andrebbe), and
productive derivational morphology (camminare →
camminabile → camminabilità). Unlike English, where
morphological complexity is relatively limited, Italian models
must demonstrate robustness to a wide range of
inflectional and derivational forms, including irregular verbs
and noun-adjective agreement patterns.</p>
        <p>Lexical Knowledge assessment focuses on
vocabulary breadth, semantic relations, and word-level
disambiguation capabilities. This includes traditional tasks,
such as word sense disambiguation (WSD), with some
verbs in Italian that are particularly polysemous, like
prendere (to take, catch, get, have) and dare (to give,
provide, yield). Evaluation must also address
lexicalsemantic knowledge specific to Italian cultural and
linguistic contexts, including understanding of false friends
with other Romance languages (burro means butter, not</p>
        <p>Morphology
Inflection, conjugation,
agreement patterns</p>
        <p>Lexicon
Vocabulary, idioms,
multi-word expressions</p>
        <p>Semantics
Meaning, disambiguation,
inference</p>
        <p>Pragmatics
Context, discourse,
communicative intent</p>
        <p>Domain Coverage
Legal, medical, technical,</p>
        <p>literary texts
Linguistic Instructions</p>
        <p>Following Italian
language directives</p>
        <p>Linguistic
Competence</p>
        <p>Syntax
Word order, parsing,
complex structures
Domain &amp; Knowledge</p>
        <p>Specialization
Cultural Knowledge
Italian culture, history,</p>
        <p>social contexts
Task Generalization &amp;
Instruction Following
Task Generalization</p>
        <p>Adapting to new
task formats</p>
        <p>Register Adaptation</p>
        <p>Formal, informal,
regional varieties
Cross-linguistic Transfer
Leveraging multilingual
knowledge
donkey) and recognition of regional lexical variants (an- where models must track referential relations across
exguria vs. cocomero for watermelon). tended texts and maintain thematic continuity. Italian’s
rich system of discourse markers (magari, dunque,
alSyntactic Processing evaluates models’ grasp of Ital- lora, comunque) and the pragmatic functions of syntactic
ian sentence structure, including complex phenomena variations require sophisticated contextual
understandthat distinguish Italian from more configurational lan- ing. Additionally, models must demonstrate sensitivity to
guages. Key areas include clitic placement and climbing speech acts and politeness, understanding when indirect
(lo voglio vedere vs. voglio vederlo), null subject licens- requests (non è che potresti...) are more appropriate than
ing and pro-drop parameters, and the pragmatic con- direct imperatives, and recognizing the pragmatic force
straints governing word order flexibility. Italian’s ability of conditional constructions, such as (sarebbe possibile vs.
to express the same propositional content through multi- è possibile).
ple syntactic configurations ( Mario ha visto Lucia, Lucia,
Mario l’ha vista, L’ha vista Mario, Lucia) requires mod- 2.2. Domain and Knowledge
els to understand both structural possibilities and their
discourse functions.</p>
        <sec id="sec-2-1-1">
          <title>The second dimension addresses the world knowledge</title>
          <p>encoded in language models, with particular attention
Semantic Processing encompasses both composi- to Italian-specific cultural, historical, and institutional
tional semantics, i.e., how meaning is constructed from contexts. This dimension recognizes that language
comconstituent parts, and pragmatic inference capabilities. petence extends beyond linguistic phenomena to
encomThis includes tasks such as textual entailment, semantic pass domain-specific expertise and culture awareness,
parsing, irony detection, and sentiment analysis, that which becomes particularly important given the
counrequire deeper contextual understanding. Italian’s rich try’s distinctive historical, geographical, political, legal,
system of grammaticalized aspect and mood markers and cultural landscape.
(stava per partire vs. era sul punto di partire vs. stava
partendo) creates semantic distinctions that must be
captured in evaluation frameworks.</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Domain Coverage spans traditional academic disci</title>
          <p>plines (mathematics, natural sciences, humanities) as well
as specialized professional domains where Italian-specific
Pragmatic Processing represents the highest level terminology, concepts, and practices may be essential.
of linguistic competence, evaluating models’ ability to Legal reasoning presents a particularly challenging case:
understand language in context and interpret commu- while mathematical reasoning may transfer readily across
nicative intentions beyond literal meaning. Key evalu- languages, Italian legal discourse requires deep
familiaration areas include discourse coherence and cohesion, ity with concepts like concordato preventivo, the
distinc.seknG .i-ssgoLn
aT r</p>
          <p>C
Dataset lrgoooyphM licexaL txaySn itsceaSnm itrscaagPm iaonDm ltreuuC itrseegR ..iItrsgLnn
AI2-ARC ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
BoolQ ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
GSM8K ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗
HellaSwag ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗
MMLU ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
PIQA ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
SciQ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
TruthfulQA ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗
WinoGrande ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗
Admission Test ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
AMI 2020 ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗
CLinkaRT 2023 ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
DiscoTEX ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗
GhigliottinAI ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗
HaSpeeDe2 ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗
LexSub ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
NERMUD ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
PreLearn20 ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
PreTENS 22 ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
QA4FAQ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗
QuandHo ✗ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗
SENTIPOLC ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗
Sum-FP ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗
Textual Entailment ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
WiC-ITA ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗
ITA-Bench ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✓
EvalITA-LLM ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗
ITALIC ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗
from the elaborate bureaucratic language of Italian public
administration (linguaggio burocratico) to the informal,
creative language of social media. Italian’s rich system of
honorifics and address forms, e.g., when to use tu, lei, and
voi and the use of conditional forms for politeness (vorrei
vs. voglio), requires social awareness that goes beyond
linguistic competence. Academic Italian, with its
distinctive structures and vocabulary (altresì, peraltro, laddove),
represents another crucial register for evaluation.
2.3. Task Generalization and Instruction</p>
          <p>Following
The third dimension captures models’ ability to
understand and execute new, unseen instructions, which is a
capability that has become increasingly important in
practical LLM applications. This dimension should be equally
relevant for Italian LLMs, as instruction-following
capabilities must transfer across linguistic and cultural
boundaries while maintaining sensitivity to Italian-specific
communicative norms and expectations.</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>Linguistic Instruction Following encompasses tasks</title>
        <p>Table 1 that require manipulation of language itself,
demonstrat(CIToAve-Braegnechof, E25vapluITbAli-cLlyL Mav,aainladblIeTAdaLtIaCs)ewtsitahnidn 3thfreapmroewpoosrekds ing meta-linguistic awareness. For Italian, this includes
Italian LLM evaluation ontology (✓ = covered, ✗ = not). style transfer tasks that require understanding of register
diferences, e.g., converting formal business
correspondence (Con la presente si comunica che...) to informal
messaging (Ti scrivo per dirti che...), or adapting academic
tion between dolo and colpa, and the complex structure of writing to journalistic style. Grammar presents
particuItalian administrative law (TAR, Consiglio di Stato). Med- lar challenges: shifting from passato prossimo to passato
ical terminology, with its mixture of Latin roots, Italian remoto depending on regional preferences, converting
adaptations, and regional variations, is another similar between active and passive constructions while
mainchallenge. Educational contexts require understanding taining appropriate clitic placement, and handling
perof the Italian school system’s structure (liceo classico, is- son shifts in embedded structures. Content restructuring,
tituto tecnico, scuola dell’infanzia) and grading systems such as summarization with specific constraints (e.g.,
“ri(giudizio vs. voto). assumi in 50 parole mantenendo un tono formale”), tests
not only linguistic competence but also adherence to
Cultural and Contextual Knowledge evaluation ad- culturally appropriate communication patterns.
dresses the understanding of Italian history, geography,
social institutions, and contemporary cultural references. Task Generalization evaluates models’ ability to
This encompasses knowledge of Italy’s regional diversity, adapt to novel task formats and requirements based on
ranging from linguistic varieties (understanding when natural language descriptions, without task-specific
trainsomeone uses scialla) to culinary traditions (knowing that ing. This includes assessment of few-shot learning
caparagù varies significantly between Bologna and Naples) bilities in Italian contexts, where models must quickly
to historical references (recognizing allusions to Tangen- adapt to new domains or specialized vocabularies. For
intopoli or the anni di piombo). Models must also be aware stance, a model might need to learn medical terminology
of the contemporary Italian media landscape, political from a few examples and then apply it consistently, or
undiscourse, and social issues, with appropriate cultural sen- derstand the conventions of Italian legal citation formats
sitivity, while at the same time avoiding stereotypes or from brief instruction. The ability to combine multiple
biases that may arise from training data and also staying sub-tasks in complex workflows, such as extracting
inupdated with new events. formation from a bureaucratic document, reformatting it
according to specific guidelines, and generating a
summary in a diferent register, represents a crucial capability</p>
      </sec>
      <sec id="sec-2-3">
        <title>Genre and Register Adaptation tests models’ sensi</title>
        <p>tivity to diferent text types and communicative contexts,</p>
        <sec id="sec-2-3-1">
          <title>Question: Given the context "Marco Rossi è nato</title>
          <p>a Milano nel 1985", which entity does "Milano"
refer to?</p>
        </sec>
        <sec id="sec-2-3-2">
          <title>A) Milano, Texas (USA)</title>
        </sec>
        <sec id="sec-2-3-3">
          <title>B) Milano, Italy (city)</title>
        </sec>
        <sec id="sec-2-3-4">
          <title>C) Milano Marittima (resort town)</title>
        </sec>
        <sec id="sec-2-3-5">
          <title>D) Milano Centrale (train station)</title>
          <p>Answer:
for practical applications.</p>
          <p>Cross-Linguistic Instruction Transfer addresses the
challenge of Italian LLMs operating in multilingual
contexts. This includes handling instructions that may draw
upon multilingual contexts (e.g., “traduci questo testo
inglese mantenendo il tono ironico”) or require
codeswitching between Italian and other languages,
particularly English in technical contexts. LLMs must
demonstrate sensitivity to when code-switching is appropriate
versus when maintaining linguistic purity is required,
understanding contexts where English technical terms
are standard (software, hardware) versus where Italian
equivalents are preferred (programma vs. software).
where the model is expected to generate the correct
option letter (e.g., "B") as its response. This approach allows
for leveraging existing evaluation metrics while adapting
to the generative capabilities of modern LLMs.</p>
          <p>Guidelines on What to Benchmark. Our proposed Multiple-choice question adaptation has become a
framework (Figure 1) could be used for a structured and prevalent strategy in LLM evaluation [9, 23, 24],
includsystematic categorization of Italian LLM evaluation tasks. ing Italian evaluations [19, 21], due to its simplicity (i.e.,
By encouraging task designers to be explicit and trans- one only needs to compare the label generated by the
parent about which dimensions their tasks cover, the model with the correct label) and its low computational
research community can more efectively allocate time, cost. However, it is important to note that this approach
expertise, and resources toward areas that are currently is not truly reflective of real-world applications, where
underrepresented. This, in turn, would allow for a richer models are often expected to generate free-form text
and more fine-grained understanding of model capabil- rather than select from predefined options. Moreover,
ities across a broad spectrum of competencies, as illus- multiple-choice question evaluation presents several
pertrated in Table 1, highlighting concrete gaps, for exam- sistent challenges for assessing LLMs. Diferent
evaluaple, the pressing need for a greater number of evaluation tion strategies often yield inconsistent results [25], and –
tasks that assess pragmatic processing, adaptation to dif- with the emergence of reasoning-intensive models [26] –
ferent registers and sociolinguistic contexts, as well as extracting the intended answer is not always
straightforthe ability to transfer instructions across languages in ward [27].
cross-linguistic scenarios.</p>
        </sec>
      </sec>
      <sec id="sec-2-4">
        <title>Open-Ended Generation Tasks represent the most</title>
        <p>3. How to Benchmark authentic form of generative evaluation, allowing
models to produce free-form text responses. However, this
3.1. Task Formulation approach introduces significant challenges in terms of
evaluation consistency and reliability, particularly for
The shift towards generative language models requires tasks that require subjective judgment or cultural
conreconsideration of traditional NLP evaluation paradigms, text understanding. For example, Instruction Following
particularly for discriminative tasks that formed the back- (IF) task will be formulated as an open-ended task as
bone of earlier evaluation eforts when classification and follows:
regression were the primary focus.</p>
        <p>Multiple-Choice Question Adaptation has emerged
as an easy-to-implement approach for bridging
traditional evaluation paradigms with generative model
capabilities. By recasting discriminative tasks as prompted
generation problems, this approach enables evaluation of
models’ reasoning processes while maintaining compati- where the model is expected to generate a coherent and
bility with established evaluation metrics. For example, correct answer following the guidelines imposed by the
Named Entity Disambiguation (NED) tasks can be refor- instruction (“Shakespearean style”), about a trip to “Italy”.
mulated as multiple-choice questions as follows: Evaluating a model’s ability to generate a coherent and
contextually appropriate response to an open-ended
question about Italian culture may require human annotators
with specific cultural knowledge, leading to potential</p>
        <sec id="sec-2-4-1">
          <title>Instruction: I am planning a trip to Italy, and</title>
          <p>I would like you to write an itinerary for my
journey in a Shakespearean style. You are not
allowed to use any commas in your response.</p>
          <p>Answer:
biases and inconsistencies in scoring. The open-ended leading to inconsistent probability distributions.
Moreparadigm ofers several distinct advantages: it enables as- over, probability-based evaluation cannot capture the
sessment of reasoning processes and explanation quality, reasoning processes that have become increasingly
imallows for partial credit scoring based on response com- portant in current LLM applications, as models cannot
ponents (e.g., a sound trip schedule, and adherence to the leverage their problem-solving strategies, provide
explawriting style) and more closely mirrors real-world deploy- nations, or exhibit the kind of multi-step reasoning that
ment scenarios where models must generate free-form characterizes human-inspired processes (e.g., Chain of
responses. However, open-ended formulation introduces Thought) in language tasks.
significant challenges, including increased computational
costs, the need for complex answer validation methods, Generative Evaluation Generative evaluation
inLLM-as-a-Judges, and task-specific evaluation metrics volves prompting a model to produce a complete,
freethat may need to be designed for each domain and appli- form response, which is then assessed against specific
crication. teria or compared to a reference answer. This approach
allows for more flexible and natural outputs, unconstrained
3.2. Task Evaluation by predefined answer options. For instance, in the Named
Entity Disambiguation (NED) task, generative evaluation
There are two main strategies for evaluating the output might prompt the model to produce a detailed
explanaof generative models: probability-based evaluation and tion such as: "The correct answer is Milano, Italy (city)
generative evaluation. These approaches difer in how because the context mentions Marco Rossi being born
they assess model outputs, with significant implications there, indicating the major Italian city rather than other
for benchmark design. places with the same name." Such responses can provide
richer insight into the model’s reasoning and capabilities.</p>
          <p>
            Probability-Based Evaluation relies on computing However, evaluating generative outputs remains a
sigthe likelihood of specific continuations given a context, nificant challenge. In the context of multiple-choice
quesleveraging the model’s internal probability distribution tion answering, the evaluation procedure must recover
over tokens. This approach is particularly well-suited the model’s intended answer from free-form text. Two
for tasks where the model must select among predefined primary approaches are commonly used: (
            <xref ref-type="bibr" rid="ref52 ref95">1</xref>
            ) applying
options, such as multiple-choice questions or cloze com- hand-crafted regular expressions, which are simple and
pletion tasks. The evaluation is based on the model’s fast to implement but susceptible to edge cases and
failability to assign higher probabilities to correct answers ures; and (2) leveraging LLM-based extractors, which
compared to incorrect ones. More formally, given a con- ofer greater robustness and accuracy but come with
text  and a set of options  = {1, 2, . . . , }, the increased computational cost. Recent studies have
invesevaluation computes the probabilities  (|) for each tigated the trade-ofs between these methods, revealing
option  and selects the one with the highest probability that even LLM-based extractors can fail under certain
conas the model’s implicit choice. In the previous example, ditions or may be unnecessary in specific scenarios [ 27].
the model would compute probabilities for each option: For open-ended tasks, evaluation becomes even more
 ("Milano, Italy"|context),  ("Milano, Texas"|context), complex due to the diversity and richness of possible
coretc. Alternatively, for computational eficiency, evalua- rect answers. These tasks require assessments across
multion can be performed on option labels:  ("B"|context), tiple dimensions, such as relevance, coherence, factuality,
though this approach may lose semantic information and and completeness. Traditional automatic metrics, such as
introduce artifacts related to label order and bias [28]. BLEU [29], ROUGE [30], METEOR [31], BERTScore [32],
          </p>
          <p>The main advantages of probability-based evaluation and COMET [33], are often insuficient to capture the
include computational eficiency–particularly when com- full quality of generated responses.
puting probabilities of single-token continuations–and For those reasons, LLM-as-a-Judge approaches [34]
the ability to assess model confidence through probability have recently gained traction for evaluating LLMs in
margins. However, this approach faces several limitations open-ended generation tasks, ofering an alternative to
that become particularly pronounced in Italian contexts. traditional, non-generative metrics. However, most of the
Length bias can systematically favor shorter options, as existing research in this area has focused on the English
longer sequences have lower joint probabilities; this is language. Encouragingly, recent developments in
multiespecially problematic for Italian, where morphological lingual, open-source LLM-as-a-Judge frameworks [35, 36,
complexity varies significantly across lexical items. To- Hercule, M-Prometheus] have shown promising results
kenization efects may create systematic biases: Italian in non-English contexts. Still, as of now, there are no
compound words or phrases may be tokenized very dif- open-weight LLM-as-a-Judge models explicitly trained
ferently by diferent tokenizers of multilingual models, for Italian, showing that there exists a significant gap in
the current literature. In general, LLM-as-a-Judge
evaluation frameworks can be expensive, especially when based Few-Shot Learning has been widely adopted in LLM
on commercial models. Even open-source alternatives, evaluation, allowing models to leverage examples to
imsuch as Prometheus [37], require substantial computa- prove performance on specific tasks. Our experience
tional resources, e.g., Prometheus is available as a 7B and indicates that few-shot prompting is particularly
efec35B model, making its deployment resource-intensive. tive when the answer format is novel or complex with
In addition, the LLM-as-a-Judge paradigm faces several respect to the model’s training data, as it provides
cruopen challenges beyond language coverage and eficiency. cial context and guidance for generating appropriate
reNotably, robust meta-evaluation is needed to assess the sponses. However, few-shot prompting also introduces a
reliability of LLM-based judgments. It is therefore impor- significant computational overhead and requires careful
tant to pair model-based evaluation with human judg- selection of examples to avoid introducing hidden biases
ment, especially for mid-resource languages like Italian. towards specific answers. Perhaps more importantly,
Not only that, LLM-based evaluators remain vulnerable few-shot prompting can lead to overfitting on the
trainto various forms of bias, which can be particularly prob- ing examples provided for the given benchmark, which
lematic in sensitive applications [38]. These limitations could be too specific and similar to the test examples that
underscore the urgent need for a well-defined, efective may not generalize well on diferent domains or tasks.
evaluation framework, especially when assessing gener- Therefore, while few-shot prompting can enhance model
ative models on Italian language benchmarks. performance, we recommend using zero-shot evaluation
as a more representative measure of model capabilities,
3.3. Task Variation whereas few-shot prompting can be used as a
supplementary task variation and a strong baseline on model
performance.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4. Where to Benchmark</title>
      <sec id="sec-3-1">
        <title>The same task can be presented in multiple ways, leading</title>
        <p>to diferent model performances based on the
formulation of the prompt. In our experience with Italian LLMs
and Italian benchmarks, we have identified several key
dimensions of task variation that significantly impact
model performance and evaluation outcomes.</p>
      </sec>
      <sec id="sec-3-2">
        <title>Prompt Variation is essential for understanding how diferent linguistic features influence model performance, as a diferent model may perform better or worse depending on how the task is presented.</title>
        <p>Cross-Lingual Prompting which refers to prompting
in a language other than the language in which the model
is expected to answer, is a particularly interesting aspect
of Italian LLM evaluation, as it allows us to leverage the
multilingual capabilities of models trained on diverse
datasets. Our observations indicate that Italian models
often perform better when prompted in English with
instructions to respond in Italian, suggesting that current
Italian LLMs are benefitting from higher-quality English
training data during pre-training and/or post-training.
• Register variation: Tests model sensitivity to Therefore, cross-lingual prompting can be a powerful
formality diferences by comparing formal aca- tool for measuring cross-linguistic performance and
undemic language ("Sulla base del testo fornito, si derstanding how models generalize across languages,
identifichi l’opzione corretta" ) versus informal con- including coding languages, such as Python, which are
versational prompts ("Leggendo questo testo, qual è often used in programming tasks.
la risposta giusta?"). This is particularly important
for Italian given its system of register markers.
• Cultural framing: Compares culturally specific
framings ("Come studente italiano, quale risposta
sceglieresti?") with culturally neutral ones. This
proves particularly important for tasks about</p>
        <p>Italian-specific knowledge.
• Instruction explicitness: Varies detail level
from minimal prompts relying on implicit under- The development of an LLM benchmark suite for a target
standing to elaborate instructions with explicit language typically follows one of three main approaches,
criteria and response formats. each with distinct advantages and limitations that
significantly shape the resulting evaluation framework. In this
section, we outline “where” to obtain the data to evaluate
LLMs, or – in the absence of existing benchmark for a
target language – where to source the data to bootstrap
the creation of a new benchmark.
• Randomicity: Introduces random variations in
prompt structure, such as changing the order of
options or rephrasing questions, to assess model
robustness to possibly irrelevant changes.</p>
        <sec id="sec-3-2-1">
          <title>Translation-Based Methodologies are the most im</title>
          <p>mediate and resource-eficient strategy, as it allows us to
leverage existing English benchmarks, such as MMLU [9],
HellaSwag [39], ARC [24], BoolQ [40], and SciQ [41],
among many others. This approach enables rapid de- linguistic analysis and content creation, ofer the
greatployment of evaluation frameworks and facilitates cross- est potential for capturing phenomena unique to
Itallinguistic comparison of model capabilities. However, ian language use that may be systematically overlooked
direct translation – apart from the possibility of trans- by adapted benchmarks. Since native benchmarks
relation errors – introduces systematic biases that may quire significant expertise, time, and resources to
deobscure genuine linguistic diferences between Italian velop, their need should be carefully evaluated against
and English, potentially leading to evaluation artifacts the potential benefits they ofer. In our experience,
nathat do not reflect authentic Italian language use patterns. tive benchmarks are particularly valuable for tasks that</p>
          <p>Our experience with translating English benchmarks require deep cultural understanding, such as cultural
refreveals several aspects that require careful consideration, erences, idiomatic expressions, and pragmatic language
as they can significantly impact the task’s validity and use. Therefore, we recommend that native development
complexity. For instance, WinoGrande [42] is a widely approaches be prioritized for tasks that are critical for
used benchmark for evaluating commonsense reasoning evaluating LLMs’ capabilities in Italian, while translation
in English, where the task involves filling in the blanks and adaptation methodologies can be used to
compleof sentences with appropriate words, e.g., The GPS and ment existing benchmarks and fill gaps in evaluation
map helped me navigate home. I got lost when the ___ coverage.
got turned upside down in which the correct answer is
map. A possible translation into Italian could be Il GPS
e la mappa mi hanno aiutato a tornare a casa. Mi sono 5. Sustainable Benchmarking
perso quando la ___ è stata capovolta, where the correct
answer is mappa. We observe that the translated task is
significantly less complex than the original, as the word
GPS is masculine in Italian, while mappa is feminine, i.e.,
a model can easily infer the correct answer based on
grammar alone rather than common sense.</p>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>Sustainable evaluation requires moving away from static</title>
        <p>benchmarks toward dynamic, community-driven
evaluations. We propose a living benchmark framework that
addresses resource constraints via adaptive dataset
management, open model prioritization, and strategic
infrastructure utilization.</p>
        <p>Adaptation-Based Methodologies ofer a middle
ground between translation and native development,
allowing us to use data that is already available in Italian
while adapting the task design to better fit the
evaluation of LLMs. This approach enables us to create
benchmarks that are more culturally and linguistically relevant
than direct translations, while still leveraging existing
resources to reduce development costs. For instance,
misogyny detection on social media platforms presents
significant diferences between English and Italian for
several reasons, including the use of diferent terms,
cultural references, and linguistic structures, i.e., translating
English benchmarks would not necessarily capture the
nuances of misogyny in Italian. Therefore,
adaptationbased methodologies can be particularly efective for
tasks that require cultural or contextual understanding,
such as sentiment analysis, hate speech detection, and
commonsense reasoning. However, adaptation also
requires careful consideration as the adaptation process
(e.g., how the prompts or possibile answers are adapted)
may introduce biases or artifacts that do not accurately
reflect the evaluation goals of the original benchmark.</p>
        <sec id="sec-3-3-1">
          <title>Native Development Approaches represent the</title>
          <p>most resource-intensive but potentially most valuable
strategy, creating evaluation frameworks specifically
designed for Italian linguistic and cultural contexts. These
approaches, while requiring substantial investment in</p>
        </sec>
        <sec id="sec-3-3-2">
          <title>Dynamic Task Management: our framework envi</title>
          <p>sions a dynamic lifecycle management for datasets where
evaluation tasks undergo continuous assessment and
removal upon reaching saturation thresholds or staleness.
The research community should propose new tasks and
perform a pilot evaluation to assess complexity, cultural
relevance, and computational requirements before
integration, with higher priority given to tasks capturing
emerging linguistic phenomena and leveraging unique
aspects of Italian language and culture.</p>
          <p>Open-Source Prioritization: we propose a three-tier
model inclusion hierarchy: fully open-source models
(training code, data pipelines, complete documentation),
open-weight models (public weights and inference code),
and closed systems (limited to significant comparative
baselines). Performance-based curation should flag
underperforming models for removal while maintaining
architectural diversity and preserving historical data.
Model Transparency and Comparative Context:
our framework would remark model openness and core
characteristics—such as the number of training tokens
and model parameters. Current leaderboards often lack
a consistent emphasis on these details during
comparisons. For example, given equal parameter counts, it is
reasonable for a fully open model trained on fewer
tokens to underperform relative to a proprietary model
trained on significantly more data. Nonetheless, such
discrepancies should be seen as valuable indicators of the
evaluation gap, encouraging the research community to
close this gap through more equitable and transparent
benchmarking. Table 2 provides a non-exhaustive list of
state-of-the-art LLM families trained on Italian data (e.g.,
Minerva [4], Llama [43], Qwen [44], Salamandra [45],
EuroLLM [46], Almawave’s Velvet, iGenius’ Italia,
Fastweb’s MIIA) where we report the number of training
tokens and model parameters.</p>
          <p>Community Governance: a community-based
steering committee with short-term rotating roles will govern
the framework, including representatives from Italian
research institutions and industry partners. The committee
establishes dataset inclusion criteria, defines evaluation
protocols, coordinates infrastructure allocation, and
mediates methodology disagreement through transparent
voting procedures.</p>
        </sec>
        <sec id="sec-3-3-3">
          <title>Infrastructure and Cost Management: the frame</title>
          <p>work leverages national computational resources, e.g.,
CINECA’s Leonardo supercomputer, as the primary
infrastructure foundation. These partnerships should
provide access to state-of-the-art GPU clusters while
maintaining community accessibility through existing
institutional allocation systems. Our preliminary cost
analysis reveals that generative evaluation tasks consume 3-5
times more resources than probability-based assessments.
Optimization strategies include batch processing, smart
caching, and hierarchical evaluation protocols.
Overall, a comprehensive evaluation of 10 models across 50
tasks can require approximately 500-750 GPU hours per
quarter, with sustainability achieved through diferent
funding sources including national support, institutional
commitments, and industry partnerships.</p>
          <p>Model
Minerva-350M
Minerva-1B
Minerva-3B
Minerva-7B
Velvet-2B
Italia-9B
FastwebMIIA-7B
Llama-3.1-8B
Llama-3.2-1B
Llama-3.2-3B
Salamandra-2B
Salamandra-7B
Velvet-14B
Qwen2.5-1.5B
Qwen2.5-3B
Qwen2.5-7B
EuroLLM-1.7B</p>
          <p>Parameter Size Training Tokens
(Billions) (Trillions)
Italian First
Multilingual</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Acknowledgments</title>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusion</title>
      <sec id="sec-5-1">
        <title>Research partly funded by PNRR - M4C2 - Investimento</title>
        <p>1.3, Partenariato Esteso PE00000013 - “FAIR - Future
ArLLMs require rigorous, standardized evaluation frame- tificial Intelligence Research” (Spoke 2 “Integrative AI”,
works that can assess diferent capabilities in linguisti- Spoke 5 “High-Quality AI” and Spoke 8 “Pervasive AI”)
cally and culturally diverse contexts. For Italian, this chal- funded by the European Commission under the
NextGenlenge is compounded by the complexity of morphosyntac- eration EU programme (https://fondazione-fair.it/).
Sitic phenomena, dialectal variation, and culturally-specific mone Conia’s fellowship is fully funded by the PNRR
knowledge requirements that existing benchmarks are MUR project PE0000013-FAIR. Luca Moroni and Roberto
yet to fully address. However, several aspects of bench- Navigli gratefully acknowledge the support of the AI
marking discussed in the paper, for instance task formula- factory IT4LIA project.
tion, evaluation and variation, can be applied efectively
to languages other than Italian, English included. We
hope that work on Italian can act as a trailblazer,
particularly for other European languages.</p>
        <p>This position paper outlines a comprehensive overview
of the Italian LLM evaluation landscape across several
important dimensions. Moreover, we firmly believe that the
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Text translation and Improve writing style. After using these tool(s)/service(s), the author(s)
reviewed and edited the content as needed and take(s) full responsibility for the publication’s
content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          <source>sive multitask language understanding</source>
          ,
          <year>2021</year>
          . URL: [1]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          , L. Siciliani, https://arxiv.org/abs/
          <year>2009</year>
          .03300.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          <string-name>
            <given-names>G.</given-names>
            <surname>Fiameni</surname>
          </string-name>
          , G. Semeraro, Llamantino: Llama 2
          <fpage>mod</fpage>
          - [10]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Ma</surname>
          </string-name>
          , G. Zhang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chandra</surname>
          </string-name>
          , S. Guo,
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          2023. URL: https://arxiv.org/abs/2312.09993.
          <article-title>and challenging multi-task language understanding</article-title>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , G. Semeraro, Advanced benchmark,
          <year>2024</year>
          . URL: https://arxiv.org/abs/2406.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          <article-title>natural-based interaction for the italian language: 01574.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          <string-name>
            <surname>Llamantino-</surname>
          </string-name>
          3-anita,
          <year>2024</year>
          . URL: https://arxiv.org/ [11]
          <string-name>
            <given-names>S.</given-names>
            <surname>Mayhew</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Blevins</surname>
          </string-name>
          , S. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suppa</surname>
          </string-name>
          , H. Gonen,
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <source>abs/2405</source>
          .07101.
          <string-name>
            <surname>J. M. Imperial</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          <string-name>
            <surname>Karlsson</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ljubešić</surname>
            , [3]
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Moroni</surname>
          </string-name>
          , G. Puccetti, P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>A. S. N.</given-names>
          </string-name>
          <string-name>
            <surname>Ljubešić</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Miranda</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Plank</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Riabi</surname>
          </string-name>
          , Y. Pin-
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          <article-title>vocabulary adaptation</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Rit- 2024
          <source>Conference of the North American Chapter</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          <string-name>
            <surname>Computational</surname>
            <given-names>Linguistics: NAACL</given-names>
          </string-name>
          <year>2025</year>
          ,
          <article-title>Associa- Human Language Technologies (Volume 1: Long</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          <string-name>
            <given-names>New</given-names>
            <surname>Mexico</surname>
          </string-name>
          ,
          <year>2025</year>
          , pp.
          <fpage>6646</fpage>
          -
          <lpage>6660</lpage>
          . URL: https:// tics, Mexico City, Mexico,
          <year>2024</year>
          , pp.
          <fpage>4322</fpage>
          -
          <lpage>4337</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          aclanthology.org/
          <year>2025</year>
          .findings-naacl.
          <volume>371</volume>
          /. doi: 10. URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>243</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2025</year>
          .findings-naacl.
          <volume>371</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>243</volume>
          . [4]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
            , S. Co- [12]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Scirè</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Conia</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Ciciliano</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          , Echoes
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 10th Italian ACL</source>
          <year>2023</year>
          ,
          <article-title>Association for Computational Lin-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          <string-name>
            <surname>Conference on Computational</surname>
          </string-name>
          <article-title>Linguistics (CLiC- guistics</article-title>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>853</fpage>
          -
          <lpage>867</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy, URL: https://aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>54</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: https://aclanthology.org/ doi:10.18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>54</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          2024.clicit-
          <volume>1</volume>
          .77/. [13]
          <string-name>
            <surname>R. Das</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Hristov</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Dimitrov</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <article-title>Koy[5] Proceedings of the Sixth Evaluation Campaign of chev, P. Nakov, EXAMS-V: A multi-discipline</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          . Final Workshop (EVALITA
          <year>2018</year>
          )
          <article-title>, volume uating vision language models</article-title>
          , in: L.-W. Ku,
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          2263 of CEUR Workshop Proceedings, CEUR-WS.org, A.
          <string-name>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.),
          <source>Proceedings of</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          2018. URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2263</volume>
          .
          <article-title>the 62nd Annual Meeting of the Association for [6] Proceedings of the Seventh Evaluation Campaign of Computational Linguistics (Volume 1: Long Pa-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          , CEUR-WS.org,
          <year>2020</year>
          . URL: https://ceur-ws. Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7768</fpage>
          -
          <lpage>7791</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          <source>org/</source>
          Vol-
          <volume>2765</volume>
          . https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          /. doi:10. [7] Proceedings of the Eighth Evaluation Campaign of 18653/v1/
          <year>2024</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>420</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          <string-name>
            <surname>Natural Language Processing</surname>
            and Speech Tools for [14]
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Du</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Hu</surname>
          </string-name>
          , G. Qi,
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          <string-name>
            <surname>Italian</surname>
          </string-name>
          , CEUR-WS.org,
          <year>2023</year>
          . URL: https://ceur-ws. H.
          <string-name>
            <surname>Jiang</surname>
          </string-name>
          , S. Cheng, B. Tian, MIKE: A new
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          <source>org/</source>
          Vol-
          <volume>3473</volume>
          .
          <article-title>benchmark for fine-grained multimodal entity</article-title>
          [8]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ravishankar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Barrett</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          <article-title>Belinkov, knowledge editing</article-title>
          , in: L.
          <string-name>
            <surname>-W. Ku</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Mar-
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          tions, in: D.
          <string-name>
            <surname>Jurafsky</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chai</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Schluter</surname>
          </string-name>
          , J.
          <source>Tetreault</source>
          <year>2024</year>
          ,
          <article-title>Association for Computational Linguistics,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          (Eds.),
          <source>Proceedings of the 58th Annual Meeting Bangkok, Thailand</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>5018</fpage>
          -
          <lpage>5029</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          <article-title>of the Association for Computational Linguistics</article-title>
          , //aclanthology.org/
          <year>2024</year>
          .findings-acl.
          <volume>298</volume>
          /. doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          <string-name>
            <surname>Association for Computational Linguistics</surname>
          </string-name>
          , On-
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .findings-acl.
          <volume>298</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          <string-name>
            <surname>line</surname>
          </string-name>
          ,
          <year>2020</year>
          , pp.
          <fpage>7590</fpage>
          -
          <lpage>7604</lpage>
          . URL: https://aclanthology. [15]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          org/
          <year>2020</year>
          .acl-main.
          <volume>679</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/2020. D.
          <string-name>
            <surname>Zhou</surname>
          </string-name>
          , L. Hou,
          <article-title>Instruction-following evaluation</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          <source>acl-main.679. for large language models</source>
          ,
          <year>2023</year>
          . URL: https://arxiv. [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Zou, org/abs/2311.07911. [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dussolle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Cardeña</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sato</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Devine</surname>
          </string-name>
          , M- guage tasks, in: A.
          <string-name>
            <surname>Rogers</surname>
          </string-name>
          , J. Boyd-Graber,
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          ation, in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>for Computational Linguistics: ACL</source>
          <year>2023</year>
          , Asso-
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          <source>guistics: NAACL</source>
          <year>2025</year>
          , Association for Compu- Canada,
          <year>2023</year>
          , pp.
          <fpage>10476</fpage>
          -
          <lpage>10501</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <string-name>
            <surname>tational Linguistics</surname>
          </string-name>
          , Albuquerque, New Mexico, //aclanthology.org/
          <year>2023</year>
          .findings-acl.
          <volume>666</volume>
          /. doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          <year>2025</year>
          , pp.
          <fpage>6161</fpage>
          -
          <lpage>6176</lpage>
          . URL: https://aclanthology.org/ 18653/v1/
          <year>2023</year>
          .findings-acl.
          <volume>666</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          2025.findings-naacl.
          <volume>344</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2025</year>
          . [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Talmor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Herzig</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lourie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Berant</surname>
          </string-name>
          , Com-
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          <source>findings-naacl.344</source>
          . monsenseQA:
          <article-title>A question answering challenge tar</article-title>
          [17]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rawat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>McBride</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ghosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nirmal</surname>
          </string-name>
          , J. Moon,
          <article-title>geting commonsense knowledge</article-title>
          , in: J. Burstein,
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          <string-name>
            <given-names>D.</given-names>
            <surname>Alamuri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. O</given-names>
            <surname>'Brien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          , DiversityMedQA: C. Doran, T. Solorio (Eds.),
          <source>Proceedings of the 2019</source>
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          <string-name>
            <surname>in</surname>
          </string-name>
          : D.
          <string-name>
            <surname>Dementieva</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          <string-name>
            <surname>Ignat</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Jin</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Mihalcea</surname>
          </string-name>
          ,
          <string-name>
            <surname>Language</surname>
            <given-names>Technologies</given-names>
          </string-name>
          , Volume
          <volume>1</volume>
          (Long and Short
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          <source>ceedings of the Third Workshop on NLP for Posi- tics</source>
          , Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>4149</fpage>
          -
          <lpage>4158</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          <string-name>
            <surname>tive Impact</surname>
          </string-name>
          , Association for Computational Linguis- URL: https://aclanthology.org/N19-1421/. doi:10.
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          <string-name>
            <surname>tics</surname>
          </string-name>
          , Miami, Florida, USA,
          <year>2024</year>
          , pp.
          <fpage>334</fpage>
          -
          <lpage>348</lpage>
          . URL:
          <volume>18653</volume>
          /v1/
          <fpage>N19</fpage>
          -1421.
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          https://aclanthology.org/
          <year>2024</year>
          .nlp4pi-
          <fpage>1</fpage>
          .29/. doi:10. [24]
          <string-name>
            <given-names>P.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Cowhey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Etzioni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Khot</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Sab-
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .nlp4pi-
          <fpage>1</fpage>
          .29. harwal, C. Schoenick,
          <string-name>
            <given-names>O.</given-names>
            <surname>Tafjord</surname>
          </string-name>
          , Think you have [18]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Borazio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Croce</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Fran- solved question answering? try arc, the ai2 reason-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          <string-name>
            <surname>cis</surname>
          </string-name>
          , J.
          <string-name>
            <surname>Gili</surname>
            , E. Musacchio,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Nissim</surname>
          </string-name>
          , V. Patti, ing challenge,
          <year>2018</year>
          . URL: https://arxiv.org/abs/
          <year>1803</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          <string-name>
            <given-names>M.</given-names>
            <surname>Rinaldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Scalena</surname>
          </string-name>
          ,
          <source>CALAMITA: Challenge</source>
          <volume>05457</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          <article-title>the abilities of LAnguage models in ITAlian</article-title>
          , in: [25]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Weber-Genzel</surname>
          </string-name>
          , P. Röttger,
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          noli (Eds.),
          <source>Proceedings of the 10th Italian</source>
          Confer-
          <article-title>First-token probabilities do not match text answers</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          <article-title>ence on Computational Linguistics (CLiC-it 2024), in instruction-tuned language models</article-title>
          , in: L.-W.
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          <string-name>
            <surname>CEUR Workshop Proceedings</surname>
          </string-name>
          , Pisa, Italy,
          <year>2024</year>
          , Ku,
          <string-name>
            <given-names>A.</given-names>
            <surname>Martins</surname>
          </string-name>
          , V. Srikumar (Eds.), Findings of the
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          pp.
          <fpage>1054</fpage>
          -
          <lpage>1063</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . Association for Computational Linguistics: ACL
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          <source>clicit-1</source>
          .116/. 2024, Association for Computational Linguistics, [19]
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Conia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Martelli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Navigli</surname>
          </string-name>
          , To- Bangkok, Thailand,
          <year>2024</year>
          , pp.
          <fpage>7407</fpage>
          -
          <lpage>7416</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          <article-title>wards a more comprehensive evaluation for Italian //aclanthology</article-title>
          .org/
          <year>2024</year>
          .findings-acl.
          <volume>441</volume>
          /. doi: 10.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          <string-name>
            <surname>LLMs</surname>
          </string-name>
          , in: F.
          <string-name>
            <surname>Dell'Orletta</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lenci</surname>
          </string-name>
          , S. Montemagni,
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .findings-acl.
          <volume>441</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Sprugnoli</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 10th Italian</source>
          [26]
          <string-name>
            <surname>DeepSeek-AI</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J. Song,
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          it
          <year>2024</year>
          ), CEUR Workshop Proceedings, Pisa, Italy, r1:
          <article-title>Incentivizing reasoning capability in llms via</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          <year>2024</year>
          , pp.
          <fpage>584</fpage>
          -
          <lpage>599</lpage>
          . URL: https://aclanthology.org/ reinforcement learning,
          <year>2025</year>
          . URL: https://arxiv.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          2024.clicit-
          <volume>1</volume>
          .67/. org/abs/2501.12948. [20]
          <string-name>
            <given-names>B.</given-names>
            <surname>Magnini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Zanoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Resta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cimmino</surname>
          </string-name>
          , P. Al- [27]
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Molfese</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Giofré</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Scirè</surname>
          </string-name>
          , S. Co-
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          <article-title>marking large language models on italian, 2025. covering the inconsistencies of llm evaluation in</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          URL: https://arxiv.org/abs/2502.02289. multiple-choice
          <source>question answering</source>
          ,
          <year>2025</year>
          . URL: [21]
          <string-name>
            <given-names>A.</given-names>
            <surname>Seveso</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Potertì</surname>
          </string-name>
          , E. Federici, M. Mezzanzanica, https://arxiv.org/abs/2503.14996.
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          <string-name>
            <given-names>F.</given-names>
            <surname>Mercorio</surname>
          </string-name>
          ,
          <string-name>
            <surname>ITALIC:</surname>
          </string-name>
          <article-title>An Italian culture-aware nat-</article-title>
          [28]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , M. Huang,
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          (Eds.),
          <source>Proceedings of the 2025 Confer- choice selectors</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          <source>ence of the Nations of the Americas Chapter of 2309</source>
          .03882.
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          the Association for Computational Linguistics: Hu- [29]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu, Bleu:
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          <string-name>
            <surname>man Language</surname>
          </string-name>
          <article-title>Technologies (Volume 1: Long Pa- a method for automatic evaluation of machine</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          <string-name>
            <surname>Albuquerque</surname>
          </string-name>
          , New Mexico,
          <year>2025</year>
          , pp.
          <fpage>1469</fpage>
          -
          <lpage>1478</lpage>
          . (Eds.),
          <source>Proceedings of the 40th Annual Meeting of</source>
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>naacl-long.68/. the Association for Computational Linguistics</article-title>
          , As-
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          <source>doi:10</source>
          .18653/v1/
          <year>2025</year>
          .
          <article-title>naacl-long.68. sociation for Computational Linguistics</article-title>
          , Philadel[22]
          <string-name>
            <given-names>A.</given-names>
            <surname>Efrat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Honovich</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , LMentry: A phia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL:
        </mixed-citation>
      </ref>
      <ref id="ref69">
        <mixed-citation>
          <article-title>language model benchmark of elementary lan</article-title>
          - https://aclanthology.org/P02-1040/. doi:
          <volume>10</volume>
          .3115/
        </mixed-citation>
      </ref>
      <ref id="ref70">
        <mixed-citation>
          1073083.1073135. //aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>248</volume>
          /. doi:10. [30]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic eval</article-title>
          -
          <volume>18653</volume>
          /v1/
          <year>2024</year>
          .emnlp-main.
          <volume>248</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref71">
        <mixed-citation>
          <article-title>uation of summaries</article-title>
          , in: Text Summarization [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , N. Mo-
        </mixed-citation>
      </ref>
      <ref id="ref72">
        <mixed-citation>
          <string-name>
            <surname>guistics</surname>
          </string-name>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL:
          <string-name>
            <surname>Chawla</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , Justice or prejudice? quanti-
        </mixed-citation>
      </ref>
      <ref id="ref73">
        <mixed-citation>
          https://aclanthology.org/W04-1013/.
          <article-title>fying biases in llm-as-a-</article-title>
          <string-name>
            <surname>judge</surname>
          </string-name>
          ,
          <year>2024</year>
          . URL: https: [31]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR</surname>
          </string-name>
          : An automatic met- //arxiv.org/abs/2410.02736. arXiv:
          <volume>2410</volume>
          .
          <fpage>02736</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref74">
        <mixed-citation>
          <article-title>ric for MT evaluation with improved correlation</article-title>
          [39]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zellers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bisk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          , Y. Choi,
        </mixed-citation>
      </ref>
      <ref id="ref75">
        <mixed-citation>
          <source>Workshop on Intrinsic and Extrinsic Evaluation (Eds.)</source>
          ,
          <source>Proceedings of the 57th Annual Meeting</source>
        </mixed-citation>
      </ref>
      <ref id="ref76">
        <mixed-citation>
          tics, Ann Arbor, Michigan,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          . URL: Florence, Italy,
          <year>2019</year>
          , pp.
          <fpage>4791</fpage>
          -
          <lpage>4800</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref77">
        <mixed-citation>
          https://aclanthology.org/W05-0909/. //aclanthology.org/P19-1472/. doi:
          <volume>10</volume>
          .18653/v1/ [32]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <fpage>P19</fpage>
          -
          <lpage>1472</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref78">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , BERTScore: Evaluating text generation [40]
          <string-name>
            <given-names>C.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.-W.</given-names>
            <surname>Chang</surname>
          </string-name>
          , T. Kwiatkowski,
        </mixed-citation>
      </ref>
      <ref id="ref79">
        <mixed-citation>
          <source>with bert</source>
          ,
          <year>2020</year>
          . URL: https://arxiv.org/abs/1904. M. Collins, K. Toutanova, BoolQ: Exploring the
        </mixed-citation>
      </ref>
      <ref id="ref80">
        <mixed-citation>
          09675.
          <article-title>surprising dificulty of natural yes/no questions</article-title>
          , [33]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Stewart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Farinha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          , COMET: in: J.
          <string-name>
            <surname>Burstein</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Doran</surname>
          </string-name>
          , T. Solorio (Eds.), Proceed-
        </mixed-citation>
      </ref>
      <ref id="ref81">
        <mixed-citation>
          <article-title>A neural framework for MT evaluation</article-title>
          ,
          <source>in: B. Web- ings of the 2019 Conference of the North American</source>
        </mixed-citation>
      </ref>
      <ref id="ref82">
        <mixed-citation>
          <source>of the 2020 Conference on Empirical Methods guistics: Human Language Technologies</source>
          , Volume
        </mixed-citation>
      </ref>
      <ref id="ref83">
        <mixed-citation>
          <source>in Natural Language Processing (EMNLP)</source>
          ,
          <article-title>As- 1 (Long</article-title>
          and Short Papers), Association for Com-
        </mixed-citation>
      </ref>
      <ref id="ref84">
        <mixed-citation>
          <year>2020</year>
          , pp.
          <fpage>2685</fpage>
          -
          <lpage>2702</lpage>
          . URL: https://aclanthology.org/
          <year>2019</year>
          , pp.
          <fpage>2924</fpage>
          -
          <lpage>2936</lpage>
          . URL: https://aclanthology.org/
        </mixed-citation>
      </ref>
      <ref id="ref85">
        <mixed-citation>
          2020.emnlp-main.
          <volume>213</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          . N19-1300/. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N19</fpage>
          -1300.
        </mixed-citation>
      </ref>
      <ref id="ref86">
        <mixed-citation>
          emnlp-main.
          <volume>213</volume>
          . [41]
          <string-name>
            <given-names>J.</given-names>
            <surname>Welbl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Gardner</surname>
          </string-name>
          , Crowdsourcing mul[34]
          <string-name>
            <given-names>J.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Shi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Tan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          , C. Xu,
          <source>tiple choice science questions</source>
          ,
          <year>2017</year>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref87">
        <mixed-citation>
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Ma</surname>
          </string-name>
          , H. Liu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , //arxiv.org/abs/1707.06209.
        </mixed-citation>
      </ref>
      <ref id="ref88">
        <mixed-citation>
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>A survey on llm-as-</article-title>
          <string-name>
            <surname>a-</surname>
            [42]
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Sakaguchi</surname>
            ,
            <given-names>R. L.</given-names>
          </string-name>
          <string-name>
            <surname>Bras</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Bhagavatula</surname>
          </string-name>
          , Y. Choi,
        </mixed-citation>
      </ref>
      <ref id="ref89">
        <mixed-citation>
          <string-name>
            <surname>judge</surname>
          </string-name>
          ,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2411.15594.
          <string-name>
            <surname>Winogrande</surname>
          </string-name>
          :
          <article-title>An adversarial winograd schema chal-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref90">
        <mixed-citation>
          <source>arXiv:2411</source>
          .15594. lenge at scale,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1907</year>
          . [35]
          <string-name>
            <given-names>S.</given-names>
            <surname>Doddapaneni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S. U. R.</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Venkatesh</surname>
          </string-name>
          ,
          <volume>10641</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref91">
        <mixed-citation>
          <string-name>
            <given-names>R.</given-names>
            <surname>Dabre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kunchukuttan</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Khapra</surname>
            , Cross- [43]
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Grattafiori</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Dubey</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Jauhri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Pandey</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Ka-
        </mixed-citation>
      </ref>
      <ref id="ref92">
        <mixed-citation>
          <article-title>lingual auto evaluation for assessing multilingual dian, A</article-title>
          .
          <string-name>
            <surname>Al-Dahle</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Letman</surname>
          </string-name>
          , et al,
          <source>The llama 3</source>
        </mixed-citation>
      </ref>
      <ref id="ref93">
        <mixed-citation>
          <string-name>
            <surname>LLMs</surname>
          </string-name>
          , in
          <source>: Proceedings of the 63rd Annual Meeting herd of models</source>
          ,
          <year>2024</year>
          . URL: https://arxiv.org/abs/
        </mixed-citation>
      </ref>
      <ref id="ref94">
        <mixed-citation>
          <source>of the Association for Computational Linguistics</source>
          <volume>2407</volume>
          .
          <fpage>21783</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref95">
        <mixed-citation>
          (Volume
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , Association for Computa- [44]
          <string-name>
            <surname>Qwen</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Hui</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zheng</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref96">
        <mixed-citation>
          <source>tional Linguistics</source>
          , Vienna, Austria,
          <year>2025</year>
          , pp.
          <fpage>29297</fpage>
          -
          <lpage>B</lpage>
          . Yu,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          , et al.,
          <source>Qwen2.5 technical report</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref97">
        <mixed-citation>
          29329. URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>acl-long</article-title>
          . URL: https://arxiv.org/abs/2412.15115.
        </mixed-citation>
      </ref>
      <ref id="ref98">
        <mixed-citation>
          <volume>1419</volume>
          /. [45]
          <string-name>
            <given-names>A.</given-names>
            <surname>Gonzalez-Agirre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pàmies</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Llop</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Baucells</surname>
          </string-name>
          , [36]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pombal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Yoon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          , I. Wu,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. D.</given-names>
            <surname>Dalt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tamayo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. J.</given-names>
            <surname>Saiz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Espuña</surname>
          </string-name>
          , J. Prats,
        </mixed-citation>
      </ref>
      <ref id="ref99">
        <mixed-citation>
          <article-title>A suite of open multilingual llm judges</article-title>
          ,
          <year>2025</year>
          . URL:
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.08489.
        </mixed-citation>
      </ref>
      <ref id="ref100">
        <mixed-citation>
          https://arxiv.org/abs/2504.04953. [46]
          <string-name>
            <given-names>P. H.</given-names>
            <surname>Martins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Alves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. M.</given-names>
            <surname>Guer</surname>
          </string-name>
          [37]
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Suk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Shin</surname>
          </string-name>
          , reiro,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. M.</given-names>
            <surname>Alves</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pombal</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Farajian,
        </mixed-citation>
      </ref>
      <ref id="ref101">
        <mixed-citation>
          <article-title>Prometheus 2: An open source language model dow</article-title>
          , J. G.
          <string-name>
            <surname>C. de Souza</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Birch</surname>
            ,
            <given-names>A. F. T.</given-names>
          </string-name>
          <string-name>
            <surname>Martins</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref102">
        <mixed-citation>
          in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
            ,
            <given-names>Y.-N.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          (Eds.),
          <year>2024</year>
          . URL: https://arxiv.org/abs/2409.16235.
        </mixed-citation>
      </ref>
      <ref id="ref103">
        <mixed-citation>
          <source>Proceedings of the 2024 Conference on Empiri-</source>
        </mixed-citation>
      </ref>
      <ref id="ref104">
        <mixed-citation>
          <string-name>
            <surname>Florida</surname>
          </string-name>
          , USA,
          <year>2024</year>
          , pp.
          <fpage>4334</fpage>
          -
          <lpage>4353</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>