1. Introduction

Sustainable Italian LLM Evaluation: Community Perspectives and Methodological Guidelines

Luca Moroni

Gianmarco Pappacoda

Edoardo Barba

Simone Conia

Andrea Galassi

Bernardo Magnini

Roberto Navigli

Paolo Torroni

Roberto Zanoli

0 0 Fondazione Bruno Kessler (FBK) , Trento , Italy 1 Sapienza NLP Group, Sapienza University of Rome , Rome , Italy 2 Università di Bologna , Bologna , Italy

2025

The evaluation of large language models for Italian faces unique challenges due to morphosyntactic complexity, dialectal variation, cultural-specific knowledge, and limited availability of computational resources. This position paper presents a comprehensive framework for Italian LLM benchmarking, in which we identify key dimensions for LLM evaluation, including linguistic capabilities, knowledge domains, task types and prompt variations, proposing high-level methodological guidelines for current and future initiatives. We advocate a community-driven, sustainable benchmarking initiative that incorporates dynamic dataset management, open model prioritization, and collaborative infrastructure utilization. Our framework aims to establish a coordinated efort within the Italian NLP community to ensure rigorous, scientifically sound evaluation practices that can adapt to the evolving landscape of Italian LLMs.

eol>Benchmarking Italian LLMs Large Language Models

1. Introduction

• Section 2: What to benchmark – a framework for prioritizing linguistic capabilities, knowledge domains, and task types in Italian LLM evaluation. • Section 3: How to benchmark – methodologi- This dimension covers the basic language skills needed cal considerations including prompt engineering, for understanding at diferent levels. Italian’s typological evaluation metrics, and aggregation strategies. characteristics, as a Romance language with rich morphology and relatively flexible syntax, create evaluation • Section 4: Where to benchmark – which datasets challenges distinct from those posed by English or other and tasks to consider for a comprehensive evalu- languages. Our framework distinguishes between five ation. hierarchical levels of linguistic analysis: 2.1. Linguistic Competence sideration in developing language-specific benchmarks morphological inflection with complex agreement sysinvolves the trade-ofs between creating native content tems, relatively free word order with pragmatic conand translating from existing English resources. Indeed, straints, extensive use of clitics and null subjects, and a while translation ofers scalability and cross-linguistic wealth of dialectal variation across regions. These charcomparability, it may fail to capture language-specific acteristics, combined with Italy’s unique cultural and phenomena, cultural nuances, and idiomatic expressions institutional landscape, create specific challenges for lanthat are crucial for comprehensive evaluation. Native guage model evaluation that cannot be adequately adItalian benchmarks, conversely, provide authentic lin- dressed through direct translation of existing English guistic challenges but require substantial expertise and benchmarks. To address these challenges, we propose a resources in order to be developed and maintained. multi-dimensional framework for Italian LLM evaluation

This position paper synthesizes community experi- that captures the essential linguistic and cultural dimenences in benchmarking Italian LLMs and proposes ac- sions of language understanding and generation, as illustionable guidelines with the objective of incentivizing trated in Figure 1. Table 1 summarizes the coverage of the development of more and better Italian LLM evalua- 25 publicly available datasets within our proposed evalution resources in a sustainable manner. We address four ation ontology, highlighting the need for comprehensive fundamental questions: benchmarks that encompass a wide range of linguistic phenomena, knowledge domains, and task types. • Section 5: Sustainable benchmarking – addressing organizational, computational, and financial challenges for long-term viability.

We present empirical insights, practical guidelines, and open research questions to encourage community dialogue toward establishing comprehensive, sustainable evaluation standards for Italian LLMs. 2. What to Benchmark

The fundamental question of what to benchmark in Italian LLM evaluation requires careful consideration of the nature of language understanding and generation capabilities. While English-centric benchmarks have established evaluation paradigms for general language understanding, Italian presents unique linguistic challenges that may require datasets and tasks specifically for the language, i.e., native Italian benchmarks, rather than relying solely on translated English resources. Drawing from established evaluation frameworks, as well as Italianspecific initiatives, we propose a systematic approach to characterizing the evaluation space along three critical dimensions that collectively capture the breadth of abilities essential for robust Italian language modeling.

Italian presents several distinctive features that distinguish it from well-studied languages like English: rich

Morphological Processing constitutes the founda

tion, testing models’ ability to handle word formation, inflection, and morpho-syntactic agreement. Recent work has demonstrated the value of elementary linguistic tasks [22] in revealing fundamental model capabilities that may be obscured in more complex scenarios. For Italian, this includes evaluating comprehension of gender and number agreement (la casa bianca vs. i tavoli bianchi), complex verbal conjugation patterns across tenses and moods (andrei, andresti, andrebbe), and productive derivational morphology (camminare → camminabile → camminabilità). Unlike English, where morphological complexity is relatively limited, Italian models must demonstrate robustness to a wide range of inflectional and derivational forms, including irregular verbs and noun-adjective agreement patterns.

Lexical Knowledge assessment focuses on vocabulary breadth, semantic relations, and word-level disambiguation capabilities. This includes traditional tasks, such as word sense disambiguation (WSD), with some verbs in Italian that are particularly polysemous, like prendere (to take, catch, get, have) and dare (to give, provide, yield). Evaluation must also address lexicalsemantic knowledge specific to Italian cultural and linguistic contexts, including understanding of false friends with other Romance languages (burro means butter, not

Morphology Inflection, conjugation, agreement patterns

Lexicon Vocabulary, idioms, multi-word expressions

Semantics Meaning, disambiguation, inference

Pragmatics Context, discourse, communicative intent

Domain Coverage Legal, medical, technical,

literary texts Linguistic Instructions

Following Italian language directives

Linguistic Competence

Syntax Word order, parsing, complex structures Domain & Knowledge

Specialization Cultural Knowledge Italian culture, history,

social contexts Task Generalization & Instruction Following Task Generalization

Adapting to new task formats

Formal, informal, regional varieties Cross-linguistic Transfer Leveraging multilingual knowledge donkey) and recognition of regional lexical variants (an- where models must track referential relations across exguria vs. cocomero for watermelon). tended texts and maintain thematic continuity. Italian’s rich system of discourse markers (magari, dunque, alSyntactic Processing evaluates models’ grasp of Ital- lora, comunque) and the pragmatic functions of syntactic ian sentence structure, including complex phenomena variations require sophisticated contextual understandthat distinguish Italian from more configurational lan- ing. Additionally, models must demonstrate sensitivity to guages. Key areas include clitic placement and climbing speech acts and politeness, understanding when indirect (lo voglio vedere vs. voglio vederlo), null subject licens- requests (non è che potresti...) are more appropriate than ing and pro-drop parameters, and the pragmatic con- direct imperatives, and recognizing the pragmatic force straints governing word order flexibility. Italian’s ability of conditional constructions, such as (sarebbe possibile vs. to express the same propositional content through multi- è possibile). ple syntactic configurations ( Mario ha visto Lucia, Lucia, Mario l’ha vista, L’ha vista Mario, Lucia) requires mod- 2.2. Domain and Knowledge els to understand both structural possibilities and their discourse functions.

The second dimension addresses the world knowledge

encoded in language models, with particular attention Semantic Processing encompasses both composi- to Italian-specific cultural, historical, and institutional tional semantics, i.e., how meaning is constructed from contexts. This dimension recognizes that language comconstituent parts, and pragmatic inference capabilities. petence extends beyond linguistic phenomena to encomThis includes tasks such as textual entailment, semantic pass domain-specific expertise and culture awareness, parsing, irony detection, and sentiment analysis, that which becomes particularly important given the counrequire deeper contextual understanding. Italian’s rich try’s distinctive historical, geographical, political, legal, system of grammaticalized aspect and mood markers and cultural landscape. (stava per partire vs. era sul punto di partire vs. stava partendo) creates semantic distinctions that must be captured in evaluation frameworks.

Domain Coverage spans traditional academic disci

plines (mathematics, natural sciences, humanities) as well as specialized professional domains where Italian-specific Pragmatic Processing represents the highest level terminology, concepts, and practices may be essential. of linguistic competence, evaluating models’ ability to Legal reasoning presents a particularly challenging case: understand language in context and interpret commu- while mathematical reasoning may transfer readily across nicative intentions beyond literal meaning. Key evalu- languages, Italian legal discourse requires deep familiaration areas include discourse coherence and cohesion, ity with concepts like concordato preventivo, the distinc.seknG .i-ssgoLn aT r

C Dataset lrgoooyphM licexaL txaySn itsceaSnm itrscaagPm iaonDm ltreuuC itrseegR ..iItrsgLnn AI2-ARC ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ BoolQ ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ GSM8K ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ HellaSwag ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ MMLU ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ PIQA ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ SciQ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ TruthfulQA ✗ ✗ ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ WinoGrande ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ Admission Test ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ AMI 2020 ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗ CLinkaRT 2023 ✗ ✓ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ DiscoTEX ✗ ✗ ✓ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ GhigliottinAI ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ ✗ ✗ HaSpeeDe2 ✗ ✓ ✗ ✓ ✓ ✗ ✓ ✗ ✗ ✗ ✗ LexSub ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ NERMUD ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ PreLearn20 ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ PreTENS 22 ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ QA4FAQ ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ QuandHo ✗ ✗ ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ SENTIPOLC ✗ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ ✗ ✗ Sum-FP ✗ ✗ ✗ ✓ ✗ ✓ ✗ ✗ ✓ ✗ ✗ Textual Entailment ✗ ✗ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ WiC-ITA ✗ ✓ ✗ ✓ ✗ ✗ ✗ ✗ ✗ ✗ ✗ ITA-Bench ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✗ ✗ ✓ EvalITA-LLM ✗ ✓ ✗ ✓ ✓ ✓ ✓ ✗ ✓ ✗ ✗ ITALIC ✓ ✓ ✓ ✓ ✗ ✓ ✓ ✗ ✗ ✗ ✗ from the elaborate bureaucratic language of Italian public administration (linguaggio burocratico) to the informal, creative language of social media. Italian’s rich system of honorifics and address forms, e.g., when to use tu, lei, and voi and the use of conditional forms for politeness (vorrei vs. voglio), requires social awareness that goes beyond linguistic competence. Academic Italian, with its distinctive structures and vocabulary (altresì, peraltro, laddove), represents another crucial register for evaluation. 2.3. Task Generalization and Instruction

Following The third dimension captures models’ ability to understand and execute new, unseen instructions, which is a capability that has become increasingly important in practical LLM applications. This dimension should be equally relevant for Italian LLMs, as instruction-following capabilities must transfer across linguistic and cultural boundaries while maintaining sensitivity to Italian-specific communicative norms and expectations.

Linguistic Instruction Following encompasses tasks

Table 1 that require manipulation of language itself, demonstrat(CIToAve-Braegnechof, E25vapluITbAli-cLlyL Mav,aainladblIeTAdaLtIaCs)ewtsitahnidn 3thfreapmroewpoosrekds ing meta-linguistic awareness. For Italian, this includes Italian LLM evaluation ontology (✓ = covered, ✗ = not). style transfer tasks that require understanding of register diferences, e.g., converting formal business correspondence (Con la presente si comunica che...) to informal messaging (Ti scrivo per dirti che...), or adapting academic tion between dolo and colpa, and the complex structure of writing to journalistic style. Grammar presents particuItalian administrative law (TAR, Consiglio di Stato). Med- lar challenges: shifting from passato prossimo to passato ical terminology, with its mixture of Latin roots, Italian remoto depending on regional preferences, converting adaptations, and regional variations, is another similar between active and passive constructions while mainchallenge. Educational contexts require understanding taining appropriate clitic placement, and handling perof the Italian school system’s structure (liceo classico, is- son shifts in embedded structures. Content restructuring, tituto tecnico, scuola dell’infanzia) and grading systems such as summarization with specific constraints (e.g., “ri(giudizio vs. voto). assumi in 50 parole mantenendo un tono formale”), tests not only linguistic competence but also adherence to Cultural and Contextual Knowledge evaluation ad- culturally appropriate communication patterns. dresses the understanding of Italian history, geography, social institutions, and contemporary cultural references. Task Generalization evaluates models’ ability to This encompasses knowledge of Italy’s regional diversity, adapt to novel task formats and requirements based on ranging from linguistic varieties (understanding when natural language descriptions, without task-specific trainsomeone uses scialla) to culinary traditions (knowing that ing. This includes assessment of few-shot learning caparagù varies significantly between Bologna and Naples) bilities in Italian contexts, where models must quickly to historical references (recognizing allusions to Tangen- adapt to new domains or specialized vocabularies. For intopoli or the anni di piombo). Models must also be aware stance, a model might need to learn medical terminology of the contemporary Italian media landscape, political from a few examples and then apply it consistently, or undiscourse, and social issues, with appropriate cultural sen- derstand the conventions of Italian legal citation formats sitivity, while at the same time avoiding stereotypes or from brief instruction. The ability to combine multiple biases that may arise from training data and also staying sub-tasks in complex workflows, such as extracting inupdated with new events. formation from a bureaucratic document, reformatting it according to specific guidelines, and generating a summary in a diferent register, represents a crucial capability

Genre and Register Adaptation tests models’ sensi

tivity to diferent text types and communicative contexts,

Question: Given the context "Marco Rossi è nato

a Milano nel 1985", which entity does "Milano" refer to?

A) Milano, Texas (USA) B) Milano, Italy (city) C) Milano Marittima (resort town) D) Milano Centrale (train station)

Answer: for practical applications.

Cross-Linguistic Instruction Transfer addresses the challenge of Italian LLMs operating in multilingual contexts. This includes handling instructions that may draw upon multilingual contexts (e.g., “traduci questo testo inglese mantenendo il tono ironico”) or require codeswitching between Italian and other languages, particularly English in technical contexts. LLMs must demonstrate sensitivity to when code-switching is appropriate versus when maintaining linguistic purity is required, understanding contexts where English technical terms are standard (software, hardware) versus where Italian equivalents are preferred (programma vs. software). where the model is expected to generate the correct option letter (e.g., "B") as its response. This approach allows for leveraging existing evaluation metrics while adapting to the generative capabilities of modern LLMs.

Guidelines on What to Benchmark. Our proposed Multiple-choice question adaptation has become a framework (Figure 1) could be used for a structured and prevalent strategy in LLM evaluation [9, 23, 24], includsystematic categorization of Italian LLM evaluation tasks. ing Italian evaluations [19, 21], due to its simplicity (i.e., By encouraging task designers to be explicit and trans- one only needs to compare the label generated by the parent about which dimensions their tasks cover, the model with the correct label) and its low computational research community can more efectively allocate time, cost. However, it is important to note that this approach expertise, and resources toward areas that are currently is not truly reflective of real-world applications, where underrepresented. This, in turn, would allow for a richer models are often expected to generate free-form text and more fine-grained understanding of model capabil- rather than select from predefined options. Moreover, ities across a broad spectrum of competencies, as illus- multiple-choice question evaluation presents several pertrated in Table 1, highlighting concrete gaps, for exam- sistent challenges for assessing LLMs. Diferent evaluaple, the pressing need for a greater number of evaluation tion strategies often yield inconsistent results [25], and – tasks that assess pragmatic processing, adaptation to dif- with the emergence of reasoning-intensive models [26] – ferent registers and sociolinguistic contexts, as well as extracting the intended answer is not always straightforthe ability to transfer instructions across languages in ward [27]. cross-linguistic scenarios.

Open-Ended Generation Tasks represent the most

3. How to Benchmark authentic form of generative evaluation, allowing models to produce free-form text responses. However, this 3.1. Task Formulation approach introduces significant challenges in terms of evaluation consistency and reliability, particularly for The shift towards generative language models requires tasks that require subjective judgment or cultural conreconsideration of traditional NLP evaluation paradigms, text understanding. For example, Instruction Following particularly for discriminative tasks that formed the back- (IF) task will be formulated as an open-ended task as bone of earlier evaluation eforts when classification and follows: regression were the primary focus.

Multiple-Choice Question Adaptation has emerged as an easy-to-implement approach for bridging traditional evaluation paradigms with generative model capabilities. By recasting discriminative tasks as prompted generation problems, this approach enables evaluation of models’ reasoning processes while maintaining compati- where the model is expected to generate a coherent and bility with established evaluation metrics. For example, correct answer following the guidelines imposed by the Named Entity Disambiguation (NED) tasks can be refor- instruction (“Shakespearean style”), about a trip to “Italy”. mulated as multiple-choice questions as follows: Evaluating a model’s ability to generate a coherent and contextually appropriate response to an open-ended question about Italian culture may require human annotators with specific cultural knowledge, leading to potential

Instruction: I am planning a trip to Italy, and

I would like you to write an itinerary for my journey in a Shakespearean style. You are not allowed to use any commas in your response.

Answer: biases and inconsistencies in scoring. The open-ended leading to inconsistent probability distributions. Moreparadigm ofers several distinct advantages: it enables as- over, probability-based evaluation cannot capture the sessment of reasoning processes and explanation quality, reasoning processes that have become increasingly imallows for partial credit scoring based on response com- portant in current LLM applications, as models cannot ponents (e.g., a sound trip schedule, and adherence to the leverage their problem-solving strategies, provide explawriting style) and more closely mirrors real-world deploy- nations, or exhibit the kind of multi-step reasoning that ment scenarios where models must generate free-form characterizes human-inspired processes (e.g., Chain of responses. However, open-ended formulation introduces Thought) in language tasks. significant challenges, including increased computational costs, the need for complex answer validation methods, Generative Evaluation Generative evaluation inLLM-as-a-Judges, and task-specific evaluation metrics volves prompting a model to produce a complete, freethat may need to be designed for each domain and appli- form response, which is then assessed against specific crication. teria or compared to a reference answer. This approach allows for more flexible and natural outputs, unconstrained 3.2. Task Evaluation by predefined answer options. For instance, in the Named Entity Disambiguation (NED) task, generative evaluation There are two main strategies for evaluating the output might prompt the model to produce a detailed explanaof generative models: probability-based evaluation and tion such as: "The correct answer is Milano, Italy (city) generative evaluation. These approaches difer in how because the context mentions Marco Rossi being born they assess model outputs, with significant implications there, indicating the major Italian city rather than other for benchmark design. places with the same name." Such responses can provide richer insight into the model’s reasoning and capabilities.

Probability-Based Evaluation relies on computing However, evaluating generative outputs remains a sigthe likelihood of specific continuations given a context, nificant challenge. In the context of multiple-choice quesleveraging the model’s internal probability distribution tion answering, the evaluation procedure must recover over tokens. This approach is particularly well-suited the model’s intended answer from free-form text. Two for tasks where the model must select among predefined primary approaches are commonly used: ( 1 ) applying options, such as multiple-choice questions or cloze com- hand-crafted regular expressions, which are simple and pletion tasks. The evaluation is based on the model’s fast to implement but susceptible to edge cases and failability to assign higher probabilities to correct answers ures; and (2) leveraging LLM-based extractors, which compared to incorrect ones. More formally, given a con- ofer greater robustness and accuracy but come with text and a set of options = {1, 2, . . . , }, the increased computational cost. Recent studies have invesevaluation computes the probabilities (|) for each tigated the trade-ofs between these methods, revealing option and selects the one with the highest probability that even LLM-based extractors can fail under certain conas the model’s implicit choice. In the previous example, ditions or may be unnecessary in specific scenarios [ 27]. the model would compute probabilities for each option: For open-ended tasks, evaluation becomes even more ("Milano, Italy"|context), ("Milano, Texas"|context), complex due to the diversity and richness of possible coretc. Alternatively, for computational eficiency, evalua- rect answers. These tasks require assessments across multion can be performed on option labels: ("B"|context), tiple dimensions, such as relevance, coherence, factuality, though this approach may lose semantic information and and completeness. Traditional automatic metrics, such as introduce artifacts related to label order and bias [28]. BLEU [29], ROUGE [30], METEOR [31], BERTScore [32],

The main advantages of probability-based evaluation and COMET [33], are often insuficient to capture the include computational eficiency–particularly when com- full quality of generated responses. puting probabilities of single-token continuations–and For those reasons, LLM-as-a-Judge approaches [34] the ability to assess model confidence through probability have recently gained traction for evaluating LLMs in margins. However, this approach faces several limitations open-ended generation tasks, ofering an alternative to that become particularly pronounced in Italian contexts. traditional, non-generative metrics. However, most of the Length bias can systematically favor shorter options, as existing research in this area has focused on the English longer sequences have lower joint probabilities; this is language. Encouragingly, recent developments in multiespecially problematic for Italian, where morphological lingual, open-source LLM-as-a-Judge frameworks [35, 36, complexity varies significantly across lexical items. To- Hercule, M-Prometheus] have shown promising results kenization efects may create systematic biases: Italian in non-English contexts. Still, as of now, there are no compound words or phrases may be tokenized very dif- open-weight LLM-as-a-Judge models explicitly trained ferently by diferent tokenizers of multilingual models, for Italian, showing that there exists a significant gap in the current literature. In general, LLM-as-a-Judge evaluation frameworks can be expensive, especially when based Few-Shot Learning has been widely adopted in LLM on commercial models. Even open-source alternatives, evaluation, allowing models to leverage examples to imsuch as Prometheus [37], require substantial computa- prove performance on specific tasks. Our experience tional resources, e.g., Prometheus is available as a 7B and indicates that few-shot prompting is particularly efec35B model, making its deployment resource-intensive. tive when the answer format is novel or complex with In addition, the LLM-as-a-Judge paradigm faces several respect to the model’s training data, as it provides cruopen challenges beyond language coverage and eficiency. cial context and guidance for generating appropriate reNotably, robust meta-evaluation is needed to assess the sponses. However, few-shot prompting also introduces a reliability of LLM-based judgments. It is therefore impor- significant computational overhead and requires careful tant to pair model-based evaluation with human judg- selection of examples to avoid introducing hidden biases ment, especially for mid-resource languages like Italian. towards specific answers. Perhaps more importantly, Not only that, LLM-based evaluators remain vulnerable few-shot prompting can lead to overfitting on the trainto various forms of bias, which can be particularly prob- ing examples provided for the given benchmark, which lematic in sensitive applications [38]. These limitations could be too specific and similar to the test examples that underscore the urgent need for a well-defined, efective may not generalize well on diferent domains or tasks. evaluation framework, especially when assessing gener- Therefore, while few-shot prompting can enhance model ative models on Italian language benchmarks. performance, we recommend using zero-shot evaluation as a more representative measure of model capabilities, 3.3. Task Variation whereas few-shot prompting can be used as a supplementary task variation and a strong baseline on model performance.

4. Where to Benchmark The same task can be presented in multiple ways, leading

to diferent model performances based on the formulation of the prompt. In our experience with Italian LLMs and Italian benchmarks, we have identified several key dimensions of task variation that significantly impact model performance and evaluation outcomes.

Prompt Variation is essential for understanding how diferent linguistic features influence model performance, as a diferent model may perform better or worse depending on how the task is presented.

Cross-Lingual Prompting which refers to prompting in a language other than the language in which the model is expected to answer, is a particularly interesting aspect of Italian LLM evaluation, as it allows us to leverage the multilingual capabilities of models trained on diverse datasets. Our observations indicate that Italian models often perform better when prompted in English with instructions to respond in Italian, suggesting that current Italian LLMs are benefitting from higher-quality English training data during pre-training and/or post-training. • Register variation: Tests model sensitivity to Therefore, cross-lingual prompting can be a powerful formality diferences by comparing formal aca- tool for measuring cross-linguistic performance and undemic language ("Sulla base del testo fornito, si derstanding how models generalize across languages, identifichi l’opzione corretta" ) versus informal con- including coding languages, such as Python, which are versational prompts ("Leggendo questo testo, qual è often used in programming tasks. la risposta giusta?"). This is particularly important for Italian given its system of register markers. • Cultural framing: Compares culturally specific framings ("Come studente italiano, quale risposta sceglieresti?") with culturally neutral ones. This proves particularly important for tasks about

Italian-specific knowledge. • Instruction explicitness: Varies detail level from minimal prompts relying on implicit under- The development of an LLM benchmark suite for a target standing to elaborate instructions with explicit language typically follows one of three main approaches, criteria and response formats. each with distinct advantages and limitations that significantly shape the resulting evaluation framework. In this section, we outline “where” to obtain the data to evaluate LLMs, or – in the absence of existing benchmark for a target language – where to source the data to bootstrap the creation of a new benchmark. • Randomicity: Introduces random variations in prompt structure, such as changing the order of options or rephrasing questions, to assess model robustness to possibly irrelevant changes.

Translation-Based Methodologies are the most im

mediate and resource-eficient strategy, as it allows us to leverage existing English benchmarks, such as MMLU [9], HellaSwag [39], ARC [24], BoolQ [40], and SciQ [41], among many others. This approach enables rapid de- linguistic analysis and content creation, ofer the greatployment of evaluation frameworks and facilitates cross- est potential for capturing phenomena unique to Itallinguistic comparison of model capabilities. However, ian language use that may be systematically overlooked direct translation – apart from the possibility of trans- by adapted benchmarks. Since native benchmarks relation errors – introduces systematic biases that may quire significant expertise, time, and resources to deobscure genuine linguistic diferences between Italian velop, their need should be carefully evaluated against and English, potentially leading to evaluation artifacts the potential benefits they ofer. In our experience, nathat do not reflect authentic Italian language use patterns. tive benchmarks are particularly valuable for tasks that

Our experience with translating English benchmarks require deep cultural understanding, such as cultural refreveals several aspects that require careful consideration, erences, idiomatic expressions, and pragmatic language as they can significantly impact the task’s validity and use. Therefore, we recommend that native development complexity. For instance, WinoGrande [42] is a widely approaches be prioritized for tasks that are critical for used benchmark for evaluating commonsense reasoning evaluating LLMs’ capabilities in Italian, while translation in English, where the task involves filling in the blanks and adaptation methodologies can be used to compleof sentences with appropriate words, e.g., The GPS and ment existing benchmarks and fill gaps in evaluation map helped me navigate home. I got lost when the ___ coverage. got turned upside down in which the correct answer is map. A possible translation into Italian could be Il GPS e la mappa mi hanno aiutato a tornare a casa. Mi sono 5. Sustainable Benchmarking perso quando la ___ è stata capovolta, where the correct answer is mappa. We observe that the translated task is significantly less complex than the original, as the word GPS is masculine in Italian, while mappa is feminine, i.e., a model can easily infer the correct answer based on grammar alone rather than common sense.

Sustainable evaluation requires moving away from static

benchmarks toward dynamic, community-driven evaluations. We propose a living benchmark framework that addresses resource constraints via adaptive dataset management, open model prioritization, and strategic infrastructure utilization.

Adaptation-Based Methodologies ofer a middle ground between translation and native development, allowing us to use data that is already available in Italian while adapting the task design to better fit the evaluation of LLMs. This approach enables us to create benchmarks that are more culturally and linguistically relevant than direct translations, while still leveraging existing resources to reduce development costs. For instance, misogyny detection on social media platforms presents significant diferences between English and Italian for several reasons, including the use of diferent terms, cultural references, and linguistic structures, i.e., translating English benchmarks would not necessarily capture the nuances of misogyny in Italian. Therefore, adaptationbased methodologies can be particularly efective for tasks that require cultural or contextual understanding, such as sentiment analysis, hate speech detection, and commonsense reasoning. However, adaptation also requires careful consideration as the adaptation process (e.g., how the prompts or possibile answers are adapted) may introduce biases or artifacts that do not accurately reflect the evaluation goals of the original benchmark.

Native Development Approaches represent the

most resource-intensive but potentially most valuable strategy, creating evaluation frameworks specifically designed for Italian linguistic and cultural contexts. These approaches, while requiring substantial investment in

Dynamic Task Management: our framework envi

sions a dynamic lifecycle management for datasets where evaluation tasks undergo continuous assessment and removal upon reaching saturation thresholds or staleness. The research community should propose new tasks and perform a pilot evaluation to assess complexity, cultural relevance, and computational requirements before integration, with higher priority given to tasks capturing emerging linguistic phenomena and leveraging unique aspects of Italian language and culture.

Open-Source Prioritization: we propose a three-tier model inclusion hierarchy: fully open-source models (training code, data pipelines, complete documentation), open-weight models (public weights and inference code), and closed systems (limited to significant comparative baselines). Performance-based curation should flag underperforming models for removal while maintaining architectural diversity and preserving historical data. Model Transparency and Comparative Context: our framework would remark model openness and core characteristics—such as the number of training tokens and model parameters. Current leaderboards often lack a consistent emphasis on these details during comparisons. For example, given equal parameter counts, it is reasonable for a fully open model trained on fewer tokens to underperform relative to a proprietary model trained on significantly more data. Nonetheless, such discrepancies should be seen as valuable indicators of the evaluation gap, encouraging the research community to close this gap through more equitable and transparent benchmarking. Table 2 provides a non-exhaustive list of state-of-the-art LLM families trained on Italian data (e.g., Minerva [4], Llama [43], Qwen [44], Salamandra [45], EuroLLM [46], Almawave’s Velvet, iGenius’ Italia, Fastweb’s MIIA) where we report the number of training tokens and model parameters.

Community Governance: a community-based steering committee with short-term rotating roles will govern the framework, including representatives from Italian research institutions and industry partners. The committee establishes dataset inclusion criteria, defines evaluation protocols, coordinates infrastructure allocation, and mediates methodology disagreement through transparent voting procedures.

Infrastructure and Cost Management: the frame

work leverages national computational resources, e.g., CINECA’s Leonardo supercomputer, as the primary infrastructure foundation. These partnerships should provide access to state-of-the-art GPU clusters while maintaining community accessibility through existing institutional allocation systems. Our preliminary cost analysis reveals that generative evaluation tasks consume 3-5 times more resources than probability-based assessments. Optimization strategies include batch processing, smart caching, and hierarchical evaluation protocols. Overall, a comprehensive evaluation of 10 models across 50 tasks can require approximately 500-750 GPU hours per quarter, with sustainability achieved through diferent funding sources including national support, institutional commitments, and industry partnerships.

Model Minerva-350M Minerva-1B Minerva-3B Minerva-7B Velvet-2B Italia-9B FastwebMIIA-7B Llama-3.1-8B Llama-3.2-1B Llama-3.2-3B Salamandra-2B Salamandra-7B Velvet-14B Qwen2.5-1.5B Qwen2.5-3B Qwen2.5-7B EuroLLM-1.7B

Parameter Size Training Tokens (Billions) (Trillions) Italian First Multilingual

Acknowledgments 6. Conclusion Research partly funded by PNRR - M4C2 - Investimento

1.3, Partenariato Esteso PE00000013 - “FAIR - Future ArLLMs require rigorous, standardized evaluation frame- tificial Intelligence Research” (Spoke 2 “Integrative AI”, works that can assess diferent capabilities in linguisti- Spoke 5 “High-Quality AI” and Spoke 8 “Pervasive AI”) cally and culturally diverse contexts. For Italian, this chal- funded by the European Commission under the NextGenlenge is compounded by the complexity of morphosyntac- eration EU programme (https://fondazione-fair.it/). Sitic phenomena, dialectal variation, and culturally-specific mone Conia’s fellowship is fully funded by the PNRR knowledge requirements that existing benchmarks are MUR project PE0000013-FAIR. Luca Moroni and Roberto yet to fully address. However, several aspects of bench- Navigli gratefully acknowledge the support of the AI marking discussed in the paper, for instance task formula- factory IT4LIA project. tion, evaluation and variation, can be applied efectively to languages other than Italian, English included. We hope that work on Italian can act as a trailblazer, particularly for other European languages.

This position paper outlines a comprehensive overview of the Italian LLM evaluation landscape across several important dimensions. Moreover, we firmly believe that the During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Text translation and Improve writing style. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

sive multitask language understanding , 2021 . URL: [1]

Basile , E. Musacchio,

Polignano , L. Siciliani, https://arxiv.org/abs/ 2009 .03300.

Fiameni , G. Semeraro, Llamantino: Llama 2 mod - [10]

Wang ,

Ma , G. Zhang,

Ni ,

Chandra , S. Guo,

2023. URL: https://arxiv.org/abs/2312.09993. and challenging multi-task language understanding [2]

Polignano ,

Basile , G. Semeraro, Advanced benchmark, 2024 . URL: https://arxiv.org/abs/2406.

natural-based interaction for the italian language: 01574.

Llamantino- 3-anita, 2024 . URL: https://arxiv.org/ [11]

Mayhew ,

Blevins , S. Liu,

Suppa , H. Gonen,

abs/2405 .07101. J. M. Imperial , B. F.

Karlsson , P.

Lin , N.

Ljubešić , [3] L.

Moroni , G. Puccetti, P.-L. Huguet Cabot , A. S. N.

Ljubešić , L.

Miranda , B.

Plank , A.

Riabi , Y. Pin-

vocabulary adaptation , in: L. Chiruzzo , A . Rit- 2024 Conference of the North American Chapter

Computational

Linguistics: NAACL

2025 , Associa- Human Language Technologies (Volume 1: Long

New

Mexico , 2025 , pp. 6646 - 6660 . URL: https:// tics, Mexico City, Mexico, 2024 , pp. 4322 - 4337 .

aclanthology.org/ 2025 .findings-naacl. 371 /. doi: 10. URL: https://aclanthology.org/ 2024 . naacl-long . 243 /.

18653 /v1/ 2025 .findings-naacl. 371 . doi: 10 .18653/v1/ 2024 . naacl-long . 243 . [4]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co- [12] A.

Scirè , S.

Conia , S.

Ciciliano , R.

Navigli , Echoes

Sprugnoli (Eds.), Proceedings of the 10th Italian ACL 2023 , Association for Computational Lin-

Conference on Computational

Linguistics (CLiC- guistics , Toronto, Canada, 2023 , pp. 853 - 867 .

it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, URL: https://aclanthology.org/ 2023 .findings-acl. 54 /.

2024 , pp. 707 - 719 . URL: https://aclanthology.org/ doi:10.18653/v1/ 2023 .findings-acl. 54 .

2024.clicit- 1 .77/. [13] R. Das , S.

Hristov , H.

Li , D.

Dimitrov , I.

Koy[5] Proceedings of the Sixth Evaluation Campaign of chev, P. Nakov, EXAMS-V: A multi-discipline

Italian . Final Workshop (EVALITA 2018 ) , volume uating vision language models , in: L.-W. Ku,

2263 of CEUR Workshop Proceedings, CEUR-WS.org, A. Martins , V. Srikumar (Eds.), Proceedings of

2018. URL: https://ceur-ws. org/ Vol- 2263 . the 62nd Annual Meeting of the Association for [6] Proceedings of the Seventh Evaluation Campaign of Computational Linguistics (Volume 1: Long Pa-

Italian , CEUR-WS.org, 2020 . URL: https://ceur-ws. Bangkok, Thailand, 2024 , pp. 7768 - 7791 . URL:

org/ Vol- 2765 . https://aclanthology.org/ 2024 . acl-long . 420 /. doi:10. [7] Proceedings of the Eighth Evaluation Campaign of 18653/v1/ 2024 . acl-long . 420 .

Natural Language Processing and Speech Tools for [14] J.

Li , M.

Du , C.

Zhang , Y.

Chen , N.

Hu , G. Qi,

Italian , CEUR-WS.org, 2023 . URL: https://ceur-ws. H. Jiang , S. Cheng, B. Tian, MIKE: A new

org/ Vol- 3473 . benchmark for fine-grained multimodal entity [8]

Abdou ,

Ravishankar ,

Barrett , Y. Belinkov, knowledge editing , in: L. -W. Ku , A . Mar-

tions, in: D. Jurafsky , J.

Chai , N.

Schluter , J. Tetreault 2024 , Association for Computational Linguistics,

(Eds.), Proceedings of the 58th Annual Meeting Bangkok, Thailand , 2024 , pp. 5018 - 5029 . URL: https:

of the Association for Computational Linguistics , //aclanthology.org/ 2024 .findings-acl. 298 /. doi: 10.

Association for Computational Linguistics , On- 18653 /v1/ 2024 .findings-acl. 298 .

line , 2020 , pp. 7590 - 7604 . URL: https://aclanthology. [15]

Zhou ,

Lu ,

Mishra ,

Brahma ,

Basu ,

Luan ,

org/ 2020 .acl-main. 679 /. doi: 10 .18653/v1/2020. D. Zhou , L. Hou, Instruction-following evaluation

acl-main.679. for large language models , 2023 . URL: https://arxiv. [9]

Hendrycks ,

Burns ,

Basart , A . Zou, org/abs/2311.07911. [16]

Dussolle ,

Cardeña ,

Sato ,

Devine , M- guage tasks, in: A. Rogers , J. Boyd-Graber,

ation, in: L. Chiruzzo , A. Ritter , L. Wang (Eds.), for Computational Linguistics: ACL 2023 , Asso-

guistics: NAACL 2025 , Association for Compu- Canada, 2023 , pp. 10476 - 10501 . URL: https:

tational Linguistics , Albuquerque, New Mexico, //aclanthology.org/ 2023 .findings-acl. 666 /. doi: 10.

2025 , pp. 6161 - 6176 . URL: https://aclanthology.org/ 18653/v1/ 2023 .findings-acl. 666 .

2025.findings-naacl. 344 /. doi: 10 .18653/v1/ 2025 . [23]

Talmor ,

Herzig ,

Lourie ,

Berant , Com-

findings-naacl.344 . monsenseQA: A question answering challenge tar [17]

Rawat ,

McBride ,

Ghosh ,

Nirmal , J. Moon, geting commonsense knowledge , in: J. Burstein,

Alamuri ,

S. O

'Brien ,

Zhu , DiversityMedQA: C. Doran, T. Solorio (Eds.), Proceedings of the 2019

in : D. Dementieva , O.

Ignat , Z.

Jin , R.

Mihalcea , Language

Technologies

, Volume 1 (Long and Short

ceedings of the Third Workshop on NLP for Posi- tics , Minneapolis, Minnesota, 2019 , pp. 4149 - 4158 .

tive Impact , Association for Computational Linguis- URL: https://aclanthology.org/N19-1421/. doi:10.

tics , Miami, Florida, USA, 2024 , pp. 334 - 348 . URL: 18653 /v1/ N19 -1421.

https://aclanthology.org/ 2024 .nlp4pi- 1 .29/. doi:10. [24]

Clark ,

Cowhey ,

Etzioni ,

Khot , A . Sab-

18653 /v1/ 2024 .nlp4pi- 1 .29. harwal, C. Schoenick,

Tafjord , Think you have [18]

Attanasio ,

Basile ,

Borazio ,

Croce , M.

Fran- solved question answering? try arc, the ai2 reason-

cis , J. Gili , E. Musacchio, M. Nissim , V. Patti, ing challenge, 2018 . URL: https://arxiv.org/abs/ 1803 .

Rinaldi ,

Scalena , CALAMITA: Challenge 05457 .

the abilities of LAnguage models in ITAlian , in: [25]

Wang ,

Ma ,

Hu ,

Weber-Genzel , P. Röttger,

noli (Eds.), Proceedings of the 10th Italian Confer- First-token probabilities do not match text answers

ence on Computational Linguistics (CLiC-it 2024), in instruction-tuned language models , in: L.-W.

CEUR Workshop Proceedings , Pisa, Italy, 2024 , Ku,

Martins , V. Srikumar (Eds.), Findings of the

pp. 1054 - 1063 . URL: https://aclanthology.org/ 2024 . Association for Computational Linguistics: ACL

clicit-1 .116/. 2024, Association for Computational Linguistics, [19]

Moroni ,

Conia ,

Martelli ,

Navigli , To- Bangkok, Thailand, 2024 , pp. 7407 - 7416 . URL: https:

wards a more comprehensive evaluation for Italian //aclanthology .org/ 2024 .findings-acl. 441 /. doi: 10.

LLMs , in: F. Dell'Orletta , A. Lenci , S. Montemagni, 18653 /v1/ 2024 .findings-acl. 441 .

Sprugnoli (Eds.), Proceedings of the 10th Italian [26] DeepSeek-AI , D.

Guo , D.

Yang , H.

Zhang , J. Song,

it 2024 ), CEUR Workshop Proceedings, Pisa, Italy, r1: Incentivizing reasoning capability in llms via

2024 , pp. 584 - 599 . URL: https://aclanthology.org/ reinforcement learning, 2025 . URL: https://arxiv.

2024.clicit- 1 .67/. org/abs/2501.12948. [20]

Magnini ,

Zanoli ,

Resta ,

Cimmino , P. Al- [27]

F. M.

Molfese ,

Moroni ,

Giofré ,

Scirè , S. Co-

marking large language models on italian, 2025. covering the inconsistencies of llm evaluation in

URL: https://arxiv.org/abs/2502.02289. multiple-choice question answering , 2025 . URL: [21]

Seveso ,

Potertì , E. Federici, M. Mezzanzanica, https://arxiv.org/abs/2503.14996.

Mercorio , ITALIC: An Italian culture-aware nat- [28]

Zheng ,

Zhou ,

Meng ,

Zhou , M. Huang,

Wang (Eds.), Proceedings of the 2025 Confer- choice selectors , 2024 . URL: https://arxiv.org/abs/

ence of the Nations of the Americas Chapter of 2309 .03882.

the Association for Computational Linguistics: Hu- [29]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu:

man Language

Technologies (Volume 1: Long Pa- a method for automatic evaluation of machine

Albuquerque , New Mexico, 2025 , pp. 1469 - 1478 . (Eds.), Proceedings of the 40th Annual Meeting of

URL: https://aclanthology.org/ 2025 . naacl-long.68/. the Association for Computational Linguistics , As-

doi:10 .18653/v1/ 2025 . naacl-long.68. sociation for Computational Linguistics , Philadel[22]

Efrat ,

Honovich , O. Levy , LMentry: A phia, Pennsylvania, USA, 2002 , pp. 311 - 318 . URL:

language model benchmark of elementary lan - https://aclanthology.org/P02-1040/. doi: 10 .3115/

1073083.1073135. //aclanthology.org/ 2024 .emnlp-main. 248 /. doi:10. [30] C.-Y. Lin , ROUGE: A package for automatic eval - 18653 /v1/ 2024 .emnlp-main. 248 .

uation of summaries , in: Text Summarization [38]

Ye ,

Wang ,

Huang ,

Chen ,

Zhang , N. Mo-

guistics , Barcelona, Spain, 2004 , pp. 74 - 81 . URL: Chawla , X. Zhang , Justice or prejudice? quanti-

https://aclanthology.org/W04-1013/. fying biases in llm-as-a- judge , 2024 . URL: https: [31]

Banerjee ,

Lavie , METEOR : An automatic met- //arxiv.org/abs/2410.02736. arXiv: 2410 . 02736 .

ric for MT evaluation with improved correlation [39]

Zellers ,

Holtzman ,

Bisk ,

Farhadi , Y. Choi,

Workshop on Intrinsic and Extrinsic Evaluation (Eds.) , Proceedings of the 57th Annual Meeting

tics, Ann Arbor, Michigan, 2005 , pp. 65 - 72 . URL: Florence, Italy, 2019 , pp. 4791 - 4800 . URL: https:

https://aclanthology.org/W05-0909/. //aclanthology.org/P19-1472/. doi: 10 .18653/v1/ [32]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger , P19 - 1472 .

Artzi , BERTScore: Evaluating text generation [40]

Clark ,

Lee ,

M.-W.

Chang , T. Kwiatkowski,

with bert , 2020 . URL: https://arxiv.org/abs/1904. M. Collins, K. Toutanova, BoolQ: Exploring the

09675. surprising dificulty of natural yes/no questions , [33]

Rei ,

Stewart ,

A. C.

Farinha ,

Lavie , COMET: in: J. Burstein , C. Doran , T. Solorio (Eds.), Proceed-

A neural framework for MT evaluation , in: B. Web- ings of the 2019 Conference of the North American

of the 2020 Conference on Empirical Methods guistics: Human Language Technologies , Volume

in Natural Language Processing (EMNLP) , As- 1 (Long and Short Papers), Association for Com-

2020 , pp. 2685 - 2702 . URL: https://aclanthology.org/ 2019 , pp. 2924 - 2936 . URL: https://aclanthology.org/

2020.emnlp-main. 213 /. doi: 10 .18653/v1/ 2020 . N19-1300/. doi: 10 .18653/v1/ N19 -1300.

emnlp-main. 213 . [41]

Welbl ,

N. F.

Liu ,

Gardner , Crowdsourcing mul[34]

Gu ,

Jiang ,

Shi ,

Tan ,

Zhai , C. Xu, tiple choice science questions , 2017 . URL: https:

Li ,

Shen ,

Ma , H. Liu,

Wang ,

Zhang , //arxiv.org/abs/1707.06209.

Wang ,

Gao ,

Ni ,

Guo , A survey on llm-as-

a- [42] K.

Sakaguchi , R. L.

Bras , C.

Bhagavatula , Y. Choi,

judge , 2025 . URL: https://arxiv.org/abs/2411.15594. Winogrande : An adversarial winograd schema chal-

arXiv:2411 .15594. lenge at scale, 2019 . URL: https://arxiv.org/abs/ 1907 . [35]

Doddapaneni ,

M. S. U. R.

Khan ,

Venkatesh , 10641 .

Dabre ,

Kunchukuttan , M. M. Khapra , Cross- [43] A.

Grattafiori , A.

Dubey , A.

Jauhri , A.

Pandey , A . Ka-

lingual auto evaluation for assessing multilingual dian, A . Al-Dahle , A. Letman , et al, The llama 3

LLMs , in : Proceedings of the 63rd Annual Meeting herd of models , 2024 . URL: https://arxiv.org/abs/

of the Association for Computational Linguistics 2407 . 21783 .

(Volume 1 : Long

Papers)

, Association for Computa- [44] Qwen , A.

Yang , B.

Zhang , B.

Hui , B.

Zheng ,

tional Linguistics , Vienna, Austria, 2025 , pp. 29297 - B . Yu,

Li , et al., Qwen2.5 technical report , 2025 .

29329. URL: https://aclanthology.org/ 2025 . acl-long . URL: https://arxiv.org/abs/2412.15115.

1419 /. [45]

Gonzalez-Agirre ,

Pàmies ,

Llop , I. Baucells , [36]

Pombal ,

Yoon ,

Fernandes , I. Wu,

Kim ,

S. D.

Dalt ,

Tamayo ,

J. J.

Saiz ,

Espuña , J. Prats,

A suite of open multilingual llm judges , 2025 . URL: 2025 . URL: https://arxiv.org/abs/2502.08489.

https://arxiv.org/abs/2504.04953. [46]

P. H.

Martins ,

Fernandes ,

Alves ,

N. M.

Guer [37]

Kim ,

Suk ,

Longpre ,

B. Y.

Lin ,

Shin , reiro,

Rei ,

D. M.

Alves ,

Pombal , A . Farajian,

Prometheus 2: An open source language model dow , J. G. C. de Souza , A.

Birch , A. F. T.

Martins ,

in: Y. Al-Onaizan , M.

Bansal , Y.-N.

Chen (Eds.), 2024 . URL: https://arxiv.org/abs/2409.16235.

Proceedings of the 2024 Conference on Empiri-

Florida , USA, 2024 , pp. 4334 - 4353 . URL: https: