<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Elio</forename><surname>Musacchio</surname></persName>
							<email>elio.musacchio@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="department">National PhD in Artificial Intelligence</orgName>
								<orgName type="institution">University of Pisa</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucia</forename><surname>Siciliani</surname></persName>
							<email>lucia.siciliani@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pierpaolo</forename><surname>Basile</surname></persName>
							<email>pierpaolo.basile@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Edoardo</forename><surname>Michielon</surname></persName>
							<email>edoardo.michielon@consulenti.fastweb.it</email>
							<affiliation key="aff2">
								<orgName type="institution">Fastweb SpA</orgName>
								<address>
									<settlement>Milan</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Marco</forename><surname>Pasqualini</surname></persName>
							<email>marco.pasqualini@consulenti.fastweb.it</email>
							<affiliation key="aff2">
								<orgName type="institution">Fastweb SpA</orgName>
								<address>
									<settlement>Milan</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Asia</forename><forename type="middle">Beatrice</forename><surname>Uboldi</surname></persName>
							<affiliation key="aff2">
								<orgName type="institution">Fastweb SpA</orgName>
								<address>
									<settlement>Milan</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Giovanni</forename><surname>Semeraro</surname></persName>
							<email>giovanni.semeraro@uniba.it</email>
							<affiliation key="aff0">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">3374DE17074283C19C789CFB170A4328</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Natural Language Processing</term>
					<term>Evaluation</term>
					<term>Benchmark</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural Language Generation tasks, an increasing number of open weights architectures have been developed and released online. In contrast with older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shown outstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend natural language. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is using benchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance in each of these individual tasks can imply having developed a model capable of understanding language. However, while this assumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains an open challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets and how a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditional approaches.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Large Language Models (LLMs) are models based on the Transformer architecture capable of solving a wide variety of Natural Language Generation (NLG) tasks, even those not encountered during training, due to their extensive training and large number of parameters. Thanks to their remarkable skills, interest in LLMs is now at its climax, resulting in a proliferation of open-weight models (e.g. LLaMA, Mistral, and many others). Among the several challenges related to the development of LLMs, one of the most critical is their evaluation <ref type="bibr" target="#b0">[1]</ref>. One approach to tackle this issue has been to build benchmarks that collect different datasets, with the aim of obtaining a more comprehensive evaluation of the model's overall capabilities. Currently, there is a leaderboard 1 [2] which 1 https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard keeps track of the capabilities of openly available LLMs. Specifically, the models are tested on six tasks that span different abilities a language model should have, e.g. reasoning or text completion. Regarding their reasoning abilities, the models are tested by solving closed-ended tasks. Specifically, multiple-choice question answering tasks are provided, where a question is given with a list of possible alternatives associated with an identifier (a letter, a number, and so on). Intuitively, since the model has also been pre-trained on closed-ended question-answering data, it should be able to generalize and understand the correct choice out of the available ones. Furthermore, rather than generating the output directly, the probabilities learned by the model are studied, using log-likelihood to assess which option is more likely to be correct. For the English language, this evaluation methodology has been a standard approach to assess the capabilities of LLMs. However, when adapting a model to a new language, due to the low amount of non-English data that has been used to pre-train such models, this methodology may not be as sound. The model only has to generate the correct option identifier, therefore this is not really testing the ability of the model of generating high-quality text in another language. The goal of this work is to understand whether a new evaluation setting applied to languageadapted LLMs may give more insight than the traditional approach. Therefore, our contributions are the following:</p><p>• We test two evaluation settings for languageadapted LLMs changing the structure of closed-ended question answering tasks; • We evaluate the performance of state-of-the-art models on these settings; • We study the sensitivity that the models have for the input prompt.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Related Works</head><p>Language Model evaluation has been a research focus ever since the first Decoder-only models, which were designed for natural language generation. One of the most remarkable skills regarding LLMs reasoning has been in-context learning. In particular, fewshot learning has been increasingly used. The idea is that providing examples of input-output in the model prompt should affect positively the generation process <ref type="bibr" target="#b2">[3]</ref>.</p><p>There are multiple leaderboards which evaluate open LLMs on non-English languages, e.g. Open PL LLM Leaderboard <ref type="bibr" target="#b3">[4]</ref> for Polish or Open KO LLM Leaderboard <ref type="bibr" target="#b4">[5]</ref> for Korean. These leaderboards are often based on the lm evaluation harness framework <ref type="bibr" target="#b5">[6]</ref>, which has been a milestone in the evaluation of LLMs. LLM evaluation can also depend on the topic at hand. There are some works which focus on mathematical reasoning <ref type="bibr" target="#b6">[7]</ref> as well as factuality <ref type="bibr" target="#b7">[8]</ref>.</p><p>These evaluation settings often rely on closed-ended tasks, specifically multiple-choice question answering. The idea is to calculate the log-likelihood of the next token to generate for the option identifiers. However, this may not be the best setting to evaluate LLMs. Wang et al. <ref type="bibr" target="#b8">[9]</ref> studied this on Instruction-tuned LLMs by training a classifier to predict which possible option to associate with the generated answer. This was done to glance over additional text generated by the model (e.g. the generated text could be "The answer is B. " as opposed to the simple "B." token). They found that the log-likelihood and the generated text decisions were often not matching.</p><p>Regarding Italian evaluation, some works have approached this challenge. Bacciu et al. <ref type="bibr" target="#b9">[10]</ref> released another version of the Open Italian LLM Leaderboard, considering a different variety of tasks. Mercorio et al. <ref type="bibr" target="#b10">[11]</ref> released a benchmark based on questions that can be found in the INVALSI test, an Italian educational test, to further test the knowledge and reasoning abilities of these models on a dataset that is natively in Italian rather than obtained through machine translation. The latter is one of the main problems when evaluating these models, due to the lack of resources w.r.t. English language, datasets that are used at the state-of-the-art are translated using machine translation models. Still, all this effort made to evaluate Italian-adapted LLMs mainly relies on closed-ended tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Experiments</head><p>We study pre-trained and language-adapted models to test their capabilities in the resolution of Italian language tasks. Specifically, we want to modify the typical formatting that is used in multiple choice question answering to study if the models are capable of correctly following and generating Italian text. Usually, the format shown in Listing 1 is used, where &lt;QUESTION&gt; is the question the model has to answer, &lt;IDENTIFIER_i&gt; and &lt;OPTION_i&gt; are the option identifier, which is usually a letter or a number, and the text of the possible answer to the previously provided question respectively. &lt;COR-RECT_IDENTIFIER&gt; is the identifier of the option that is the correct answer to the question.</p><formula xml:id="formula_0">&lt;QUESTION&gt;: &lt;IDENTIFIER_1&gt; &lt;OPTION_1&gt; &lt;IDENTIFIER_2&gt; &lt;OPTION_2&gt; ... &lt;IDENTIFIER_N&gt; &lt;OPTION_N&gt; &lt;CORRECT_IDENTIFIER&gt; Listing 1: closed-ended format</formula><p>We aim to modify the task so that the model has to generate the text of the correct option instead of the identifier. To do so, we consider two main evaluation settings:</p><p>• Open-ended (OE): we remove the available options and only supply the question in the prompt; • Closed-ended no identifiers (CE-NI): we format the options without an identifier, the model has to write the corresponding text of the correct option.</p><p>In particular, for the CE-NI setting, we apply the format shown in Listing 2, where &lt;CORRECT_OPTION&gt; is the text of the option that represents the correct answer to the question. Generally models are also evaluated by calculating the log-likelihood rather than generating text directly. The chosen option is then selected based on the highest value. We choose to perform a generative task instead, to check whether the models are capable of generating the answer string only without additional text and to also check if they generate something outside of the provided options. To evaluate this case, we use the BLEU, ROUGE-L and BertScore F1 metrics, which are reference metrics used to evaluate the correspondence of a generated sentence with a base one. BLEU and ROUGE-L focus on matching n-grams, while BertScore leverages pre-trained Bert models to assess the semantic similarity between words of the two texts. Furthermore, we consider four different possible prompt formats:</p><p>• Plain (P): there is no formatting, the text of the task is provided as it is in the prompt, only a "Risposta:" string is added at the end; Furthermore, for the few-shot formats, we consider two distinct numbers of examples to provide in the prompt: one-shot and five-shots. The intuition is that a language-adapted LLM should significantly improve performance even when provided with a single example.</p><p>We consider these prompt formats because most of the evaluation settings for Italian LLMs are done without applying the chat template. We argue that this choice may not be the best one when considering Instruct models that have been trained using a specific prompt format to continue a conversation. They should be evaluated using the same prompt format since it is also the one that will be used in case of deployment.</p><p>To set up the experimental protocol, we use the lmevaluation-harness library <ref type="bibr" target="#b5">[6]</ref>, which provides an immediate and intuitive command line to automatically evaluate LLMs on previously defined as well as custom tasks. Specifically, we define custom tasks within the library following the previously defined evaluation settings. To do so, we consider the following datasets:</p><p>• ARC-Challenge <ref type="bibr" target="#b11">[12]</ref>: consists of multiplechoice science exam questions, the Challenge set consists of complex questions that were not correctly answered by both a retrieval and cooccurrence method;</p><p>• MMLU <ref type="bibr" target="#b12">[13]</ref>: consists of multiple-choice questions from 57 different topics (e.g. mathematics, computer science, and so on), requiring problemsolving abilities and knowledge to answer correctly; • EXAMS <ref type="bibr" target="#b13">[14]</ref>: consists of multiple-choice questions from high school exams. The dataset contains different subsets curated for different languages and optionally contains additional paragraphs regarding the question (extracted from Wikipedia); • WWBM <ref type="bibr" target="#b14">[15]</ref>: consists of multiple-choice questions spanning a wide range of topics. The questions come from the Italian version of the "Who Wants to Be a Millionaire?" board game where contestants answer progressively difficult questions.</p><p>The question-answer instances are split into different categories depending on the difficulty of the question itself.</p><p>For the Italian version of these datasets, both EXAMS and WWBM are provided with splits in the Italian language natively. For ARC and MMLU, instead, we use the Italian version provided in the library for the okapi task released by Lai et al. <ref type="bibr" target="#b15">[16]</ref>, who performed automatic translation of the original datasets using GPT-3.5 Turbo for several languages. For all of these datasets, we define two custom tasks which apply the OE and CE-NI evaluation settings automatically. The examples used in the few-shot settings are taken from the validation splits of the datasets. For EXAMS, we use the train split as a test split (since it is not provided), while for WWBM, we remove the first five instances from the original dataset and use them as a validation split.</p><p>Regarding the models, we experiment using the following:</p><p>• Italia-9B-Instruct-v0.1 2 : trained from scratch with a focus on the Italian language (90% of data in Italian and the rest in English) with instructiontuning for conversational purposes; • LLaMAntino-2-chat-13b-hf-UltraChat-ITA <ref type="bibr" target="#b16">[17]</ref>: instruction-tuning of LLaMAntino-2-chat-13b-hf-ITA (an Italian-adapted LLM) using a translated version of the UltraChat dataset; • LLaMAntino-3-ANITA-8B-Inst-DPO-ITA <ref type="bibr" target="#b17">[18]</ref>: fine-tuning, DPO and adaptation using a mixture of Italian and English datasets starting from the LLaMA-3-8B-Instruct model; • maestrale-chat-v0.4-alpha-sft 3  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 1</head><p>Results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset and for each metric is in bold</p><p>• Meta-Llama-3-8B 4 and Meta-Llama-3-8B-Instruct 5 : latest version of the LLaMA family of models released by META (base and instruct version respectively); • Minerva-3B-base-v1.0 6 : trained from scratch to be a proficient bilingual base model (English and Italian); • zefiro-7b-dpo-ITA 7 : based on zephyr by Tunstall et al. <ref type="bibr" target="#b18">[19]</ref>, DPO training done on top of zefiro-7b-sft-ITA.</p><p>Furthermore, to test whether bilingual training helps the model solve these tasks, we instruction-tuned two 4 https://huggingface.co/meta-llama/Meta-Llama-3-8B 5 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 6 https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0 7 https://huggingface.co/mii-community/zefiro-7b-dpo-ITA new models. We start from the Meta-LLaMA-3-8B-Instruct checkpoint and fine-tune the model on 40, 000 instances from 3 different datasets: databricks-dolly-15k, OpenOrca and UltraChat. The datasets are automatically translated to Italian using ChatGPT 3.5. We consider two different settings, one where 20, 000 instances are kept for each language (Italian and English), and one where 40, 000 instances are kept for the Italian language only. For instruction tuning, we used LoRA with 𝑟 equal to 16 and alpha equal to 16, targeting all linear layers of the model. Other hyperparameters are effective batch size equal to 128, learning rate equal to 2𝑒 − 5, weight decay equal to 0.01 and warmup steps equal to 5. In both cases, the instances to be used during the training are chosen at random.</p><p>For all experiments, we use the greedy-decoding generation strategy with a maximum number of tokens to </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset and for each metric is in bold generate equal to 64. This limit was set for computational requirements and the value was chosen after studying the datasets to assess the number of tokens required for each answer. There was no combination of tokenizer and dataset which had a 95% percentile greater than 50 for token count, therefore we can safely set the previously defined boundary. We also set torch.bfloat16 and use flash-attention-2 <ref type="bibr" target="#b19">[20]</ref> to speed up the generation process.</p><p>Inference was always done with batch size set to 1 to maximize the quality of the generated text. Furthermore, we consider changing the number of few-shots that are given in the prompt. Our assumption is that the models may learn to follow the patterns given in the examples, and therefore the Italian language generation may become more likely thanks to the additional information conveyed in the prompt. We aim to mitigate this potential bias by decreasing the number of shots. Thus, the number of shots for all settings using a few-shot strategy was set to either 1 or 5.</p><p>We report the results of the OE setting in Table <ref type="table">1</ref> and of the CE-NI setting in Table <ref type="table">2</ref> and comment them in the following section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Hardware and Software Configuration</head><p>Our experimental setup consisted of a multi-node cluster provided by Fastweb SpA and equipped with Nvidia H100 GPUs for distributed training and evaluation. We used a suite of open-source libraries, including Transformers from Hugging Face <ref type="bibr" target="#b20">[21]</ref>, which provides seamless integration with PyTorch <ref type="bibr" target="#b21">[22]</ref> and DeepSpeed <ref type="bibr" target="#b22">[23]</ref>, as well as Unsloth <ref type="foot" target="#foot_0">8</ref> and TRL <ref type="bibr" target="#b23">[24]</ref>. This software stack has been instrumental in efficiently handling large data sets and complex models. This configuration allowed for parallelization of computations, significantly reducing training and evaluation time. DeepSpeed optimized memory usage and communication between nodes, allowing us to effortlessly scale evaluation processes across multiple model architectures.</p><p>The hardware-software combination ensured efficient, cost-effective, and reproducible experiments, which are critical for comparing multiple models and training new ones efficiently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Findings and Additional Tests</head><p>Analyzing the results, it is clear that the OE strategy did not yield very satisfactory results for BLEU and ROUGE-L. We associate this with the difficulty of generating a response matching exactly the ground truth when the text that can be generated is not constrained in any way. To further support this point, we can see that the BertScore of some experiments yields good results, hinting that the semantics of the content that has been generated is similar to that of the ground truth.</p><p>Regarding the CE-NI strategy, the obtained results are much better for all metrics. Therefore providing the options in the input prompt greatly helped the model in limiting its generation to follow the provided options. Surprisingly, with respect to the Italian leaderboard where fine-tuned versions of the LLaMA 3 family were shown to have much better results, here the results are in line with the base models (or even worse in some cases). Furthermore, one of the best-performing models is maestralechat-v0.4-alpha-sft, which consistently outperforms the LLaMA 3 models in most cases.</p><p>For both settings the obtained results show that providing input-output examples in the prompt greatly enhances the results for all settings.</p><p>For both settings, primarily Instruct models were used. Upon analyzing the generated results, we observed instances where the model provided the correct result but appended an additional substring (e.g., the model began explaining the reasoning behind its response). To assess if this might have affected the result, we performed an additional test where we checked if the ground truth string was a substring of the generated output (after removing punctuation and trailing whitespaces as well as lowercasing the two strings). We report the complete results in Appendix C. Overall, some models show an improvement in performance, but the results still do not beat maestrale-chat-v0.4-alpha-sft.</p><p>We provide some generation examples in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusions and Future Works</head><p>We have carried out a study on the effectiveness of evaluation of Italian-adapted LLMs on closed-ended tasks, multiple-choice question answering tasks specifically.</p><p>We have experimented with two settings: an open-ended one and a closed-ended one without option identifiers. The results show better performance for the latter. Furthermore, they also show that, with respect to the Open Italian LLM Leaderboard, there are significant differences regarding model performance. We can conclude that the evaluation of Italian-adapted models should follow a more rigorous procedure which does not mainly rely on closed-ended tasks. We release the code that was used on GitHub 9 . In the future, we plan to further work on the topic and attempt to define best practices for the evaluation of these models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Model Format ARC_IT MMLU_IT EXAMS WBMM</head><p>Italia-9B-Instruct-v0.   </p></div><figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table Sub -</head><label>Sub</label><figDesc>string matching results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset is in bold</figDesc><table><row><cell>Model</cell><cell cols="5">Format ARC_IT MMLU_IT EXAMS WBMM</cell></row><row><cell></cell><cell>P</cell><cell>0.00</cell><cell>0.38</cell><cell>0.30</cell><cell>73.56</cell></row><row><cell></cell><cell>P-F 1</cell><cell>39.86</cell><cell>33.19</cell><cell>37.53</cell><cell>52.43</cell></row><row><cell></cell><cell>P-F 5</cell><cell>44.74</cell><cell>36.03</cell><cell>40.10</cell><cell>56.62</cell></row><row><cell>Italia-9B-Instruct-v0.1</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>I</cell><cell>29.77</cell><cell>29.59</cell><cell>26.73</cell><cell>55.91</cell></row><row><cell></cell><cell>I-F 1</cell><cell>26.78</cell><cell>31.08</cell><cell>29.01</cell><cell>55.86</cell></row><row><cell></cell><cell>I-F 5</cell><cell>32.59</cell><cell>31.42</cell><cell>32.77</cell><cell>56.62</cell></row><row><cell></cell><cell>P</cell><cell>43.54</cell><cell>30.08</cell><cell>40.89</cell><cell>58.16</cell></row><row><cell></cell><cell>P-F 1</cell><cell>49.10</cell><cell>38.17</cell><cell>44.65</cell><cell>66.19</cell></row><row><cell></cell><cell>P-F 5</cell><cell>50.90</cell><cell>40.23</cell><cell>45.45</cell><cell>67.32</cell></row><row><cell>LLaMAntino-2-chat-13b-hf-UltraChat-ITA</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>I</cell><cell>41.66</cell><cell>26.29</cell><cell>34.75</cell><cell>60.56</cell></row><row><cell></cell><cell>I-F 1</cell><cell>44.23</cell><cell>33.16</cell><cell>38.12</cell><cell>57.95</cell></row><row><cell></cell><cell>I-F 5</cell><cell>48.08</cell><cell>39.50</cell><cell>36.83</cell><cell>62.92</cell></row><row><cell></cell><cell>P</cell><cell>55.86</cell><cell>43.84</cell><cell>52.48</cell><cell>70.44</cell></row><row><cell></cell><cell>P-F 1</cell><cell>60.57</cell><cell>45.34</cell><cell>48.32</cell><cell>72.38</cell></row><row><cell></cell><cell>P-F 5</cell><cell>62.45</cell><cell>46.82</cell><cell>51.49</cell><cell>69.82</cell></row><row><cell>LLaMAntino-3-ANITA-8B-Inst-DPO-ITA</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>I</cell><cell>61.85</cell><cell>44.93</cell><cell>54.46</cell><cell>75.91</cell></row><row><cell></cell><cell>I-F 1</cell><cell>62.19</cell><cell>43.75</cell><cell>49.51</cell><cell>74.06</cell></row><row><cell></cell><cell>I-F 5</cell><cell>61.42</cell><cell>45.11</cell><cell>52.87</cell><cell>75.14</cell></row><row><cell></cell><cell>P</cell><cell>69.38</cell><cell>50.18</cell><cell>58.71</cell><cell>73.56</cell></row><row><cell></cell><cell>P-F 1</cell><cell>71.43</cell><cell>54.52</cell><cell>58.22</cell><cell>76.88</cell></row><row><cell>maestrale-chat-v0.4-alpha-sft</cell><cell>P-F 5</cell><cell>73.31</cell><cell>55.85</cell><cell>58.02</cell><cell>78.21</cell></row><row><cell></cell><cell>I</cell><cell>46.88</cell><cell>29.83</cell><cell>40.30</cell><cell>60.36</cell></row><row><cell></cell><cell>I-F 1</cell><cell>69.63</cell><cell>52.22</cell><cell>56.54</cell><cell>74.58</cell></row><row><cell></cell><cell>I-F 5</cell><cell>70.15</cell><cell>54.30</cell><cell>56.73</cell><cell>75.40</cell></row><row><cell></cell><cell>P</cell><cell>57.57</cell><cell>46.30</cell><cell>56.54</cell><cell>75.09</cell></row><row><cell>Meta-Llama-3-8B</cell><cell>P-F 1</cell><cell>63.13</cell><cell>46.88</cell><cell>51.58</cell><cell>71.20</cell></row><row><cell></cell><cell>P-F 5</cell><cell>66.47</cell><cell>50.49</cell><cell>53.37</cell><cell>75.96</cell></row><row><cell></cell><cell>P</cell><cell>59.54</cell><cell>44.26</cell><cell>53.07</cell><cell>68.85</cell></row><row><cell></cell><cell>P-F 1</cell><cell>66.30</cell><cell>50.13</cell><cell>51.18</cell><cell>72.79</cell></row><row><cell></cell><cell>P-F 5</cell><cell>68.69</cell><cell>52.42</cell><cell>57.43</cell><cell>72.79</cell></row><row><cell>Meta-Llama-3-8B-Instruct</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>I</cell><cell>57.83</cell><cell>36.04</cell><cell>48.61</cell><cell>74.89</cell></row><row><cell></cell><cell>I-F 1</cell><cell>69.29</cell><cell>48.14</cell><cell>54.46</cell><cell>75.40</cell></row><row><cell></cell><cell>I-F 5</cell><cell>70.83</cell><cell>54.17</cell><cell>60.10</cell><cell>77.75</cell></row><row><cell></cell><cell>P</cell><cell>47.48</cell><cell>43.71</cell><cell>59.90</cell><cell>73.86</cell></row><row><cell>Minerva-3B-base-v1.0</cell><cell>P-F 1</cell><cell>25.66</cell><cell>28.51</cell><cell>23.86</cell><cell>33.25</cell></row><row><cell></cell><cell>P-F 5</cell><cell>20.10</cell><cell>23.09</cell><cell>22.87</cell><cell>34.94</cell></row><row><cell></cell><cell>P</cell><cell>48.76</cell><cell>39.18</cell><cell>41.58</cell><cell>60.67</cell></row><row><cell></cell><cell>P-F 1</cell><cell>55.00</cell><cell>40.37</cell><cell>46.04</cell><cell>62.56</cell></row><row><cell></cell><cell>P-F 5</cell><cell>60.31</cell><cell>45.34</cell><cell>48.42</cell><cell>64.86</cell></row><row><cell>zefiro-7b-dpo-ITA</cell><cell></cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell>I</cell><cell>31.48</cell><cell>31.50</cell><cell>40.40</cell><cell>72.69</cell></row><row><cell></cell><cell>I-F 1</cell><cell>50.98</cell><cell>46.11</cell><cell>45.15</cell><cell>66.55</cell></row><row><cell></cell><cell>I-F 5</cell><cell>58.26</cell><cell>47.16</cell><cell>50.20</cell><cell>64.55</cell></row><row><cell></cell><cell>P</cell><cell>59.71</cell><cell>44.50</cell><cell>54.16</cell><cell>69.92</cell></row><row><cell></cell><cell>P-F 1</cell><cell>66.04</cell><cell>49.70</cell><cell>50.89</cell><cell>72.53</cell></row><row><cell>LLaMA3-BILINGUAL (Ours)</cell><cell>P-F 5 I</cell><cell>67.58 60.65</cell><cell>52.29 38.61</cell><cell>56.54 50.20</cell><cell>72.84 75.35</cell></row><row><cell></cell><cell>I-F 1</cell><cell>69.63</cell><cell>50.00</cell><cell>56.14</cell><cell>75.04</cell></row><row><cell></cell><cell>I-F 5</cell><cell>70.49</cell><cell>54.51</cell><cell>60.10</cell><cell>77.90</cell></row><row><cell></cell><cell>P</cell><cell>60.57</cell><cell>45.16</cell><cell>54.26</cell><cell>70.49</cell></row><row><cell></cell><cell>P-F 1</cell><cell>66.21</cell><cell>49.79</cell><cell>51.98</cell><cell>72.43</cell></row><row><cell>LLaMA3-ITA-ONLY (Ours)</cell><cell>P-F 5 I</cell><cell>67.67 59.88</cell><cell>52.38 37.08</cell><cell>57.23 50.40</cell><cell>73.71 75.40</cell></row><row><cell></cell><cell>I-F 1</cell><cell>69.21</cell><cell>50.19</cell><cell>56.63</cell><cell>74.94</cell></row><row><cell></cell><cell>I-F 5</cell><cell>70.40</cell><cell>54.28</cell><cell>59.41</cell><cell>77.65</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_6"><head>Table Sub -</head><label>Sub</label><figDesc>string matching results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset is in bold</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_0">https://github.com/unslothai/unsloth</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>We acknowledge the support of the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Appendix</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Prompt Formats</head><p>All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.</p><p>Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano. Risposta:</p><p>Example 1: Prompt in the P-F format for the OE setting Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto. Produce fiori che attraggono gli insetti. Produce more che hanno un sapore migliore. Permette alle piante di more di adattarsi a nuove condizioni. Risposta: Permette alle piante di more di adattarsi a nuove condizioni.</p><p>Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano. Risposta:</p><p>Example 2: Prompt in the P-F 1 format for the OE setting &lt;|start_header_id|&gt;user&lt;|end_header_id|&gt; Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt; Example 3: Prompt in the I-F format using LLaMA 3 chat template &lt;|begin_of_text|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt; Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto. Produce fiori che attraggono gli insetti. Produce more che hanno un sapore migliore. Permette alle piante di more di adattarsi a nuove condizioni.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt; Permette alle piante di more di adattarsi a nuove condizioni.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt; Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt; Example 4: Prompt in the I-F 1 format using LLaMA 3 chat template</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Zero-shot Response Examples</head><p>All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.</p><p>Una sorgente sonora di frequenza f 0, si muove con velocità costante lungo una circonferenza. Nel centro della circonferenza si trova il ricevitore del suono. Quale asserzione è esatta per la frequenza f registrata dal ricevitore? Risposta:</p><p>Ground truth: f = f 0 Generated Answer: La frequenza f registrata dal ricevitore è costante e uguale a f 0 Example 5: Generated answer with additional text for the OE setting Il periodo di rotazione di un satellite artificiale intorno ad un pianeta è T . La distanza tra il satellite ed il centro del pianeta è r . A che distanza dal centro del pianeta ruota un altro satellite se il suo periodo di rotazione è T / 8? Opzioni: 8 r r/8 4 r r/4 Risposta:</p><p>Ground truth: r/4 Generated Answer: r/8 Spiegazione: Se il periodo di rotazione del satellite è T / 8, allora la sua distanza dal centro del pianeta è r / 8. Infatti, se il satellite ruota in 8T, allora la sua distanza dal centro del pianeta è Example 6: Generated answer with additional text for the CE-NI setting</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C. Substring Matching Results</head></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">A survey on evaluation of large language models</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Yi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Wang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on Intelligent Systems and Technology</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1" to="45" />
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Fourrier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Habib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Lozovskaya</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Szafer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<ptr target="https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard" />
		<title level="m">Open llm leaderboard v2</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Wróbel</surname></persName>
		</author>
		<ptr target="https://huggingface.co/spaces/speakleash/open_pl_llm_leaderboard" />
		<title level="m">SpeakLeash Team, Cyfronet Team</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>Open pl llm leaderboard</note>
</biblStruct>

<biblStruct xml:id="b4">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Park</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Cho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Lee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Lee</surname></persName>
		</author>
		<title level="m">Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note>ACL Main</note>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Tow</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biderman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Black</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Dipofi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Foster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Golding</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Hsu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mcdonell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Muennighoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Phang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Reynolds</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Thite ; Evaluation</surname></persName>
		</author>
		<author>
			<persName><forename type="first">;</forename><forename type="middle">B</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zou</surname></persName>
		</author>
		<idno type="DOI">10.5281/zenodo.5371628</idno>
		<ptr target="https://doi.org/10.5281/zenodo.5371628.doi:10.5281/zenodo.5371628" />
		<title level="m">A framework for fewshot language model evaluation</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">9</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<analytic>
		<title level="a" type="main">Large language models for mathematical reasoning: Progresses and challenges</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ahn</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Verma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Lou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yin</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</title>
				<meeting>the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="225" to="237" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Head-totail: How knowledgeable are large language models (llms)? aka will llms replace knowledge graphs?</title>
		<author>
			<persName><forename type="first">K</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Zha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><forename type="middle">L</forename><surname>Dong</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long Papers</title>
		<meeting>the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</meeting>
		<imprint>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="311" to="325" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<monogr>
		<title level="m" type="main">my answer is c&quot;: First-token probabilities do not match text answers in instruction-tuned language models</title>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Hu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Weber-Genzel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Röttger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Kreuter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Hovy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Plank</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2402.14499.arXiv:2402.14499" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">DanteLLM: Let&apos;s push Italian LLM research forward!</title>
		<author>
			<persName><forename type="first">A</forename><surname>Bacciu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Campagnano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Trappolini</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Silvestri</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2024.lrec-main.388" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)</title>
				<editor>
			<persName><forename type="first">N</forename><surname>Calzolari</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M.-Y</forename><surname>Kan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Hoste</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Lenci</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Sakti</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Xue</surname></persName>
		</editor>
		<meeting>the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)<address><addrLine>Torino, Italia</addrLine></address></meeting>
		<imprint>
			<publisher>ELRA and ICCL</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="page" from="4343" to="4355" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<title level="m" type="main">Disce aut deficere: Evaluating llms proficiency on the invalsi italian benchmark</title>
		<author>
			<persName><forename type="first">F</forename><surname>Mercorio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mezzanzanica</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Potertì</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Serino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Seveso</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2406.17535.arXiv:2406.17535" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b11">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Cowhey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Etzioni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Khot</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Sabharwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Schoenick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Tafjord</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1803.05457v1</idno>
		<title level="m">Think you have solved question answering? try arc, the ai2 reasoning challenge</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<analytic>
		<title level="a" type="main">Measuring massive multitask language understanding</title>
		<author>
			<persName><forename type="first">D</forename><surname>Hendrycks</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Burns</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Basart</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Mazeika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Steinhardt</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the International Conference on Learning Representations</title>
				<meeting>the International Conference on Learning Representations</meeting>
		<imprint>
			<publisher>ICLR</publisher>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">EXAMS: A multisubject high school examinations dataset for cross-lingual and multilingual question answering</title>
		<author>
			<persName><forename type="first">M</forename><surname>Hardalov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Mihaylov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zlatkova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dinkov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Koychev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.438</idno>
		<ptr target="https://aclanthology.org/2020.emnlp-main.438.doi:10.18653/v1/2020.emnlp-main.438" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Webber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="5427" to="5444" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Playing with knowledge: A virtual player for &quot;who wants to be a millionaire?&quot; that leverages question answering techniques</title>
		<author>
			<persName><forename type="first">P</forename><surname>Molino</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Lops</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Semeraro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>De Gemmis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<idno type="DOI">10.1016/j.artint.2015.02.003</idno>
		<ptr target="https://doi.org/10.1016/j.artint.2015.02.003" />
	</analytic>
	<monogr>
		<title level="j">Artificial Intelligence</title>
		<imprint>
			<biblScope unit="volume">222</biblScope>
			<biblScope unit="page" from="157" to="181" />
			<date type="published" when="2015">2015</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<analytic>
		<title level="a" type="main">Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback</title>
		<author>
			<persName><forename type="first">V</forename><surname>Lai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ngo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Dernoncourt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Rossi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Nguyen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</title>
				<meeting>the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="318" to="327" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b16">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Musacchio</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Polignano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Siciliani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Fiameni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Semeraro</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2312.09993</idno>
		<title level="m">Llamantino: Llama 2 models for effective text generation in italian language</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b17">
	<monogr>
		<author>
			<persName><forename type="first">M</forename><surname>Polignano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Basile</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Semeraro</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2405.07101</idno>
		<title level="m">Advanced natural-based interaction for the italian language: Llamantino-3-anita</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b18">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Tunstall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Beeching</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lambert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Rajani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Rasul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Belkada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Werra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Fourrier</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Habib</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Sarrazin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Sanseviero</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Rush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.16944</idno>
		<title level="m">Zephyr: Direct distillation of lm alignment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">FlashAttention-2: Faster attention with better parallelism and work partitioning</title>
		<author>
			<persName><forename type="first">T</forename><surname>Dao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations (ICLR)</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Transformers: State-of-the-art natural language processing</title>
		<author>
			<persName><forename type="first">T</forename><surname>Wolf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Debut</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Sanh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chaumond</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Delangue</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Moi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Cistac</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Rault</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Louf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Funtowicz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Davison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shleifer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Von Platen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Jernite</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Plu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">L</forename><surname>Scao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gugger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Drame</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Lhoest</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">M</forename><surname>Rush</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/2020.emnlp-demos.6" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics</title>
				<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="38" to="45" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Ansel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Gimelshein</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Jain</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Voznesensky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Bao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Berard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Burovski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chauhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Chourdia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Constable</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Desmaison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Devito</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Ellison</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Feng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Gschwind</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hirsh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Kalambarkar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Kirsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lazos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lezcano</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Luk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Maher</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Pan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Puhrsch</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Reso</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Saroufim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">Y</forename><surname>Siraichi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Suk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Suo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Tillet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><surname>Zou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mathews</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Chanan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Chintala</surname></persName>
		</author>
		<idno type="DOI">10.1145/3620665.3640366</idno>
		<ptr target="https://pytorch.org/assets/pytorch2-2.pdf.doi:10.1145/3620665.3640366" />
	</analytic>
	<monogr>
		<title level="m">29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems</title>
				<imprint>
			<publisher>ACM</publisher>
			<date type="published" when="2024">2024</date>
			<biblScope unit="volume">2</biblScope>
		</imprint>
	</monogr>
	<note>ASPLOS &apos;24</note>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<title level="m" type="main">Deepspeed data efficiency: Improving deep learning model quality and training efficiency via efficient data sampling and routing</title>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Yao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Holmes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2212.03597.arXiv:2212.03597" />
		<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Werra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Belkada</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Tunstall</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Beeching</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Thrush</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Lambert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<ptr target="https://github.com/huggingface/trl" />
		<title level="m">Trl: Transformer reinforcement learning</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
