<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Elio Musacchio</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Edoardo Michielon</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Marco Pasqualini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Asia Beatrice Uboldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giovanni Semeraro</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Fastweb SpA</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>National PhD in Artificial Intelligence, University of Pisa</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>1</volume>
      <fpage>225</fpage>
      <lpage>237</lpage>
      <abstract>
        <p>With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural Language Generation tasks, an increasing number of open weights architectures have been developed and released online. In contrast with older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shown outstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend natural language. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is using benchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance in each of these individual tasks can imply having developed a model capable of understanding language. However, while this assumption is not incorrect, it is evident that it is not suficient, and the evaluation of Large Language Models still remains an open challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets and how a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditional approaches.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Language Models</kwd>
        <kwd>Natural Language Processing</kwd>
        <kwd>Evaluation</kwd>
        <kwd>Benchmark</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Large Language Models (LLMs) are models based on
the Transformer architecture capable of solving a wide
variety of Natural Language Generation (NLG) tasks, even
those not encountered during training, due to their
extensive training and large number of parameters. Thanks
to their remarkable skills, interest in LLMs is now at its
climax, resulting in a proliferation of open-weight
models (e.g. LLaMA, Mistral, and many others). Among the
several challenges related to the development of LLMs,
one of the most critical is their evaluation [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One
approach to tackle this issue has been to build benchmarks
that collect diferent datasets, with the aim of obtaining
a more comprehensive evaluation of the model’s overall
capabilities. Currently, there is a leaderboard1 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] which
keeps track of the capabilities of openly available LLMs.
      </p>
      <sec id="sec-1-1">
        <title>Specifically, the models are tested on six tasks that span</title>
        <p>diferent abilities a language model should have, e.g.
reasoning or text completion. Regarding their reasoning
abilities, the models are tested by solving closed-ended tasks.
Specifically, multiple-choice question answering tasks are
provided, where a question is given with a list of possible
alternatives associated with an identifier (a letter, a
number, and so on). Intuitively, since the model has also been
pre-trained on closed-ended question-answering data, it
should be able to generalize and understand the correct
choice out of the available ones. Furthermore, rather than
generating the output directly, the probabilities learned
by the model are studied, using log-likelihood to assess
which option is more likely to be correct. For the
English language, this evaluation methodology has been
a standard approach to assess the capabilities of LLMs.
However, when adapting a model to a new language, due
to the low amount of non-English data that has been used
to pre-train such models, this methodology may not be
as sound. The model only has to generate the correct
option identifier, therefore this is not really testing the
ability of the model of generating high-quality text in
another language. The goal of this work is to understand
whether a new evaluation setting applied to
languageadapted LLMs may give more insight than the traditional
approach. Therefore, our contributions are the following:
• We test two evaluation settings for
languageadapted LLMs changing the structure of
closed</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>3. Experiments</title>
      <p>ended question answering tasks;
• We evaluate the performance of state-of-the-art</p>
      <p>models on these settings;
• We study the sensitivity that the models have for
the input prompt.</p>
      <p>We study pre-trained and language-adapted models to
test their capabilities in the resolution of Italian language
tasks. Specifically, we want to modify the typical
formatting that is used in multiple choice question
answering to study if the models are capable of correctly
fol2. Related Works lowing and generating Italian text. Usually, the format
shown in Listing 1 is used, where &lt;QUESTION&gt; is the
Language Model evaluation has been a research focus question the model has to answer, &lt;IDENTIFIER_i&gt; and
ever since the first Decoder-only models, which were &lt;OPTION_i&gt; are the option identifier, which is usually
designed for natural language generation. a letter or a number, and the text of the possible answer</p>
      <p>
        One of the most remarkable skills regarding LLMs to the previously provided question respectively.
&lt;CORreasoning has been in-context learning. In particular, few- RECT_IDENTIFIER&gt; is the identifier of the option that
shot learning has been increasingly used. The idea is that is the correct answer to the question.
providing examples of input-output in the model prompt
should afect positively the generation process [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>
        There are multiple leaderboards which evaluate open &lt;QUESTION&gt;:
LLMs on non-English languages, e.g. Open PL LLM &lt;IDENTIFIER_1&gt; &lt;OPTION_1&gt;
Leaderboard [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] for Polish or Open KO LLM Leaderboard &lt;IDENTIFIER_2&gt; &lt;OPTION_2&gt;
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] for Korean. These leaderboards are often based on the ...
lm evaluation harness framework [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], which has been a &lt;IDENTIFIER_N&gt; &lt;OPTION_N&gt;
milestone in the evaluation of LLMs. LLM evaluation can
also depend on the topic at hand. There are some works &lt;CORRECT_IDENTIFIER&gt;
which focus on mathematical reasoning [7] as well as Listing 1: closed-ended format
factuality [8].
      </p>
      <sec id="sec-2-1">
        <title>These evaluation settings often rely on closed-ended</title>
        <p>tasks, specifically multiple-choice question answering. We aim to modify the task so that the model has to
The idea is to calculate the log-likelihood of the next generate the text of the correct option instead of the
token to generate for the option identifiers. However, this identifier. To do so, we consider two main evaluation
may not be the best setting to evaluate LLMs. Wang et al. settings:
[9] studied this on Instruction-tuned LLMs by training • Open-ended (OE): we remove the available
opa classifier to predict which possible option to associate tions and only supply the question in the prompt;
with the generated answer. This was done to glance over
additional text generated by the model (e.g. the generated • Closed-ended no identifiers (CE-NI): we
fortext could be "The answer is B." as opposed to the simple mat the options without an identifier, the model
"B." token). They found that the log-likelihood and the has to write the corresponding text of the correct
generated text decisions were often not matching. option.</p>
        <p>Regarding Italian evaluation, some works have ap- In particular, for the CE-NI setting, we apply the format
proached this challenge. Bacciu et al. [10] released an- shown in Listing 2, where &lt;CORRECT_OPTION&gt; is the
other version of the Open Italian LLM Leaderboard, con- text of the option that represents the correct answer to
sidering a diferent variety of tasks. Mercorio et al. [11] the question.
released a benchmark based on questions that can be
found in the INVALSI test, an Italian educational test,
to further test the knowledge and reasoning abilities of
these models on a dataset that is natively in Italian rather &lt;QUESTION&gt;:
than obtained through machine translation. The latter is &lt;OPTION_1&gt;
one of the main problems when evaluating these mod- &lt;OPTION_2&gt;
els, due to the lack of resources w.r.t. English language, ...
datasets that are used at the state-of-the-art are trans- &lt;OPTION_N&gt;
lated using machine translation models. Still, all this
efort made to evaluate Italian-adapted LLMs mainly re- &lt;CORRECT_OPTION&gt;
lies on closed-ended tasks. Listing 2: closed-ended no identifiers format
&lt;CORRECT_IDENTIFIER&gt; and
&lt;COR</p>
      </sec>
      <sec id="sec-2-2">
        <title>RECT_OPTION&gt; are the outputs that we expect the evaluated model should generate.</title>
      </sec>
      <sec id="sec-2-3">
        <title>We provide complete examples of the prompt formats in</title>
      </sec>
      <sec id="sec-2-4">
        <title>Appendix A.</title>
      </sec>
      <sec id="sec-2-5">
        <title>Generally models are also evaluated by calculating the</title>
        <p>log-likelihood rather than generating text directly. The
chosen option is then selected based on the highest value.</p>
      </sec>
      <sec id="sec-2-6">
        <title>We choose to perform a generative task instead, to check</title>
        <p>whether the models are capable of generating the answer
string only without additional text and to also check if
they generate something outside of the provided options.</p>
      </sec>
      <sec id="sec-2-7">
        <title>To evaluate this case, we use the BLEU, ROUGE-L and</title>
        <p>BertScore F1 metrics, which are reference metrics used
to evaluate the correspondence of a generated sentence
with a base one. BLEU and ROUGE-L focus on matching
n-grams, while BertScore leverages pre-trained Bert
models to assess the semantic similarity between words
of the two texts. Furthermore, we consider four diferent
possible prompt formats:
• MMLU [13]: consists of multiple-choice
questions from 57 diferent topics (e.g. mathematics,
computer science, and so on), requiring
problemsolving abilities and knowledge to answer
correctly;
• EXAMS [14]: consists of multiple-choice
questions from high school exams. The dataset
contains diferent subsets curated for diferent
languages and optionally contains additional
paragraphs regarding the question (extracted from</p>
      </sec>
      <sec id="sec-2-8">
        <title>Wikipedia);</title>
        <p>• WWBM [15]: consists of multiple-choice
questions spanning a wide range of topics. The
questions come from the Italian version of the “Who
Wants to Be a Millionaire?” board game where
contestants answer progressively dificult
questions. The question-answer instances are split
into diferent categories depending on the
dificulty of the question itself.</p>
        <p>For the Italian version of these datasets, both EXAMS
• Plain (P): there is no formatting, the text of the</p>
        <p>and WWBM are provided with splits in the Italian
lantask is provided as it is in the prompt, only a</p>
        <p>guage natively. For ARC and MMLU, instead, we use
"Risposta:" string is added at the end;</p>
        <p>the Italian version provided in the library for the okapi
• Plain few-shot (P-F): same as P, but multiple task released by Lai et al. [16], who performed automatic
examples of input-output are provided;
• Instruct (I): the chat template of the model is translation of the original datasets using GPT-3.5 Turbo
for several languages. For all of these datasets, we
deapplied to the text of the task;</p>
        <p>ifne two custom tasks which apply the OE and CE-NI
• Instruct few-shot (I-F): same as I, but multiple
examples of input-output are provided. evaluation settings automatically. The examples used in
the few-shot settings are taken from the validation splits</p>
        <p>Furthermore, for the few-shot formats, we consider of the datasets. For EXAMS, we use the train split as a
two distinct numbers of examples to provide in the test split (since it is not provided), while for WWBM, we
prompt: one-shot and five-shots. The intuition is that remove the first five instances from the original dataset
a language-adapted LLM should significantly improve and use them as a validation split.
performance even when provided with a single example. Regarding the models, we experiment using the
fol</p>
        <p>We consider these prompt formats because most of lowing:
the evaluation settings for Italian LLMs are done without
applying the chat template. We argue that this choice
may not be the best one when considering Instruct models
that have been trained using a specific prompt format to
continue a conversation. They should be evaluated using
the same prompt format since it is also the one that will
be used in case of deployment.</p>
      </sec>
      <sec id="sec-2-9">
        <title>To set up the experimental protocol, we use the lm</title>
        <p>
          evaluation-harness library [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], which provides an
immediate and intuitive command line to automatically
evaluate LLMs on previously defined as well as custom tasks.
Specifically, we define custom tasks within the library
following the previously defined evaluation settings. To
do so, we consider the following datasets:
• Italia-9B-Instruct-v0.12: trained from scratch
with a focus on the Italian language (90% of data
in Italian and the rest in English) with
instructiontuning for conversational purposes;
• LLaMAntino-2-chat-13b-hf-UltraChat-ITA
[17]: instruction-tuning of
LLaMAntino-2-chat13b-hf-ITA (an Italian-adapted LLM) using a
translated version of the UltraChat dataset;
• LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
[18]: fine-tuning, DPO and adaptation using a
mixture of Italian and English datasets starting
from the LLaMA-3-8B-Instruct model;
• maestrale-chat-v0.4-alpha-sft3:
instructiontuning for 2 epochs on a conversational dataset
consisting of 1.7M instances, starting from an
        </p>
      </sec>
      <sec id="sec-2-10">
        <title>Italian-adapted version of Mistral-7b;</title>
        <p>• ARC-Challenge [12]: consists of
multiplechoice science exam questions, the Challenge
set consists of complex questions that were not
correctly answered by both a retrieval and co- 2https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1
occurrence method; 3https://huggingface.co/mii-llm/maestrale-chat-v0.4-alpha-sft
Italia-9B-Instruct-v0.1
LLaMAntino-2-chat-13b-hf-UltraChat-ITA</p>
        <p>P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
P
P-F 1
P-F 5
I
I-F 1
I-F 5
4
7
Results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The
best result for each dataset and for each metric is in bold</p>
        <sec id="sec-2-10-1">
          <title>Meta-Llama-3-8B</title>
          <p>and</p>
        </sec>
        <sec id="sec-2-10-2">
          <title>Meta-Llama-3-8B</title>
          <p>Instruct5 : latest version of the LLaMA family
new
models.</p>
        </sec>
      </sec>
      <sec id="sec-2-11">
        <title>We start from the</title>
      </sec>
      <sec id="sec-2-12">
        <title>Meta-LLaMA-3-8B</title>
      </sec>
      <sec id="sec-2-13">
        <title>Instruct checkpoint and fine-tune the model on</title>
      </sec>
      <sec id="sec-2-14">
        <title>5 https://huggingface.co/meta- llama/Meta- Llama- 3- 8B- Instruct</title>
      </sec>
      <sec id="sec-2-15">
        <title>6 https://huggingface.co/sapienzanlp/Minerva- 3B- base- v1.0</title>
      </sec>
      <sec id="sec-2-16">
        <title>7 https://huggingface.co/mii- community/zefiro- 7b- dpo- ITA</title>
      </sec>
      <sec id="sec-2-17">
        <title>OpenOrca and UltraChat. The datasets are automatically</title>
        <p>translated to Italian using ChatGPT 3.5. We consider two
diferent settings, one where
20, 000 instances are kept
for each language (Italian and English), and one where
40, 000 instances are kept for the Italian language only.
For instruction tuning, we used LoRA with  equal to 16
and alpha equal to 16, targeting all linear layers of the
model. Other hyperparameters are efective batch size
equal to 128, learning rate equal to 2 −
equal to 0.01 and warmup steps equal to 5. In both cases,
the instances to be used during the training are chosen</p>
      </sec>
      <sec id="sec-2-18">
        <title>5, weight decay</title>
        <p>at random.</p>
      </sec>
      <sec id="sec-2-19">
        <title>For all experiments, we use the greedy-decoding gen</title>
        <p>eration strategy with a maximum number of tokens to</p>
        <p>Italia-9B-Instruct-v0.1
LLaMAntino-2-chat-13b-hf-UltraChat-ITA
given in the examples, and therefore the Italian language from Hugging Face [21], which provides seamless
integeneration may become more likely thanks to the addi- gration with PyTorch [22] and DeepSpeed [23], as well
tional information conveyed in the prompt. We aim to
mitigate this potential bias by decreasing the number of
as Unsloth8 and TRL [24]. This software stack has been 4. Conclusions and Future Works
instrumental in eficiently handling large data sets and
complex models. We have carried out a study on the efectiveness of
eval</p>
        <p>This configuration allowed for parallelization of com- uation of Italian-adapted LLMs on closed-ended tasks,
putations, significantly reducing training and evaluation multiple-choice question answering tasks specifically.
time. DeepSpeed optimized memory usage and commu- We have experimented with two settings: an open-ended
nication between nodes, allowing us to efortlessly scale one and a closed-ended one without option identifiers.
evaluation processes across multiple model architectures. The results show better performance for the latter.
Fur</p>
        <p>The hardware-software combination ensured eficient, thermore, they also show that, with respect to the Open
cost-efective, and reproducible experiments, which are Italian LLM Leaderboard, there are significant diferences
critical for comparing multiple models and training new regarding model performance. We can conclude that
ones eficiently. the evaluation of Italian-adapted models should follow
a more rigorous procedure which does not mainly rely
3.2. Findings and Additional Tests on closed-ended tasks. We release the code that was used
on GitHub9. In the future, we plan to further work on
Analyzing the results, it is clear that the OE strategy did the topic and attempt to define best practices for the
not yield very satisfactory results for BLEU and ROUGE- evaluation of these models.</p>
      </sec>
      <sec id="sec-2-20">
        <title>L. We associate this with the dificulty of generating a re</title>
        <p>tshpaotncsaenmbaetcgheinnegreaxteadctilsynthoet cgornoustnrdaitnreudthinwahneyn wthaey.teTxot Acknowledgments
further support this point, we can see that the BertScore We acknowledge the support of the PNRR project FAIR
of some experiments yields good results, hinting that the Future AI Research (PE00000013), Spoke 6 - Symbiotic AI
semantics of the content that has been generated is simi- (CUP H97G22000210007) under the NRRP MUR program
lar to that of the ground truth. funded by the NextGenerationEU.</p>
        <p>Regarding the CE-NI strategy, the obtained results are
much better for all metrics. Therefore providing the
options in the input prompt greatly helped the model in lim- References
iting its generation to follow the provided options.
Surprisingly, with respect to the Italian leaderboard where
ifne-tuned versions of the LLaMA 3 family were shown to
have much better results, here the results are in line with
the base models (or even worse in some cases).
Furthermore, one of the best-performing models is
maestralechat-v0.4-alpha-sft, which consistently outperforms the N.</p>
        <p>LLaMA 3 models in most cases. T.</p>
      </sec>
      <sec id="sec-2-21">
        <title>For both settings the obtained results show that providing input-output examples in the prompt greatly enhances the results for all settings.</title>
      </sec>
      <sec id="sec-2-22">
        <title>For both settings, primarily Instruct models were used.</title>
        <p>Upon analyzing the generated results, we observed
instances where the model provided the correct result but
appended an additional substring (e.g., the model began
explaining the reasoning behind its response). To assess
if this might have afected the result, we performed an
additional test where we checked if the ground truth
string was a substring of the generated output (after
removing punctuation and trailing whitespaces as well as
lowercasing the two strings). We report the complete
results in Appendix C. Overall, some models show an
improvement in performance, but the results still do not
beat maestrale-chat-v0.4-alpha-sft.</p>
      </sec>
      <sec id="sec-2-23">
        <title>We provide some generation examples in Appendix B.</title>
      </sec>
      <sec id="sec-2-24">
        <title>8https://github.com/unslothai/unsloth</title>
      </sec>
      <sec id="sec-2-25">
        <title>9https://github.com/swapUniba/Closed-ITA-LLM-Evaluation</title>
      </sec>
      <sec id="sec-2-26">
        <title>P. Wu, S. Chintala, Pytorch 2: Faster machine</title>
        <p>learning through dynamic python bytecode
transformation and graph compilation, in: 29th ACM</p>
      </sec>
      <sec id="sec-2-27">
        <title>International Conference on Architectural Support</title>
        <p>for Programming Languages and Operating
Systems, Volume 2 (ASPLOS ’24), ACM, 2024. URL:
https://pytorch.org/assets/pytorch2-2.pdf . doi:10.
1145/3620665.3640366.
[23] C. Li, Z. Yao, X. Wu, M. Zhang, C. Holmes, C. Li,
Y. He, Deepspeed data eficiency: Improving deep
learning model quality and training eficiency via
efifcient data sampling and routing, 2024. URL: https:
//arxiv.org/abs/2212.03597. arXiv:2212.03597.
[24] L. von Werra, Y. Belkada, L. Tunstall, E.
Beeching, T. Thrush, N. Lambert, S. Huang, Trl:
Transformer reinforcement learning, https://github.com/
huggingface/trl, 2020.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>A. Prompt Formats</title>
      <p>All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.
Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il freddo si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il calore si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Il freddo si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Risposta:
Example 1: Prompt in the P-F format for the OE setting
Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si
riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre
sessualmente e asessualmente? Opzioni:
Consente alle piante di crescere più in alto.</p>
      <p>Produce fiori che attraggono gli insetti.</p>
      <p>Produce more che hanno un sapore migliore.</p>
      <p>Permette alle piante di more di adattarsi a nuove condizioni.</p>
      <p>Risposta: Permette alle piante di more di adattarsi a nuove condizioni.</p>
      <p>Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il freddo si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il calore si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Il freddo si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Risposta:
Example 2: Prompt in the P-F 1 format for the OE setting
&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;
Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il freddo si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il calore si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Il freddo si sposta dal cubetto di ghiaccio alla sua mano.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;
Example 3: Prompt in the I-F format using LLaMA 3 chat template
Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si
riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre
sessualmente e asessualmente? Opzioni:
Consente alle piante di crescere più in alto.</p>
      <p>Produce fiori che attraggono gli insetti.</p>
      <p>Produce more che hanno un sapore migliore.</p>
      <p>Permette alle piante di more di adattarsi a nuove condizioni.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;
Permette alle piante di more di adattarsi a nuove condizioni.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;user&lt;|end_header_id|&gt;
Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il freddo si sposta dalla sua mano al cubetto di ghiaccio.</p>
      <p>Il calore si sposta dal cubetto di ghiaccio alla sua mano.</p>
      <p>Il freddo si sposta dal cubetto di ghiaccio alla sua mano.&lt;|eot_id|&gt;&lt;|start_header_id|&gt;assistant&lt;|end_header_id|&gt;
Example 4: Prompt in the I-F 1 format using LLaMA 3 chat template</p>
    </sec>
    <sec id="sec-4">
      <title>B. Zero-shot Response Examples</title>
      <p>All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.
Una sorgente sonora di frequenza f 0, si muove con velocità costante lungo una circonferenza. Nel centro della circonferenza si
trova il ricevitore del suono. Quale asserzione è esatta per la frequenza f registrata dal ricevitore? Risposta:
Ground truth: f = f 0
Generated Answer: La frequenza f registrata dal ricevitore è costante e uguale a f 0
Example 5: Generated answer with additional text for the OE setting
Il periodo di rotazione di un satellite artificiale intorno ad un pianeta è T . La distanza tra il satellite ed il centro del pianeta è r .
A che distanza dal centro del pianeta ruota un altro satellite se il suo periodo di rotazione è T / 8? Opzioni:
8 r
r/8
4 r
r/4
Risposta:
Ground truth: r/4
Generated Answer: r/8 Spiegazione: Se il periodo di rotazione del satellite è T / 8, allora la sua distanza dal centro del pianeta
è r / 8. Infatti, se il satellite ruota in 8T, allora la sua distanza dal centro del pianeta è
Example 6: Generated answer with additional text for the CE-NI setting</p>
    </sec>
    <sec id="sec-5">
      <title>C. Substring Matching Results</title>
      <p>Italia-9B-Instruct-v0.1
LLaMAntino-2-chat-13b-hf-UltraChat-ITA
LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
maestrale-chat-v0.4-alpha-sft</p>
      <p>Meta-Llama-3-8B
Meta-Llama-3-8B-Instruct</p>
      <p>Minerva-3B-base-v1.0</p>
      <p>zefiro-7b-dpo-ITA
LLaMA3-BILINGUAL (Ours)
LLaMA3-ITA-ONLY (Ours)
Format</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
      <p>P
P-F 1
P-F 5</p>
      <p>I
I-F 1
I-F 5</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>C.</given-names>
            <surname>Fourrier</surname>
          </string-name>
          , Habib,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozovskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Szafer</surname>
          </string-name>
          , Wolf, Open llm leaderboard v2, https://huggingface.co/spaces/ open-llm-leaderboard/open_llm_leaderboard,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Wróbel</surname>
          </string-name>
          , SpeakLeash Team, Cyfronet Team, Open pl llm leaderboard, https://huggingface.co/spaces/ speakleash/open_pl_llm_leaderboard,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>C.</given-names>
            <surname>Park</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark</article-title>
          , in: ACL Main,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          , A. DiPofi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>McDonell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Phang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Reynolds</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Thite,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>