A study on the soundness of closed-ended evaluation of
                                Large Language Models adapted to the Italian language
                                Elio Musacchio1,2,* , Lucia Siciliani1,* , Pierpaolo Basile1,* , Edoardo Michielon3 ,
                                Marco Pasqualini3 , Asia Beatrice Uboldi3 and Giovanni Semeraro1
                                1
                                  Department of Computer Science, University of Bari Aldo Moro, Italy
                                2
                                  National PhD in Artificial Intelligence, University of Pisa, Italy
                                3
                                  Fastweb SpA, Milan, Italy


                                                  Abstract
                                                  With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural Language
                                                  Generation tasks, an increasing number of open weights architectures have been developed and released online. In contrast
                                                  with older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shown
                                                  outstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend natural
                                                  language. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is using
                                                  benchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance in
                                                  each of these individual tasks can imply having developed a model capable of understanding language. However, while this
                                                  assumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains an
                                                  open challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets and
                                                  how a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditional
                                                  approaches.

                                                  Keywords
                                                  Large Language Models, Natural Language Processing, Evaluation, Benchmark


                                1. Introduction                                                                                            keeps track of the capabilities of openly available LLMs.
                                                                                                                                           Specifically, the models are tested on six tasks that span
                                Large Language Models (LLMs) are models based on                                                           different abilities a language model should have, e.g. rea-
                                the Transformer architecture capable of solving a wide                                                     soning or text completion. Regarding their reasoning abil-
                                variety of Natural Language Generation (NLG) tasks, even                                                   ities, the models are tested by solving closed-ended tasks.
                                those not encountered during training, due to their ex-                                                    Specifically, multiple-choice question answering tasks are
                                tensive training and large number of parameters. Thanks                                                    provided, where a question is given with a list of possible
                                to their remarkable skills, interest in LLMs is now at its                                                 alternatives associated with an identifier (a letter, a num-
                                climax, resulting in a proliferation of open-weight mod-                                                   ber, and so on). Intuitively, since the model has also been
                                els (e.g. LLaMA, Mistral, and many others). Among the                                                      pre-trained on closed-ended question-answering data, it
                                several challenges related to the development of LLMs,                                                     should be able to generalize and understand the correct
                                one of the most critical is their evaluation [1]. One ap-                                                  choice out of the available ones. Furthermore, rather than
                                proach to tackle this issue has been to build benchmarks                                                   generating the output directly, the probabilities learned
                                that collect different datasets, with the aim of obtaining                                                 by the model are studied, using log-likelihood to assess
                                a more comprehensive evaluation of the model’s overall                                                     which option is more likely to be correct. For the En-
                                capabilities. Currently, there is a leaderboard1 [2] which                                                 glish language, this evaluation methodology has been
                                                                                                                                           a standard approach to assess the capabilities of LLMs.
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,                                       However, when adapting a model to a new language, due
                                Dec 04 — 06, 2024, Pisa, Italy
                                *
                                  Corresponding author.
                                                                                                                                           to the low amount of non-English data that has been used
                                $ elio.musacchio@uniba.it (E. Musacchio); lucia.siciliani@uniba.it                                         to pre-train such models, this methodology may not be
                                (L. Siciliani); pierpaolo.basile@uniba.it (P. Basile);                                                     as sound. The model only has to generate the correct
                                edoardo.michielon@consulenti.fastweb.it (E. Michielon);                                                    option identifier, therefore this is not really testing the
                                marco.pasqualini@consulenti.fastweb.it (M. Pasqualini);                                                    ability of the model of generating high-quality text in
                                asiabeatrice.uboldi@consulenti.fastweb.it (A. B. Uboldi);
                                giovanni.semeraro@uniba.it (G. Semeraro)
                                                                                                                                           another language. The goal of this work is to understand
                                 0009-0006-9670-9998 (E. Musacchio); 0000-0002-1438-280X                                                  whether a new evaluation setting applied to language-
                                (L. Siciliani); 0000-0002-0545-1105 (P. Basile); 0000-0001-6883-1853                                       adapted LLMs may give more insight than the traditional
                                (G. Semeraro)                                                                                              approach. Therefore, our contributions are the following:
                                            © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License
                                            Attribution 4.0 International (CC BY 4.0).
                                1
                                    https://huggingface.co/spaces/open-llm-leaderboard/open_llm_                                                • We test two evaluation settings for language-
                                    leaderboard                                                                                                   adapted LLMs changing the structure of closed-


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
       ended question answering tasks;                        3. Experiments
     • We evaluate the performance of state-of-the-art
       models on these settings;                             We study pre-trained and language-adapted models to
     • We study the sensitivity that the models have for     test their capabilities in the resolution of Italian language
       the input prompt.                                     tasks. Specifically, we want to modify the typical for-
                                                             matting that is used in multiple choice question answer-
                                                             ing to study if the models are capable of correctly fol-
2. Related Works                                             lowing and generating Italian text. Usually, the format
                                                             shown in Listing 1 is used, where <QUESTION> is the
Language Model evaluation has been a research focus question the model has to answer, <IDENTIFIER_i> and
ever since the first Decoder-only models, which were <OPTION_i> are the option identifier, which is usually
designed for natural language generation.                    a letter or a number, and the text of the possible answer
   One of the most remarkable skills regarding LLMs to the previously provided question respectively. <COR-
reasoning has been in-context learning. In particular, few- RECT_IDENTIFIER> is the identifier of the option that
shot learning has been increasingly used. The idea is that is the correct answer to the question.
providing examples of input-output in the model prompt
should affect positively the generation process [3].
                                                                   <QUESTION>:
   There are multiple leaderboards which evaluate open
                                                                   <IDENTIFIER_1> <OPTION_1>
LLMs on non-English languages, e.g. Open PL LLM
                                                                   <IDENTIFIER_2> <OPTION_2>
Leaderboard [4] for Polish or Open KO LLM Leaderboard
                                                                   ...
[5] for Korean. These leaderboards are often based on the
                                                                   <IDENTIFIER_N> <OPTION_N>
lm evaluation harness framework [6], which has been a
milestone in the evaluation of LLMs. LLM evaluation can
                                                                   <CORRECT_IDENTIFIER>
also depend on the topic at hand. There are some works
which focus on mathematical reasoning [7] as well as                        Listing 1: closed-ended format
factuality [8].
   These evaluation settings often rely on closed-ended
tasks, specifically multiple-choice question answering.         We aim to modify the task so that the model has to
The idea is to calculate the log-likelihood of the next generate the text of the correct option instead of the
token to generate for the option identifiers. However, this identifier. To do so, we consider two main evaluation
may not be the best setting to evaluate LLMs. Wang et al. settings:
[9] studied this on Instruction-tuned LLMs by training
                                                                   • Open-ended (OE): we remove the available op-
a classifier to predict which possible option to associate
                                                                      tions and only supply the question in the prompt;
with the generated answer. This was done to glance over
                                                                   • Closed-ended no identifiers (CE-NI): we for-
additional text generated by the model (e.g. the generated
                                                                      mat the options without an identifier, the model
text could be "The answer is B." as opposed to the simple
                                                                      has to write the corresponding text of the correct
"B." token). They found that the log-likelihood and the
                                                                      option.
generated text decisions were often not matching.
   Regarding Italian evaluation, some works have ap-            In particular, for the CE-NI setting, we apply the format
proached this challenge. Bacciu et al. [10] released an- shown in Listing 2, where <CORRECT_OPTION> is the
other version of the Open Italian LLM Leaderboard, con- text of the option that represents the correct answer to
sidering a different variety of tasks. Mercorio et al. [11] the question.
released a benchmark based on questions that can be
found in the INVALSI test, an Italian educational test,
to further test the knowledge and reasoning abilities of
these models on a dataset that is natively in Italian rather       <QUESTION>:
than obtained through machine translation. The latter is           <OPTION_1>
one of the main problems when evaluating these mod-                <OPTION_2>
els, due to the lack of resources w.r.t. English language,         ...
datasets that are used at the state-of-the-art are trans-          <OPTION_N>
lated using machine translation models. Still, all this
effort made to evaluate Italian-adapted LLMs mainly re-            <CORRECT_OPTION>
lies on closed-ended tasks.                                         Listing 2: closed-ended no identifiers format
   <CORRECT_IDENTIFIER>               and         <COR-               • MMLU [13]: consists of multiple-choice ques-
RECT_OPTION> are the outputs that we expect                             tions from 57 different topics (e.g. mathematics,
the evaluated model should generate.                                    computer science, and so on), requiring problem-
We provide complete examples of the prompt formats in                   solving abilities and knowledge to answer cor-
Appendix A.                                                             rectly;
   Generally models are also evaluated by calculating the             • EXAMS [14]: consists of multiple-choice ques-
log-likelihood rather than generating text directly. The                tions from high school exams. The dataset con-
chosen option is then selected based on the highest value.              tains different subsets curated for different lan-
We choose to perform a generative task instead, to check                guages and optionally contains additional para-
whether the models are capable of generating the answer                 graphs regarding the question (extracted from
string only without additional text and to also check if                Wikipedia);
they generate something outside of the provided options.              • WWBM [15]: consists of multiple-choice ques-
To evaluate this case, we use the BLEU, ROUGE-L and                     tions spanning a wide range of topics. The ques-
BertScore F1 metrics, which are reference metrics used                  tions come from the Italian version of the “Who
to evaluate the correspondence of a generated sentence                  Wants to Be a Millionaire?” board game where
with a base one. BLEU and ROUGE-L focus on matching                     contestants answer progressively difficult ques-
n-grams, while BertScore leverages pre-trained Bert                     tions. The question-answer instances are split
models to assess the semantic similarity between words                  into different categories depending on the diffi-
of the two texts. Furthermore, we consider four different               culty of the question itself.
possible prompt formats:
                                                             For the Italian version of these datasets, both EXAMS
     • Plain (P): there is no formatting, the text of the
                                                          and WWBM are provided with splits in the Italian lan-
       task is provided as it is in the prompt, only a
                                                          guage natively. For ARC and MMLU, instead, we use
       "Risposta:" string is added at the end;
                                                          the Italian version provided in the library for the okapi
     • Plain few-shot (P-F): same as P, but multiple
                                                          task released by Lai et al. [16], who performed automatic
       examples of input-output are provided;
                                                          translation of the original datasets using GPT-3.5 Turbo
     • Instruct (I): the chat template of the model is
                                                          for several languages. For all of these datasets, we de-
       applied to the text of the task;
                                                          fine two custom tasks which apply the OE and CE-NI
     • Instruct few-shot (I-F): same as I, but multiple
                                                          evaluation settings automatically. The examples used in
       examples of input-output are provided.
                                                          the few-shot settings are taken from the validation splits
   Furthermore, for the few-shot formats, we consider of the datasets. For EXAMS, we use the train split as a
two distinct numbers of examples to provide in the test split (since it is not provided), while for WWBM, we
prompt: one-shot and five-shots. The intuition is that remove the first five instances from the original dataset
a language-adapted LLM should significantly improve and use them as a validation split.
performance even when provided with a single example.        Regarding the models, we experiment using the fol-
   We consider these prompt formats because most of lowing:
the evaluation settings for Italian LLMs are done without
applying the chat template. We argue that this choice           • Italia-9B-Instruct-v0.12 : trained from scratch
may not be the best one when considering Instruct models          with a focus on the Italian language (90% of data
that have been trained using a specific prompt format to          in Italian and the rest in English) with instruction-
continue a conversation. They should be evaluated using           tuning for conversational purposes;
the same prompt format since it is also the one that will       • LLaMAntino-2-chat-13b-hf-UltraChat-ITA
be used in case of deployment.                                    [17]: instruction-tuning of LLaMAntino-2-chat-
   To set up the experimental protocol, we use the lm-            13b-hf-ITA (an Italian-adapted LLM) using a
evaluation-harness library [6], which provides an imme-           translated version of the UltraChat dataset;
diate and intuitive command line to automatically evalu-        • LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
ate LLMs on previously defined as well as custom tasks.           [18]: fine-tuning, DPO and adaptation using a
Specifically, we define custom tasks within the library           mixture of Italian and English datasets starting
following the previously defined evaluation settings. To          from the LLaMA-3-8B-Instruct model;
do so, we consider the following datasets:                      • maestrale-chat-v0.4-alpha-sft3 : instruction-
     • ARC-Challenge [12]: consists of multiple-                  tuning for 2 epochs on a conversational dataset
        choice science exam questions, the Challenge              consisting of 1.7M instances, starting from an
        set consists of complex questions that were not           Italian-adapted version of Mistral-7b;
       correctly answered by both a retrieval and co-        2
                                                                 https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1
       occurrence method;                                    3
                                                                 https://huggingface.co/mii-llm/maestrale-chat-v0.4-alpha-sft
                    Model                    Format              ARC_IT                     MMLU_IT                        EXAMS                        WBMM
                                                      BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score
                                                P      0.00      0.05    0.69       0.00      0.30    1.96       0.00      0.38    2.13       0.76     27.70   70.17
                                              P-F 1    2.17     13.43   68.88       1.35      8.72   54.52       1.28     13.25   66.87       2.58     33.47   77.29
                                              P-F 5    3.50     17.95   73.30       2.17     12.94   70.27       2.18     15.60   72.29       7.54     38.56   83.18
             Italia-9B-Instruct-v0.1
                                                I      0.52      7.17   64.30       0.75      6.91   63.13       0.50      6.57   63.11       0.24      7.65   63.36
                                              I-F 1    0.57      6.99   64.33       0.70      7.08   63.35       0.50      6.59   63.25       0.22      6.93   62.63
                                              I-F 5    0.70      8.00   65.35       0.84      7.95   64.45       0.56      7.04   63.52       0.30     10.16   64.77
                                                P      1.01     11.35   66.12       1.28     10.34   61.10       0.84     10.43   64.86       0.57     20.59   69.17
                                              P-F 1    1.99     15.47   71.38       0.99      8.87   62.97       1.42     14.39   69.41       3.35     33.64   81.18
                                              P-F 5    3.49     18.71   73.97       2.69     14.51   71.32       2.29     16.78   73.47       9.93     35.82   83.21
    LLaMAntino-2-chat-13b-hf-UltraChat-ITA
                                                I      0.80      7.50   64.34       0.87      6.94   63.27       0.50      6.25   62.87       0.24      8.51   64.03
                                              I-F 1    0.95      9.70   65.93       1.02      8.03   63.96       0.71      8.59   64.53       0.36     11.43   66.13
                                              I-F 5    1.61     14.15   70.09       0.87      6.94   66.40       1.06     12.57   68.70       2.42     32.73   70.10
                                                P      0.88     10.18   65.71       0.95     10.08   65.39       0.66     10.45   65.01       0.23     15.25   67.05
                                              P-F 1    1.91     14.99   70.49       0.81      8.42   62.37       1.48     16.42   70.67       1.84     34.75   81.27
                                              P-F 5    1.41     15.24   69.40       0.75     10.59   65.00       1.40     17.74   72.63       2.94     35.32   82.36
    LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
                                                I      0.74      8.10   65.34       0.78      8.05   64.44       0.37      6.13   62.75       0.20      8.38   63.05
                                              I-F 1    1.14     11.41   68.83       0.72      9.21   63.29       0.77     14.69   68.03       0.36     11.43   76.91
                                              I-F 5    1.84     14.74   71.50       1.10     11.87   68.81       0.88     15.10   71.28       1.32     33.09   81.10
                                                P      1.26     11.35   65.29       1.50     10.47   57.25       1.03     12.23   60.84       0.76     27.70   70.17
                                              P-F 1    3.43     19.45   73.16       1.49     12.14   65.56       2.86     22.53   73.09       6.75     46.26   84.60
                                              P-F 5    5.33     21.29   74.59      3.40     17.99    72.53      4.48     23.45    75.77      20.66    50.50    87.08
         maestrale-chat-v0.4-alpha-sft
                                                I      0.88      8.38   64,61       0.99      8.15   63.65       0.77     11.05   65.53       0.47     19.98   69.34
                                              I-F 1    1.43     11.77   68.04       1.34      9.73   65.38       1.12     14.93   68.31       1.70     39.04   80.08
                                              I-F 5    2.34     16.27   71.37       1.91     15.11   69.33       2.47     20.83   74.12       2.86     45.05   84.10
                                                P      0.74      7.18   61.89       0.75      7.32   61.02       0.57      5.73   60.63       0.21     11.63   63.49
               Meta-Llama-3-8B                P-F 1    3.35     18.57   73.58       1.31     10.21   63.81       2.99     21.10   72.85       9.06     40.66   83.82
                                              P-F 5   5.59     21.53    74.85       3.23     17.39   72.42       3.16     21.32   74.70      16.34     45.18   85.85
                                                P      0.92     10.10   65.38       1,04     10.03   64.90       0.71      9.03   64.55       0.22     12.92   65.58
                                              P-F 1    2.56     17.29   72.06       1.11      8.85   62.76       1.83     18.00   70.81       3.99     37.27   82.28
                                              P-F 5    4.50     19.70   73.98       3.26     16.67   72.42       3.57     21.11   74.86       9.40     39.28   84.04
           Meta-Llama-3-8B-Instruct
                                                I      0.50      6.07   64.00       0.72      6.19   63.24       0.41      5.15   62.25       0.21      6.69   62.07
                                              I-F 1    0.81      9.62   65.87       1.07      9.64   65.42       0.76     10.96   65.29       0.64     23.33   71.47
                                              I-F 5    2.46     17.44   72.09       2.35     15.41   71.01       0.88     15.10   73.84       5.96     39.86   83.87
                                                P      0.39      4.76   59.43       0.42      4.65   58.24       0.25      4.09   58.78       0.10      3.22   58.07
             Minerva-3B-base-v1.0             P-F 1    0.76      9.75   67.01       0.58      5.90   60.49       0.38      5.57   60.98       2.22     27.03   78.51
                                              P-F 5    2.61     14.08   71.22       1.57      8.92   64.40       2.01     13.65   70.64      10.65     33.59   82.32
                                                P      0.72      4.10   66.25       1.04     10.69   65.11       0.65      9.31   65.32       0.65      9.31   67.70
                                              P-F 1    3.64     16.47   72.60       1.19     11.31   66.58       2.75     17.09   71.21       6.12     33.15   81.85
                                              P-F 5    2.86     17.44   74.66       2.91     15.25   72.26       3.14     19.21   74.44      10.59     35.31   83.31
               zefiro-7b-dpo-ITA
                                                I      0.65      6.96   63.50       0.85      6.91   62.85       0.55      6.23   62.47       0.22      6.96   63.20
                                              I-F 1    1.03      9.57   66.31       0.76      6.20   62.23       0.80      8.66   64.65       0.30      8.32   64.41
                                              I-F 5    1.91     14.50   70.63       1.91     15.11   66.09       1.52     15.36   70.47       0.81     24.60   73.30
                                                P      0.80      9.17   64.41       1.00      9.34   64.13       0.67      8.32   63.68       0.20     11.77   64.80
                                              P-F 1    2.54     17.65   72.12       1.12      9.05   62.93       1.81     18.15   70.87       4.53     37.43   82.58
                                              P-F 5    4.69     19.68   74.09       3.26     16.89   72.24       3.31     20.85   74.61       9.54     39.35   84.03
          LLaMA3-BILINGUAL (Ours)
                                                I      0.54      6.16   64.05       0.73      6.35   63.20       0.34      5.18   62.17       0.21      6.62   61.95
                                              I-F 1    0.90     10.63   66.72       1.19     10.48   65.88       0.91     12.63   66.24       0.77     27.20   73.93
                                              I-F 5    3.33     18.00   72.76       2.90     15.80   71.69       2.64     18.73   73.84       7.23     39.75   83.97
                                                P      0.87      6.75   64.07       0.97      9.10   64.59       0.64      7.78   63.23       0.19     10.51   64.02
                                              P-F 1    2.47     17.74   72.03       1.14      9.13   63.00       1.73     17.94   70.77       4.67     37.67   82.69
                                              P-F 5    2.61     16.64   74.10       3.11     16.97   72.21       3.22     21.04   74.65       8.91     39.34   84.05
           LLaMA3-ITA-ONLY (Ours)
                                                I      0.58      6.05   64.12       0.73      6.35   63.24       0.35      5.21   62.17       0.21      6.90   62.14
                                              I-F 1    1.02     10.94   67.03       1.26     10.79   66.33       0.96     12.95   66.52       0.77     27.20   74.25
                                              I-F 5    3.13     18.35   72.89       2.98     15.87   71.76       2.72     18.45   73.86       7.23     39.75   84.11

Table 1
Results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The
best result for each dataset and for each metric is in bold


          • Meta-Llama-3-8B4 and Meta-Llama-3-8B-                                          new models. We start from the Meta-LLaMA-3-8B-
            Instruct5 : latest version of the LLaMA family                                 Instruct checkpoint and fine-tune the model on 40, 000
            of models released by META (base and instruct                                  instances from 3 different datasets: databricks-dolly-15k,
            version respectively);                                                         OpenOrca and UltraChat. The datasets are automatically
          • Minerva-3B-base-v1.06 : trained from scratch to                                translated to Italian using ChatGPT 3.5. We consider two
            be a proficient bilingual base model (English and                              different settings, one where 20, 000 instances are kept
            Italian);                                                                      for each language (Italian and English), and one where
          • zefiro-7b-dpo-ITA7 : based on zephyr by Tun-                                   40, 000 instances are kept for the Italian language only.
            stall et al. [19], DPO training done on top of zefiro-                         For instruction tuning, we used LoRA with 𝑟 equal to 16
            7b-sft-ITA.                                                                    and alpha equal to 16, targeting all linear layers of the
                                                                                           model. Other hyperparameters are effective batch size
  Furthermore, to test whether bilingual training helps                                    equal to 128, learning rate equal to 2𝑒 − 5, weight decay
the model solve these tasks, we instruction-tuned two                                      equal to 0.01 and warmup steps equal to 5. In both cases,
                                                                                           the instances to be used during the training are chosen
4
  https://huggingface.co/meta-llama/Meta-Llama-3-8B                                        at random.
5
  https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct                                  For all experiments, we use the greedy-decoding gen-
6
  https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0
7
  https://huggingface.co/mii-community/zefiro-7b-dpo-ITA                                   eration strategy with a maximum number of tokens to
                Model                    Format              ARC_IT                     MMLU_IT                        EXAMS                        WBMM
                                                  BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score   BLEU    ROUGE-L Bert-Score
                                            P      0.00      0.00    0.06       0.00      0.59    1.22        0.0      0.38    0.38      15.32     73.40   85.48
                                          P-F 1   53.48     55.09   87.09      36.80     49.17   84.18      55.49     55.00   86.74      45.60     55.00   82.55
                                          P-F 5   56.34     58.89   88.52      44.40     52.41   85.88      61.55     57.38   88.33      53.75     59.73   90.66
         Italia-9B-Instruct-v0.1
                                            I      5.76     21.91   71.17       9.00     27.68   72.64       4.32     18.44   68.91       0.80     20.14   69.70
                                          I-F 1    6.61     26.10   73.02      12.85     34.66   76.37       9.02     31.13   74.74       0.73     19.22   69.88
                                          I-F 5   20.48     42.83   81.79      17.92     40.90   80.14      28.41     47.58   83.99      13.18     48.74   87.45
                                            P     30.12     50.94   81.74      28.16     39.69   69.34      40.63     55.14   82.94      10.43     58.07   83.02
                                          P-F 1   55.05     61.92   86.97      31.61     49.91   82.15      55.25     61.98   85.13      63.84     68.91   90.84
                                          P-F 5   61.89     63.37   89.76      47.52     56.01   86.79      65.37     61.54   89.61      65.36     70.35   93.05
LLaMAntino-2-chat-13b-hf-UltraChat-ITA
                                            I     12.48     28.34   72.03       9.86     20.21   68.39       7.87     22.46   69.09       1.24     22.45   69.34
                                          I-F 1   26.69     47.17   80.57      17.02     32.28   74.05      16.93     37.10   74.83       7.45     69.00   75.40
                                          I-F 5   45.81     57.95   86.78      30.61     48.57   82.92      42.04     51.42   82.78      36.48     65.88   91.00
                                            P     12.15     37.28   74.72      14.69     37.91   75.05      12.21     38.12   75.46       1.30     39.35   76.48
                                          P-F 1   14.47     47.84   79.49      15.84     36.97   72.69      18.55     51.38   83.07       6.42     69.34   90.84
                                          P-F 5   22.85     61.81   85.17      15.85     47.98   79.34      17.64     56.84   84.49       7.37     68.90   91.11
 LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
                                            I     26.20     50.98   77.86      23.28     42.78   75.57      20.46     43.53   74.63       1.71     30.53   68.74
                                          I-F 1   20.74     55.60   84.26      15.74     40.51   75.90      17.07     49.49   81.87       3.89     63.97   88.29
                                          I-F 5   33.17     64.94   88.34      26.53     55.00   84.09      29.73     60.60   87.10       7.08     71.96   91.75
                                            P     42.45     69.92   88.44      38.09     59.54   84.57      46.17     68.57   87.20      15.32     73.40   85.48
                                          P-F 1   79.53     79.04   94.04      34.92     55.74   83.36      62.81     71.17   87.53      69.73     78.49   94.88
                                          P-F 5   81.20    80.55    94.59      62.02    68.65    90.72      72.63     71.42   92.49      73.21    79.76    95.18
     maestrale-chat-v0.4-alpha-sft
                                            I     16.11     34.10   73.41      12.34     24.07   69.21       7.91     28.05   70.58       2.52     32.78   73.04
                                          I-F 1   66.41     74.91   92.45      47.17     62.46   87.87      68.85     69.79   91.52      50.12     75.70   94.13
                                          I-F 5   78.44     77.93   93.85      59.44     67.17   90.14      71.50     70.67   92.14      71.27     77.23   94.60
                                            P      8.38     20.59   68.40       8.91     20.43   67.95       8.35     19.02   67.60       0.77     12.62   64.06
           Meta-Llama-3-8B                P-F 1   70.20     72.06   92.15      26.07     48.25   80.63      67.09     66.66   90.67      70.29     73.23   93.71
                                          P-F 5   73.43     74.69   92.95      56.77     64.59   89.37      67.27     67.61   91.11      73.73     77.71   94.71
                                            P     27.10     57.71   85.67      20.83     48.00   81.40      34.70     60.52   86.87       2.60     54.93   85.40
                                          P-F 1   69.96     74.04   92.17      22.95     41.62   75.98      57.83     65.96   85.58      65.54     74.66   94.09
                                          P-F 5   75.09     75.86   93.29      59.34     66.51   89.89      69.40     71.03   92.02      64.27     74.97   94.05
       Meta-Llama-3-8B-Instruct
                                            I     27.30     46.34   87.41      17.68     29.85   70.09      14.68     35.41   71.00       2.97     36.10   68.84
                                          I-F 1   39.36     68.02   88.52      32.99     51.59   80.93      29.55     57.44   83.34       4.05     61.24   86.41
                                          I-F 5   76.67     77.67   93.89      61.79     67.93   90.33      70.09    72.80    92.50      31.83     78.24   94.61
                                            P      5.26     14.56   64.85       6.19     15.35   64.39       7.18     17.54   66.57       0.67      8.93   62.02
         Minerva-3B-base-v1.0             P-F 1   24.75     38.08   81.24      15.42     31.38   76.28      35.85     42.49   83.13      26.74     38.71   85.39
                                          P-F 5   27.42     35.87   80.43      30.94     40.03   81.48      67.27     67.61   83.40      35.45     41.20   86.05
                                            P     17.93     45.89   81.26      15.32     36.77   77.20      26.47     51.89   85.01       3.62     54.89   87.08
                                          P-F 1   62.63     67.49   89.74      46.24     55.33   86.50      57.02     61.54   85.34      56.91     65.59   91.97
                                          P-F 5   69.99     70.81   91.91      54.02     61.06   88.43      66.22     63.98   90.51      60.84     68.44   92.63
           zefiro-7b-dpo-ITA
                                            I      4.95     15.47   66.80       5.47     14.85   65.80       6.04     16.51   66.77       1.40     43.83   65.65
                                          I-F 1   47.00     62.58   86.61      18.34     37.69   75.45      49.06     59.85   83.95       5.12     51.55   84.52
                                          I-F 5   61.73     68.53   89.21      59.44     67.17   86.33      55.84     64.23   87.26       5.70     58.93   87.96
                                            P     14.41     43.85   79.53      14.00     38.01   76.92      20.49     52.95   83.29       1.40     43.83   80.01
                                          P-F 1   69.27     73.89   92.13      22.31     40.91   75.49      57.96     66.05   85.38      67.20     74.25   94.00
                                          P-F 5   73.31     75.04   93.08      59.53     66.61   89.95      69.32     70.60   91.93      65.09     74.98   94.07
      LLaMA3-BILINGUAL (Ours)
                                            I     27.77     48.26   76.39      19.12     32.17   70.85      15.90     37.02   71.55       2.74     35.59   68.78
                                          I-F 1   40.94     69.83   89.47      34.58     54.21   82.18      37.44     62.63   86.22       6.78     68.31   90.47
                                          I-F 5   76.35     77.70   93.89      61.68     68.25   90.48      71.01     72.55   92.40      38.00     78.90   94.83
                                            P     12.60     38.93   77.42      13.08     35.94   75.97      17.48     49.55   81.90       1.22     39.87   78.14
                                          P-F 1   68.11     73.95   92.28      22.34     40.98   75.53      58.79     67.01   85.64      67.05     74.22   93.98
                                          P-F 5   73.05     75.14   93.07      59.40     66.68   89.96      69.87     70.98   92.02      67.14     75.68   94.26
       LLaMA3-ITA-ONLY (Ours)
                                            I     26.77     48.26   76.15      17.97     30.46   70.25      15.82     36.76   71.42       2.72     35.58   68.78
                                          I-F 1   45.48     71.08   89.89      37.10     55.43   82.88      43.47     64.79   87.24       7.45     68.99   90.73
                                          I-F 5   76.54     77.74   93.88      61.49     68.09   90.39      71.05     72.36   92.37      43.92     78.88   94.93

Table 2
Results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next to the format name.
The best result for each dataset and for each metric is in bold


generate equal to 64. This limit was set for computational                             shots. Thus, the number of shots for all settings using a
requirements and the value was chosen after studying                                   few-shot strategy was set to either 1 or 5.
the datasets to assess the number of tokens required for                                  We report the results of the OE setting in Table 1 and
each answer. There was no combination of tokenizer and                                 of the CE-NI setting in Table 2 and comment them in the
dataset which had a 95% percentile greater than 50 for                                 following section.
token count, therefore we can safely set the previously
defined boundary. We also set torch.bfloat16 and use                                   3.1. Hardware and Software
flash-attention-2 [20] to speed up the generation process.
Inference was always done with batch size set to 1 to
                                                                                            Configuration
maximize the quality of the generated text.                                            Our experimental setup consisted of a multi-node cluster
   Furthermore, we consider changing the number of                                     provided by Fastweb SpA and equipped with Nvidia H100
few-shots that are given in the prompt. Our assump-                                    GPUs for distributed training and evaluation. We used
tion is that the models may learn to follow the patterns                               a suite of open-source libraries, including Transformers
given in the examples, and therefore the Italian language                              from Hugging Face [21], which provides seamless inte-
generation may become more likely thanks to the addi-                                  gration with PyTorch [22] and DeepSpeed [23], as well
tional information conveyed in the prompt. We aim to
mitigate this potential bias by decreasing the number of
as Unsloth8 and TRL [24]. This software stack has been       4. Conclusions and Future Works
instrumental in efficiently handling large data sets and
complex models.                                              We have carried out a study on the effectiveness of eval-
   This configuration allowed for parallelization of com-    uation of Italian-adapted LLMs on closed-ended tasks,
putations, significantly reducing training and evaluation    multiple-choice question answering tasks specifically.
time. DeepSpeed optimized memory usage and commu-            We have experimented with two settings: an open-ended
nication between nodes, allowing us to effortlessly scale    one and a closed-ended one without option identifiers.
evaluation processes across multiple model architectures.    The results show better performance for the latter. Fur-
   The hardware-software combination ensured efficient,      thermore, they also show that, with respect to the Open
cost-effective, and reproducible experiments, which are      Italian LLM Leaderboard, there are significant differences
critical for comparing multiple models and training new      regarding model performance. We can conclude that
ones efficiently.                                            the evaluation of Italian-adapted models should follow
                                                             a more rigorous procedure which does not mainly rely
                                                             on closed-ended tasks. We release the code that was used
3.2. Findings and Additional Tests                           on GitHub9 . In the future, we plan to further work on
Analyzing the results, it is clear that the OE strategy did the topic and attempt to define best practices for the
not yield very satisfactory results for BLEU and ROUGE- evaluation of these models.
L. We associate this with the difficulty of generating a re-
sponse matching exactly the ground truth when the text
that can be generated is not constrained in any way. To Acknowledgments
further support this point, we can see that the BertScore
                                                             We acknowledge the support of the PNRR project FAIR -
of some experiments yields good results, hinting that the
                                                             Future AI Research (PE00000013), Spoke 6 - Symbiotic AI
semantics of the content that has been generated is simi-
                                                             (CUP H97G22000210007) under the NRRP MUR program
lar to that of the ground truth.
                                                             funded by the NextGenerationEU.
   Regarding the CE-NI strategy, the obtained results are
much better for all metrics. Therefore providing the op-
tions in the input prompt greatly helped the model in lim- References
iting its generation to follow the provided options. Sur-
prisingly, with respect to the Italian leaderboard where       [1] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu,
fine-tuned versions of the LLaMA 3 family were shown to            H. Chen, X. Yi, C. Wang, Y. Wang, et al., A survey on
have much better results, here the results are in line with        evaluation of large language models, ACM Trans-
the base models (or even worse in some cases). Further-            actions on Intelligent Systems and Technology 15
more, one of the best-performing models is maestrale-              (2024) 1–45.
chat-v0.4-alpha-sft, which consistently outperforms the        [2] C. Fourrier, N. Habib, A. Lozovskaya,
LLaMA 3 models in most cases.                                      K. Szafer, T. Wolf, Open llm leader-
   For both settings the obtained results show that pro-           board       v2,        https://huggingface.co/spaces/
viding input-output examples in the prompt greatly en-             open-llm-leaderboard/open_llm_leaderboard,
hances the results for all settings.                               2024.
   For both settings, primarily Instruct models were used. [3] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-
Upon analyzing the generated results, we observed in-              plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-
stances where the model provided the correct result but            try, A. Askell, et al., Language models are few-shot
appended an additional substring (e.g., the model began            learners, Advances in neural information process-
explaining the reasoning behind its response). To assess           ing systems 33 (2020) 1877–1901.
if this might have affected the result, we performed an        [4] K. Wróbel, SpeakLeash Team, Cyfronet Team, Open
additional test where we checked if the ground truth               pl llm leaderboard, https://huggingface.co/spaces/
string was a substring of the generated output (after re-          speakleash/open_pl_llm_leaderboard, 2024.
moving punctuation and trailing whitespaces as well as         [5] C. Park, H. Kim, D. Kim, S. Cho, S. Kim, S. Lee,
lowercasing the two strings). We report the complete               Y. Kim, H. Lee, Open ko-llm leaderboard: Evalu-
results in Appendix C. Overall, some models show an                ating large language models in korean with ko-h5
improvement in performance, but the results still do not           benchmark, in: ACL Main, 2024.
beat maestrale-chat-v0.4-alpha-sft.                            [6] L. Gao, J. Tow, S. Biderman, S. Black, A. DiPofi,
   We provide some generation examples in Appendix B.              C. Foster, L. Golding, J. Hsu, K. McDonell, N. Muen-
                                                                   nighoff, J. Phang, L. Reynolds, E. Tang, A. Thite,

8                                                            9
    https://github.com/unslothai/unsloth                         https://github.com/swapUniba/Closed-ITA-LLM-Evaluation
     B. Wang, K. Wang, A. Zou, A framework for few-              //aclanthology.org/2020.emnlp-main.438. doi:10.
     shot language model evaluation, 2021. URL: https:           18653/v1/2020.emnlp-main.438.
     //doi.org/10.5281/zenodo.5371628. doi:10.5281/         [15] P. Molino, P. Lops, G. Semeraro, M. de Gem-
     zenodo.5371628.                                             mis, P. Basile,       Playing with knowledge: A
 [7] J. Ahn, R. Verma, R. Lou, D. Liu, R. Zhang, W. Yin,         virtual player for “who wants to be a million-
     Large language models for mathematical reasoning:           aire?” that leverages question answering tech-
     Progresses and challenges, in: Proceedings of the           niques, Artificial Intelligence 222 (2015) 157–
     18th Conference of the European Chapter of the              181. URL: https://www.sciencedirect.com/science/
     Association for Computational Linguistics: Student          article/pii/S0004370215000259. doi:https://doi.
     Research Workshop, 2024, pp. 225–237.                       org/10.1016/j.artint.2015.02.003.
 [8] K. Sun, Y. Xu, H. Zha, Y. Liu, X. L. Dong, Head-to-    [16] V. Lai, C. Nguyen, N. Ngo, T. Nguyen, F. Dernon-
     tail: How knowledgeable are large language models           court, R. Rossi, T. Nguyen, Okapi: Instruction-tuned
     (llms)? aka will llms replace knowledge graphs?, in:        large language models in multiple languages with
     Proceedings of the 2024 Conference of the North             reinforcement learning from human feedback, in:
     American Chapter of the Association for Computa-            Proceedings of the 2023 Conference on Empirical
     tional Linguistics: Human Language Technologies             Methods in Natural Language Processing: System
     (Volume 1: Long Papers), 2024, pp. 311–325.                 Demonstrations, 2023, pp. 318–327.
 [9] X. Wang, B. Ma, C. Hu, L. Weber-Genzel, P. Röttger,    [17] P. Basile, E. Musacchio, M. Polignano, L. Siciliani,
     F. Kreuter, D. Hovy, B. Plank, "my answer                   G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod-
     is c": First-token probabilities do not match               els for effective text generation in italian language,
     text answers in instruction-tuned language mod-             arXiv preprint arXiv:2312.09993 (2023).
     els, 2024. URL: https://arxiv.org/abs/2402.14499.      [18] M. Polignano, P. Basile, G. Semeraro,              Ad-
     arXiv:2402.14499.                                           vanced natural-based interaction for the italian
[10] A. Bacciu, C. Campagnano, G. Trappolini, F. Sil-            language: Llamantino-3-anita, arXiv preprint
     vestri, DanteLLM: Let’s push Italian LLM research           arXiv:2405.07101 (2024).
     forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste,       [19] L. Tunstall, E. Beeching, N. Lambert, N. Rajani,
     A. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of           K. Rasul, Y. Belkada, S. Huang, L. von Werra,
     the 2024 Joint International Conference on Com-             C. Fourrier, N. Habib, N. Sarrazin, O. Sanseviero,
     putational Linguistics, Language Resources and              A. M. Rush, T. Wolf, Zephyr: Direct distillation of
     Evaluation (LREC-COLING 2024), ELRA and ICCL,               lm alignment, 2023. arXiv:2310.16944.
     Torino, Italia, 2024, pp. 4343–4355. URL: https:       [20] T. Dao, FlashAttention-2: Faster attention with
     //aclanthology.org/2024.lrec-main.388.                      better parallelism and work partitioning, in: Inter-
[11] F. Mercorio, M. Mezzanzanica, D. Potertì, A. Serino,        national Conference on Learning Representations
     A. Seveso, Disce aut deficere: Evaluating llms              (ICLR), 2024.
     proficiency on the invalsi italian benchmark,          [21] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. De-
     2024. URL: https://arxiv.org/abs/2406.17535.                langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Fun-
     arXiv:2406.17535.                                           towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
[12] P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sab-           Y. Jernite, J. Plu, C. Xu, T. L. Scao, S. Gugger,
     harwal, C. Schoenick, O. Tafjord, Think you have            M. Drame, Q. Lhoest, A. M. Rush, Transformers:
     solved question answering? try arc, the ai2 reason-         State-of-the-art natural language processing, in:
     ing challenge, arXiv:1803.05457v1 (2018).                   Proceedings of the 2020 Conference on Empirical
[13] D. Hendrycks, C. Burns, S. Basart, A. Zou,                  Methods in Natural Language Processing: System
     M. Mazeika, D. Song, J. Steinhardt, Measuring mas-          Demonstrations, Association for Computational
     sive multitask language understanding, Proceed-             Linguistics, Online, 2020, pp. 38–45. URL: https://
     ings of the International Conference on Learning            www.aclweb.org/anthology/2020.emnlp-demos.6.
     Representations (ICLR) (2021).                         [22] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain,
[14] M. Hardalov, T. Mihaylov, D. Zlatkova, Y. Dinkov,           M. Voznesensky, B. Bao, P. Bell, D. Berard,
     I. Koychev, P. Nakov,           EXAMS: A multi-             E. Burovski, G. Chauhan, A. Chourdia, W. Consta-
     subject high school examinations dataset for                ble, A. Desmaison, Z. DeVito, E. Ellison, W. Feng,
     cross-lingual and multilingual question answer-             J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalam-
     ing, in: B. Webber, T. Cohn, Y. He, Y. Liu                  barkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang,
     (Eds.), Proceedings of the 2020 Conference on               J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch,
     Empirical Methods in Natural Language Process-              M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, M. Suo,
     ing (EMNLP), Association for Computational Lin-             P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang,
     guistics, Online, 2020, pp. 5427–5444. URL: https:          X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan,
     P. Wu, S. Chintala, Pytorch 2: Faster machine
     learning through dynamic python bytecode trans-
     formation and graph compilation, in: 29th ACM
     International Conference on Architectural Support
     for Programming Languages and Operating Sys-
     tems, Volume 2 (ASPLOS ’24), ACM, 2024. URL:
     https://pytorch.org/assets/pytorch2-2.pdf. doi:10.
     1145/3620665.3640366.
[23] C. Li, Z. Yao, X. Wu, M. Zhang, C. Holmes, C. Li,
     Y. He, Deepspeed data efficiency: Improving deep
     learning model quality and training efficiency via ef-
     ficient data sampling and routing, 2024. URL: https:
     //arxiv.org/abs/2212.03597. arXiv:2212.03597.
[24] L. von Werra, Y. Belkada, L. Tunstall, E. Beech-
     ing, T. Thrush, N. Lambert, S. Huang, Trl: Trans-
     former reinforcement learning, https://github.com/
     huggingface/trl, 2020.
Appendix
A. Prompt Formats
All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.
Il freddo si sposta dalla sua mano al cubetto di ghiaccio.
Il calore si sposta dal cubetto di ghiaccio alla sua mano.
Il freddo si sposta dal cubetto di ghiaccio alla sua mano.
Risposta:

Example 1: Prompt in the P-F format for the OE setting


Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si
riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre
sessualmente e asessualmente? Opzioni:
Consente alle piante di crescere più in alto.
Produce fiori che attraggono gli insetti.
Produce more che hanno un sapore migliore.
Permette alle piante di more di adattarsi a nuove condizioni.
Risposta: Permette alle piante di more di adattarsi a nuove condizioni.

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.
Il freddo si sposta dalla sua mano al cubetto di ghiaccio.
Il calore si sposta dal cubetto di ghiaccio alla sua mano.
Il freddo si sposta dal cubetto di ghiaccio alla sua mano.
Risposta:

Example 2: Prompt in the P-F 1 format for the OE setting


<|start_header_id|>user<|end_header_id|>

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.
Il freddo si sposta dalla sua mano al cubetto di ghiaccio.
Il calore si sposta dal cubetto di ghiaccio alla sua mano.
Il freddo si sposta dal cubetto di ghiaccio alla sua mano.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Example 3: Prompt in the I-F format using LLaMA 3 chat template
<|begin_of_text|><|start_header_id|>user<|end_header_id|>

Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si
riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre
sessualmente e asessualmente? Opzioni:
Consente alle piante di crescere più in alto.
Produce fiori che attraggono gli insetti.
Produce more che hanno un sapore migliore.
Permette alle piante di more di adattarsi a nuove condizioni.<|eot_id|><|start_header_id|>assistant<|end_header_id|>

Permette alle piante di more di adattarsi a nuove condizioni.<|eot_id|><|start_header_id|>user<|end_header_id|>

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni:
Il calore si sposta dalla sua mano al cubetto di ghiaccio.
Il freddo si sposta dalla sua mano al cubetto di ghiaccio.
Il calore si sposta dal cubetto di ghiaccio alla sua mano.
Il freddo si sposta dal cubetto di ghiaccio alla sua mano.<|eot_id|><|start_header_id|>assistant<|end_header_id|>


Example 4: Prompt in the I-F 1 format using LLaMA 3 chat template


B. Zero-shot Response Examples
All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.

Una sorgente sonora di frequenza f 0, si muove con velocità costante lungo una circonferenza. Nel centro della circonferenza si
trova il ricevitore del suono. Quale asserzione è esatta per la frequenza f registrata dal ricevitore? Risposta:

Ground truth: f = f 0
Generated Answer: La frequenza f registrata dal ricevitore è costante e uguale a f 0

Example 5: Generated answer with additional text for the OE setting


Il periodo di rotazione di un satellite artificiale intorno ad un pianeta è T . La distanza tra il satellite ed il centro del pianeta è r .
A che distanza dal centro del pianeta ruota un altro satellite se il suo periodo di rotazione è T / 8? Opzioni:
8r
r/8
4r
r/4
Risposta:

Ground truth: r/4
Generated Answer: r/8 Spiegazione: Se il periodo di rotazione del satellite è T / 8, allora la sua distanza dal centro del pianeta
è r / 8. Infatti, se il satellite ruota in 8T, allora la sua distanza dal centro del pianeta è

Example 6: Generated answer with additional text for the CE-NI setting


C. Substring Matching Results
                      Model                            Format       ARC_IT        MMLU_IT         EXAMS         WBMM
                                                          P          0.00           0.26            0.20         45.47
                                                        P-F 1        3.94           4.50            5.84         35.96
                                                        P-F 5        5.73           5.00            5.84         36.78
             Italia-9B-Instruct-v0.1
                                                          I          4.96           5.73            7.53         41.07
                                                        I-F 1        4.53           5.86            7.72         41.38
                                                        I-F 5        4.96           5.59            6.73         36.78
                                                          P          6.07           5.91            7.13         32.69
                                                        P-F 1        5.39           5.76            5.84         32.89
                                                        P-F 5        5.82           5.88            7.03         32.12
 LLaMAntino-2-chat-13b-hf-UltraChat-ITA
                                                          I          5.48           5.08            7.62         33.91
                                                        I-F 1        5.90           6.28            7.23         34.48
                                                        I-F 5        6.33           6.41            7.62         32.12
                                                          P          7.44           7.55            10.0         36.62
                                                        P-F 1        7.10           6.58            8.42         34.02
                                                        P-F 5        7.36           7.32            8.91         31.36
  LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
                                                          I          4.96           5.89            7.82         36.42
                                                        I-F 1        6.50           6.91            8.32         35.60
                                                        I-F 5        6.07           6.66            6.63         30.90
                                                          P          7.02           7.49           10.69         45.47
                                                        P-F 1        8.30           8.39           11.68         47.16
                                                        P-F 5        8.13           8.53           11.58         45.01
         maestrale-chat-v0.4-alpha-sft
                                                          I          5.90           7.56           10.69         46.65
                                                        I-F 1        7.19           8.00           10.59         46.29
                                                        I-F 5        8.04           8.60            9.60         44.55
                                                          P          5.48           6.95            9.11         37.85
                Meta-Llama-3-8B                         P-F 1        6.67           7.14            9.70         39.03
                                                        P-F 5        5.73           7.35            9.70          40.0
                                                          P          7.96           7.65            10.0         38.26
                                                        P-F 1        6.67           7.44            7.92         36.78
                                                        P-F 5        6.76           7.54            10.0         35.35
           Meta-Llama-3-8B-Instruct
                                                          I          3.85           5.32            7.43         38.16
                                                        I-F 1        6.16           6.07            9.80         40.56
                                                        I-F 5        7.36           7.41            8.81         36.88
                                                          P          2.57           3.48            4.46         30.49
             Minerva-3B-base-v1.0                       P-F 1        2.31           3.86            5.05         28.59
                                                        P-F 5        3.34           2.74            4.36         30.54
                                                          P          5.39           6.20            2.18         29.67
                                                        P-F 1        4.71           5.69            7.03         31.00
                                                        P-F 5        4.96           6.56            8.42         31.56
                zefiro-7b-dpo-ITA
                                                          I          3.84           5.97            6.24         32.33
                                                        I-F 1        5.82           4.98            6.83         28.54
                                                        I-F 5        5.56           6.54            7.43         29.97
                                                          P          7.96           7.76           10.79         38.57
                                                        P-F 1        6.84           7.54            8.12         36.68
                                                        P-F 5        6.33           7.60            9.31         35.19
         LLaMA3-BILINGUAL (Ours)
                                                          I          3.85           5.47            7.82         38.47
                                                        I-F 1        5.99           6.68            9.51         39.59
                                                        I-F 5        7.36           7.50            8.22         36.57
                                                          P          7.36           7.92           10.69         39.03
                                                        P-F 1        7.02           7.57            8.02         36.78
                                                        P-F 5        6.67           7.63            9.60         36.11
           LLaMA3-ITA-ONLY (Ours)
                                                          I          3.94           5.48            7.82         38.21
                                                        I-F 1        6.59           6.66            10.0         39.23
                                                        I-F 5        7.36           7.59            7.62         36.47

Table
Sub-string matching results for the OE setting. For the few-shots formats, the number of given shots is also provided next to
the format name. The best result for each dataset is in bold
                      Model                            Format       ARC_IT        MMLU_IT         EXAMS         WBMM
                                                          P           0.00          0.38            0.30         73.56
                                                        P-F 1        39.86         33.19           37.53         52.43
                                                        P-F 5        44.74         36.03           40.10         56.62
             Italia-9B-Instruct-v0.1
                                                          I          29.77         29.59           26.73         55.91
                                                        I-F 1        26.78         31.08           29.01         55.86
                                                        I-F 5        32.59         31.42           32.77         56.62
                                                          P          43.54         30.08           40.89         58.16
                                                        P-F 1        49.10         38.17           44.65         66.19
                                                        P-F 5        50.90         40.23           45.45         67.32
 LLaMAntino-2-chat-13b-hf-UltraChat-ITA
                                                          I          41.66         26.29           34.75         60.56
                                                        I-F 1        44.23         33.16           38.12         57.95
                                                        I-F 5        48.08         39.50           36.83         62.92
                                                          P          55.86         43.84           52.48         70.44
                                                        P-F 1        60.57         45.34           48.32         72.38
                                                        P-F 5        62.45         46.82           51.49         69.82
  LLaMAntino-3-ANITA-8B-Inst-DPO-ITA
                                                          I          61.85         44.93           54.46         75.91
                                                        I-F 1        62.19         43.75           49.51         74.06
                                                        I-F 5        61.42         45.11           52.87         75.14
                                                          P          69.38         50.18           58.71         73.56
                                                        P-F 1        71.43         54.52           58.22         76.88
                                                        P-F 5        73.31         55.85           58.02         78.21
         maestrale-chat-v0.4-alpha-sft
                                                          I          46.88         29.83           40.30         60.36
                                                        I-F 1        69.63         52.22           56.54         74.58
                                                        I-F 5        70.15         54.30           56.73         75.40
                                                          P          57.57         46.30           56.54         75.09
                Meta-Llama-3-8B                         P-F 1        63.13         46.88           51.58         71.20
                                                        P-F 5        66.47         50.49           53.37         75.96
                                                          P          59.54         44.26           53.07         68.85
                                                        P-F 1        66.30         50.13           51.18         72.79
                                                        P-F 5        68.69         52.42           57.43         72.79
           Meta-Llama-3-8B-Instruct
                                                          I          57.83         36.04           48.61         74.89
                                                        I-F 1        69.29         48.14           54.46         75.40
                                                        I-F 5        70.83         54.17           60.10         77.75
                                                          P          47.48         43.71           59.90         73.86
             Minerva-3B-base-v1.0                       P-F 1        25.66         28.51           23.86         33.25
                                                        P-F 5        20.10         23.09           22.87         34.94
                                                          P          48.76         39.18           41.58         60.67
                                                        P-F 1        55.00         40.37           46.04         62.56
                                                        P-F 5        60.31         45.34           48.42         64.86
                zefiro-7b-dpo-ITA
                                                          I          31.48         31.50           40.40         72.69
                                                        I-F 1        50.98         46.11           45.15         66.55
                                                        I-F 5        58.26         47.16           50.20         64.55
                                                          P          59.71         44.50           54.16         69.92
                                                        P-F 1        66.04         49.70           50.89         72.53
                                                        P-F 5        67.58         52.29           56.54         72.84
         LLaMA3-BILINGUAL (Ours)
                                                          I          60.65         38.61           50.20         75.35
                                                        I-F 1        69.63         50.00           56.14         75.04
                                                        I-F 5        70.49         54.51           60.10         77.90
                                                          P          60.57         45.16           54.26         70.49
                                                        P-F 1        66.21         49.79           51.98         72.43
                                                        P-F 5        67.67         52.38           57.23         73.71
           LLaMA3-ITA-ONLY (Ours)
                                                          I          59.88         37.08           50.40         75.40
                                                        I-F 1        69.21         50.19           56.63         74.94
                                                        I-F 5        70.40         54.28           59.41         77.65

Table
Sub-string matching results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next
to the format name. The best result for each dataset is in bold