1. Introduction

A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language

Elio Musacchio

0 2

Lucia Siciliani

Pierpaolo Basile

Edoardo Michielon

Marco Pasqualini

Asia Beatrice Uboldi

Giovanni Semeraro

0 0 Department of Computer Science, University of Bari Aldo Moro , Italy 1 Fastweb SpA , Milan , Italy 2 National PhD in Artificial Intelligence, University of Pisa , Italy

2024

1 225 237

With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural Language Generation tasks, an increasing number of open weights architectures have been developed and released online. In contrast with older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shown outstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend natural language. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is using benchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance in each of these individual tasks can imply having developed a model capable of understanding language. However, while this assumption is not incorrect, it is evident that it is not suficient, and the evaluation of Large Language Models still remains an open challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets and how a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditional approaches.

eol>Large Language Models Natural Language Processing Evaluation Benchmark

1. Introduction

Large Language Models (LLMs) are models based on the Transformer architecture capable of solving a wide variety of Natural Language Generation (NLG) tasks, even those not encountered during training, due to their extensive training and large number of parameters. Thanks to their remarkable skills, interest in LLMs is now at its climax, resulting in a proliferation of open-weight models (e.g. LLaMA, Mistral, and many others). Among the several challenges related to the development of LLMs, one of the most critical is their evaluation [ 1 ]. One approach to tackle this issue has been to build benchmarks that collect diferent datasets, with the aim of obtaining a more comprehensive evaluation of the model’s overall capabilities. Currently, there is a leaderboard1 [ 2 ] which keeps track of the capabilities of openly available LLMs.

Specifically, the models are tested on six tasks that span

diferent abilities a language model should have, e.g. reasoning or text completion. Regarding their reasoning abilities, the models are tested by solving closed-ended tasks. Specifically, multiple-choice question answering tasks are provided, where a question is given with a list of possible alternatives associated with an identifier (a letter, a number, and so on). Intuitively, since the model has also been pre-trained on closed-ended question-answering data, it should be able to generalize and understand the correct choice out of the available ones. Furthermore, rather than generating the output directly, the probabilities learned by the model are studied, using log-likelihood to assess which option is more likely to be correct. For the English language, this evaluation methodology has been a standard approach to assess the capabilities of LLMs. However, when adapting a model to a new language, due to the low amount of non-English data that has been used to pre-train such models, this methodology may not be as sound. The model only has to generate the correct option identifier, therefore this is not really testing the ability of the model of generating high-quality text in another language. The goal of this work is to understand whether a new evaluation setting applied to languageadapted LLMs may give more insight than the traditional approach. Therefore, our contributions are the following: • We test two evaluation settings for languageadapted LLMs changing the structure of closed

3. Experiments

ended question answering tasks; • We evaluate the performance of state-of-the-art

models on these settings; • We study the sensitivity that the models have for the input prompt.

We study pre-trained and language-adapted models to test their capabilities in the resolution of Italian language tasks. Specifically, we want to modify the typical formatting that is used in multiple choice question answering to study if the models are capable of correctly fol2. Related Works lowing and generating Italian text. Usually, the format shown in Listing 1 is used, where <QUESTION> is the Language Model evaluation has been a research focus question the model has to answer, <IDENTIFIER_i> and ever since the first Decoder-only models, which were <OPTION_i> are the option identifier, which is usually designed for natural language generation. a letter or a number, and the text of the possible answer

One of the most remarkable skills regarding LLMs to the previously provided question respectively. <CORreasoning has been in-context learning. In particular, few- RECT_IDENTIFIER> is the identifier of the option that shot learning has been increasingly used. The idea is that is the correct answer to the question. providing examples of input-output in the model prompt should afect positively the generation process [ 3 ].

There are multiple leaderboards which evaluate open <QUESTION>: LLMs on non-English languages, e.g. Open PL LLM <IDENTIFIER_1> <OPTION_1> Leaderboard [ 4 ] for Polish or Open KO LLM Leaderboard <IDENTIFIER_2> <OPTION_2> [ 5 ] for Korean. These leaderboards are often based on the ... lm evaluation harness framework [ 6 ], which has been a <IDENTIFIER_N> <OPTION_N> milestone in the evaluation of LLMs. LLM evaluation can also depend on the topic at hand. There are some works <CORRECT_IDENTIFIER> which focus on mathematical reasoning [7] as well as Listing 1: closed-ended format factuality [8].

These evaluation settings often rely on closed-ended

tasks, specifically multiple-choice question answering. We aim to modify the task so that the model has to The idea is to calculate the log-likelihood of the next generate the text of the correct option instead of the token to generate for the option identifiers. However, this identifier. To do so, we consider two main evaluation may not be the best setting to evaluate LLMs. Wang et al. settings: [9] studied this on Instruction-tuned LLMs by training • Open-ended (OE): we remove the available opa classifier to predict which possible option to associate tions and only supply the question in the prompt; with the generated answer. This was done to glance over additional text generated by the model (e.g. the generated • Closed-ended no identifiers (CE-NI): we fortext could be "The answer is B." as opposed to the simple mat the options without an identifier, the model "B." token). They found that the log-likelihood and the has to write the corresponding text of the correct generated text decisions were often not matching. option.

Regarding Italian evaluation, some works have ap- In particular, for the CE-NI setting, we apply the format proached this challenge. Bacciu et al. [10] released an- shown in Listing 2, where <CORRECT_OPTION> is the other version of the Open Italian LLM Leaderboard, con- text of the option that represents the correct answer to sidering a diferent variety of tasks. Mercorio et al. [11] the question. released a benchmark based on questions that can be found in the INVALSI test, an Italian educational test, to further test the knowledge and reasoning abilities of these models on a dataset that is natively in Italian rather <QUESTION>: than obtained through machine translation. The latter is <OPTION_1> one of the main problems when evaluating these mod- <OPTION_2> els, due to the lack of resources w.r.t. English language, ... datasets that are used at the state-of-the-art are trans- <OPTION_N> lated using machine translation models. Still, all this efort made to evaluate Italian-adapted LLMs mainly re- <CORRECT_OPTION> lies on closed-ended tasks. Listing 2: closed-ended no identifiers format <CORRECT_IDENTIFIER> and <COR

RECT_OPTION> are the outputs that we expect the evaluated model should generate. We provide complete examples of the prompt formats in Appendix A. Generally models are also evaluated by calculating the

log-likelihood rather than generating text directly. The chosen option is then selected based on the highest value.

We choose to perform a generative task instead, to check

whether the models are capable of generating the answer string only without additional text and to also check if they generate something outside of the provided options.

To evaluate this case, we use the BLEU, ROUGE-L and

BertScore F1 metrics, which are reference metrics used to evaluate the correspondence of a generated sentence with a base one. BLEU and ROUGE-L focus on matching n-grams, while BertScore leverages pre-trained Bert models to assess the semantic similarity between words of the two texts. Furthermore, we consider four diferent possible prompt formats: • MMLU [13]: consists of multiple-choice questions from 57 diferent topics (e.g. mathematics, computer science, and so on), requiring problemsolving abilities and knowledge to answer correctly; • EXAMS [14]: consists of multiple-choice questions from high school exams. The dataset contains diferent subsets curated for diferent languages and optionally contains additional paragraphs regarding the question (extracted from

Wikipedia);

• WWBM [15]: consists of multiple-choice questions spanning a wide range of topics. The questions come from the Italian version of the “Who Wants to Be a Millionaire?” board game where contestants answer progressively dificult questions. The question-answer instances are split into diferent categories depending on the dificulty of the question itself.

For the Italian version of these datasets, both EXAMS • Plain (P): there is no formatting, the text of the

and WWBM are provided with splits in the Italian lantask is provided as it is in the prompt, only a

guage natively. For ARC and MMLU, instead, we use "Risposta:" string is added at the end;

the Italian version provided in the library for the okapi • Plain few-shot (P-F): same as P, but multiple task released by Lai et al. [16], who performed automatic examples of input-output are provided; • Instruct (I): the chat template of the model is translation of the original datasets using GPT-3.5 Turbo for several languages. For all of these datasets, we deapplied to the text of the task;

ifne two custom tasks which apply the OE and CE-NI • Instruct few-shot (I-F): same as I, but multiple examples of input-output are provided. evaluation settings automatically. The examples used in the few-shot settings are taken from the validation splits

Furthermore, for the few-shot formats, we consider of the datasets. For EXAMS, we use the train split as a two distinct numbers of examples to provide in the test split (since it is not provided), while for WWBM, we prompt: one-shot and five-shots. The intuition is that remove the first five instances from the original dataset a language-adapted LLM should significantly improve and use them as a validation split. performance even when provided with a single example. Regarding the models, we experiment using the fol

We consider these prompt formats because most of lowing: the evaluation settings for Italian LLMs are done without applying the chat template. We argue that this choice may not be the best one when considering Instruct models that have been trained using a specific prompt format to continue a conversation. They should be evaluated using the same prompt format since it is also the one that will be used in case of deployment.

To set up the experimental protocol, we use the lm

evaluation-harness library [ 6 ], which provides an immediate and intuitive command line to automatically evaluate LLMs on previously defined as well as custom tasks. Specifically, we define custom tasks within the library following the previously defined evaluation settings. To do so, we consider the following datasets: • Italia-9B-Instruct-v0.12: trained from scratch with a focus on the Italian language (90% of data in Italian and the rest in English) with instructiontuning for conversational purposes; • LLaMAntino-2-chat-13b-hf-UltraChat-ITA [17]: instruction-tuning of LLaMAntino-2-chat13b-hf-ITA (an Italian-adapted LLM) using a translated version of the UltraChat dataset; • LLaMAntino-3-ANITA-8B-Inst-DPO-ITA [18]: fine-tuning, DPO and adaptation using a mixture of Italian and English datasets starting from the LLaMA-3-8B-Instruct model; • maestrale-chat-v0.4-alpha-sft3: instructiontuning for 2 epochs on a conversational dataset consisting of 1.7M instances, starting from an

Italian-adapted version of Mistral-7b;

• ARC-Challenge [12]: consists of multiplechoice science exam questions, the Challenge set consists of complex questions that were not correctly answered by both a retrieval and co- 2https://huggingface.co/iGeniusAI/Italia-9B-Instruct-v0.1 occurrence method; 3https://huggingface.co/mii-llm/maestrale-chat-v0.4-alpha-sft Italia-9B-Instruct-v0.1 LLaMAntino-2-chat-13b-hf-UltraChat-ITA

P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 P P-F 1 P-F 5 I I-F 1 I-F 5 4 7 Results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset and for each metric is in bold

Meta-Llama-3-8B

and

Meta-Llama-3-8B

Instruct5 : latest version of the LLaMA family new models.

We start from the Meta-LLaMA-3-8B Instruct checkpoint and fine-tune the model on 5 https://huggingface.co/meta- llama/Meta- Llama- 3- 8B- Instruct 6 https://huggingface.co/sapienzanlp/Minerva- 3B- base- v1.0 7 https://huggingface.co/mii- community/zefiro- 7b- dpo- ITA OpenOrca and UltraChat. The datasets are automatically

translated to Italian using ChatGPT 3.5. We consider two diferent settings, one where 20, 000 instances are kept for each language (Italian and English), and one where 40, 000 instances are kept for the Italian language only. For instruction tuning, we used LoRA with equal to 16 and alpha equal to 16, targeting all linear layers of the model. Other hyperparameters are efective batch size equal to 128, learning rate equal to 2 − equal to 0.01 and warmup steps equal to 5. In both cases, the instances to be used during the training are chosen

5, weight decay

at random.

For all experiments, we use the greedy-decoding gen

eration strategy with a maximum number of tokens to

Italia-9B-Instruct-v0.1 LLaMAntino-2-chat-13b-hf-UltraChat-ITA given in the examples, and therefore the Italian language from Hugging Face [21], which provides seamless integeneration may become more likely thanks to the addi- gration with PyTorch [22] and DeepSpeed [23], as well tional information conveyed in the prompt. We aim to mitigate this potential bias by decreasing the number of as Unsloth8 and TRL [24]. This software stack has been 4. Conclusions and Future Works instrumental in eficiently handling large data sets and complex models. We have carried out a study on the efectiveness of eval

This configuration allowed for parallelization of com- uation of Italian-adapted LLMs on closed-ended tasks, putations, significantly reducing training and evaluation multiple-choice question answering tasks specifically. time. DeepSpeed optimized memory usage and commu- We have experimented with two settings: an open-ended nication between nodes, allowing us to efortlessly scale one and a closed-ended one without option identifiers. evaluation processes across multiple model architectures. The results show better performance for the latter. Fur

The hardware-software combination ensured eficient, thermore, they also show that, with respect to the Open cost-efective, and reproducible experiments, which are Italian LLM Leaderboard, there are significant diferences critical for comparing multiple models and training new regarding model performance. We can conclude that ones eficiently. the evaluation of Italian-adapted models should follow a more rigorous procedure which does not mainly rely 3.2. Findings and Additional Tests on closed-ended tasks. We release the code that was used on GitHub9. In the future, we plan to further work on Analyzing the results, it is clear that the OE strategy did the topic and attempt to define best practices for the not yield very satisfactory results for BLEU and ROUGE- evaluation of these models.

L. We associate this with the dificulty of generating a re

tshpaotncsaenmbaetcgheinnegreaxteadctilsynthoet cgornoustnrdaitnreudthinwahneyn wthaey.teTxot Acknowledgments further support this point, we can see that the BertScore We acknowledge the support of the PNRR project FAIR of some experiments yields good results, hinting that the Future AI Research (PE00000013), Spoke 6 - Symbiotic AI semantics of the content that has been generated is simi- (CUP H97G22000210007) under the NRRP MUR program lar to that of the ground truth. funded by the NextGenerationEU.

Regarding the CE-NI strategy, the obtained results are much better for all metrics. Therefore providing the options in the input prompt greatly helped the model in lim- References iting its generation to follow the provided options. Surprisingly, with respect to the Italian leaderboard where ifne-tuned versions of the LLaMA 3 family were shown to have much better results, here the results are in line with the base models (or even worse in some cases). Furthermore, one of the best-performing models is maestralechat-v0.4-alpha-sft, which consistently outperforms the N.

LLaMA 3 models in most cases. T.

For both settings the obtained results show that providing input-output examples in the prompt greatly enhances the results for all settings. For both settings, primarily Instruct models were used.

Upon analyzing the generated results, we observed instances where the model provided the correct result but appended an additional substring (e.g., the model began explaining the reasoning behind its response). To assess if this might have afected the result, we performed an additional test where we checked if the ground truth string was a substring of the generated output (after removing punctuation and trailing whitespaces as well as lowercasing the two strings). We report the complete results in Appendix C. Overall, some models show an improvement in performance, but the results still do not beat maestrale-chat-v0.4-alpha-sft.

We provide some generation examples in Appendix B. 8https://github.com/unslothai/unsloth 9https://github.com/swapUniba/Closed-ITA-LLM-Evaluation P. Wu, S. Chintala, Pytorch 2: Faster machine

learning through dynamic python bytecode transformation and graph compilation, in: 29th ACM

International Conference on Architectural Support

for Programming Languages and Operating Systems, Volume 2 (ASPLOS ’24), ACM, 2024. URL: https://pytorch.org/assets/pytorch2-2.pdf . doi:10. 1145/3620665.3640366. [23] C. Li, Z. Yao, X. Wu, M. Zhang, C. Holmes, C. Li, Y. He, Deepspeed data eficiency: Improving deep learning model quality and training eficiency via efifcient data sampling and routing, 2024. URL: https: //arxiv.org/abs/2212.03597. arXiv:2212.03597. [24] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, T. Thrush, N. Lambert, S. Huang, Trl: Transformer reinforcement learning, https://github.com/ huggingface/trl, 2020.

A. Prompt Formats

All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model. Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio.

Il freddo si sposta dalla sua mano al cubetto di ghiaccio.

Il calore si sposta dal cubetto di ghiaccio alla sua mano.

Il freddo si sposta dal cubetto di ghiaccio alla sua mano.

Risposta: Example 1: Prompt in the P-F format for the OE setting Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto.

Produce fiori che attraggono gli insetti.

Produce more che hanno un sapore migliore.

Permette alle piante di more di adattarsi a nuove condizioni.

Risposta: Permette alle piante di more di adattarsi a nuove condizioni.

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio.

Il freddo si sposta dalla sua mano al cubetto di ghiaccio.

Il calore si sposta dal cubetto di ghiaccio alla sua mano.

Il freddo si sposta dal cubetto di ghiaccio alla sua mano.

Risposta: Example 2: Prompt in the P-F 1 format for the OE setting <|start_header_id|>user<|end_header_id|> Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio.

Il freddo si sposta dalla sua mano al cubetto di ghiaccio.

Il calore si sposta dal cubetto di ghiaccio alla sua mano.

Il freddo si sposta dal cubetto di ghiaccio alla sua mano.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Example 3: Prompt in the I-F format using LLaMA 3 chat template Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto.

Produce fiori che attraggono gli insetti.

Produce more che hanno un sapore migliore.

Il freddo si sposta dalla sua mano al cubetto di ghiaccio.

Il calore si sposta dal cubetto di ghiaccio alla sua mano.

B. Zero-shot Response Examples

All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model. Una sorgente sonora di frequenza f 0, si muove con velocità costante lungo una circonferenza. Nel centro della circonferenza si trova il ricevitore del suono. Quale asserzione è esatta per la frequenza f registrata dal ricevitore? Risposta: Ground truth: f = f 0 Generated Answer: La frequenza f registrata dal ricevitore è costante e uguale a f 0 Example 5: Generated answer with additional text for the OE setting Il periodo di rotazione di un satellite artificiale intorno ad un pianeta è T . La distanza tra il satellite ed il centro del pianeta è r . A che distanza dal centro del pianeta ruota un altro satellite se il suo periodo di rotazione è T / 8? Opzioni: 8 r r/8 4 r r/4 Risposta: Ground truth: r/4 Generated Answer: r/8 Spiegazione: Se il periodo di rotazione del satellite è T / 8, allora la sua distanza dal centro del pianeta è r / 8. Infatti, se il satellite ruota in 8T, allora la sua distanza dal centro del pianeta è Example 6: Generated answer with additional text for the CE-NI setting

C. Substring Matching Results

Italia-9B-Instruct-v0.1 LLaMAntino-2-chat-13b-hf-UltraChat-ITA LLaMAntino-3-ANITA-8B-Inst-DPO-ITA maestrale-chat-v0.4-alpha-sft

Meta-Llama-3-8B Meta-Llama-3-8B-Instruct

Minerva-3B-base-v1.0

zefiro-7b-dpo-ITA LLaMA3-BILINGUAL (Ours) LLaMA3-ITA-ONLY (Ours) Format

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

P P-F 1 P-F 5

I I-F 1 I-F 5

[1]

Chang ,

Wang ,

Wu ,

Yang ,

Zhu ,

Chen ,

Yi ,

Wang ,

Wang , et al., A survey on evaluation of large language models , ACM Transactions on Intelligent Systems and Technology 15 ( 2024 ) 1 - 45 .

[2]

Fourrier , Habib,

Lozovskaya ,

Szafer , Wolf, Open llm leaderboard v2, https://huggingface.co/spaces/ open-llm-leaderboard/open_llm_leaderboard, 2024 .

[3]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language models are few-shot learners , Advances in neural information processing systems 33 ( 2020 ) 1877 - 1901 .

[4]

Wróbel , SpeakLeash Team, Cyfronet Team, Open pl llm leaderboard, https://huggingface.co/spaces/ speakleash/open_pl_llm_leaderboard, 2024 .

[5]

Park ,

Kim ,

Cho ,

Kim ,

Lee ,

Kim ,

Lee , Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark , in: ACL Main, 2024 .

[6]

Gao ,

Tow ,

Biderman ,

Black , A. DiPofi,

Foster ,

Golding ,

Hsu ,

McDonell ,

Muennighof ,

Phang ,

Reynolds ,

Tang , A . Thite,