A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language

A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language ElioMusacchio elio.musacchio@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

National PhD in Artificial Intelligence University of Pisa

Italy

LuciaSiciliani lucia.siciliani@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

PierpaoloBasile pierpaolo.basile@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

EdoardoMichielon edoardo.michielon@consulenti.fastweb.it Fastweb SpA

Milan Italy

MarcoPasqualini marco.pasqualini@consulenti.fastweb.it Fastweb SpA

Milan Italy

AsiaBeatriceUboldi Fastweb SpA

Milan Italy

GiovanniSemeraro giovanni.semeraro@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

A study on the soundness of closed-ended evaluation of Large Language Models adapted to the Italian language 1613-0073 3374DE17074283C19C789CFB170A4328 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Natural Language Processing Evaluation Benchmark

With the rising interest in Large Language Models, deep architectures capable of solving a wide range of Natural Language Generation tasks, an increasing number of open weights architectures have been developed and released online. In contrast with older architectures, which were aimed at solving specific linguistic assignments, Large Language Models have shown outstanding capabilities in solving several tasks at once, raising the question of whether they can truly comprehend natural language. Nevertheless, evaluating this kind of capability is far from easy. One of the proposed solutions so far is using benchmarks that combine various types of tasks. This approach is based on the premise that achieving good performance in each of these individual tasks can imply having developed a model capable of understanding language. However, while this assumption is not incorrect, it is evident that it is not sufficient, and the evaluation of Large Language Models still remains an open challenge. In this paper, we conduct a study aimed at highlighting the potential and limitations of current datasets and how a new evaluation setting applied to language-adapted Large Language Models may provide more insight than traditional approaches.

Introduction

Large Language Models (LLMs) are models based on the Transformer architecture capable of solving a wide variety of Natural Language Generation (NLG) tasks, even those not encountered during training, due to their extensive training and large number of parameters. Thanks to their remarkable skills, interest in LLMs is now at its climax, resulting in a proliferation of open-weight models (e.g. LLaMA, Mistral, and many others). Among the several challenges related to the development of LLMs, one of the most critical is their evaluation [1]. One approach to tackle this issue has been to build benchmarks that collect different datasets, with the aim of obtaining a more comprehensive evaluation of the model's overall capabilities. Currently, there is a leaderboard 1 [2] which 1 https://huggingface.co/spaces/open-llm-leaderboard/open_llm_ leaderboard keeps track of the capabilities of openly available LLMs. Specifically, the models are tested on six tasks that span different abilities a language model should have, e.g. reasoning or text completion. Regarding their reasoning abilities, the models are tested by solving closed-ended tasks. Specifically, multiple-choice question answering tasks are provided, where a question is given with a list of possible alternatives associated with an identifier (a letter, a number, and so on). Intuitively, since the model has also been pre-trained on closed-ended question-answering data, it should be able to generalize and understand the correct choice out of the available ones. Furthermore, rather than generating the output directly, the probabilities learned by the model are studied, using log-likelihood to assess which option is more likely to be correct. For the English language, this evaluation methodology has been a standard approach to assess the capabilities of LLMs. However, when adapting a model to a new language, due to the low amount of non-English data that has been used to pre-train such models, this methodology may not be as sound. The model only has to generate the correct option identifier, therefore this is not really testing the ability of the model of generating high-quality text in another language. The goal of this work is to understand whether a new evaluation setting applied to languageadapted LLMs may give more insight than the traditional approach. Therefore, our contributions are the following:

• We test two evaluation settings for languageadapted LLMs changing the structure of closed-ended question answering tasks; • We evaluate the performance of state-of-the-art models on these settings; • We study the sensitivity that the models have for the input prompt.

Related Works

Language Model evaluation has been a research focus ever since the first Decoder-only models, which were designed for natural language generation. One of the most remarkable skills regarding LLMs reasoning has been in-context learning. In particular, fewshot learning has been increasingly used. The idea is that providing examples of input-output in the model prompt should affect positively the generation process [3].

There are multiple leaderboards which evaluate open LLMs on non-English languages, e.g. Open PL LLM Leaderboard [4] for Polish or Open KO LLM Leaderboard [5] for Korean. These leaderboards are often based on the lm evaluation harness framework [6], which has been a milestone in the evaluation of LLMs. LLM evaluation can also depend on the topic at hand. There are some works which focus on mathematical reasoning [7] as well as factuality [8].

These evaluation settings often rely on closed-ended tasks, specifically multiple-choice question answering. The idea is to calculate the log-likelihood of the next token to generate for the option identifiers. However, this may not be the best setting to evaluate LLMs. Wang et al. [9] studied this on Instruction-tuned LLMs by training a classifier to predict which possible option to associate with the generated answer. This was done to glance over additional text generated by the model (e.g. the generated text could be "The answer is B. " as opposed to the simple "B." token). They found that the log-likelihood and the generated text decisions were often not matching.

Regarding Italian evaluation, some works have approached this challenge. Bacciu et al. [10] released another version of the Open Italian LLM Leaderboard, considering a different variety of tasks. Mercorio et al. [11] released a benchmark based on questions that can be found in the INVALSI test, an Italian educational test, to further test the knowledge and reasoning abilities of these models on a dataset that is natively in Italian rather than obtained through machine translation. The latter is one of the main problems when evaluating these models, due to the lack of resources w.r.t. English language, datasets that are used at the state-of-the-art are translated using machine translation models. Still, all this effort made to evaluate Italian-adapted LLMs mainly relies on closed-ended tasks.

Experiments

We study pre-trained and language-adapted models to test their capabilities in the resolution of Italian language tasks. Specifically, we want to modify the typical formatting that is used in multiple choice question answering to study if the models are capable of correctly following and generating Italian text. Usually, the format shown in Listing 1 is used, where <QUESTION> is the question the model has to answer, <IDENTIFIER_i> and <OPTION_i> are the option identifier, which is usually a letter or a number, and the text of the possible answer to the previously provided question respectively. <COR-RECT_IDENTIFIER> is the identifier of the option that is the correct answer to the question.

<QUESTION>: <IDENTIFIER_1> <OPTION_1> <IDENTIFIER_2> <OPTION_2> ... <IDENTIFIER_N> <OPTION_N> <CORRECT_IDENTIFIER> Listing 1: closed-ended format

We aim to modify the task so that the model has to generate the text of the correct option instead of the identifier. To do so, we consider two main evaluation settings:

• Open-ended (OE): we remove the available options and only supply the question in the prompt; • Closed-ended no identifiers (CE-NI): we format the options without an identifier, the model has to write the corresponding text of the correct option.

In particular, for the CE-NI setting, we apply the format shown in Listing 2, where <CORRECT_OPTION> is the text of the option that represents the correct answer to the question. Generally models are also evaluated by calculating the log-likelihood rather than generating text directly. The chosen option is then selected based on the highest value. We choose to perform a generative task instead, to check whether the models are capable of generating the answer string only without additional text and to also check if they generate something outside of the provided options. To evaluate this case, we use the BLEU, ROUGE-L and BertScore F1 metrics, which are reference metrics used to evaluate the correspondence of a generated sentence with a base one. BLEU and ROUGE-L focus on matching n-grams, while BertScore leverages pre-trained Bert models to assess the semantic similarity between words of the two texts. Furthermore, we consider four different possible prompt formats:

• Plain (P): there is no formatting, the text of the task is provided as it is in the prompt, only a "Risposta:" string is added at the end; Furthermore, for the few-shot formats, we consider two distinct numbers of examples to provide in the prompt: one-shot and five-shots. The intuition is that a language-adapted LLM should significantly improve performance even when provided with a single example.

We consider these prompt formats because most of the evaluation settings for Italian LLMs are done without applying the chat template. We argue that this choice may not be the best one when considering Instruct models that have been trained using a specific prompt format to continue a conversation. They should be evaluated using the same prompt format since it is also the one that will be used in case of deployment.

To set up the experimental protocol, we use the lmevaluation-harness library [6], which provides an immediate and intuitive command line to automatically evaluate LLMs on previously defined as well as custom tasks. Specifically, we define custom tasks within the library following the previously defined evaluation settings. To do so, we consider the following datasets:

• ARC-Challenge [12]: consists of multiplechoice science exam questions, the Challenge set consists of complex questions that were not correctly answered by both a retrieval and cooccurrence method;

• MMLU [13]: consists of multiple-choice questions from 57 different topics (e.g. mathematics, computer science, and so on), requiring problemsolving abilities and knowledge to answer correctly; • EXAMS [14]: consists of multiple-choice questions from high school exams. The dataset contains different subsets curated for different languages and optionally contains additional paragraphs regarding the question (extracted from Wikipedia); • WWBM [15]: consists of multiple-choice questions spanning a wide range of topics. The questions come from the Italian version of the "Who Wants to Be a Millionaire?" board game where contestants answer progressively difficult questions.

The question-answer instances are split into different categories depending on the difficulty of the question itself.

For the Italian version of these datasets, both EXAMS and WWBM are provided with splits in the Italian language natively. For ARC and MMLU, instead, we use the Italian version provided in the library for the okapi task released by Lai et al. [16], who performed automatic translation of the original datasets using GPT-3.5 Turbo for several languages. For all of these datasets, we define two custom tasks which apply the OE and CE-NI evaluation settings automatically. The examples used in the few-shot settings are taken from the validation splits of the datasets. For EXAMS, we use the train split as a test split (since it is not provided), while for WWBM, we remove the first five instances from the original dataset and use them as a validation split.

Regarding the models, we experiment using the following:

• Italia-9B-Instruct-v0.1 2 : trained from scratch with a focus on the Italian language (90% of data in Italian and the rest in English) with instructiontuning for conversational purposes; • LLaMAntino-2-chat-13b-hf-UltraChat-ITA [17]: instruction-tuning of LLaMAntino-2-chat-13b-hf-ITA (an Italian-adapted LLM) using a translated version of the UltraChat dataset; • LLaMAntino-3-ANITA-8B-Inst-DPO-ITA [18]: fine-tuning, DPO and adaptation using a mixture of Italian and English datasets starting from the LLaMA-3-8B-Instruct model; • maestrale-chat-v0.4-alpha-sft 3

Table 1

Results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset and for each metric is in bold

• Meta-Llama-3-8B 4 and Meta-Llama-3-8B-Instruct 5 : latest version of the LLaMA family of models released by META (base and instruct version respectively); • Minerva-3B-base-v1.0 6 : trained from scratch to be a proficient bilingual base model (English and Italian); • zefiro-7b-dpo-ITA 7 : based on zephyr by Tunstall et al. [19], DPO training done on top of zefiro-7b-sft-ITA.

Furthermore, to test whether bilingual training helps the model solve these tasks, we instruction-tuned two 4 https://huggingface.co/meta-llama/Meta-Llama-3-8B 5 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 6 https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0 7 https://huggingface.co/mii-community/zefiro-7b-dpo-ITA new models. We start from the Meta-LLaMA-3-8B-Instruct checkpoint and fine-tune the model on 40, 000 instances from 3 different datasets: databricks-dolly-15k, OpenOrca and UltraChat. The datasets are automatically translated to Italian using ChatGPT 3.5. We consider two different settings, one where 20, 000 instances are kept for each language (Italian and English), and one where 40, 000 instances are kept for the Italian language only. For instruction tuning, we used LoRA with 𝑟 equal to 16 and alpha equal to 16, targeting all linear layers of the model. Other hyperparameters are effective batch size equal to 128, learning rate equal to 2𝑒 − 5, weight decay equal to 0.01 and warmup steps equal to 5. In both cases, the instances to be used during the training are chosen at random.

For all experiments, we use the greedy-decoding generation strategy with a maximum number of tokens to

Table 2

Results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset and for each metric is in bold generate equal to 64. This limit was set for computational requirements and the value was chosen after studying the datasets to assess the number of tokens required for each answer. There was no combination of tokenizer and dataset which had a 95% percentile greater than 50 for token count, therefore we can safely set the previously defined boundary. We also set torch.bfloat16 and use flash-attention-2 [20] to speed up the generation process.

Inference was always done with batch size set to 1 to maximize the quality of the generated text. Furthermore, we consider changing the number of few-shots that are given in the prompt. Our assumption is that the models may learn to follow the patterns given in the examples, and therefore the Italian language generation may become more likely thanks to the additional information conveyed in the prompt. We aim to mitigate this potential bias by decreasing the number of shots. Thus, the number of shots for all settings using a few-shot strategy was set to either 1 or 5.

We report the results of the OE setting in Table 1 and of the CE-NI setting in Table 2 and comment them in the following section.

Hardware and Software Configuration

Our experimental setup consisted of a multi-node cluster provided by Fastweb SpA and equipped with Nvidia H100 GPUs for distributed training and evaluation. We used a suite of open-source libraries, including Transformers from Hugging Face [21], which provides seamless integration with PyTorch [22] and DeepSpeed [23], as well as Unsloth 8 and TRL [24]. This software stack has been instrumental in efficiently handling large data sets and complex models. This configuration allowed for parallelization of computations, significantly reducing training and evaluation time. DeepSpeed optimized memory usage and communication between nodes, allowing us to effortlessly scale evaluation processes across multiple model architectures.

The hardware-software combination ensured efficient, cost-effective, and reproducible experiments, which are critical for comparing multiple models and training new ones efficiently.

Findings and Additional Tests

Analyzing the results, it is clear that the OE strategy did not yield very satisfactory results for BLEU and ROUGE-L. We associate this with the difficulty of generating a response matching exactly the ground truth when the text that can be generated is not constrained in any way. To further support this point, we can see that the BertScore of some experiments yields good results, hinting that the semantics of the content that has been generated is similar to that of the ground truth.

Regarding the CE-NI strategy, the obtained results are much better for all metrics. Therefore providing the options in the input prompt greatly helped the model in limiting its generation to follow the provided options. Surprisingly, with respect to the Italian leaderboard where fine-tuned versions of the LLaMA 3 family were shown to have much better results, here the results are in line with the base models (or even worse in some cases). Furthermore, one of the best-performing models is maestralechat-v0.4-alpha-sft, which consistently outperforms the LLaMA 3 models in most cases.

For both settings the obtained results show that providing input-output examples in the prompt greatly enhances the results for all settings.

For both settings, primarily Instruct models were used. Upon analyzing the generated results, we observed instances where the model provided the correct result but appended an additional substring (e.g., the model began explaining the reasoning behind its response). To assess if this might have affected the result, we performed an additional test where we checked if the ground truth string was a substring of the generated output (after removing punctuation and trailing whitespaces as well as lowercasing the two strings). We report the complete results in Appendix C. Overall, some models show an improvement in performance, but the results still do not beat maestrale-chat-v0.4-alpha-sft.

We provide some generation examples in Appendix B.

Conclusions and Future Works

We have carried out a study on the effectiveness of evaluation of Italian-adapted LLMs on closed-ended tasks, multiple-choice question answering tasks specifically.

We have experimented with two settings: an open-ended one and a closed-ended one without option identifiers. The results show better performance for the latter. Furthermore, they also show that, with respect to the Open Italian LLM Leaderboard, there are significant differences regarding model performance. We can conclude that the evaluation of Italian-adapted models should follow a more rigorous procedure which does not mainly rely on closed-ended tasks. We release the code that was used on GitHub 9 . In the future, we plan to further work on the topic and attempt to define best practices for the evaluation of these models.

Model Format ARC_IT MMLU_IT EXAMS WBMM

Italia-9B-Instruct-v0.

Table Sub -Substring matching results for the OE setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset is in boldModelFormat ARC_IT MMLU_IT EXAMS WBMMP0.000.380.3073.56P-F 139.8633.1937.5352.43P-F 544.7436.0340.1056.62Italia-9B-Instruct-v0.1I29.7729.5926.7355.91I-F 126.7831.0829.0155.86I-F 532.5931.4232.7756.62P43.5430.0840.8958.16P-F 149.1038.1744.6566.19P-F 550.9040.2345.4567.32LLaMAntino-2-chat-13b-hf-UltraChat-ITAI41.6626.2934.7560.56I-F 144.2333.1638.1257.95I-F 548.0839.5036.8362.92P55.8643.8452.4870.44P-F 160.5745.3448.3272.38P-F 562.4546.8251.4969.82LLaMAntino-3-ANITA-8B-Inst-DPO-ITAI61.8544.9354.4675.91I-F 162.1943.7549.5174.06I-F 561.4245.1152.8775.14P69.3850.1858.7173.56P-F 171.4354.5258.2276.88maestrale-chat-v0.4-alpha-sftP-F 573.3155.8558.0278.21I46.8829.8340.3060.36I-F 169.6352.2256.5474.58I-F 570.1554.3056.7375.40P57.5746.3056.5475.09Meta-Llama-3-8BP-F 163.1346.8851.5871.20P-F 566.4750.4953.3775.96P59.5444.2653.0768.85P-F 166.3050.1351.1872.79P-F 568.6952.4257.4372.79Meta-Llama-3-8B-InstructI57.8336.0448.6174.89I-F 169.2948.1454.4675.40I-F 570.8354.1760.1077.75P47.4843.7159.9073.86Minerva-3B-base-v1.0P-F 125.6628.5123.8633.25P-F 520.1023.0922.8734.94P48.7639.1841.5860.67P-F 155.0040.3746.0462.56P-F 560.3145.3448.4264.86zefiro-7b-dpo-ITAI31.4831.5040.4072.69I-F 150.9846.1145.1566.55I-F 558.2647.1650.2064.55P59.7144.5054.1669.92P-F 166.0449.7050.8972.53LLaMA3-BILINGUAL (Ours)P-F 5 I67.58 60.6552.29 38.6156.54 50.2072.84 75.35I-F 169.6350.0056.1475.04I-F 570.4954.5160.1077.90P60.5745.1654.2670.49P-F 166.2149.7951.9872.43LLaMA3-ITA-ONLY (Ours)P-F 5 I67.67 59.8852.38 37.0857.23 50.4073.71 75.40I-F 169.2150.1956.6374.94I-F 570.4054.2859.4177.65

Table Sub -Substring matching results for the CE-NI setting. For the few-shots formats, the number of given shots is also provided next to the format name. The best result for each dataset is in boldhttps://github.com/unslothai/unsloth

Acknowledgments

We acknowledge the support of the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

Appendix

A. Prompt Formats

All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.

Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano. Risposta:

Example 1: Prompt in the P-F format for the OE setting Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto. Produce fiori che attraggono gli insetti. Produce more che hanno un sapore migliore. Permette alle piante di more di adattarsi a nuove condizioni. Risposta: Permette alle piante di more di adattarsi a nuove condizioni.

Example 2: Prompt in the P-F 1 format for the OE setting <|start_header_id|>user<|end_header_id|> Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Example 3: Prompt in the I-F format using LLaMA 3 chat template <|begin_of_text|><|start_header_id|>user<|end_header_id|> Le more selvatiche si riproducono asessualmente sprigionando nuove radici quando i loro steli toccano il terreno. Si riproducono anche sessualmente attraverso i loro fiori. Qual è il vantaggio della pianta di more di potersi riprodurre sessualmente e asessualmente? Opzioni: Consente alle piante di crescere più in alto. Produce fiori che attraggono gli insetti. Produce more che hanno un sapore migliore. Permette alle piante di more di adattarsi a nuove condizioni.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Permette alle piante di more di adattarsi a nuove condizioni.<|eot_id|><|start_header_id|>user<|end_header_id|> Anna tiene un cubetto di ghiaccio. Perché si scioglie il cubetto di ghiaccio nella sua mano? Opzioni: Il calore si sposta dalla sua mano al cubetto di ghiaccio. Il freddo si sposta dalla sua mano al cubetto di ghiaccio. Il calore si sposta dal cubetto di ghiaccio alla sua mano. Il freddo si sposta dal cubetto di ghiaccio alla sua mano.<|eot_id|><|start_header_id|>assistant<|end_header_id|> Example 4: Prompt in the I-F 1 format using LLaMA 3 chat template

B. Zero-shot Response Examples

All showcased examples in this section are obtained from Meta-Llama-3-8B-Instruct model.

Una sorgente sonora di frequenza f 0, si muove con velocità costante lungo una circonferenza. Nel centro della circonferenza si trova il ricevitore del suono. Quale asserzione è esatta per la frequenza f registrata dal ricevitore? Risposta:

Ground truth: f = f 0 Generated Answer: La frequenza f registrata dal ricevitore è costante e uguale a f 0 Example 5: Generated answer with additional text for the OE setting Il periodo di rotazione di un satellite artificiale intorno ad un pianeta è T . La distanza tra il satellite ed il centro del pianeta è r . A che distanza dal centro del pianeta ruota un altro satellite se il suo periodo di rotazione è T / 8? Opzioni: 8 r r/8 4 r r/4 Risposta:

Ground truth: r/4 Generated Answer: r/8 Spiegazione: Se il periodo di rotazione del satellite è T / 8, allora la sua distanza dal centro del pianeta è r / 8. Infatti, se il satellite ruota in 8T, allora la sua distanza dal centro del pianeta è Example 6: Generated answer with additional text for the CE-NI setting

C. Substring Matching Results

A survey on evaluation of large language models YChang XWang JWang YWu LYang KZhu HChen XYi CWang YWang ACM Transactions on Intelligent Systems and Technology 15 2024 CFourrier NHabib ALozovskaya KSzafer TWolf Open llm leaderboard v2 2024 Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in neural information processing systems 33 2020 KWróbel SpeakLeash Team, Cyfronet Team 2024 Open pl llm leaderboard CPark HKim DKim SCho SKim SLee YKim HLee Open ko-llm leaderboard: Evaluating large language models in korean with ko-h5 benchmark 2024 ACL Main LGao JTow SBiderman SBlack ADipofi CFoster LGolding JHsu KMcdonell NMuennighoff JPhang LReynolds ETang AThite ; Evaluation ;BWang KWang AZou 10.5281/zenodo.5371628 A framework for fewshot language model evaluation 2021 9 Large language models for mathematical reasoning: Progresses and challenges JAhn RVerma RLou DLiu RZhang WYin Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop the 18th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop 2024 Head-totail: How knowledgeable are large language models (llms)? aka will llms replace knowledge graphs? KSun YXu HZha YLiu XLDong Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long Papers the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 2024 1 my answer is c": First-token probabilities do not match text answers in instruction-tuned language models XWang BMa CHu LWeber-Genzel PRöttger FKreuter DHovy BPlank 2024 DanteLLM: Let's push Italian LLM research forward! ABacciu CCampagnano GTrappolini FSilvestri Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) NCalzolari M.-YKan VHoste ALenci SSakti NXue the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)

Torino, Italia

ELRA and ICCL 2024 Disce aut deficere: Evaluating llms proficiency on the invalsi italian benchmark FMercorio MMezzanzanica DPotertì ASerino ASeveso 2024 PClark ICowhey OEtzioni TKhot ASabharwal CSchoenick OTafjord arXiv:1803.05457v1 Think you have solved question answering? try arc, the ai2 reasoning challenge 2018 Measuring massive multitask language understanding DHendrycks CBurns SBasart AZou MMazeika DSong JSteinhardt Proceedings of the International Conference on Learning Representations the International Conference on Learning Representations ICLR 2021 EXAMS: A multisubject high school examinations dataset for cross-lingual and multilingual question answering MHardalov TMihaylov DZlatkova YDinkov IKoychev PNakov 10.18653/v1/2020.emnlp-main.438 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics BWebber TCohn YHe YLiu the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 Playing with knowledge: A virtual player for "who wants to be a millionaire?" that leverages question answering techniques PMolino PLops GSemeraro MDe Gemmis PBasile 10.1016/j.artint.2015.02.003 Artificial Intelligence 222 2015 Okapi: Instruction-tuned large language models in multiple languages with reinforcement learning from human feedback VLai CNguyen NNgo TNguyen FDernoncourt RRossi TNguyen Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations the 2023 Conference on Empirical Methods in Natural Language Processing: System Demonstrations 2023 PBasile EMusacchio MPolignano LSiciliani GFiameni GSemeraro arXiv:2312.09993 Llamantino: Llama 2 models for effective text generation in italian language 2023 arXiv preprint MPolignano PBasile GSemeraro arXiv:2405.07101 Advanced natural-based interaction for the italian language: Llamantino-3-anita 2024 arXiv preprint LTunstall EBeeching NLambert NRajani KRasul YBelkada SHuang LWerra CFourrier NHabib NSarrazin OSanseviero AMRush TWolf arXiv:2310.16944 Zephyr: Direct distillation of lm alignment 2023 FlashAttention-2: Faster attention with better parallelism and work partitioning TDao International Conference on Learning Representations (ICLR) 2024 Transformers: State-of-the-art natural language processing TWolf LDebut VSanh JChaumond CDelangue AMoi PCistac TRault RLouf MFuntowicz JDavison SShleifer PVon Platen CMa YJernite JPlu CXu TLScao SGugger MDrame QLhoest AMRush Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics 2020 Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation JAnsel EYang HHe NGimelshein AJain MVoznesensky BBao PBell DBerard EBurovski GChauhan AChourdia WConstable ADesmaison ZDevito EEllison WFeng JGong MGschwind BHirsh SHuang KKalambarkar LKirsch MLazos MLezcano YLiang JLiang YLu CLuk BMaher YPan CPuhrsch MReso MSaroufim MYSiraichi HSuk MSuo PTillet EWang XWang WWen SZhang XZhao KZhou RZou AMathews GChanan PWu SChintala 10.1145/3620665.3640366 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems ACM 2024 2 ASPLOS '24 Deepspeed data efficiency: Improving deep learning model quality and training efficiency via efficient data sampling and routing CLi ZYao XWu MZhang CHolmes CLi YHe 2024 LWerra YBelkada LTunstall EBeeching TThrush NLambert SHuang Trl: Transformer reinforcement learning 2020