1. Introduction

Teasing LLMs adapted to Italian

Leonardo Ranaldi

1 2

Giulia Pucci

Elena Sofia Ruzzetti

Fabio Massimo Zanzotto

André Freitas

0 1 0 Department of Computer Science, University of Manchester , UK 1 Idiap Research Institute , Switzerland 2 Università degli studi Roma Tor Vergata , Italy

Instruction-tuned Large Language Models (It-LLMs) are changing NLP thanks to their easy accessibility. These models seem able to grasp language, solve complex tasks, and perform even with few resources. These abilities and ease of handling democratize their use, enabling many researchers to produce their homemade It-LLMs. However, the complete understanding of their potential needs to be improved due to the black-box nature of many models and the absence of holistic evaluation studies. We present an evaluation resource for It-LLMs tuned in Italian to address these challenges. Our proposal includes evaluating models on several aspects. We take a holistic approach to analyzing model performance factors, including the pre-training base, instruction-tuning data, and training methods. Our results reveal that data quality is the most crucial factor in scaling model performance. While available open-source models demonstrate impressive ability, they present problems when customized adapters are used. We are encouraged by the rapid development of models by the open-source community. However, we also highlight the need for rigorous evaluation to support the claims.

eol>Instruction-tuned Large Language Models Multilingual LLMs

1. Introduction

In this paper, we propose evaluation resources for Italian It-LLMs. Furthermore, we tested a set of open-source The advent of Instruction-tuned Large Language Mod- It-LLMs fine-tuned in the Italian language, demonstratels (It-LLMs) marks yet another change in NLP in the ing excellent adaptability and some gaps in downstream last few decades. Indeed, their abilities are evident in performance. In particular, our methodology, applying a numerous applications, from complex problem-solving systematic and holistic approach, examines the problemto information retrieval to conversational assistants such solving ability, writing ability, and alignment between as ChatGPT. Examples include GPT-4, which demon- languages of customized It-LLMs that are fine-tuned in strates abilities in language comprehension and common a specific language, i.e., Italian, starting from the work sense, logical-mathematical problem solving, law, and proposed by Chia et al. [5]. Through a rigorous explomedicine. However, despite their remarkable competence ration of these factors, we seek to shed light on the vital and adaptability, the full extent of their potential has yet elements that determine the performance of the models, to be fully understood. Indeed, their direction is poorly facilitating an understanding of how these models can captured, given many models’ simple use, black-box na- best be harnessed to meet our needs. Our contribution is ture, and lack of in-depth and holistic evaluation studies fully available and open-source 1 [1, 2, 3].

To manage these challenges and deeper understand the abilities of these models, a series of evaluation bench- 2. The Open-Source Instructed marks were introduced that are explicitly designed for LLMs the comprehensive evaluation of It-LLMs [4, 5, 6, 7, 8, 9].

However, evaluation resources are only available in English, and it is tricky and misleading to evaluate a model trained on instructions in the Italian language.

Large Language Models (LLMs) have caught mainstream attention; they have become a comprehensive category of models. LLMs are comprehended as pre-trained and ifne-tuned models with general language prompts or Instruction-tuned models. Therefore, we distinguish between basic and Instructed models, where basic LLMs are pre-trained LLMs that can be fine-tuned on instructions to become Instruction-tuned LLMs (It-LLMs). In particular, in Table 1, we summarize mainly open-source val 1https://github.com/LeonardRanaldi/italian-instruct-e the pace of development of new models can outpace advances in evaluation studies. Unfortunately, informal evaluations often spot new models, which must be clariTable 1 ifed when comparing diferent models. Open-source Large Language Models, with ∗ we denote data We should consider diferent factors, such as predump not available. training and instruction data, to arrive at a holistic understanding of LLMs and It-LLMs. While previous work Model Backbone Size Source Training has conducted in-depth studies in some areas, such as AFBVCaalihpcliacuzaotencGnaa[L1[[[M3111]514]][]16] LLLLGLLLLaaaaLMMMMMAAAA 7777----63433B3000BBBB SeARlSUlfeph-fnCaiankchrneaeadoGtDwwPDaenTtabata SSSSuuuuppppReeeeLrrrrHvvvviiiiFsssseeeedddd admacatorakmssept[sl2e1[t2]e,0ou]tnahdneedrrfmsatcaotnroedrsicnosghn.ocFureoldtrebe,exsuaccmohnpsaliesd,egpreeendreftoroarlmabcahennieccvehesCCFSatuaaumsmntooobmse[cic1izco9eo]d[1[A178d]]apter LLLLLLaaaMMMAAA 77--711B33BB BaAAizllppeaadccaaat((aIItt(aaitlliiaaalinna))n) SSSuuupppeeerrrvvviiissseeeddd onRceucsetnomtwizoerdkmshodoewlss ftohrepealratsitciuciltayr laanndgucuagsteosmoriztaatsiokns. Table 2 of It-LLMs in many languages. Santilli and Rodolà Details of open-source instructed LLMs. [17] translated Alpaca [11] into Italian by proposing Camoscio. Later, in Stambecco, the author [18] reproduced the same work by modifying some parameters. In LLMs due to the need for more transparency and repro- [19], the models of the Baize [13] family were adapted ducibility of closed-source models. into Italian. In this new scenario, evaluation has become

The essential part of the Instruction-tuning idea is increasingly important and challenging. Recent evaluathe data used to train LLMs. Indeed, factors such as tion studies produce concrete results such as accuracy quality, quantity, and format can determine the behavior and precision [5, 22]. However, these methodologies are of the instructed model. Table 3 presents several open- generic and not customized for a specific task and lansource resources. There is a growing tendency to exploit guage. synthetic instruction data from closed-source models Model Size Domain Source t[o10m,1i1m].icWthheilebethhiasvpiorarcotifceclcoasneda-lsloowuricnestmruocdteedls,mitodcealns SSSAeehlpllaffa--rICecnaGhsDPatrtTau[tc[1at134[[]1]110]] 155702200KKKk DDGGiieeaannllooeeggrruuaallee HumaCCnGhh-aaAPttnTGG-n3PPoTTtation also lead to problems such as the inheritance of the black- Table 3 box nature of closed-source models and instability due Open-source Instruction-tuning datasets. to noisy synthetic instructions [12].

Finally, a holistic overview of the instructed opensource models can be found in Table 2, where the basic model with dimensions, instruction dataset, and training method for each It-LLMs is given. We observe a variety of model sizes and datasets. Therefore, this overview of open-source instructed LLMs provides comprehensive factors for evaluation and analysis.

Finally, Ranaldi et al. [4], generalizing previous work, proposed a cross-lingual approach by eliciting It-LLMs with multilingual Alpaca empowered with translationfollowing demonstrations.

In this paper, we propose an Italian evaluation method for Italian fine-tuned It-LLMs. Our method is based on various general skills and usage scenarios applicable to It-LLMs adapted. 3. Challenges & Methods in

Evaluating Intruction-tuned

LLMs 3.1. Background and Challenges The highest wall in evaluating LLMs is the closed-source concept, where creators often hide model details, instruction datasets, and training methods. Such models thus lead to a knowledge vacuum in the research community as it is impossible to rigorously analyze the reasons for their performance.

On the other side of the coin is an ongoing open-source development that aims to democratize language model technology. While these eforts are highly encouraged, 3.2. Proposed Methods We propose to translate three well-known resources to evaluate the abilities of several Intruction-tuned Large Language Models. To perform well, the adapted models should have inherited world awareness, multi-hop reasoning, and more, merely like the original models. These benchmarks are: Massive Multitask Language Understanding (MMLU) [23] measures knowledge of the world and problem-solving problems in multiple subjects with 57 subjects across STEM, humanities, social sciences, and other areas. a t a d l a n i g i r O - Baize 30B ∗Alpaca-Lora 7B ♣Alpaca-Lora 13B + Alpaca-Lora 30B ◦Baize 7B ⋄Baize 13B ∗Alpaca-Lora 7B ♣Alpaca-Lora 13B + Alpaca-Lora 30B a ◦Baize 7B t a ⋄Baize 13B d n - Baize 30B a i l

Italian Adapters 7B a I ∗Camoscio 7B t ∗Stambecco 7B Acc. 35.6 50.9 58.4 43.5 50.9 59.8 35.1 50.6 57.9 44.3 51.2 59.5 – 30.2 28.2 -0.5 -0.3 -0.5 -0.8 -0.3 -0.5 – -5.4 -7.4

Acc. Evaluation results. We denote the accuracy across the benchmarks as Acc., while denotes the performance change compared to the original version trained and evaluated on English datasets.

Discrete Reasoning Over Paragraphs (DROP) [24] els with fewer parameters. In conclusion, fine-tuning reading comprehension on mathematics where the model a customized resource, in this case, customized English should perform discrete reasoning on passages extracted language resources, was insuficient to increase perforfrom Wikipedia articles.

BIG-Bench Hard (BBH)

[25] is a subset of challeng- that fine-tuning on custom It-LLMs may have inserted a ing tasks related to navigation, logical deduction, and fallacy detection. using the framework proposed in [5] 3.

Evaluation

Each benchmark was translated into Italian using google API 2. Then, zero-shot evaluations were done for the original version in English and ours in Italian

4. Results

Customized Instruction-tuned Large Language Models (It-LLMs) need further refinement. This statement is supported by the results shown in Table 4 on fine-tuned models over the Italian benchmarks. Firstly, the original Alpaca-Lora and Baize evaluated on the English benchmark outperformed Camoscio, Stambecco, and Fauno evaluated on the Italian benchmark.

Secondly, the diferences between Camoscio, Stambecco, Fauno, and the original Alpaca-Lora and Baize are very close on the Italian benchmarks (Italian Data on

3https://github.com/declare-lab/instruct-eval language benchmarks are remarkably lower than mod- lingual skills inspired by what has been done in [4, 9] mance. This phenomenon may be due to the quality of the data used for homemade fine-tuning and also suggests bias. These gaps should be further investigated, and the scientific community should pay more attention.

5. Conclusions

In this paper, we have presented a systematic evaluation of four resources for Instruction-tuned Large Language Models (It-LLMs). Our holistic approach analyzed critical performance factors and showed that eforts to customize It-LLMs are not always rewarded by performance.

Underlining the importance of the contribution of the open-source community in proposing new solutions to meet specific needs. We emphasize the significance of data quality in scaling model performance. Additionally, our translated benchmarks provide valuable insights into the adaptability and efectiveness of It-LLMs for specific language tasks. By addressing key evaluation challenges, our work contributes to the responsible and efective utilization of It-LLMs, fostering further advancements in NLP.

In future developments, we will investigate lightweight approaches to elicit adapters’ multi- and cross/v1/N19-1246. [25] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann, Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H. Chi, D. Zhou, J. Wei, Challenging big-bench tasks and whether chain-of-thought can solve them, 2022. arXiv:2210.09261.