<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Teasing LLMs adapted to Italian</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Leonardo Ranaldi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giulia Pucci</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elena Sofia Ruzzetti</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fabio Massimo Zanzotto</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>André Freitas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Manchester</institution>
          ,
          <country country="UK">UK</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Idiap Research Institute</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Università degli studi Roma Tor Vergata</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Instruction-tuned Large Language Models (It-LLMs) are changing NLP thanks to their easy accessibility. These models seem able to grasp language, solve complex tasks, and perform even with few resources. These abilities and ease of handling democratize their use, enabling many researchers to produce their homemade It-LLMs. However, the complete understanding of their potential needs to be improved due to the black-box nature of many models and the absence of holistic evaluation studies. We present an evaluation resource for It-LLMs tuned in Italian to address these challenges. Our proposal includes evaluating models on several aspects. We take a holistic approach to analyzing model performance factors, including the pre-training base, instruction-tuning data, and training methods. Our results reveal that data quality is the most crucial factor in scaling model performance. While available open-source models demonstrate impressive ability, they present problems when customized adapters are used. We are encouraged by the rapid development of models by the open-source community. However, we also highlight the need for rigorous evaluation to support the claims.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Instruction-tuned Large Language Models</kwd>
        <kwd>Multilingual LLMs</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In this paper, we propose evaluation resources for
Italian It-LLMs. Furthermore, we tested a set of open-source
The advent of Instruction-tuned Large Language Mod- It-LLMs fine-tuned in the Italian language,
demonstratels (It-LLMs) marks yet another change in NLP in the ing excellent adaptability and some gaps in downstream
last few decades. Indeed, their abilities are evident in performance. In particular, our methodology, applying a
numerous applications, from complex problem-solving systematic and holistic approach, examines the
problemto information retrieval to conversational assistants such solving ability, writing ability, and alignment between
as ChatGPT. Examples include GPT-4, which demon- languages of customized It-LLMs that are fine-tuned in
strates abilities in language comprehension and common a specific language, i.e., Italian, starting from the work
sense, logical-mathematical problem solving, law, and proposed by Chia et al. [5]. Through a rigorous
explomedicine. However, despite their remarkable competence ration of these factors, we seek to shed light on the vital
and adaptability, the full extent of their potential has yet elements that determine the performance of the models,
to be fully understood. Indeed, their direction is poorly facilitating an understanding of how these models can
captured, given many models’ simple use, black-box na- best be harnessed to meet our needs. Our contribution is
ture, and lack of in-depth and holistic evaluation studies fully available and open-source 1
[1, 2, 3].</p>
      <p>To manage these challenges and deeper understand
the abilities of these models, a series of evaluation bench- 2. The Open-Source Instructed
marks were introduced that are explicitly designed for LLMs
the comprehensive evaluation of It-LLMs [4, 5, 6, 7, 8, 9].</p>
      <p>However, evaluation resources are only available in
English, and it is tricky and misleading to evaluate a model
trained on instructions in the Italian language.</p>
      <p>Large Language Models (LLMs) have caught mainstream
attention; they have become a comprehensive category
of models. LLMs are comprehended as pre-trained and
ifne-tuned models with general language prompts or
Instruction-tuned models. Therefore, we distinguish
between basic and Instructed models, where basic LLMs
are pre-trained LLMs that can be fine-tuned on
instructions to become Instruction-tuned LLMs (It-LLMs). In
particular, in Table 1, we summarize mainly open-source
val
1https://github.com/LeonardRanaldi/italian-instruct-e
the pace of development of new models can outpace
advances in evaluation studies. Unfortunately, informal
evaluations often spot new models, which must be
clariTable 1 ifed when comparing diferent models.
Open-source Large Language Models, with ∗ we denote data We should consider diferent factors, such as
predump not available. training and instruction data, to arrive at a holistic
understanding of LLMs and It-LLMs. While previous work
Model Backbone Size Source Training has conducted in-depth studies in some areas, such as
AFBVCaalihpcliacuzaotencGnaa[L1[[[M3111]514]][]16] LLLLGLLLLaaaaLMMMMMAAAA 7777----63433B3000BBBB SeARlSUlfeph-fnCaiankchrneaeadoGtDwwPDaenTtabata SSSSuuuuppppReeeeLrrrrHvvvviiiiFsssseeeedddd
admacatorakmssept[sl2e1[t2]e,0ou]tnahdneedrrfmsatcaotnroedrsicnosghn.ocFureoldtrebe,exsuaccmohnpsaliesd,egpreeendreftoroarlmabcahennieccvehesCCFSatuaaumsmntooobmse[cic1izco9eo]d[1[A178d]]apter LLLLLLaaaMMMAAA 77--711B33BB BaAAizllppeaadccaaat((aIItt(aaitlliiaaalinna))n) SSSuuupppeeerrrvvviiissseeeddd onRceucsetnomtwizoerdkmshodoewlss ftohrepealratsitciuciltayr laanndgucuagsteosmoriztaatsiokns.
Table 2 of It-LLMs in many languages. Santilli and Rodolà
Details of open-source instructed LLMs. [17] translated Alpaca [11] into Italian by proposing
Camoscio. Later, in Stambecco, the author [18]
reproduced the same work by modifying some parameters. In
LLMs due to the need for more transparency and repro- [19], the models of the Baize [13] family were adapted
ducibility of closed-source models. into Italian. In this new scenario, evaluation has become</p>
      <p>The essential part of the Instruction-tuning idea is increasingly important and challenging. Recent
evaluathe data used to train LLMs. Indeed, factors such as tion studies produce concrete results such as accuracy
quality, quantity, and format can determine the behavior and precision [5, 22]. However, these methodologies are
of the instructed model. Table 3 presents several open- generic and not customized for a specific task and
lansource resources. There is a growing tendency to exploit guage.
synthetic instruction data from closed-source models Model Size Domain Source
t[o10m,1i1m].icWthheilebethhiasvpiorarcotifceclcoasneda-lsloowuricnestmruocdteedls,mitodcealns SSSAeehlpllaffa--rICecnaGhsDPatrtTau[tc[1at134[[]1]110]] 155702200KKKk DDGGiieeaannllooeeggrruuaallee HumaCCnGhh-aaAPttnTGG-n3PPoTTtation
also lead to problems such as the inheritance of the black- Table 3
box nature of closed-source models and instability due Open-source Instruction-tuning datasets.
to noisy synthetic instructions [12].</p>
      <p>Finally, a holistic overview of the instructed
opensource models can be found in Table 2, where the basic
model with dimensions, instruction dataset, and training
method for each It-LLMs is given. We observe a variety
of model sizes and datasets. Therefore, this overview of
open-source instructed LLMs provides comprehensive
factors for evaluation and analysis.</p>
      <p>Finally, Ranaldi et al. [4], generalizing previous work,
proposed a cross-lingual approach by eliciting It-LLMs
with multilingual Alpaca empowered with
translationfollowing demonstrations.</p>
      <p>In this paper, we propose an Italian evaluation method
for Italian fine-tuned It-LLMs. Our method is based on
various general skills and usage scenarios applicable to
It-LLMs adapted.
3. Challenges &amp; Methods in</p>
      <p>Evaluating Intruction-tuned</p>
      <p>LLMs
3.1. Background and Challenges
The highest wall in evaluating LLMs is the closed-source
concept, where creators often hide model details,
instruction datasets, and training methods. Such models thus
lead to a knowledge vacuum in the research community
as it is impossible to rigorously analyze the reasons for
their performance.</p>
      <p>On the other side of the coin is an ongoing open-source
development that aims to democratize language model
technology. While these eforts are highly encouraged,
3.2. Proposed Methods
We propose to translate three well-known resources to
evaluate the abilities of several Intruction-tuned Large
Language Models. To perform well, the adapted models
should have inherited world awareness, multi-hop
reasoning, and more, merely like the original models. These
benchmarks are:
Massive Multitask Language Understanding
(MMLU) [23] measures knowledge of the world and
problem-solving problems in multiple subjects with 57
subjects across STEM, humanities, social sciences, and
other areas.
a
t
a
d
l
a
n
i
g
i
r
O - Baize 30B
∗Alpaca-Lora 7B
♣Alpaca-Lora 13B
+ Alpaca-Lora 30B
◦Baize 7B
⋄Baize 13B
∗Alpaca-Lora 7B
♣Alpaca-Lora 13B
+ Alpaca-Lora 30B
a ◦Baize 7B
t
a ⋄Baize 13B
d
n - Baize 30B
a
i
l</p>
      <p>Italian Adapters 7B
a
I ∗Camoscio 7B
t
∗Stambecco 7B
Acc.
35.6
50.9
58.4
43.5
50.9
59.8
35.1
50.6
57.9
44.3
51.2
59.5
–
30.2
28.2

-0.5
-0.3
-0.5
-0.8
-0.3
-0.5
–
-5.4
-7.4</p>
      <p>Acc.
Evaluation results. We denote the accuracy across the benchmarks as Acc., while  denotes the performance change compared
to the original version trained and evaluated on English datasets.</p>
      <p>Discrete Reasoning Over Paragraphs (DROP) [24]
els with fewer parameters. In conclusion, fine-tuning
reading comprehension on mathematics where the model
a customized resource, in this case, customized English
should perform discrete reasoning on passages extracted
language resources, was insuficient to increase
perforfrom Wikipedia articles.</p>
      <sec id="sec-1-1">
        <title>BIG-Bench Hard (BBH)</title>
        <p>[25] is a subset of challeng- that fine-tuning on custom It-LLMs may have inserted a
ing tasks related to navigation, logical deduction, and
fallacy detection.
using the framework proposed in [5] 3.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Evaluation</title>
        <p>Each benchmark was translated into
Italian using google API 2. Then, zero-shot evaluations were
done for the original version in English and ours in Italian</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>4. Results</title>
      <p>Customized Instruction-tuned Large Language Models
(It-LLMs) need further refinement. This statement is
supported by the results shown in Table 4 on fine-tuned
models over the Italian benchmarks. Firstly, the original
Alpaca-Lora and Baize evaluated on the English
benchmark outperformed Camoscio, Stambecco, and Fauno
evaluated on the Italian benchmark.</p>
      <p>Secondly, the diferences between Camoscio,
Stambecco, Fauno, and the original Alpaca-Lora and Baize
are very close on the Italian benchmarks (Italian Data on</p>
      <p>3https://github.com/declare-lab/instruct-eval
language benchmarks are remarkably lower than mod- lingual skills inspired by what has been done in [4, 9]
mance. This phenomenon may be due to the quality of
the data used for homemade fine-tuning and also suggests
bias. These gaps should be further investigated, and the
scientific community should pay more attention.</p>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <p>In this paper, we have presented a systematic evaluation
of four resources for Instruction-tuned Large Language
Models (It-LLMs). Our holistic approach analyzed critical
performance factors and showed that eforts to customize
It-LLMs are not always rewarded by performance.</p>
      <p>Underlining the importance of the contribution of the
open-source community in proposing new solutions to
meet specific needs. We emphasize the significance of
data quality in scaling model performance. Additionally,
our translated benchmarks provide valuable insights into
the adaptability and efectiveness of It-LLMs for specific
language tasks. By addressing key evaluation challenges,
our work contributes to the responsible and efective
utilization of It-LLMs, fostering further advancements in
NLP.</p>
      <p>In future developments, we will investigate
lightweight approaches to elicit adapters’ multi- and
cross/v1/N19-1246.
[25] M. Suzgun, N. Scales, N. Schärli, S. Gehrmann,
Y. Tay, H. W. Chung, A. Chowdhery, Q. V. Le, E. H.
Chi, D. Zhou, J. Wei, Challenging big-bench tasks
and whether chain-of-thought can solve them, 2022.
arXiv:2210.09261.</p>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>