<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Evaluation of Italian and English Small Language Models for Domain-based QA in Low-Resource Scenario</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Irene Siragusa</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Roberto Pirrone</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Engineering, University of Palermo</institution>
          ,
          <addr-line>Palermo, 90128, Sicily</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Usage of open-source Large Language Models, which can be run locally, modified, fine-tuned, and queried without APIs that require data sharing, is required when dealing with sensitive or confidential information. In addition, suitable computational resources are needed to infer and fine-tune such models. The objective of this work is to assess the potentialities of Small Language Models in low-resource scenarios in which quantization may be required. In particular, the focus will be on the usage of these models in the context of the Italian and English languages from both a purely quantitative and resource-oriented evaluation, across two Question Answering data sets, a generic closed answer and a domain-based one with open answers.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;LLM</kwd>
        <kwd>QA</kwd>
        <kwd>Quantization</kwd>
        <kwd>Fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        from late 2024 until April 2025, are considered, for which
an instruction tuning phase was performed, involving
Generative Large Language Models (LLMs) are mainly both English and Italian as supported languages. This
oriented towards the paradigm “the bigger the better”, research led to the selection of models belonging to the
involving both closed-source models such as GPT [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] following families, namely Qwen 3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], Gemma 3 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ],
Claude [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and Gemini [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], but also Llama (Llama 3.1 Phi 4 [
        <xref ref-type="bibr" rid="ref11">11, 12</xref>
        ] and Ministral [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], for which only the free
405B [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] or Llama 4 Maverick 400B [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]) and DeepSeek models available below the 20B parameters are
consid(DeepSeek R1 671B [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]) models. Despite the impressive ered. Performances of these models were evaluated using
capabilities of such models, in both textual and multi- both the full-precision and 8 /4 bit quantization scenarios.
modal setup, significant issues arise when dealing with The evaluation was carried out with the generic
benchtheir size. In particular, higher computational resources mark MMLU [
        <xref ref-type="bibr" rid="ref14 ref15">14, 15</xref>
        ] and UniQA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], a domain-specific
are needed during the training phase, which is performed Question Answer (QA) data set in the university domain.
only once and asynchronously. The inference phase, on Both data sets cover English and Italian, and relative
evalthe other hand, despite requiring less computational re- uations were performed in both languages. Statistics for
sources, may result in being a bottleneck of the final the evaluation time and GPU used are also calculated.
distributed application, for which multi-currency and To further stress the potentialities of these models, the
related GPU resources are needed. Pay-per-use APIs smallest ones were fine-tuned with two diverse strategies
resolve all the computational aspects but lead to privacy- over the UniQA data set, and relative performances of
related issues. Applications that involve the use of Ar- both selected benchmarks have been analyzed.
tificial Intelligence (AI) models as support systems in Thus, the main contributions of this work can be
sumprivate companies or hospitals, where data is confiden- marized as follows.
tial or sensitive and any breaches must be avoided, should
be compliant with those restrictive requirements and not
allow sharing data with third parties.
      </p>
      <p>
        To ensure these privacy-related issues, the focus of
this work is on open-source models which can be trained
locally and inferred in a low-resource scenario, both with
full precision or in a quantization setup [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. In doing this,
the most recent Small Language Models (SLMs), released
1. Evaluation of open-source SLMs with MMLU
benchmark and UniQA data set in diferent
quantization scenarios, from both quantitative and
computational perspective;
2. Fine-tuning with two proposed strategies over
      </p>
      <p>
        the UniQA data set;
3. Comprehensive evaluation of fine-tuned models
over both MMLU and UniQA.
2. Background
discussed in Section 6, while the concluding remarks are
drawn in Section 7.
[
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ] and expert-based LLMs [
        <xref ref-type="bibr" rid="ref4 ref6 ref7">6, 7, 4, 30</xref>
        ], but also smaller
models obtained through a distillation procedure from
larger models [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], thus providing the general public with
a valuable alternative.
      </p>
      <p>Small Language Models are the focus of this research,
which is limited to multilingual generative models with
an explicit reference for supporting the Italian language,
in addition to English. In particular, only models based
on a transformer decoder-only architecture [31] and
instruct fine-tuning are considered. Instruct models
are capable of generating text given an instruction,
thus making them suitable for the proposed evaluation
scenario which includes closed and open QA tasks.</p>
      <p>In addition, as the increasing and faster development
of newer models, only models which have been
released from the last months of 2024 to April 2025
are examined. More in detail, we considered only
models with less than 20B parameters, which have
been sub-grouped as 4B, 8B, and 12B-14B models, to
better evaluate their performance. The selected models
are listed below along with their principal characteristics.</p>
      <p>
        Fine-tuning pre-trained LLM in the context of domain
and task adaptation involves strategies for both proper
ifne-tuning and the technique to reduce the overall
finetuning computational cost, while keeping its
efectiveness. Supervised Fine-Tuning (SFT) strategies are used
for instruction tuning, domain, language, or task
adaptation [
        <xref ref-type="bibr" rid="ref17 ref18 ref19">17, 18, 19</xref>
        ]. As a supervised method, both the input
and the desired output are provided to the model, and,
following a teacher forcing methodology, the model is
forced to use the expected golden target token, even if
the wrong one has been previously generated [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ]. In
the case of QA tasks, this consists of a question and
associated answer and an optional context from which the
answer should be derived.
      </p>
      <p>
        Parameter-Eficient Fine-Tuning (PEFT) techniques are
adopted in conjunction with SFT to speed up the
finetuning phase and reduce computational resources re- Gemma 3 [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] is a family of multimodal and
quired. In particular, those involve freezing, quantization, multilingual models developed by Google DeepMind,
and Low-Rank Adaptation (LoRA) [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ]. In freezing, only co-designed with Gemini models [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ], with which they
weights are actually trained in selected layers, while the share the same tokenizer. A Grouped-Query Attention
rest are kept frozen. In the quantization strategy [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] the (GQA) mechanism [32] was used with post-norm
precision representation of the weights in the model is and pre-norm with RMSNorm [33] and support for
reduced from 32-bit to a 16-, 8-, or 4-bit representation. longer contexts. Gemma 3 models range from 1 to
This technique can be used at both the training and infer- 27B parameters and were trained with a knowledge
ence time, thus decreasing the computational resources distillation strategy. In the context of this research, only
needed in terms of GPU. Lastly, LoRA [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ] is one of the the 4B and 12B versions are considered.
most used PEFT techniques where low-weight adapters,
associated to selected layers, are actually trained, instead Ministral [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] is a model from the French company
of the original ones. In doing this, the size of the trainable Mistral AI, it was released in the 3B and 8B parameter
parameters is greatly decreased, and the computational version. Ministral models are the newer version of
resources needed for the fine-tuning process are reduced Mistral 7B [34], which uses an interleaved
slidingaccordingly. In addition, those techniques can be com- window attention pattern to provide a faster, more
bined to better fit computational constraints, such as in computationally eficient, and low-latency solution at
Quantized Low-Rank Adaptation (QLoRA) [
        <xref ref-type="bibr" rid="ref23">23</xref>
        ] in which inference time. As the 3B version is not open-source,
quantization is applied along with LoRA during training. only the 8B version was considered in this work.
      </p>
      <p>
        For efective fine-tuning, models are trained, on
average for a few training epochs, mainly ranging from 3
to 15, [
        <xref ref-type="bibr" rid="ref24 ref25">24, 25, 26, 27</xref>
        ], usually combining diferent PEFT
strategies [28, 29].
      </p>
    </sec>
    <sec id="sec-2">
      <title>3. Models</title>
      <p>Capabilities around diferent tasks for closed-source and
huge LLMs are well known, but in the context of real
applications, usage of such models is impracticable. This
can be mainly addressed to costly pay-per-use APIs, and
to the sharing of private data to third parties that may
lead to data breaches. Natural Language Processing (NLP)
community is exploring not only the capabilities of larger</p>
      <p>
        Phi 4 [
        <xref ref-type="bibr" rid="ref11">11, 12</xref>
        ] is a family of Microsoft models that
showed impressive capabilities despite the reduced
number of parameters, compared to other models.
      </p>
      <p>Higher performance of these models can be addressed to
the three-stage training procedure and the data curation
process, which involves a data decontamination process
to the most used benchmarks. In addition, more variety
in data, attention towards synthetic data for Chain of
Thoughts (CoT) and reasoning capabilities, contributed
in enhancing the overall behavior of these models. Phi 4
was released in its full version, which consists of 14B
parameters and in its mini version with 3.8B parameters,
which will be considered as a 4B model in the subsequent
analysis. University of Palermo, covering information about the
bachelor and master degree courses for the academic year</p>
      <p>
        Qwen 3 [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] is a family of multilingual models released 2024/2025. Data are natively both in Italian and English,
by the Chinese company Alibaba Cloud. Along with i.e. no translation procedure was involved for developing
the large models of 30B and 235B parameters Mixture the model. From here on, UniQA-EN will be used for
of Expert Models, smaller models have been released the English split, UniQA-IT for the Italian one, and the
ranging from 1B to 32B parameters. Only models of 4B, general form UniQA will be used for both splits.
8B and 14B parameters are considered in this analysis.
      </p>
      <p>Great attention in Qwen 3 models was towards reasoning
and CoT, both in training data selection and at inference 5. Experimental setup
time, in which the explicit thinking mode can be enabled
or not.</p>
      <p>
        Models in an out-of-the-box setup were tested with
MMLU and UniQA data sets at diferent levels of
quantization, namely in their base, 8-bit (Q8) and 4-bit
4. Data sets (Q4) quantization [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These evaluations were performed
to assess the diferent performances of quantized models
Three English and Italian data sets have been considered versus their base version along with the efective
for evaluation purposes, two are closed QA data sets, and computational resources involved, such as GPU memory
the other an open QA dataset. In the first case, the model and inference time. Quantization was performed with
is asked to answer with one of the provided answers, the usage of the bitsandbytes library1 in combination
while in the second case a free text answer is expected. with the transformers library [36] in both 8-bit and 4-bit
As closed QA, the general Massive Multitask Language quantization [
        <xref ref-type="bibr" rid="ref26">37</xref>
        ].
      </p>
      <p>
        Understanding (MMLU) task was selected in its English
and Italian versions. From here on, the English version Regarding the MMLU-EN and MMLU-IT evaluation,
of MMLU will be referred to as MMLU-EN, and the we used the Language Model Evaluation Harness
frameItalian version as MMLU-IT, while MMLU will be used work [35], in 5-shot setup, and considering the accuracy
to refer to both splits. On the other hand, UniQA was as the evaluation metric. Performance for the UniQA
selected as a domain-specific open answer QA data set data set was obtained providing the following prompt,
in the university domain, available both in English and enriched with the target question and associated
docuItalian. ments, following an in-context learning strategy [
        <xref ref-type="bibr" rid="ref27">38</xref>
        ].
      </p>
      <p>
        MMLU-EN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] is a generic benchmark task to
evaluate the capabilities of LLMs after their training
phase. It is a closed QA task involving 57 diferent
subjects in STEM, humanities, and social science with
diverse complexity ranging from elementary level to
advanced and professional level. It consists of 14079
questions, and the models are queried with a 5-shot
strategy in which 5 sample questions are provided
for each subject. Accuracy is the proposed metric for
performance evaluation.
      </p>
      <p>
        MMLU-IT [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is the translated version of the MMLU
data set, which is also referenced in the Language
Model Evaluation Harness framework [35]. Translation
was obtained automatically using an ad hoc developed
prompt for ChatGPT. No further checks have been
conducted on the data set to evaluate its correctness in
terms of translation.
      </p>
      <p>
        UniQA [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] is a QA data set for the University domain
that comprehends nearly 14k QA pairs and more than
1k documents, which serve as a context for the question.
      </p>
      <p>The data set has been generated in a semi-automated
manner using the data retrieved from the website of the</p>
      <sec id="sec-2-1">
        <title>You are Unipa-GPT, the chatbot and vir</title>
        <p>tual assistant of the University of Palermo.</p>
      </sec>
      <sec id="sec-2-2">
        <title>Provide an answer to the provided QUESTION concerning the University of Palermo, relying on the given DOCUMENTS</title>
      </sec>
      <sec id="sec-2-3">
        <title>If the question is in English, answer in English.</title>
      </sec>
      <sec id="sec-2-4">
        <title>If the question is in Italian, answer in Italian.</title>
      </sec>
      <sec id="sec-2-5">
        <title>QUESTION:</title>
        <p>question</p>
      </sec>
      <sec id="sec-2-6">
        <title>DOCUMENTS:</title>
        <p>documents</p>
        <p>For UniQA, we used the default generation
configurations suggested by the developers of the selected models.
In particular, the thinking mode was disabled for Qwen
3, while the sampling strategy in the generation phase
was disabled in the context of Gemma 3 models.
Whenever the model was not able to generate and answer, the</p>
      </sec>
      <sec id="sec-2-7">
        <title>1https://github.com/bitsandbytes-foundation/bitsandbytes</title>
        <p>
          default empty answer has been considered as the gener- document strategy, no additional context was provided
ated one. As evaluation metric, BLEU [
          <xref ref-type="bibr" rid="ref28">39</xref>
          ], ROUGE [
          <xref ref-type="bibr" rid="ref29">40</xref>
          ], in the prompt, allowing the model to learn the QA pairs
METEOR [
          <xref ref-type="bibr" rid="ref30">41</xref>
          ] and BERTScore [
          <xref ref-type="bibr" rid="ref31">42</xref>
          ], with the multilingual directly and, at inference time, to integrate the
knowlmodel XLM-RoBERTa Large [
          <xref ref-type="bibr" rid="ref32">43</xref>
          ] were calculated. Since edge provided by the documents in a context-learning
the F1 BERTScore provides a more comprehensive evalu- set-up [
          <xref ref-type="bibr" rid="ref27">38</xref>
          ].
ation of the meaning and significance of the generated
answer, it was the only metric considered for evaluation
purposes in the context of this work. In the Appendix, all
the calculated metrics for the UniQA data set are reported
for each inference configuration tested (Table 6).
        </p>
        <p>Following the approaches described in Section 2, our
choice was to perform a full fine-tuning limited to
selected layers. A unique strategy was designed that is
suitable for heterogeneous models with a diferent
number of decoder layers. We fully fine-tuned only the last
5.1. Fine-tuning strategies 25% of the decoder layers and the classification head,
while freezing the remaining layers. The proposed
stratOnly the smallest models, namely Gemma 3 4B, Phi 4 egy resulted in a valuable trade-of with PEFT techniques
mini, and Qwen 3 4B, have been fine-tuned over the and full fine-tuning. In addition, this strategy meets the
English and Italian training split of the UniQA data set. In proposed research question in analyzing the impact of
particular, two diferent fine-tuning strategies have been quantization at the inference phase and not during
trainproposed and used in this phase, namely w/ docs (with ing.
documents) and w/o docs (without documents). They The models have been trained for five epochs: a larger
difer in the arrangement of the training samples and the number of training epochs do not lead to significant
associated instruction prompt as reported in Table 1. improvement compared to the considered training data.
A validation set was expunged from the training set with
Table 1 a 90:10 ratio, and it was used as a criterion to select the
Instruction prompts designed for fine-tuning w/ and w/o doc- best model over the validation loss.
uments.</p>
        <p>w/ docs prompt text
You are Unipa-GPT, the chatbot and virtual assistant
of the University of Palermo.</p>
        <p>Provide an answer to the provided QUESTION
concerning the University of Palermo,
relying on the given DOCUMENTS
If the question is in English, answer in English.</p>
        <p>If the question is in Italian, answer in Italian.</p>
        <p>QUESTION:
&lt;QUESTION&gt;
DOCUMENTS:
&lt;DOCUMENTS&gt;</p>
        <p>w/o docs prompt text
You are Unipa-GPT, the chatbot and virtual assistant
of the University of Palermo.</p>
        <p>Provide an answer to the provided QUESTION
concerning the University of Palermo.</p>
        <p>If the question is in English, answer in English.</p>
        <p>If the question is in Italian, answer in Italian.</p>
        <p>QUESTION:
&lt;QUESTION&gt;</p>
        <p>In w/ docs strategy, annotated documents were
fed as input in the training sample, thus allowing the
model to read the documents and force it to extract
and re-paraphase the desired snippet in the document,
containing the answer. On the other hand, in the w/o</p>
        <p>Inferences were run on a local machine on a single 48
GB NVIDIA RTX 6000 Ada Generation GPU (machine
1) and on a cluster with 1 NVIDIA A100 64 GB GPUs
from the Leonardo supercomputer2 via an ISCRA-C
application (machine 2), while fine-tuning was executed
on machine 2. Over the same machines, the occupied
GPUs and inference time were monitored to simulate
and provide an estimation of the required computational
resources in the low-resource scenario.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>6. Results</title>
      <p>
        In Table 2 a comprehensive evaluation of the selected
models is reported. Evaluations also include
performances over bigger models such as Mistral Small [
        <xref ref-type="bibr" rid="ref33">44</xref>
        ],
Llama 4 Scout Instruct [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], Claude 3.5 Sonnet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] and
GPT 4o Mini [
        <xref ref-type="bibr" rid="ref27">38</xref>
        ]. These models have been considered
since their performance for both tasks was available from
public leaderboards [
        <xref ref-type="bibr" rid="ref34 ref35 ref36">45, 46, 47</xref>
        ]. In this phase, no spot
checks or roundtrip translations have been conducted to
further investigate errors in the automatically translated
MMLU-IT split, to assess whether some inference errors
derive from actual model limitations or from translation
artifacts.
      </p>
      <p>The overall best results are achieved by Claude
3.5 Sonnet, followed by GPT 4o mini. Nevertheless,
Phi 4 results in a valuable alternative since it reaches
performances comparable to Mistral Small and is only 0.2</p>
      <sec id="sec-3-1">
        <title>2https://leonardo-supercomputer.cineca.eu/it/home-it/</title>
        <p>points below GPT in MMLU-EN. As for MMLU-IT, scores
tend to be lower compared to the English split, and again In full precision inference, best results are assessed by
Phi results the best. With reference to smaller models, Gemma 3 models reaching a BERT-F1 score of 0.88 on
avQwen 3 in its 4B and 8B versions outperforms other erage in both 4B and 12B versions, also surpassing larger
models in MMLU tasks, while showing a significant 14B models. In this context, the performances of Gemma
average inference time, compared with the competitors. 3 4B are much more interesting from a computational
Performance generally exhibits a decrease in quantized perspective, since it is 64% smaller than the 12B version,
models. The decrease is significant in the case of Q4, reaching comparable performances. The smallest model
while the average inference time for each question is in this set-up is Phi 4B mini, which occupies less than
decreasing for Q8 and tends to increase in the case of 16GB and reaches the smallest inference time, which is
Q4, especially for Qwen. desired in context of real-time applications. Regarding
quantized inferences, GPU values decrease by 70% and</p>
        <p>
          Generally speaking, the quantization procedure at 80% for Q8 and Q4, respectively, compared to models
inference time can increase the answer time due to the inferred with full precision. In terms of inference time,
additional computation required for quantization [
          <xref ref-type="bibr" rid="ref26">37</xref>
          ]. significant increases are found in Q8, while a reduction
This behavior is highly emphasized with the UniQA is found in Q4, which is mainly related to the
quantievaluation, where the input provided to the models can zation strategy adopted by bitsandbytes [
          <xref ref-type="bibr" rid="ref26">37</xref>
          ]. Overall,
be significantly longer compared to MMLU samples. for both performance and computational resource
usThe results for UniQA are reported in Table 3, together age, Phi and Ministral are the best models which benefit
with the average GPU occupied for each inference. To from quantization, and keep comparable performances
better compare obtained results, the standard deviation over the selected benchmarks, despite a slight decrease.
over the average inference time and GPU usage is The worst performances are assessed by Gemma models
also reported. Note that the average inference time which deeply sufer the quantization procedure that leads
is reported in seconds, while the GPU usage in GB: to output empty string (Q4) or meaningless output in not
associated standard deviation follows the same scale, desired languages (Q8).
and, in the GPU case, the majority results 0.0 since the In contrast with the MMLU case, in which a slight
corresponding variation is lower than 0.1 GB. discrepancy can be found between the English and
Italian split, in the UniQA case, all performance places
on the same level, and, in some cases, slightly towards
the Italian split. This performance can be explained
through an analysis on the data set, in which the
presence of the context can guide the model more
efectively in generating the desired answer, and in the
language-related characteristics and understandings.
        </p>
        <p>In Tables 4 and 5 the results over both benchmarks are
reported using the two proposed fine-tuning strategies,
with and without documents.</p>
        <p>No improvements are found after the fine-tuning phase
with the two proposed strategies in terms of performance
on the MMLU benchmarks. Models fine-tuned with the
w/ docs strategy tend to better maintain the performance
obtained by the base models. These results show that the
ifne-tuning on a specific task did not lead to a degradation
in performance in a generic benchmark and that the
generalization performance of the considered LLM is
maintained. This is mainly due to the light fine-tuning
strategy adopted, which does not cause the model to
overfit.</p>
        <p>Regarding UniQA performance, both strategies have
been shown to be successful since overall performance
for the BERT F1 score increased. More specifically,
better results are obtained in the case of w/o docs
strategy, both from evaluation metrics and for average
inference time, which is reduced. Improvements are
found in both the base and quantized inferences. As in
the without fine-tuning inference, the average time for
quantized models deeply penalized Gemma 3 4B, while
Qwen 3 4B trained with a w/ docs strategy, resulted in
being the overall best model both in MMLU and UniQA
benchmarks. Qwen 3 4B, in fact, better maintained
the same level of performance across the diferent
quantization levels. In addition, the w/o docs fine-tuning
strategy was crucial to improve capabilities for Phi 4
mini 4B, in particular in the base and Q8 quantized
inference. A general speed-up in performances is found
in fine-tuned models over UniQA benchmark, while no
improvements are found in Gemma 3 4B in Q4 setup,
where performances are kept low.</p>
        <p>The results obtained show that recent progress in
developing multilingual LLMs provides the opportunity to
use a valuable out-of-the-box model, also for
domainspecific tasks with appropriate prompt engineering. In
addition, the two proposed fine-tuning strategies,
coupled with an overall light training phase as for the
number of epochs, trainable layers, and consequently the
resources needed, results crucial to improve capabilities
of the SLMs under consideration, as for Phi 4 mini 4B
and Qwen 3 4B. Those models trained in a target domain
for a desired QA task of interest were able to outperform
models three times larger in size, requiring on-budget
resources. In general, both models should be considered
as a valuable alternative to develop a custom LLM in a
low-resource scenario. Phi tend to outperform after a
w/o docs fine-tuning in terms of BERT score. On the
other hand, Qwen presents strong performance in both
traditional metrics such as the BLEU, ROUGE, and
Meteor scores (Table 6) with both a fine-tuning strategy and
diferent quantization. Depending on the actual
computational resources available, Phi is preferred, since it is
smaller compared to Qwen. Despite metrics being really
close to each other, the w/o docs training strategy is the
best and the fastest one in the training phase.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>7. Conclusions</title>
      <p>Super Computing Resource Allocation class C project
IscrC_DOCVLM2 (HP10C97VNN).</p>
    </sec>
    <sec id="sec-5">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used
Writefull for grammar and spelling checks. After using
these tools, the authors reviewed and edited the content
as needed and assume full responsibility for the content
of the publication.
In this work, we evaluated the recent open-source
instruction-tuned multilingual Small Language Models
belonging to diferent families with a focus on their
performances upon a base inference and after a Q8 and Q4
quantization. In particular, both closed and open answer
QA tasks were analyzed in Italian and English.
Performances were evaluated from a quantitative perspective
with the general MMLU benchmark and UniQA, a QA
data set based on a specific domain for which relevant
documents are associated to each question.</p>
      <p>The results show that among the largest models under
evaluation, Phi 4 14B almost reached Claude 3.5
Sonnet and GPT 4o Mini in the MMLU benchmark, while
Gemma 3 14B obtained interesting performances when
inferred with full precision using UniQA. Among the
smaller models, Qwen 3 4B and Phi 4 mini 4B were the
most promising ones: both models better scale in terms
of performance after Q8 and Q4 in selected benchmarks.</p>
      <p>In addition, two fine-tuning strategies were proposed
for the last 25% layers and the classification head of the
smaller models using the training split of the UniQA
data set. The results proved that Qwen 3 4B benefits the
most of the training when evaluated over UniQA, while
maintaining general good performance in the MMLU task.
Such considerations together with the flexibility towards
quantization and smaller inference time make Qwen 3
4B a valuable model to implement custom LLM-based
applications in a low-context scenario after a suitable
ifne-tuning phase.</p>
      <p>More tests are needed to evaluate the performance
of the investigated models from a qualitative
perspective. More in detail, additional tests will be conducted
to simulate a real-case scenario, involving both human
evaluation of the quality of the provided answers and
truly open-ended QA in the domain of interest.</p>
    </sec>
    <sec id="sec-6">
      <title>A. Evaluation metrics</title>
      <p>In Table 6, are reported the full calculated metrics over
the UniQA data set in diferent quantization and
fine</p>
      <p>tuning strategies.</p>
      <p>Model
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Gemma 3 4B</p>
      <p>Phi 4 4B
Qwen 3 4B
Ministral 8B
Qwen 3 8B
Ministral 8B
Qwen 3 8B
Ministral 8B
Qwen 3 8B
Gemma 3 12B</p>
      <p>Phi 4 14B
Qwen 3 14B
Gemma 3 12B</p>
      <p>Phi 4 14B
Qwen 3 14B
Gemma 3 12B</p>
      <p>Phi 4 14B
Qwen 3 14B
w/o docs
w/o docs
w/o docs
w/o docs
w/o docs
w/o docs
w/o docs
w/o docs
w/o docs
w/ docs
w/ docs
w/ docs
w/ docs
w/ docs
w/ docs
w/ docs
w/ docs
w/ docs</p>
      <p>Q8
Q8
Q8
Q4
Q4
Q4
Q8
Q8
Q8
Q4
Q4
Q4
Q8
Q8
Q8
Q4
Q4
Q4
Q8
Q8
Q4
Q4
Q8
Q8
Q8
Q4
Q4
Q4</p>
      <p>UNIQA
Overview of the calculated metrics in the UniQA-EN and UniQA-IT split. BERT-prec and BERT-rec stands for BERT precision
and recall scores, respectively, while FT and QTN refers to the fine-tuning and quantization strategy adopted. Average
performance and execution time is reported in seconds, while the GPU used in GB. Bold values are the higher ones for each
block, while starred ones the overall best.</p>
      <p>FT</p>
      <p>QTN</p>
      <p>BLEU</p>
      <p>ROUGE-1</p>
      <p>ROUGE-2</p>
      <p>ROUGE-L</p>
      <p>ROUGE-Lsum
During the preparation of this work, the author(s) used Other and Writefull in order to: Improve
reviewed and edited the content as needed and take(s) full responsibility for the publication’s
0.317
0.307
0.354</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1] OpenAI, GPT-4o
          <source>System Card, arXiv preprint arXiv:2410.21276</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Antropic</surname>
          </string-name>
          ,
          <source>The Claude 3 Model Family: Opus</source>
          , Sonnet, Haiku,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <surname>GeminiTeam</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Anil</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Millican</surname>
          </string-name>
          , et al.,
          <source>Gemini: A Family of Highly Capable Multimodal Models</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>Google</surname>
            <given-names>DeepMind</given-names>
          </string-name>
          ,
          <source>Gemini</source>
          <volume>2</volume>
          .
          <article-title>5: Our most intelligent AI model</article-title>
          ,
          <year>2025</year>
          . blog. google/technology/google-deepmind/
          <article-title>gemini-model-thinking-updates-march-</article-title>
          <year>2025</year>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>LlamaTeam</surname>
          </string-name>
          ,
          <source>The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783</source>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          <article-title>[6] LlamaTeam, The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation</article-title>
          ,
          <year>2025</year>
          . https://ai.meta.com/blog/ llama-4
          <string-name>
            <surname>-</surname>
          </string-name>
          multimodal-intelligence/.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <surname>DeepSeek-AI</surname>
          </string-name>
          ,
          <article-title>DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning</article-title>
          ,
          <source>arXiv preprint arXiv:2501.12948</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jacob</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kligys</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Adam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <article-title>Quantization and training of neural networks for eficient integer-arithmetic-only inference</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9] QwenTeam,
          <source>Qwen3 Technical Report, arXiv preprint arXiv:2505.09388</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>GemmaTeam</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Kamath</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Ferret</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Pathak</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Vieillard</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Merhej</surname>
          </string-name>
          , et al.,
          <source>Gemma 3 TechniAcknowledgments cal Report, arXiv preprint arXiv:2503.19786</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <article-title>We thank Giampiero Barbaro, who contributed in devel-</article-title>
          et al.,
          <source>Phi-4 Technical Report</source>
          ,
          <article-title>arXiv preprint oping the training strategy in the early stages of this</article-title>
          arXiv:
          <volume>2412</volume>
          .08905 (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          <article-title>work, as part of his master thesis</article-title>
          . This work is sup- [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Abouelenin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ashfaq</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Atkinson, ported by the cup project J73C24000070007, "CAESAR" H</article-title>
          .
          <string-name>
            <surname>Awadalla</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Bach</surname>
          </string-name>
          , et al., Phi-4
          <string-name>
            <surname>-Mini Technical</surname>
          </string-name>
          (
          <article-title>Cognitive evolution in Ai: Explainable and Self-Aware Report: Compact yet Powerful Multimodal Robots through multimodal data processing). The works Language Models via Mixture-of-LoRAs, arXiv presented were partially developed on the Leonardo su-</article-title>
          preprint
          <source>arXiv:2503.01743</source>
          (
          <year>2025</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <surname>MistralAITeam</surname>
          </string-name>
          , Un Ministral, des Ministraux,
          <year>2024</year>
          . [26]
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Iyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mao</surname>
          </string-name>
          , X. Ma, https://mistral.ai/news/ministraux. A.
          <string-name>
            <surname>Efrat</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            , G. Ghosh,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Lewis</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>D.</given-names>
            <surname>Hendrycks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Burns</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Basart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <surname>O. Levy</surname>
          </string-name>
          , LIMA: Less Is More for
          <string-name>
            <given-names>M.</given-names>
            <surname>Mazeika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Steinhardt</surname>
          </string-name>
          , Measuring Alignment,
          <source>arXiv preprint arXiv:2305.11206</source>
          (
          <year>2023</year>
          ).
          <article-title>massive multitask language understanding</article-title>
          , arXiv [27]
          <string-name>
            <given-names>W.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. K.</given-names>
            <surname>Luu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. J.</given-names>
            <surname>Buehler</surname>
          </string-name>
          , Fine-tuning large lanpreprint arXiv:
          <year>2009</year>
          .
          <volume>03300</volume>
          (
          <year>2021</year>
          ).
          <article-title>guage models for domain adaptation: Exploration</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>V. D.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. V.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. T.</given-names>
            <surname>Ngo</surname>
          </string-name>
          , T. Nguyen,
          <article-title>of training strategies, scaling, model merging</article-title>
          and
          <string-name>
            <given-names>F.</given-names>
            <surname>Dernoncourt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. A.</given-names>
            <surname>Rossi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. H.</given-names>
            <surname>Nguyen</surname>
          </string-name>
          , Okapi: synergistic capabilities,
          <year>2024</year>
          .
          <article-title>Instruction-tuned large language models in mul-</article-title>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Afzal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Chalumattu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Matthes</surname>
          </string-name>
          , L. Mascarell,
          <article-title>tiple languages with reinforcement learning from AdaptEval: Evaluating Large Language Models on human feedback</article-title>
          ,
          <source>arXiv preprint arXiv:2307</source>
          .
          <article-title>16039 Domain Adaptation for Text Summarization</article-title>
          , in: (
          <year>2023</year>
          ).
          <source>Proceedings of the 1st Workshop on Customizable</source>
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <surname>I. Siragusa</surname>
          </string-name>
          , R. Pirrone,
          <article-title>UniQA: an italian and english NLP: Progress and Challenges in Customizing NLP question-answering data set based on educational for a Domain, Application</article-title>
          , Group, or Individual documents,
          <source>Proceedings of the Eighth Workshop on (CustomNLP4U)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>76</fpage>
          -
          <lpage>85</lpage>
          .
          <source>Natural Language for Artificial Intelligence (NL4AI</source>
          [29]
          <string-name>
            <given-names>J.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Su</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <year>2024</year>
          ) co
          <article-title>-located with 23th International Confer- S. Wu, Dragft: Adapting large language modence of the Italian Association for Artificial Intelli- els with dictionary and retrieval augmented finegence (AI*IA</article-title>
          <year>2024</year>
          )
          <article-title>(2024). tuning for domain-specific machine translation,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Alpaca-LoRA</surname>
          </string-name>
          , https://github.com/tloen/alpaca-lora,
          <source>arXiv preprint arXiv:2402.15061</source>
          (
          <year>2024</year>
          ).
          <year>2023</year>
          . [30]
          <string-name>
            <surname>MistralAITeam</surname>
          </string-name>
          , Large Enough,
          <year>2024</year>
          . https://mistral.
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. McAuley</surname>
          </string-name>
          ,
          <string-name>
            <surname>Baize:</surname>
          </string-name>
          <article-title>An open- ai/news/mistral-large-2407. source chat model with parameter-eficient tuning [31]</article-title>
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          , J. Uszkoreit,
          <article-title>on self-chat data</article-title>
          ,
          <source>arXiv preprint arXiv:2304</source>
          .01196
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , (
          <year>2023</year>
          ).
          <article-title>Attention is All you Need</article-title>
          , in: Advances in Neural
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bacciu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Campagnano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Trappolini</surname>
          </string-name>
          ,
          <string-name>
            <surname>F.</surname>
          </string-name>
          <source>Sil- Information Processing Systems</source>
          ,
          <year>2017</year>
          . vestri,
          <source>DanteLLM: Let's push Italian LLM research</source>
          [32]
          <string-name>
            <given-names>J.</given-names>
            <surname>Ainslie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee-Thorp</surname>
          </string-name>
          , M. de Jong, Y. Zemlyanskiy, forward!,
          <source>in: Proceedings of the 2024</source>
          Joint In- F. Lebron, S. Sanghai, GQA:
          <article-title>Training generalized ternational Conference on Computational Linguis- multi-query transformer models from multi-head tics, Language Resources and Evaluation (LREC- checkpoints</article-title>
          , in
          <source>: Proceedings of the 2023 ConferCOLING</source>
          <year>2024</year>
          ),
          <article-title>ELRA</article-title>
          and
          <string-name>
            <given-names>ICCL</given-names>
            ,
            <surname>Torino</surname>
          </string-name>
          , Italia,
          <year>2024</year>
          , ence on Empirical Methods in Natural Language pp.
          <fpage>4343</fpage>
          -
          <lpage>4355</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          . Processing, Association for Computational Linguislrec-main.
          <volume>388</volume>
          /. tics, Singapore,
          <year>2023</year>
          , pp.
          <fpage>4895</fpage>
          -
          <lpage>4901</lpage>
          . URL: https:
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>D.</given-names>
            <surname>Jurafsky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. H.</given-names>
            <surname>Martin</surname>
          </string-name>
          , Speech and Language //aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>298</volume>
          /. doi:10.
          <string-name>
            <surname>Processing</surname>
          </string-name>
          : An Introduction to Natural
          <source>Language</source>
          <volume>18653</volume>
          /v1/
          <year>2023</year>
          .emnlp-main.
          <volume>298</volume>
          .
          <string-name>
            <surname>Processing</surname>
            , Computational Linguistics, and Speech [33]
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , R. Sennrich,
          <source>Root mean square layer norRecognition with Language Models</source>
          ,
          <year>2025</year>
          . malization, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          , NY,
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>V.</given-names>
            <surname>Lialin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Deshpande</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rumshisky</surname>
          </string-name>
          , Scal- USA,
          <year>2019</year>
          .
          <article-title>ing down to scale up: A guide to parameter-eficient [34]</article-title>
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          Bamifne-tuning,
          <source>arXiv preprint 2303.15647</source>
          (
          <year>2024</year>
          ). ford,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. de las Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          ,
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lengyel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. R.</given-names>
            <surname>Lavaud</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <string-name>
            <surname>Low-Rank Adap- A. Lachaux</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Stock</surname>
            ,
            <given-names>T. L.</given-names>
          </string-name>
          <string-name>
            <surname>Scao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <article-title>Wang, tation of Large Language Models</article-title>
          , in: International T. Lacroix,
          <string-name>
            <given-names>W. E.</given-names>
            <surname>Sayed</surname>
          </string-name>
          , Mistral 7B,
          <source>arXiv preprint Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2310</volume>
          .06825 (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <article-title>Dettmers and Tim and Artidoro Pagnoni</article-title>
          and Ari [35]
          <string-name>
            <given-names>L.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Abbasi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Black</surname>
          </string-name>
          , Holtzman and Luke Zettlemoyer, QLoRA: Eficient A. DiPofi,
          <string-name>
            <given-names>C.</given-names>
            <surname>Foster</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Golding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hsu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Le Noac</surname>
          </string-name>
          <article-title>'h, Finetuning of Quantized LLMs, arXiv preprint H</article-title>
          .
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>McDonell</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Muennighof</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Ociepa</surname>
          </string-name>
          , arXiv:
          <fpage>2305</fpage>
          .14314 (
          <year>2023</year>
          ). J.
          <string-name>
            <surname>Phang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Reynolds</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Schoelkopf</surname>
            ,
            <given-names>A</given-names>
          </string-name>
          . Skowron,
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Taori</surname>
          </string-name>
          , I. Gulrajani,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dubois</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Thite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Guestrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Liang</surname>
          </string-name>
          , T. B.
          <string-name>
            <surname>Hashimoto</surname>
          </string-name>
          , Stanford al
          <article-title>- A. Zou, A framework for few-shot language model paca: An instruction-following llama model</article-title>
          ,
          <source>https: evaluation</source>
          ,
          <year>2024</year>
          . URL: https://zenodo.org/records/ //github.com/tatsu-lab/stanford_alpaca,
          <year>2023</year>
          . 12608602. doi:
          <volume>10</volume>
          .5281/zenodo.12608602.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , G. Semeraro, Advanced [36]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          <article-title>Denatural-based interaction for the italian language: langue</article-title>
          , et al.,
          <article-title>Transformers: State-of-the-</article-title>
          <string-name>
            <surname>Art</surname>
          </string-name>
          Llamantino-3-anita,
          <year>2024</year>
          .
          <source>Natural Language Processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations</source>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . URL: https://www.aclweb. org/anthology/2020.emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>T.</given-names>
            <surname>Dettmers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Belkada</surname>
          </string-name>
          , L. Zettlemoyer,
          <string-name>
            <surname>Llm.</surname>
          </string-name>
          int8()
          <article-title>: 8-bit matrix multiplication for transformers at scale</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/ 2208.07339. arXiv:
          <volume>2208</volume>
          .
          <fpage>07339</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <source>Language Models are Few-Shot Learners, Advances in neural information processing systems</source>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>BLEU: a Method for Automatic Evaluation of Machine Translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [40]
          <string-name>
            <surname>C.-Y. Lin</surname>
          </string-name>
          ,
          <article-title>ROUGE: A Package for Automatic Evaluation of Summaries</article-title>
          , in: Text Summarization Branches Out,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation</article-title>
          and/or summarization,
          <year>2005</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating Text Generation with BERT</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09675</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>A.</given-names>
            <surname>Conneau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Khandelwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Chaudhary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Wenzek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Guzmán</surname>
          </string-name>
          , E. Grave,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ott</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zettlemoyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          ,
          <source>Unsupervised Crosslingual Representation Learning at Scale</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [44]
          <string-name>
            <surname>MistralAITeam</surname>
          </string-name>
          ,
          <source>Mistral Small 3.1</source>
          ,
          <year>2025</year>
          . https:// mistral.ai/news/mistral-small-3-1.
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          <article-title>[45] Multi-task Language Understanding on MML, https://paperswithcode.com/sota/multi-tasklanguage-understanding-on-</article-title>
          <string-name>
            <surname>mmlu</surname>
          </string-name>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sakai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Geng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Way</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Lai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <source>Multilingual MMLU Benchmark Leaderboard</source>
          ,
          <year>2024</year>
          . https: //huggingface.co/spaces/StarscreamDeceptions/ Multilingual-MMLU-Benchmark-Leaderboard.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [47]
          <article-title>Classifica generale degli LLM italiani</article-title>
          , https://huggingface.co/spaces/miillm/open_ita_llm_leaderboard,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>