1. Introduction

Evaluation of Italian and English Small Language Models for Domain-based QA in Low-Resource Scenario

Irene Siragusa

Roberto Pirrone

0 0 Department of Engineering, University of Palermo , Palermo, 90128, Sicily , Italy

2025

Usage of open-source Large Language Models, which can be run locally, modified, fine-tuned, and queried without APIs that require data sharing, is required when dealing with sensitive or confidential information. In addition, suitable computational resources are needed to infer and fine-tune such models. The objective of this work is to assess the potentialities of Small Language Models in low-resource scenarios in which quantization may be required. In particular, the focus will be on the usage of these models in the context of the Italian and English languages from both a purely quantitative and resource-oriented evaluation, across two Question Answering data sets, a generic closed answer and a domain-based one with open answers.

eol>LLM QA Quantization Fine-tuning

1. Introduction

from late 2024 until April 2025, are considered, for which an instruction tuning phase was performed, involving Generative Large Language Models (LLMs) are mainly both English and Italian as supported languages. This oriented towards the paradigm “the bigger the better”, research led to the selection of models belonging to the involving both closed-source models such as GPT [ 1 ] following families, namely Qwen 3 [ 9 ], Gemma 3 [ 10 ], Claude [ 2 ] and Gemini [ 3, 4 ], but also Llama (Llama 3.1 Phi 4 [ 11, 12 ] and Ministral [ 13 ], for which only the free 405B [ 5 ] or Llama 4 Maverick 400B [ 6 ]) and DeepSeek models available below the 20B parameters are consid(DeepSeek R1 671B [ 7 ]) models. Despite the impressive ered. Performances of these models were evaluated using capabilities of such models, in both textual and multi- both the full-precision and 8 /4 bit quantization scenarios. modal setup, significant issues arise when dealing with The evaluation was carried out with the generic benchtheir size. In particular, higher computational resources mark MMLU [ 14, 15 ] and UniQA [ 16 ], a domain-specific are needed during the training phase, which is performed Question Answer (QA) data set in the university domain. only once and asynchronously. The inference phase, on Both data sets cover English and Italian, and relative evalthe other hand, despite requiring less computational re- uations were performed in both languages. Statistics for sources, may result in being a bottleneck of the final the evaluation time and GPU used are also calculated. distributed application, for which multi-currency and To further stress the potentialities of these models, the related GPU resources are needed. Pay-per-use APIs smallest ones were fine-tuned with two diverse strategies resolve all the computational aspects but lead to privacy- over the UniQA data set, and relative performances of related issues. Applications that involve the use of Ar- both selected benchmarks have been analyzed. tificial Intelligence (AI) models as support systems in Thus, the main contributions of this work can be sumprivate companies or hospitals, where data is confiden- marized as follows. tial or sensitive and any breaches must be avoided, should be compliant with those restrictive requirements and not allow sharing data with third parties.

To ensure these privacy-related issues, the focus of this work is on open-source models which can be trained locally and inferred in a low-resource scenario, both with full precision or in a quantization setup [ 8 ]. In doing this, the most recent Small Language Models (SLMs), released 1. Evaluation of open-source SLMs with MMLU benchmark and UniQA data set in diferent quantization scenarios, from both quantitative and computational perspective; 2. Fine-tuning with two proposed strategies over

the UniQA data set; 3. Comprehensive evaluation of fine-tuned models over both MMLU and UniQA. 2. Background discussed in Section 6, while the concluding remarks are drawn in Section 7. [ 1, 2 ] and expert-based LLMs [ 6, 7, 4, 30 ], but also smaller models obtained through a distillation procedure from larger models [ 10 ], thus providing the general public with a valuable alternative.

Small Language Models are the focus of this research, which is limited to multilingual generative models with an explicit reference for supporting the Italian language, in addition to English. In particular, only models based on a transformer decoder-only architecture [31] and instruct fine-tuning are considered. Instruct models are capable of generating text given an instruction, thus making them suitable for the proposed evaluation scenario which includes closed and open QA tasks.

In addition, as the increasing and faster development of newer models, only models which have been released from the last months of 2024 to April 2025 are examined. More in detail, we considered only models with less than 20B parameters, which have been sub-grouped as 4B, 8B, and 12B-14B models, to better evaluate their performance. The selected models are listed below along with their principal characteristics.

Fine-tuning pre-trained LLM in the context of domain and task adaptation involves strategies for both proper ifne-tuning and the technique to reduce the overall finetuning computational cost, while keeping its efectiveness. Supervised Fine-Tuning (SFT) strategies are used for instruction tuning, domain, language, or task adaptation [ 17, 18, 19 ]. As a supervised method, both the input and the desired output are provided to the model, and, following a teacher forcing methodology, the model is forced to use the expected golden target token, even if the wrong one has been previously generated [ 20 ]. In the case of QA tasks, this consists of a question and associated answer and an optional context from which the answer should be derived.

Parameter-Eficient Fine-Tuning (PEFT) techniques are adopted in conjunction with SFT to speed up the finetuning phase and reduce computational resources re- Gemma 3 [ 10 ] is a family of multimodal and quired. In particular, those involve freezing, quantization, multilingual models developed by Google DeepMind, and Low-Rank Adaptation (LoRA) [ 21 ]. In freezing, only co-designed with Gemini models [ 3, 4 ], with which they weights are actually trained in selected layers, while the share the same tokenizer. A Grouped-Query Attention rest are kept frozen. In the quantization strategy [ 8 ] the (GQA) mechanism [32] was used with post-norm precision representation of the weights in the model is and pre-norm with RMSNorm [33] and support for reduced from 32-bit to a 16-, 8-, or 4-bit representation. longer contexts. Gemma 3 models range from 1 to This technique can be used at both the training and infer- 27B parameters and were trained with a knowledge ence time, thus decreasing the computational resources distillation strategy. In the context of this research, only needed in terms of GPU. Lastly, LoRA [ 22 ] is one of the the 4B and 12B versions are considered. most used PEFT techniques where low-weight adapters, associated to selected layers, are actually trained, instead Ministral [ 13 ] is a model from the French company of the original ones. In doing this, the size of the trainable Mistral AI, it was released in the 3B and 8B parameter parameters is greatly decreased, and the computational version. Ministral models are the newer version of resources needed for the fine-tuning process are reduced Mistral 7B [34], which uses an interleaved slidingaccordingly. In addition, those techniques can be com- window attention pattern to provide a faster, more bined to better fit computational constraints, such as in computationally eficient, and low-latency solution at Quantized Low-Rank Adaptation (QLoRA) [ 23 ] in which inference time. As the 3B version is not open-source, quantization is applied along with LoRA during training. only the 8B version was considered in this work.

For efective fine-tuning, models are trained, on average for a few training epochs, mainly ranging from 3 to 15, [ 24, 25, 26, 27 ], usually combining diferent PEFT strategies [28, 29].

3. Models

Capabilities around diferent tasks for closed-source and huge LLMs are well known, but in the context of real applications, usage of such models is impracticable. This can be mainly addressed to costly pay-per-use APIs, and to the sharing of private data to third parties that may lead to data breaches. Natural Language Processing (NLP) community is exploring not only the capabilities of larger

Phi 4 [ 11, 12 ] is a family of Microsoft models that showed impressive capabilities despite the reduced number of parameters, compared to other models.

Higher performance of these models can be addressed to the three-stage training procedure and the data curation process, which involves a data decontamination process to the most used benchmarks. In addition, more variety in data, attention towards synthetic data for Chain of Thoughts (CoT) and reasoning capabilities, contributed in enhancing the overall behavior of these models. Phi 4 was released in its full version, which consists of 14B parameters and in its mini version with 3.8B parameters, which will be considered as a 4B model in the subsequent analysis. University of Palermo, covering information about the bachelor and master degree courses for the academic year

Qwen 3 [ 9 ] is a family of multilingual models released 2024/2025. Data are natively both in Italian and English, by the Chinese company Alibaba Cloud. Along with i.e. no translation procedure was involved for developing the large models of 30B and 235B parameters Mixture the model. From here on, UniQA-EN will be used for of Expert Models, smaller models have been released the English split, UniQA-IT for the Italian one, and the ranging from 1B to 32B parameters. Only models of 4B, general form UniQA will be used for both splits. 8B and 14B parameters are considered in this analysis.

Great attention in Qwen 3 models was towards reasoning and CoT, both in training data selection and at inference 5. Experimental setup time, in which the explicit thinking mode can be enabled or not.

Models in an out-of-the-box setup were tested with MMLU and UniQA data sets at diferent levels of quantization, namely in their base, 8-bit (Q8) and 4-bit 4. Data sets (Q4) quantization [ 8 ]. These evaluations were performed to assess the diferent performances of quantized models Three English and Italian data sets have been considered versus their base version along with the efective for evaluation purposes, two are closed QA data sets, and computational resources involved, such as GPU memory the other an open QA dataset. In the first case, the model and inference time. Quantization was performed with is asked to answer with one of the provided answers, the usage of the bitsandbytes library1 in combination while in the second case a free text answer is expected. with the transformers library [36] in both 8-bit and 4-bit As closed QA, the general Massive Multitask Language quantization [ 37 ].

Understanding (MMLU) task was selected in its English and Italian versions. From here on, the English version Regarding the MMLU-EN and MMLU-IT evaluation, of MMLU will be referred to as MMLU-EN, and the we used the Language Model Evaluation Harness frameItalian version as MMLU-IT, while MMLU will be used work [35], in 5-shot setup, and considering the accuracy to refer to both splits. On the other hand, UniQA was as the evaluation metric. Performance for the UniQA selected as a domain-specific open answer QA data set data set was obtained providing the following prompt, in the university domain, available both in English and enriched with the target question and associated docuItalian. ments, following an in-context learning strategy [ 38 ].

MMLU-EN [ 14 ] is a generic benchmark task to evaluate the capabilities of LLMs after their training phase. It is a closed QA task involving 57 diferent subjects in STEM, humanities, and social science with diverse complexity ranging from elementary level to advanced and professional level. It consists of 14079 questions, and the models are queried with a 5-shot strategy in which 5 sample questions are provided for each subject. Accuracy is the proposed metric for performance evaluation.

MMLU-IT [ 15 ] is the translated version of the MMLU data set, which is also referenced in the Language Model Evaluation Harness framework [35]. Translation was obtained automatically using an ad hoc developed prompt for ChatGPT. No further checks have been conducted on the data set to evaluate its correctness in terms of translation.

UniQA [ 16 ] is a QA data set for the University domain that comprehends nearly 14k QA pairs and more than 1k documents, which serve as a context for the question.

The data set has been generated in a semi-automated manner using the data retrieved from the website of the

You are Unipa-GPT, the chatbot and vir

tual assistant of the University of Palermo.

Provide an answer to the provided QUESTION concerning the University of Palermo, relying on the given DOCUMENTS If the question is in English, answer in English. If the question is in Italian, answer in Italian. QUESTION:

question

DOCUMENTS:

documents

For UniQA, we used the default generation configurations suggested by the developers of the selected models. In particular, the thinking mode was disabled for Qwen 3, while the sampling strategy in the generation phase was disabled in the context of Gemma 3 models. Whenever the model was not able to generate and answer, the

1https://github.com/bitsandbytes-foundation/bitsandbytes

default empty answer has been considered as the gener- document strategy, no additional context was provided ated one. As evaluation metric, BLEU [ 39 ], ROUGE [ 40 ], in the prompt, allowing the model to learn the QA pairs METEOR [ 41 ] and BERTScore [ 42 ], with the multilingual directly and, at inference time, to integrate the knowlmodel XLM-RoBERTa Large [ 43 ] were calculated. Since edge provided by the documents in a context-learning the F1 BERTScore provides a more comprehensive evalu- set-up [ 38 ]. ation of the meaning and significance of the generated answer, it was the only metric considered for evaluation purposes in the context of this work. In the Appendix, all the calculated metrics for the UniQA data set are reported for each inference configuration tested (Table 6).

Following the approaches described in Section 2, our choice was to perform a full fine-tuning limited to selected layers. A unique strategy was designed that is suitable for heterogeneous models with a diferent number of decoder layers. We fully fine-tuned only the last 5.1. Fine-tuning strategies 25% of the decoder layers and the classification head, while freezing the remaining layers. The proposed stratOnly the smallest models, namely Gemma 3 4B, Phi 4 egy resulted in a valuable trade-of with PEFT techniques mini, and Qwen 3 4B, have been fine-tuned over the and full fine-tuning. In addition, this strategy meets the English and Italian training split of the UniQA data set. In proposed research question in analyzing the impact of particular, two diferent fine-tuning strategies have been quantization at the inference phase and not during trainproposed and used in this phase, namely w/ docs (with ing. documents) and w/o docs (without documents). They The models have been trained for five epochs: a larger difer in the arrangement of the training samples and the number of training epochs do not lead to significant associated instruction prompt as reported in Table 1. improvement compared to the considered training data. A validation set was expunged from the training set with Table 1 a 90:10 ratio, and it was used as a criterion to select the Instruction prompts designed for fine-tuning w/ and w/o doc- best model over the validation loss. uments.

w/ docs prompt text You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo.

Provide an answer to the provided QUESTION concerning the University of Palermo, relying on the given DOCUMENTS If the question is in English, answer in English.

If the question is in Italian, answer in Italian.

QUESTION: <QUESTION> DOCUMENTS: <DOCUMENTS>

w/o docs prompt text You are Unipa-GPT, the chatbot and virtual assistant of the University of Palermo.

Provide an answer to the provided QUESTION concerning the University of Palermo.

If the question is in English, answer in English.

If the question is in Italian, answer in Italian.

QUESTION: <QUESTION>

In w/ docs strategy, annotated documents were fed as input in the training sample, thus allowing the model to read the documents and force it to extract and re-paraphase the desired snippet in the document, containing the answer. On the other hand, in the w/o

Inferences were run on a local machine on a single 48 GB NVIDIA RTX 6000 Ada Generation GPU (machine 1) and on a cluster with 1 NVIDIA A100 64 GB GPUs from the Leonardo supercomputer2 via an ISCRA-C application (machine 2), while fine-tuning was executed on machine 2. Over the same machines, the occupied GPUs and inference time were monitored to simulate and provide an estimation of the required computational resources in the low-resource scenario.

6. Results

In Table 2 a comprehensive evaluation of the selected models is reported. Evaluations also include performances over bigger models such as Mistral Small [ 44 ], Llama 4 Scout Instruct [ 6 ], Claude 3.5 Sonnet [ 2 ] and GPT 4o Mini [ 38 ]. These models have been considered since their performance for both tasks was available from public leaderboards [ 45, 46, 47 ]. In this phase, no spot checks or roundtrip translations have been conducted to further investigate errors in the automatically translated MMLU-IT split, to assess whether some inference errors derive from actual model limitations or from translation artifacts.

The overall best results are achieved by Claude 3.5 Sonnet, followed by GPT 4o mini. Nevertheless, Phi 4 results in a valuable alternative since it reaches performances comparable to Mistral Small and is only 0.2

2https://leonardo-supercomputer.cineca.eu/it/home-it/

points below GPT in MMLU-EN. As for MMLU-IT, scores tend to be lower compared to the English split, and again In full precision inference, best results are assessed by Phi results the best. With reference to smaller models, Gemma 3 models reaching a BERT-F1 score of 0.88 on avQwen 3 in its 4B and 8B versions outperforms other erage in both 4B and 12B versions, also surpassing larger models in MMLU tasks, while showing a significant 14B models. In this context, the performances of Gemma average inference time, compared with the competitors. 3 4B are much more interesting from a computational Performance generally exhibits a decrease in quantized perspective, since it is 64% smaller than the 12B version, models. The decrease is significant in the case of Q4, reaching comparable performances. The smallest model while the average inference time for each question is in this set-up is Phi 4B mini, which occupies less than decreasing for Q8 and tends to increase in the case of 16GB and reaches the smallest inference time, which is Q4, especially for Qwen. desired in context of real-time applications. Regarding quantized inferences, GPU values decrease by 70% and

Generally speaking, the quantization procedure at 80% for Q8 and Q4, respectively, compared to models inference time can increase the answer time due to the inferred with full precision. In terms of inference time, additional computation required for quantization [ 37 ]. significant increases are found in Q8, while a reduction This behavior is highly emphasized with the UniQA is found in Q4, which is mainly related to the quantievaluation, where the input provided to the models can zation strategy adopted by bitsandbytes [ 37 ]. Overall, be significantly longer compared to MMLU samples. for both performance and computational resource usThe results for UniQA are reported in Table 3, together age, Phi and Ministral are the best models which benefit with the average GPU occupied for each inference. To from quantization, and keep comparable performances better compare obtained results, the standard deviation over the selected benchmarks, despite a slight decrease. over the average inference time and GPU usage is The worst performances are assessed by Gemma models also reported. Note that the average inference time which deeply sufer the quantization procedure that leads is reported in seconds, while the GPU usage in GB: to output empty string (Q4) or meaningless output in not associated standard deviation follows the same scale, desired languages (Q8). and, in the GPU case, the majority results 0.0 since the In contrast with the MMLU case, in which a slight corresponding variation is lower than 0.1 GB. discrepancy can be found between the English and Italian split, in the UniQA case, all performance places on the same level, and, in some cases, slightly towards the Italian split. This performance can be explained through an analysis on the data set, in which the presence of the context can guide the model more efectively in generating the desired answer, and in the language-related characteristics and understandings.

In Tables 4 and 5 the results over both benchmarks are reported using the two proposed fine-tuning strategies, with and without documents.

No improvements are found after the fine-tuning phase with the two proposed strategies in terms of performance on the MMLU benchmarks. Models fine-tuned with the w/ docs strategy tend to better maintain the performance obtained by the base models. These results show that the ifne-tuning on a specific task did not lead to a degradation in performance in a generic benchmark and that the generalization performance of the considered LLM is maintained. This is mainly due to the light fine-tuning strategy adopted, which does not cause the model to overfit.

Regarding UniQA performance, both strategies have been shown to be successful since overall performance for the BERT F1 score increased. More specifically, better results are obtained in the case of w/o docs strategy, both from evaluation metrics and for average inference time, which is reduced. Improvements are found in both the base and quantized inferences. As in the without fine-tuning inference, the average time for quantized models deeply penalized Gemma 3 4B, while Qwen 3 4B trained with a w/ docs strategy, resulted in being the overall best model both in MMLU and UniQA benchmarks. Qwen 3 4B, in fact, better maintained the same level of performance across the diferent quantization levels. In addition, the w/o docs fine-tuning strategy was crucial to improve capabilities for Phi 4 mini 4B, in particular in the base and Q8 quantized inference. A general speed-up in performances is found in fine-tuned models over UniQA benchmark, while no improvements are found in Gemma 3 4B in Q4 setup, where performances are kept low.

The results obtained show that recent progress in developing multilingual LLMs provides the opportunity to use a valuable out-of-the-box model, also for domainspecific tasks with appropriate prompt engineering. In addition, the two proposed fine-tuning strategies, coupled with an overall light training phase as for the number of epochs, trainable layers, and consequently the resources needed, results crucial to improve capabilities of the SLMs under consideration, as for Phi 4 mini 4B and Qwen 3 4B. Those models trained in a target domain for a desired QA task of interest were able to outperform models three times larger in size, requiring on-budget resources. In general, both models should be considered as a valuable alternative to develop a custom LLM in a low-resource scenario. Phi tend to outperform after a w/o docs fine-tuning in terms of BERT score. On the other hand, Qwen presents strong performance in both traditional metrics such as the BLEU, ROUGE, and Meteor scores (Table 6) with both a fine-tuning strategy and diferent quantization. Depending on the actual computational resources available, Phi is preferred, since it is smaller compared to Qwen. Despite metrics being really close to each other, the w/o docs training strategy is the best and the fastest one in the training phase.

7. Conclusions

Super Computing Resource Allocation class C project IscrC_DOCVLM2 (HP10C97VNN).

Declaration on Generative AI

During the preparation of this work, the authors used Writefull for grammar and spelling checks. After using these tools, the authors reviewed and edited the content as needed and assume full responsibility for the content of the publication. In this work, we evaluated the recent open-source instruction-tuned multilingual Small Language Models belonging to diferent families with a focus on their performances upon a base inference and after a Q8 and Q4 quantization. In particular, both closed and open answer QA tasks were analyzed in Italian and English. Performances were evaluated from a quantitative perspective with the general MMLU benchmark and UniQA, a QA data set based on a specific domain for which relevant documents are associated to each question.

The results show that among the largest models under evaluation, Phi 4 14B almost reached Claude 3.5 Sonnet and GPT 4o Mini in the MMLU benchmark, while Gemma 3 14B obtained interesting performances when inferred with full precision using UniQA. Among the smaller models, Qwen 3 4B and Phi 4 mini 4B were the most promising ones: both models better scale in terms of performance after Q8 and Q4 in selected benchmarks.

In addition, two fine-tuning strategies were proposed for the last 25% layers and the classification head of the smaller models using the training split of the UniQA data set. The results proved that Qwen 3 4B benefits the most of the training when evaluated over UniQA, while maintaining general good performance in the MMLU task. Such considerations together with the flexibility towards quantization and smaller inference time make Qwen 3 4B a valuable model to implement custom LLM-based applications in a low-context scenario after a suitable ifne-tuning phase.

More tests are needed to evaluate the performance of the investigated models from a qualitative perspective. More in detail, additional tests will be conducted to simulate a real-case scenario, involving both human evaluation of the quality of the provided answers and truly open-ended QA in the domain of interest.

A. Evaluation metrics

In Table 6, are reported the full calculated metrics over the UniQA data set in diferent quantization and fine

tuning strategies.

Model Gemma 3 4B

Phi 4 4B Qwen 3 4B Gemma 3 4B

Phi 4 4B Qwen 3 4B Ministral 8B Qwen 3 8B Ministral 8B Qwen 3 8B Ministral 8B Qwen 3 8B Gemma 3 12B

Phi 4 14B Qwen 3 14B Gemma 3 12B

Phi 4 14B Qwen 3 14B w/o docs w/o docs w/o docs w/o docs w/o docs w/o docs w/o docs w/o docs w/o docs w/ docs w/ docs w/ docs w/ docs w/ docs w/ docs w/ docs w/ docs w/ docs

Q8 Q8 Q8 Q4 Q4 Q4 Q8 Q8 Q8 Q4 Q4 Q4 Q8 Q8 Q8 Q4 Q4 Q4 Q8 Q8 Q4 Q4 Q8 Q8 Q8 Q4 Q4 Q4

UNIQA Overview of the calculated metrics in the UniQA-EN and UniQA-IT split. BERT-prec and BERT-rec stands for BERT precision and recall scores, respectively, while FT and QTN refers to the fine-tuning and quantization strategy adopted. Average performance and execution time is reported in seconds, while the GPU used in GB. Bold values are the higher ones for each block, while starred ones the overall best.

QTN

BLEU

ROUGE-1

ROUGE-2

ROUGE-L

ROUGE-Lsum During the preparation of this work, the author(s) used Other and Writefull in order to: Improve reviewed and edited the content as needed and take(s) full responsibility for the publication’s 0.317 0.307 0.354

[1] OpenAI, GPT-4o System Card, arXiv preprint arXiv:2410.21276 ( 2024 ).

[2] Antropic , The Claude 3 Model Family: Opus , Sonnet, Haiku, 2024 .

[3] GeminiTeam , R.

Anil , S.

Borgeaud , J.-B. Alayrac , J.

Yu , R.

Soricut , J.

Schalkwyk , A. M.

Dai , A.

Hauth , K.

Millican , et al., Gemini: A Family of Highly Capable Multimodal Models , 2024 .

[4] Google

DeepMind

, Gemini 2 . 5: Our most intelligent AI model , 2025 . blog. google/technology/google-deepmind/ gemini-model-thinking-updates-march- 2025 /.

[5] LlamaTeam , The Llama 3 Herd of Models, arXiv preprint arXiv:2407.21783 ( 2024 ).

[6] LlamaTeam, The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation , 2025 . https://ai.meta.com/blog/ llama-4 - multimodal-intelligence/.

[7] DeepSeek-AI , DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , arXiv preprint arXiv:2501.12948 ( 2025 ).

[8]

Jacob ,

Kligys ,

Chen ,

Zhu ,

Tang ,

Howard ,

Adam ,

Kalenichenko , Quantization and training of neural networks for eficient integer-arithmetic-only inference , in: Proceedings of the IEEE conference on computer vision and pattern recognition , 2018 .

[9] QwenTeam, Qwen3 Technical Report, arXiv preprint arXiv:2505.09388 ( 2025 ).

[10] GemmaTeam , A.

Kamath , J.

Ferret , S.

Pathak , N.

Vieillard , R.

Merhej , et al., Gemma 3 TechniAcknowledgments cal Report, arXiv preprint arXiv:2503.19786 ( 2025 ).

[11]

Abdin ,

Aneja ,

Behl ,

Bubeck ,

Eldan , We thank Giampiero Barbaro, who contributed in devel- et al., Phi-4 Technical Report , arXiv preprint oping the training strategy in the early stages of this arXiv: 2412 .08905 ( 2024 ).

work, as part of his master thesis . This work is sup- [12]

Abouelenin ,

Ashfaq , A. Atkinson, ported by the cup project J73C24000070007, "CAESAR" H . Awadalla , N. Bach , et al., Phi-4 -Mini Technical ( Cognitive evolution in Ai: Explainable and Self-Aware Report: Compact yet Powerful Multimodal Robots through multimodal data processing). The works Language Models via Mixture-of-LoRAs, arXiv presented were partially developed on the Leonardo su- preprint arXiv:2503.01743 ( 2025 ).

[13] MistralAITeam , Un Ministral, des Ministraux, 2024 . [26]

Zhou , P. Liu,

Xu ,

Iyer ,

Sun ,

Mao , X. Ma, https://mistral.ai/news/ministraux. A. Efrat , P.

Yu , L.

Yu , S.

Zhang , G. Ghosh, M.

Lewis ,

[14]

Hendrycks ,

Burns ,

Basart ,

Zou ,

Zettlemoyer , O. Levy , LIMA: Less Is More for

Mazeika ,

Song ,

Steinhardt , Measuring Alignment, arXiv preprint arXiv:2305.11206 ( 2023 ). massive multitask language understanding , arXiv [27]

Lu ,

R. K.

Luu ,

M. J.

Buehler , Fine-tuning large lanpreprint arXiv: 2009 . 03300 ( 2021 ). guage models for domain adaptation: Exploration

[15]

V. D.

Lai ,

C. V.

Nguyen ,

N. T.

Ngo , T. Nguyen, of training strategies, scaling, model merging and

Dernoncourt ,

R. A.

Rossi ,

T. H.

Nguyen , Okapi: synergistic capabilities, 2024 . Instruction-tuned large language models in mul- [28]

Afzal ,

Chalumattu ,

Matthes , L. Mascarell, tiple languages with reinforcement learning from AdaptEval: Evaluating Large Language Models on human feedback , arXiv preprint arXiv:2307 . 16039 Domain Adaptation for Text Summarization , in: ( 2023 ). Proceedings of the 1st Workshop on Customizable

[16] I. Siragusa , R. Pirrone, UniQA: an italian and english NLP: Progress and Challenges in Customizing NLP question-answering data set based on educational for a Domain, Application , Group, or Individual documents, Proceedings of the Eighth Workshop on (CustomNLP4U) , 2024 , pp. 76 - 85 . Natural Language for Artificial Intelligence (NL4AI [29]

Zheng ,

Hong ,

Liu ,

Wang ,

Su ,

Liang , 2024 ) co -located with 23th International Confer- S. Wu, Dragft: Adapting large language modence of the Italian Association for Artificial Intelli- els with dictionary and retrieval augmented finegence (AI*IA 2024 ) (2024). tuning for domain-specific machine translation,

[17] Alpaca-LoRA , https://github.com/tloen/alpaca-lora, arXiv preprint arXiv:2402.15061 ( 2024 ). 2023 . [30] MistralAITeam , Large Enough, 2024 . https://mistral.

[18]

Xu ,

Guo ,

Duan , J. McAuley , Baize:

An open- ai/news/mistral-large-2407. source chat model with parameter-eficient tuning [31]

Vaswani ,

Shazeer ,

Parmar , J. Uszkoreit, on self-chat data , arXiv preprint arXiv:2304 .01196

Jones ,

A. N.

Gomez , L. u. Kaiser, I. Polosukhin , ( 2023 ). Attention is All you Need , in: Advances in Neural

[19]

Bacciu ,

Campagnano ,

Trappolini , F. Sil- Information Processing Systems , 2017 . vestri, DanteLLM: Let's push Italian LLM research [32]

Ainslie ,

Lee-Thorp , M. de Jong, Y. Zemlyanskiy, forward!, in: Proceedings of the 2024 Joint In- F. Lebron, S. Sanghai, GQA: Training generalized ternational Conference on Computational Linguis- multi-query transformer models from multi-head tics, Language Resources and Evaluation (LREC- checkpoints , in : Proceedings of the 2023 ConferCOLING 2024 ), ELRA and ICCL , Torino , Italia, 2024 , ence on Empirical Methods in Natural Language pp. 4343 - 4355 . URL: https://aclanthology.org/ 2024 . Processing, Association for Computational Linguislrec-main. 388 /. tics, Singapore, 2023 , pp. 4895 - 4901 . URL: https:

[20]

Jurafsky ,

J. H.

Martin , Speech and Language //aclanthology.org/ 2023 .emnlp-main. 298 /. doi:10. Processing : An Introduction to Natural Language 18653 /v1/ 2023 .emnlp-main. 298 . Processing , Computational Linguistics, and Speech [33] B. Zhang , R. Sennrich, Root mean square layer norRecognition with Language Models , 2025 . malization, Curran Associates Inc., Red

Hook

, NY,

[21]

Lialin ,

Deshpande ,

Yao ,

Rumshisky , Scal- USA, 2019 . ing down to scale up: A guide to parameter-eficient [34]

A. Q.

Jiang ,

Sablayrolles ,

Mensch , C. Bamifne-tuning, arXiv preprint 2303.15647 ( 2024 ). ford,

D. S.

Chaplot , D. de las Casas,

Bressand ,

[22]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Lengyel ,

Lample ,

Saulnier ,

L. R.

Lavaud , M.-

Wang ,

Wang , W. Chen, LoRA: Low-Rank Adap- A. Lachaux , P.

Stock , T. L.

Scao , T.

Lavril , T.

Wang, tation of Large Language Models , in: International T. Lacroix,

W. E.

Sayed , Mistral 7B, arXiv preprint Conference on Learning Representations , 2022 . arXiv: 2310 .06825 ( 2023 ).

[23] Dettmers and Tim and Artidoro Pagnoni and Ari [35]

Gao ,

Tow ,

Abbasi ,

Biderman ,

Black , Holtzman and Luke Zettlemoyer, QLoRA: Eficient A. DiPofi,

Foster ,

Golding ,

Hsu ,

Le Noac 'h, Finetuning of Quantized LLMs, arXiv preprint H . Li , K.

McDonell , N.

Muennighof , C.

Ociepa , arXiv: 2305 .14314 ( 2023 ). J. Phang , L.

Reynolds , H.

Schoelkopf , A . Skowron,

[24]

Taori , I. Gulrajani,

Zhang ,

Dubois ,

Li ,

Sutawika ,

Tang ,

Thite ,

Wang ,

Guestrin ,

Liang , T. B. Hashimoto , Stanford al - A. Zou, A framework for few-shot language model paca: An instruction-following llama model , https: evaluation , 2024 . URL: https://zenodo.org/records/ //github.com/tatsu-lab/stanford_alpaca, 2023 . 12608602. doi: 10 .5281/zenodo.12608602.

[25]

Polignano ,

Basile , G. Semeraro, Advanced [36]

Wolf ,

Debut ,

Sanh ,

Chaumond , C. Denatural-based interaction for the italian language: langue , et al., Transformers: State-of-the- Art Llamantino-3-anita, 2024 . Natural Language Processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations , Association for Computational Linguistics , Online, 2020 , pp. 38 - 45 . URL: https://www.aclweb. org/anthology/2020.emnlp-demos. 6 .

[37]

Dettmers ,

Lewis ,

Belkada , L. Zettlemoyer, Llm. int8() : 8-bit matrix multiplication for transformers at scale , 2022 . URL: https://arxiv.org/abs/ 2208.07339. arXiv: 2208 . 07339 .

[38]

Brown ,

Mann ,

Ryder ,

Subbiah ,

J. D.

Kaplan ,

Dhariwal ,

Neelakantan ,

Shyam ,

Sastry ,

Askell , et al., Language Models are Few-Shot Learners, Advances in neural information processing systems ( 2020 ).

[39]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, BLEU: a Method for Automatic Evaluation of Machine Translation , in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics , 2002 .

[40] C.-Y. Lin , ROUGE: A Package for Automatic Evaluation of Summaries , in: Text Summarization Branches Out, 2004 .

[41]

Banerjee ,

Lavie , METEOR: An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, 2005 .

[42]

Zhang ,

Kishore ,

Wu ,

K. Q.

Weinberger , Y. Artzi, BERTScore: Evaluating Text Generation with BERT , arXiv preprint arXiv: 1904 . 09675 ( 2020 ).

[43]

Conneau ,

Khandelwal ,

Goyal ,

Chaudhary ,

Wenzek ,

Guzmán , E. Grave,

Ott ,

Zettlemoyer ,

Stoyanov , Unsupervised Crosslingual Representation Learning at Scale , 2019 .

[44] MistralAITeam , Mistral Small 3.1 , 2025 . https:// mistral.ai/news/mistral-small-3-1.

[45] Multi-task Language Understanding on MML, https://paperswithcode.com/sota/multi-tasklanguage-understanding-on-

mmlu , 2021 .

[46]

Zhou ,

Sakai ,

Zhou ,

Li ,

Geng ,

Li ,

Lin ,

Way ,

Li ,

Wan ,

Wu ,

Lai ,

Zeng , Multilingual MMLU Benchmark Leaderboard , 2024 . https: //huggingface.co/spaces/StarscreamDeceptions/ Multilingual-MMLU-Benchmark-Leaderboard.

[47] Classifica generale degli LLM italiani , https://huggingface.co/spaces/miillm/open_ita_llm_leaderboard, 2024 .