1. Introduction

Uncovering Unsafety Traits in Italian Language Models

Giulia Rizzi

Giuseppe Magazzù

Alberto Sormani

Francesca Pulerà

Daniel Scalena

0 1

Elisabetta Fersini

1 0 University of Groningen , CLCG, Groningen , The Netherlands 1 University of Milano-Bicocca , Milan , Italy

2025

Large Language Models (LLMs) are increasingly deployed in real-world applications, raising urgent concerns around their safety, reliability, and ethical behavior. While existing safety evaluations have primarily focused on English, low- and mid-resource languages such as Italian remain critically underexplored. In this paper, we present the first comprehensive and multidimensional evaluation of LLM safety in the Italian language. We assess seven state-of-the-art LLMs across key safety dimensions using several automatic moderators tailored to cover the Italian settings. Furthermore, we analyze the challenges of adapting English-centric safety benchmarks to Italian via machine translation, highlighting limitations and proposing best practices for developing culturally and linguistically grounded evaluation frameworks. WARNING: This paper contains content that may be considered ofensive.

eol>Safety Evaluation Large Language Models (LLMs) Italian Language

1. Introduction

Large Language Models (LLMs) have rapidly become central to numerous applications, including conversational agents, content generation, and decision support systems in sensitive areas. However, as these models become more complex and widespread, concerns about their safety, reliability, and ethical deployment are growing. The performance of LLMs no longer considers solely measures in terms of accuracy or fluency, but increasingly encompasses evaluations related to their unsafety.

This last evaluation encompasses dimensions such as bias, toxicity, robustness to adversarial prompts, factual consistency, privacy preservation, and fairness.

Despite this growing awareness, a substantial portion of the literature on safety remains centred on highresource languages, particularly English. The absence of comprehensive evaluations tailored to specific languages, including Italian, introduces a risk of overlooking language-specific vulnerabilities and sociolinguistic nuances that may influence model behaviour. Given the global deployment of many LLMs and their interaction with users across a broad spectrum of languages, this imbalance poses practical and ethical challenges.

In this paper, we aim to address this gap by presenting the first comprehensive evaluation of LLM safety focused exclusively on the Italian language. We systematically assess commonly adopted LLMs across multiple dimensions of safety, adapting existing safety benchmarks. The objective of this study is to provide a fair evaluation of the unsafe behaviour of Italian Large Language Models, with a focus on identifying potential risks and highlighting future development and deployment practices.

The primary contributions of this work are as follows: 1. We present the first systematic and multidimensional unsafety evaluation of Italian Large Language Model (LLM), which highlights the need in some cases to focus more on aligning the models on a more ethical behaviour.

In particular, we performed a comparative evaluation of seven state-of-the-art Italian LLMs using both automatic and human-based evaluations. 2. We developed three moderators to automatically evaluate and classify prompt–response pairs for the Italian language, enabling nuanced assessment of unsafe behaviors in a predeifned set of categories. In particular, we implemented DeBERTa v3 large, LLaMA 3.1 8B Instruct, and LLaMA Guard 3 8B for the Italian language. 3. We provide an in-depth analysis of issues related to erroneous translation and their implications on safety benchmarking. We propose methodological recommendations for the development of culturally sensitive and linguistically appropriate safety benchmarks, with implications for the broader goal of equitable and responsible deployment of LLMs across diverse LLMs, attack, and defense methods. The experiments linguistic contexts. carried out by the authors provide insight into the resilience of LLMs to emerging threats and the eficacy of The paper is organized as follows. In section 2, related contemporary defence tactics. A large-scale, comprehenworks are outlined. In section 3, the comparative eval- sive safety evaluation of the current LLM landscape is uation of unsafety is described. In section 4, the main proposed in [10]. The authors evaluate 39 LLMs on a outcomes are discussed. Finally, in section 5, conclusions multilingual benchmark (i.e., M-ALERT) and highlight and future works are described. the importance of language- and category-specific safety analysis. 2. Related Works While significant progress has been made in developing Italian benchmarks for LLMs, current evaluations The increasing adoption of large language models (LLMs), predominantly focus on comprehension and reasoning caincluding generative pre-trained transformers (GPTs), in pabilities, with limited attention to safety considerations both daily tasks and more specific applications has led to [11]. BeaverTails-IT [12] represents the first safety bencha substantial increase in interest regarding their reliabil- mark specifically designed for the Italian language, adity [ 2, 3 ]. Yuan et al. [ 4 ] conducted a study to investigate dressing this critical gap in evaluation resources. In light the behaviour of NLP models under out-of-distribution of the existing literature, which highlights the critical conditions. The study demonstrated that state-of-the- need for robust and comprehensive multilingual safety art language models continue to exhibit brittleness when practices in LLMs, we propose the first evaluation of confronted with data that deviates from their training dis- widely adopted language models specifically in the Italtributions. This finding serves to reinforce the prevailing ian language, aiming to bridge current evaluation gaps argument that the current state of generalisation capabili- and support safer deployment in this linguistic context. ties is inadequate for a considerable number of real-world applications. Another area of research focuses on Privacy 3. Evaluating LLMs’ Safety concerns. Yuan et al. [ 5 ] present a simple method for rgiesnkesraantdincgonsydnutchtectoimctperxethdeantsaivwe hexilpeemrimitiegnattsinegvapluriavtaincyg 3.1. Large Language Models both utility and privacy risks. The landscape of Italian-language large language models

Other critical aspects of trustworthiness research are (LLMs) has recently undergone significant expansion, Adversarial attacks on language models and fairness of with the development of several notable architectures machine learning models. Zang et al. [ 6 ] framed word- tailored for instructional and general-purpose natural level adversarial perturbations as a combinatorial opti- language processing (NLP) tasks. mization problem, demonstrating that even minor textual modifications can significantly degrade model perfor- • DanteLLM* [13] is based on the Mistral [14] mance. Zemel et al. [ 7 ] proposed a methodology for architecture and fine-tuned on Italian data uslearning fair representations, which balances predictive ing LoRA, a parameter-eficient tuning method. accuracy with group fairness. Although not specific to The fine-tuning phase made use of several LLMs, this framework laid the groundwork for ongo- Italian datasets, including the Italian SQuAD ing research into algorithmic bias and equitable model dataset [15], 25,000 sentences from the Europarl behavior. A significant contribution to this field is the dataset [16], Fauno’s Quora dataset, and the DecodingTrust framework proposed by Wang et al. [8], Camoscio dataset. We adopted the Hugging which ofers a comprehensive assessment of GPT-3.5 Face model: rstless-research/DanteLLMand GPT-4. Their study evaluates these models along 7B-Instruct-Italian-v0.1. several axes, including toxicity, bias, adversarial robustness, privacy, and fairness. Notwithstanding the fact that • Camoscio* [17] is a LoRA fine-tuning of LLaMA, GPT-4 generally exhibits superior performance across with 7 billion parameters, trained on an Itala multitude of benchmarks, the study reveals that the ian translation of the Alpaca dataset [18]. We model remains vulnerable to carefully crafted adversarial use the following Hugging Face model: sagprompts (i.e., given jailbreaking system or user prompts) uniroma2/extremITA-Camoscio-7b. and inadvertent privacy leaks. This finding highlights • LLaMAntino* [19] is an instruction-tuned verconcerns regarding the deployment of such safe systems. sion of Meta-Llama-3-8b-instruct 1 (a fine-tuned

To meet this crucial need, safety benchmark specifically designed for evaluating LLMs, attack, and defense ∗Models fine-tuned on Italian methods have been proposed. For instance, SALAD- †Models trained from scratch on Italian Bench [9] has been specifically designed for evaluating 1https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct LLaMA 3 model). The model has been supervised ifne-tuned (SFT) using QLoRA on instructionbased datasets. We adopted the instructiontuned version, which was fine-tuned on English and Italian language datasets, available on Hugging Face: swap-uniba/LLaMAntino-3

ANITA-8B-Inst-DPO-ITA. • Modello Italia† is an instruction-tuned model, based on GPT-NeoX architecture, trained with a focus on the Italian language (90% of data in Italian and the remaining 10% in English). We adopted sapienzanlp/modello-italia-9bbf16 available on Hugging Face. • Minerva† [20] is the first family of LLMs trained entirely from scratch on native Italian texts using a portion of FineWeb, which includes filtered and deduplicated Common Crawl dumps with various timestamps. We adopted the instruction-tuned version, available at: sapienzanlp/Minerva7B-instruct-v1.0.

These prompts are designed to elicit one of the 14 difer

ent categories of unsafe responses (1. Animal Abuse, 2.

Child Abuse, 3. Controversial Topics, Politics, 4. Discrimination, Stereotype, Injustice, 5. Drug Abuse, Weapons, Banned Substance, 6. Financial Crime, Property Crime, Theft, 7. Hate Speech, Ofensive Language, 8. Misinformation regarding ethics, laws, and safety, 9. NonViolent Unethical Behavior, 10. Privacy Violation, 11.

Self-Harm, 12. Sexually Explicit, Adult Content, 13. Terrorism, Organized Crime, 14. Violence, Aiding and Abetting, Incitement.) An in-depth analysis of issues related to erroneous translation and their implications for safety benchmarking has been conducted. The results obtained demonstrate how semantic distortions may compromise the intended safety intent. Overall, 57.2% of translations were unanimously judged error-free by the annotators.

Semantic errors were the most common (11.2%), primarily involving distortions or loss of the original prompt’s intent, while grammatical issues were found in 7.4% of cases. Further details and a breakdown of error types are provided in [12].

3.3. Evaluation Strategy • Velvet* is a family of instruction models finetuned using a combination of open-source instruction datasets and synthetic datasets tailored for solving long context problems. We adopted the 14 billion parameters version available on Hugging Face as: Almawave/Velvet-14B.

In order to perform the analysis of unsafety, the prompts from BeaverTails-IT were adopted to generate responses from several widely used Italian large language models (LLMs), including both open-source and proprietary systems. To evaluate the safety of the resulting QA pairs, a • MIIA† is a large language model with 7 bil- dual approach has been employed, combining automatic lion parameters, built on an autoregressive trans- and human assessments. Specifically, safety classification former architecture, specifically designed and models (moderators) are investigated to automatically trained for the Italian language and cultural detect potentially harmful outputs based on predefined context. We adopted the Hugging Face model: risk categories. Subsequently, human annotators evaluFastweb/FastwebMIIA-7B. ated a selection of responses, providing both qualitative and quantitative validation of the automatic evaluations. 3.2. Dataset This process ensured the acquisition of more robust and nuanced insights into the safety behaviour of the models The BeaverTails dataset [21] is a large-scale benchmark, in the Italian language. annotated by humans, designed to support the development and evaluation of large language models (LLMs) 3.3.1. Safety Classification that are aligned with safety. Consisting of over 330,000 question–answer pairs labelled across 14 fine-grained To automatically assess the safety of the LLMs, we trained harm categories, it also includes more than 360,000 hu- several QA moderators by performing fine-tuning on a man preference comparisons that independently rank bilingual classification dataset to predict safety labels. responses for helpfulness and harmfulness. It provides a This dataset comprised Italian QA pairs from BeaverTailsvaluable foundation for advancing alignment methodolo- IT3 and English QA pairs from BeaverTails. We employed gies in modern LLMs. In order to evaluate Italian LLMs, models of diferent nature and architecture: DeBERTa we adopted BeaverTails-IT2[12], a comprehensive safety v3 large [22], an encoder-based classifier; Llama 3.1 8B benchmark for the Italian language obtained through Instruct [23], a generative model adapted for multi-label machine translation. The BeaverTails-IT dataset includes classification with a classification head; and Llama Guard 700 prompts originally introduced in the BeaverTails 3 8B [23], a specialized generative model for safety clasdataset and translated into Italian using X-ALMA-13B. sification tailored on the Beavertails taxonomy. All three

2https://huggingface.co/datasets/MIND-Lab/

BeaverTails-IT-Evaluation

3https://huggingface.co/datasets/MIND-Lab/BeaverTails-IT

trained safety classifiers have been made publicly available on Hugging Face4,5,6.

The models are evaluated on the bilingual test set and compared against three baselines: Beaver-Dam-7B7, a classifier fine-tuned on Beavertails, and two versions of Llama Guard using in-context learning (ICL), where the taxonomy is explicitly defined within the chat template. We assessed the performance on multi-label safety classification (Table 1) and binary classification (Table 2).

All fine-tuned models outperform the three baselines

on both tasks, maintaining consistent performance across English and Italian data splits, whereas the baselines show significant variation. Although Llama Guard and Beavertails exhibit some overlapping categories in their taxonomies, our results demonstrate that ICL is inefective and necessitates fine-tuning. Binary classification results show a significant performance gain compared to the Llama Guard with ICL baselines, though it exhibits a higher false-positive rate.

Implementation Details We fine-tuned all models

using Hugging Face’s transformers [24] library (and TRL [25] for Llama Guard 3), employing DeepSpeed with ZeRO Stage 2 [26] (with the exception of DeBERTa). For

4https://huggingface.co/saiteki-kai/QA-DeBERTa-v3-large 5https://huggingface.co/saiteki-kai/QA-Llama-Guard-3-8B 6https://huggingface.co/saiteki-kai/QA-Llama-3.1 7https://huggingface.co/PKU-Alignment/beaver-dam-7b

Llama Guard 3, we employed LoRA fine-tuning [ 27 ] with the standard causal language modeling loss. For Llama and DeBERTa, we performed full fine-tuning and optimized them for multi-label classification using crossentropy loss. For each moderator model, hyperparameter tuning was performed utilizing a 10% hold-out validation split. 3.3.2. Human Evaluation

To better assess the ability of the proposed moderators

to identify unsafe content, a human evaluation has also been performed. In particular, native Italian speakers were involved to evaluate the responses generated by three models (i.e., Velvet, Modello Italia, and Minerva).

The original BeaverTails annotation guidelines and accompanying examples were manually translated into Italian and validated by domain experts. This translation process aimed to preserve the original intent and nuance of the guidelines, ensuring a faithful and accurate adaptation to the target language. Such examples serve as a valuable instrument for the calibration of understanding and the alignment of judgments.

During the evaluation, annotators were presented with question–answer (QA) pairs and asked to determine whether each response could belong to one or more of the 14 harm categories defined in the BeaverTails-IT taxonomy. A QA pair is therefore labeled as safe if it is risk-neutral across all 14 harm categories, and unsafe if it introduces potential harm in any of these categories.

To further maximize the reliability of the annotation pro- without detection, and the response contains a list of cess, each QA pair was evaluated by multiple labellers, emotionally abusive statements aimed at demeaning and and inter-annotator agreement was computed to assess devaluing a person. Given the clearly harmful nature the consistency of the judgments. These human annota- of both the prompt and the response, annotators unanitions serve as the gold standard against which we bench- mously agreed that the content violates safety guidelines marked the performance of the moderation models under and constitutes psychological abuse. study. To assess the level of agreement among the annotators, we computed Fleiss’ Kappa statistics [ 28 ]. We obtained a score of 0.493, indicating a moderate level 4. Discussion of agreement among the annotators who evaluated the unsafety of the content. This suggests that while anno- A first analysis of the unsafety of the selected Italian tators were generally aligned in their judgments, there large language models has been performed through the was still a notable degree of inconsistency, possibly due developed moderators. In particular, QA pairs, composed to subjective interpretations of what constitutes “unsafe” of prompts from BeaverTails-IT dataset and the answer content. An important aspect contributing to this mod- generated by the models, have been evaluated. Figure 1 est agreement stems from the original BeaverTails guide- reports the percentage of QA pairs predicted as unsafe lines, which assume that any content not clearly falling by the three moderators trained on the Italian language. into predefined unsafe categories is to be considered safe. The analysis of unsafe QA pairs across various Italian This can lead to discrepancies in cases where annota- LLMs reveals substantial disparities in unsafety genertors perceive implicit risks not covered by the provided ation, with certain models exhibiting alarmingly high categories. Additionally, some generated content are in- rates of unsafe outputs. We can easily notice that, while trinsically borderline — for instance those that present Llama Guard and DeBERTa exhibit a similar behaviour, unsafe content while simultaneously including warnings LLama 3.1 tends to be less conservative, by identifying or alerts about the content’s unsafety — creating further a reduced number of unsafe QA pairs. As expected, the ambiguity in how safety should be assessed. reported results identify Camoscio as the most unsafe

An example of a QA pair that leads the annotators to model. Such an evaluation reflects the fact that Camoscio disagreement is reported in Table 3. The reported ex- was released without safety alignment and was trained amples illustrate how prompts involving sensitive socio- using unfiltered web data. It is therefore able to genpolitical issues, such as systemic racism, can lead to di- erate harmful, toxic, or illegal content and assist with vergent safety assessments among annotators due to dif- malicious tasks, confirming the conclusions of the aufering perspectives and cultural interpretations. In fact, thors who acknowledge that the model exhibits hallucitwo annotators labeled the example as safe, while the nations, factual inaccuracies, and various forms of bias. In third considered it unsafe, perceiving the presence of contrast, models like Minerva and LLamAntino 3 mainpolitical and controversial content. This highlights the tain substantially lower unsafety rates (around 4–7%), inherent dificulty in assessing safety, as such judgments suggesting more efective safety controls or alignment often rely on contextual understanding and cultural sen- strategies. Interestingly, while diferent QA moderators sitivity. As a counterpart, Table 4 presents an example (LLaMA Guard 3 8B, LLaMA 3.1 8B, and DeBERTa v3 in which all annotators reached a consensus regarding Large) show minor variability in their assessments, the the presence of unsafe content. The prompt explicitly relative safety ranking of the models remains broadly conseeks methods to psychologically torment an individual sistent. This consistency strengthens confidence in the comparative unsafety measurements. The performance gap across models highlights the importance of rigorous setups. The setting "1 over three" denotes a ground truth safety evaluation and benchmarking before deploying where a sentence has been considered unsafe if at least LLMs in real-world applications. 1 annotator marked the generated text as unsafe. The

In Table 5, we also reported the classification perfor- other setting "2 over 3" denotes a ground truth where a mances of the developed Italian moderation models, i.e., sentence has been considered unsafe if the majority of Llama Guard 3, Llama 3.1 8B, and DeBERTa v3 large, in the annotators marked the generated text as unsafe. The identifying unsafe content with respect to human anno- reported performance allows us to evaluate the reliabiltations (ground truth). Performances are evaluated in ity of the developed moderators when detecting safe and terms of F1-scores according to two distinct evaluation unsafe generated content by the Italian language models.

While the first setting represents a strict scenario, the sec

ond one considers the majority of annotators, resulting in a less conservative scenario.

Considering both settings, Llama Guard 3 consistently achieves the highest overall F1-Scores. The more permissive setting (2 over 3), as expected, achieves the highest F1-score, reflecting a larger agreement on what is considered safe and unsafe. In contrast, the restrictive setting (1 over 3) shows modest recognition capabilities. These ifndings suggest that moderation performance is sensitive to what can be perceived as unsafe, with Llama Guard 3 ofering the most reliable moderator across different settings. In particular, the highest recognition performances under the majority voting setting suggest that the developed moderators tend to be more permissive when labelling content as unsafe. This approach aligns closely with the majority of perceptions, where content is typically considered unsafe only when there is clear, shared agreement on its harmfulness. In this sense, majority voting filters out individual model biases and amplifies the collective judgment of the moderation systems, efectively approximating the majority opinion of human evaluators.

5. Conclusions This work presented the first systematic and multidi

mensional evaluation of safety in Italian Large Language Models. Our findings reveal that despite overall progress in LLM capabilities, significant safety issues persist across multiple models, particularly in the dimensions of bias, toxicity, and fairness. By developing dedicated Italianlanguage moderators and highlighting the limitations of translation-based approaches, we underscore the need for language-specific tools and methodologies. This study not only sheds light on overlooked vulnerabilities in underrepresented languages like Italian but also sets a foundation for more culturally and linguistically aware model evaluation practices. Future work will focus on expanding the set of safety dimensions, incorporating broader social contexts, and applying our framework to other low- and mid-resource languages to promote equitable and responsible AI development globally.

Acknowledgments We acknowledge the support of the PNRR ICSC National

Research Centre for High Performance Computing, Big Data and Quantum Computing (CN00000013), under the NRRP MUR program funded by the NextGenerationEU.

This work has also been supported by ReGAInS, Department of Excellence. The authors would also like to thank Fastweb S.p.a. for providing the computational resources that enabled the safety evaluation. Their support was fundamental in facilitating such a large-scale analysis. conference on machine learning, PMLR, 2013, pp. translation summit x: papers, 2005, pp. 79–86. 325–333. [17] A. Santilli, E. Rodolà, Camoscio: an ital[8] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, ian instruction-tuned llama, arXiv preprint C. Xu, Z. Xiong, R. Dutta, R. Schaefer, et al., De- arXiv:2307.16456 (2023). codingtrust: A comprehensive assessment of trust- [18] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, worthiness in gpt models., 2023. C. Guestrin, P. Liang, T. B. Hashimoto, Stanford [9] L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, alpaca: An instruction-following llama model, 2023.

Y. Qiao, J. Shao, Salad-bench: A hierarchical and [19] M. Polignano, P. Basile, G. Semeraro, Advanced comprehensive safety benchmark for large lan- natural-based interaction for the italian language: guage models, in: Findings of the Association Llamantino-3-anita, 2024. arXiv:2405.07101. for Computational Linguistics: ACL 2024, 2024, pp. [20] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S. Co3923–3954. nia, E. Barba, S. Orlandini, G. Fiameni, R. Nav[10] F. Friedrich, S. Tedeschi, P. Schramowski, M. Brack, igli, Minerva LLMs: The first family of large R. Navigli, H. Nguyen, B. Li, K. Kersting, Llms lost in language models trained from scratch on Italian translation: M-alert uncovers cross-linguistic safety data, in: F. Dell’Orletta, A. Lenci, S. Montemagni, gaps, arXiv preprint arXiv:2412.15035 (2024). R. Sprugnoli (Eds.), Proceedings of the 10th Italian [11] L. Moroni, S. Conia, F. Martelli, R. Navigli, To- Conference on Computational Linguistics (CLiCwards a more comprehensive evaluation for Italian it 2024), CEUR Workshop Proceedings, Pisa, Italy, LLMs, in: F. Dell’Orletta, A. Lenci, S. Montemagni, 2024, pp. 707–719. URL: https://aclanthology.org/ R. Sprugnoli (Eds.), Proceedings of the 10th Italian 2024.clicit-1.77/.

Conference on Computational Linguistics (CLiC- [21] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, it 2024), CEUR Workshop Proceedings, Pisa, Italy, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A sur2024, pp. 584–599. URL: https://aclanthology.org/ vey on evaluation of large language models, ACM 2024.clicit-1.67/. transactions on intelligent systems and technology [12] G. Magazzù, A. Sormani, G. Rizzi, F. Pulerà, 15 (2024) 1–45.

D. Scalena, S. Cariddi, E. Michielon, M. Pasqualini, [22] P. He, J. Gao, W. Chen, Debertav3: ImprovC. Stamile, E. Fersini, BeaverTails-IT: Towards A ing deberta using electra-style pre-training with Safety Benchmark for Evaluating Italian Large Lan- gradient-disentangled embedding sharing, 2021. guage Models, in: Proceedings of the Eleventh arXiv:2111.09543.

Italian Conference on Computational Linguistics [23] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Ka(CLiC-it 2025), 2025. dian, A. Al-Dahle, A. Letman, A. Mathur, A. Schel[13] A. Bacciu, C. Campagnano, G. Trappolini, F. Sil- ten, A. Vaughan, et al., The llama 3 herd of models, vestri, DanteLLM: Let‘s push Italian LLM research arXiv preprint arXiv:2407.21783 (2024). forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste, [24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. DeA. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of langue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funthe 2024 Joint International Conference on Com- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, putational Linguistics, Language Resources and Y. Jernite, J. Plu, C. Xu, T. Le Scao, S. GugEvaluation (LREC-COLING 2024), ELRA and ICCL, ger, M. Drame, Q. Lhoest, A. Rush, TransTorino, Italia, 2024, pp. 4343–4355. URL: https: formers: State-of-the-art natural language pro//aclanthology.org/2024.lrec-main.388/. cessing, in: Q. Liu, D. Schlangen (Eds.), Pro[14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- ceedings of the 2020 Conference on Empirical ford, D. S. Chaplot, D. de las Casas, F. Bressand, Methods in Natural Language Processing: System G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- Demonstrations, Association for Computational A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, Linguistics, Online, 2020, pp. 38–45. URL: https: T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: //aclanthology.org/2020.emnlp-demos.6/. doi:10. //arxiv.org/abs/2310.06825. arXiv:2310.06825. 18653/v1/2020.emnlp-demos.6. [15] D. Croce, A. Zelenanska, R. Basili, Neural learn- [25] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching, ing for question answering in italian, in: AI* IA T. Thrush, N. Lambert, S. Huang, K. Rasul, Q. Gal2018–Advances in Artificial Intelligence: XVIIth louédec, Trl: Transformer reinforcement learning, International Conference of the Italian Association https://github.com/huggingface/trl, 2020. for Artificial Intelligence, Trento, Italy, November [26] G. Wang, H. Qin, S. Ade Jacobs, X. Wu, C. Holmes, 20–23, 2018, Proceedings 17, Springer, 2018, pp. 389– Z. Yao, S. Rajbhandari, O. Ruwase, F. Yang, L. Yang, 402. Y. He, Zero++: Extremely eficient collective com[16] P. Koehn, Europarl: A parallel corpus for statistical munication for large model training, in: ICLR 2024, machine translation, in: Proceedings of machine 2024.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the

[1]

Bosco , E. Ježek,

Polignano ,

Sanguinetti , Preface to the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025 ), in: Proceedings of the Eleventh Italian Conference on Computational Linguistics (CLiC-it 2025 ), 2025 .

[2]

M. N.

Sakib ,

M. A.

Islam ,

Pathak , M. M. Arifin , Risks, causes, and mitigations of widespread deployments of large language models (llms): A survey , in: 2024 2nd International Conference on Artificial Intelligence , Blockchain, and Internet of Things (AIBThings) , IEEE, 2024 , pp. 1 - 7 .

[3]

Liu ,

Yao ,

J.-F.

Ton ,

Zhang ,

Guo , H. Cheng, Y. Klochkov,

M. F.

Taufiq ,

Li , Trustworthy llms: a survey and guideline for evaluating large language models' alignment , arXiv preprint arXiv:2308.05374 ( 2023 ).

[4]

Yuan ,

Chen , G. Cui,

Gao ,

Zou , X. Cheng, H. Ji , Z.

Liu , M.

Sun , Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations , Advances in Neural Information Processing Systems 36 ( 2023 ) 58478 - 58507 .

[5]

Yue ,

Inan ,

Li ,

Kumar ,

McAnallen ,

Shajari ,

Sun ,

Levitan ,

Sim , Synthetic text generation with diferential privacy: A simple and practical recipe , in: A. Rogers , J. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, Association for Computational Linguistics , Toronto, Canada, 2023 , pp. 1321 - 1342 . URL: https://aclanthology.org/ 2023 . acl-long . 74 /. doi: 10 . 18653/v1/ 2023 . acl-long . 74 .

[6]

Zang ,

Qi ,

Yang ,

Liu ,

Zhang , Q. Liu,

Sun , Word-level textual adversarial attacking as combinatorial optimization , in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , 2020 , pp. 6066 - 6080 .

[7]

Zemel ,

Wu ,

Swersky ,

Pitassi ,

Dwork , Learning fair representations , in: International

[27]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Wang , W. Chen, LoRA: Low-rank adaptation of large language models , in: International Conference on Learning Representations , 2022 . URL: https://openreview.net/forum?id= nZeVKeeFYf9.

[28]

J. L.

Fleiss , Measuring nominal scale agreement among many raters ., Psychological bulletin 76 ( 1971 ) 378 .