<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Uncovering Unsafety Traits in Italian Language Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Giulia Rizzi</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Giuseppe Magazzù</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alberto Sormani</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Francesca Pulerà</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniel Scalena</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Elisabetta Fersini</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Groningen</institution>
          ,
          <addr-line>CLCG, Groningen</addr-line>
          ,
          <country country="NL">The Netherlands</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Milano-Bicocca</institution>
          ,
          <addr-line>Milan</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Large Language Models (LLMs) are increasingly deployed in real-world applications, raising urgent concerns around their safety, reliability, and ethical behavior. While existing safety evaluations have primarily focused on English, low- and mid-resource languages such as Italian remain critically underexplored. In this paper, we present the first comprehensive and multidimensional evaluation of LLM safety in the Italian language. We assess seven state-of-the-art LLMs across key safety dimensions using several automatic moderators tailored to cover the Italian settings. Furthermore, we analyze the challenges of adapting English-centric safety benchmarks to Italian via machine translation, highlighting limitations and proposing best practices for developing culturally and linguistically grounded evaluation frameworks. WARNING: This paper contains content that may be considered ofensive.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Safety Evaluation</kwd>
        <kwd>Large Language Models (LLMs)</kwd>
        <kwd>Italian Language</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Large Language Models (LLMs) have rapidly become
central to numerous applications, including
conversational agents, content generation, and decision support
systems in sensitive areas. However, as these models
become more complex and widespread, concerns about
their safety, reliability, and ethical deployment are
growing. The performance of LLMs no longer considers solely
measures in terms of accuracy or fluency, but
increasingly encompasses evaluations related to their unsafety.</p>
      <p>This last evaluation encompasses dimensions such as
bias, toxicity, robustness to adversarial prompts, factual
consistency, privacy preservation, and fairness.</p>
      <p>Despite this growing awareness, a substantial
portion of the literature on safety remains centred on
highresource languages, particularly English. The absence
of comprehensive evaluations tailored to specific
languages, including Italian, introduces a risk of
overlooking language-specific vulnerabilities and sociolinguistic
nuances that may influence model behaviour. Given the
global deployment of many LLMs and their interaction
with users across a broad spectrum of languages, this
imbalance poses practical and ethical challenges.</p>
      <p>In this paper, we aim to address this gap by presenting
the first comprehensive evaluation of LLM safety focused
exclusively on the Italian language. We systematically
assess commonly adopted LLMs across multiple
dimensions of safety, adapting existing safety benchmarks. The
objective of this study is to provide a fair evaluation of the
unsafe behaviour of Italian Large Language Models, with
a focus on identifying potential risks and highlighting
future development and deployment practices.</p>
      <p>The primary contributions of this work are as follows:
1. We present the first systematic and
multidimensional unsafety evaluation of Italian
Large Language Model (LLM), which
highlights the need in some cases to focus more on
aligning the models on a more ethical behaviour.</p>
      <p>
        In particular, we performed a comparative
evaluation of seven state-of-the-art Italian LLMs using
both automatic and human-based evaluations.
2. We developed three moderators to
automatically evaluate and classify prompt–response
pairs for the Italian language, enabling
nuanced assessment of unsafe behaviors in a
predeifned set of categories. In particular, we
implemented DeBERTa v3 large, LLaMA 3.1 8B Instruct,
and LLaMA Guard 3 8B for the Italian language.
3. We provide an in-depth analysis of issues
related to erroneous translation and their
implications on safety benchmarking. We
propose methodological recommendations for the
development of culturally sensitive and
linguistically appropriate safety benchmarks, with
implications for the broader goal of equitable and
responsible deployment of LLMs across diverse LLMs, attack, and defense methods. The experiments
linguistic contexts. carried out by the authors provide insight into the
resilience of LLMs to emerging threats and the eficacy of
The paper is organized as follows. In section 2, related contemporary defence tactics. A large-scale,
comprehenworks are outlined. In section 3, the comparative eval- sive safety evaluation of the current LLM landscape is
uation of unsafety is described. In section 4, the main proposed in [10]. The authors evaluate 39 LLMs on a
outcomes are discussed. Finally, in section 5, conclusions multilingual benchmark (i.e., M-ALERT) and highlight
and future works are described. the importance of language- and category-specific safety
analysis.
2. Related Works While significant progress has been made in
developing Italian benchmarks for LLMs, current evaluations
The increasing adoption of large language models (LLMs), predominantly focus on comprehension and reasoning
caincluding generative pre-trained transformers (GPTs), in pabilities, with limited attention to safety considerations
both daily tasks and more specific applications has led to [11]. BeaverTails-IT [12] represents the first safety
bencha substantial increase in interest regarding their reliabil- mark specifically designed for the Italian language,
adity [
        <xref ref-type="bibr" rid="ref2 ref3">2, 3</xref>
        ]. Yuan et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] conducted a study to investigate dressing this critical gap in evaluation resources. In light
the behaviour of NLP models under out-of-distribution of the existing literature, which highlights the critical
conditions. The study demonstrated that state-of-the- need for robust and comprehensive multilingual safety
art language models continue to exhibit brittleness when practices in LLMs, we propose the first evaluation of
confronted with data that deviates from their training dis- widely adopted language models specifically in the
Italtributions. This finding serves to reinforce the prevailing ian language, aiming to bridge current evaluation gaps
argument that the current state of generalisation capabili- and support safer deployment in this linguistic context.
ties is inadequate for a considerable number of real-world
applications. Another area of research focuses on Privacy 3. Evaluating LLMs’ Safety
concerns. Yuan et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] present a simple method for
rgiesnkesraantdincgonsydnutchtectoimctperxethdeantsaivwe hexilpeemrimitiegnattsinegvapluriavtaincyg 3.1. Large Language Models
both utility and privacy risks. The landscape of Italian-language large language models
      </p>
      <p>
        Other critical aspects of trustworthiness research are (LLMs) has recently undergone significant expansion,
Adversarial attacks on language models and fairness of with the development of several notable architectures
machine learning models. Zang et al. [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] framed word- tailored for instructional and general-purpose natural
level adversarial perturbations as a combinatorial opti- language processing (NLP) tasks.
mization problem, demonstrating that even minor textual
modifications can significantly degrade model perfor- • DanteLLM* [13] is based on the Mistral [14]
mance. Zemel et al. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] proposed a methodology for architecture and fine-tuned on Italian data
uslearning fair representations, which balances predictive ing LoRA, a parameter-eficient tuning method.
accuracy with group fairness. Although not specific to The fine-tuning phase made use of several
LLMs, this framework laid the groundwork for ongo- Italian datasets, including the Italian SQuAD
ing research into algorithmic bias and equitable model dataset [15], 25,000 sentences from the Europarl
behavior. A significant contribution to this field is the dataset [16], Fauno’s Quora dataset, and the
DecodingTrust framework proposed by Wang et al. [8], Camoscio dataset. We adopted the Hugging
which ofers a comprehensive assessment of GPT-3.5 Face model:
rstless-research/DanteLLMand GPT-4. Their study evaluates these models along 7B-Instruct-Italian-v0.1.
several axes, including toxicity, bias, adversarial
robustness, privacy, and fairness. Notwithstanding the fact that • Camoscio* [17] is a LoRA fine-tuning of LLaMA,
GPT-4 generally exhibits superior performance across with 7 billion parameters, trained on an
Itala multitude of benchmarks, the study reveals that the ian translation of the Alpaca dataset [18]. We
model remains vulnerable to carefully crafted adversarial use the following Hugging Face model:
sagprompts (i.e., given jailbreaking system or user prompts) uniroma2/extremITA-Camoscio-7b.
and inadvertent privacy leaks. This finding highlights • LLaMAntino* [19] is an instruction-tuned
verconcerns regarding the deployment of such safe systems. sion of Meta-Llama-3-8b-instruct 1 (a fine-tuned
      </p>
      <p>To meet this crucial need, safety benchmark
specifically designed for evaluating LLMs, attack, and defense ∗Models fine-tuned on Italian
methods have been proposed. For instance, SALAD- †Models trained from scratch on Italian
Bench [9] has been specifically designed for evaluating 1https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct
LLaMA 3 model). The model has been supervised
ifne-tuned (SFT) using QLoRA on
instructionbased datasets. We adopted the
instructiontuned version, which was fine-tuned on
English and Italian language datasets, available on
Hugging Face:
swap-uniba/LLaMAntino-3</p>
      <p>ANITA-8B-Inst-DPO-ITA.
• Modello Italia† is an instruction-tuned model,
based on GPT-NeoX architecture, trained with
a focus on the Italian language (90% of data in
Italian and the remaining 10% in English). We
adopted
sapienzanlp/modello-italia-9bbf16 available on Hugging Face.
• Minerva† [20] is the first family of LLMs trained
entirely from scratch on native Italian texts using
a portion of FineWeb, which includes filtered and
deduplicated Common Crawl dumps with various
timestamps. We adopted the instruction-tuned
version, available at:
sapienzanlp/Minerva7B-instruct-v1.0.</p>
      <sec id="sec-1-1">
        <title>These prompts are designed to elicit one of the 14 difer</title>
        <p>ent categories of unsafe responses (1. Animal Abuse, 2.</p>
        <p>Child Abuse, 3. Controversial Topics, Politics, 4.
Discrimination, Stereotype, Injustice, 5. Drug Abuse, Weapons,
Banned Substance, 6. Financial Crime, Property Crime,
Theft, 7. Hate Speech, Ofensive Language, 8.
Misinformation regarding ethics, laws, and safety, 9.
NonViolent Unethical Behavior, 10. Privacy Violation, 11.</p>
        <p>Self-Harm, 12. Sexually Explicit, Adult Content, 13.
Terrorism, Organized Crime, 14. Violence, Aiding and
Abetting, Incitement.) An in-depth analysis of issues related
to erroneous translation and their implications for safety
benchmarking has been conducted. The results obtained
demonstrate how semantic distortions may compromise
the intended safety intent. Overall, 57.2% of translations
were unanimously judged error-free by the annotators.</p>
        <p>Semantic errors were the most common (11.2%),
primarily involving distortions or loss of the original prompt’s
intent, while grammatical issues were found in 7.4% of
cases. Further details and a breakdown of error types are
provided in [12].</p>
        <p>3.3. Evaluation Strategy
• Velvet* is a family of instruction models
finetuned using a combination of open-source
instruction datasets and synthetic datasets tailored for
solving long context problems. We adopted the 14
billion parameters version available on Hugging
Face as: Almawave/Velvet-14B.</p>
        <p>In order to perform the analysis of unsafety, the prompts
from BeaverTails-IT were adopted to generate responses
from several widely used Italian large language models
(LLMs), including both open-source and proprietary
systems. To evaluate the safety of the resulting QA pairs, a
• MIIA† is a large language model with 7 bil- dual approach has been employed, combining automatic
lion parameters, built on an autoregressive trans- and human assessments. Specifically, safety classification
former architecture, specifically designed and models (moderators) are investigated to automatically
trained for the Italian language and cultural detect potentially harmful outputs based on predefined
context. We adopted the Hugging Face model: risk categories. Subsequently, human annotators
evaluFastweb/FastwebMIIA-7B. ated a selection of responses, providing both qualitative
and quantitative validation of the automatic evaluations.
3.2. Dataset This process ensured the acquisition of more robust and
nuanced insights into the safety behaviour of the models
The BeaverTails dataset [21] is a large-scale benchmark, in the Italian language.
annotated by humans, designed to support the
development and evaluation of large language models (LLMs) 3.3.1. Safety Classification
that are aligned with safety. Consisting of over 330,000
question–answer pairs labelled across 14 fine-grained To automatically assess the safety of the LLMs, we trained
harm categories, it also includes more than 360,000 hu- several QA moderators by performing fine-tuning on a
man preference comparisons that independently rank bilingual classification dataset to predict safety labels.
responses for helpfulness and harmfulness. It provides a This dataset comprised Italian QA pairs from
BeaverTailsvaluable foundation for advancing alignment methodolo- IT3 and English QA pairs from BeaverTails. We employed
gies in modern LLMs. In order to evaluate Italian LLMs, models of diferent nature and architecture: DeBERTa
we adopted BeaverTails-IT2[12], a comprehensive safety v3 large [22], an encoder-based classifier; Llama 3.1 8B
benchmark for the Italian language obtained through Instruct [23], a generative model adapted for multi-label
machine translation. The BeaverTails-IT dataset includes classification with a classification head; and Llama Guard
700 prompts originally introduced in the BeaverTails 3 8B [23], a specialized generative model for safety
clasdataset and translated into Italian using X-ALMA-13B. sification tailored on the Beavertails taxonomy. All three</p>
      </sec>
      <sec id="sec-1-2">
        <title>2https://huggingface.co/datasets/MIND-Lab/</title>
        <p>BeaverTails-IT-Evaluation</p>
      </sec>
      <sec id="sec-1-3">
        <title>3https://huggingface.co/datasets/MIND-Lab/BeaverTails-IT</title>
        <p>trained safety classifiers have been made publicly
available on Hugging Face4,5,6.</p>
        <p>The models are evaluated on the bilingual test set and
compared against three baselines: Beaver-Dam-7B7, a
classifier fine-tuned on Beavertails, and two versions of
Llama Guard using in-context learning (ICL), where the
taxonomy is explicitly defined within the chat template.
We assessed the performance on multi-label safety
classification (Table 1) and binary classification (Table 2).</p>
      </sec>
      <sec id="sec-1-4">
        <title>All fine-tuned models outperform the three baselines</title>
        <p>on both tasks, maintaining consistent performance across
English and Italian data splits, whereas the baselines
show significant variation. Although Llama Guard and
Beavertails exhibit some overlapping categories in their
taxonomies, our results demonstrate that ICL is
inefective and necessitates fine-tuning. Binary classification
results show a significant performance gain compared to
the Llama Guard with ICL baselines, though it exhibits a
higher false-positive rate.</p>
      </sec>
      <sec id="sec-1-5">
        <title>Implementation Details We fine-tuned all models</title>
        <p>using Hugging Face’s transformers [24] library (and TRL
[25] for Llama Guard 3), employing DeepSpeed with
ZeRO Stage 2 [26] (with the exception of DeBERTa). For</p>
      </sec>
      <sec id="sec-1-6">
        <title>4https://huggingface.co/saiteki-kai/QA-DeBERTa-v3-large 5https://huggingface.co/saiteki-kai/QA-Llama-Guard-3-8B 6https://huggingface.co/saiteki-kai/QA-Llama-3.1 7https://huggingface.co/PKU-Alignment/beaver-dam-7b</title>
        <p>
          Llama Guard 3, we employed LoRA fine-tuning [
          <xref ref-type="bibr" rid="ref8">27</xref>
          ] with
the standard causal language modeling loss. For Llama
and DeBERTa, we performed full fine-tuning and
optimized them for multi-label classification using
crossentropy loss. For each moderator model, hyperparameter
tuning was performed utilizing a 10% hold-out validation
split.
3.3.2. Human Evaluation
        </p>
      </sec>
      <sec id="sec-1-7">
        <title>To better assess the ability of the proposed moderators</title>
        <p>to identify unsafe content, a human evaluation has also
been performed. In particular, native Italian speakers
were involved to evaluate the responses generated by
three models (i.e., Velvet, Modello Italia, and Minerva).</p>
        <p>The original BeaverTails annotation guidelines and
accompanying examples were manually translated into
Italian and validated by domain experts. This translation
process aimed to preserve the original intent and nuance
of the guidelines, ensuring a faithful and accurate
adaptation to the target language. Such examples serve as a
valuable instrument for the calibration of understanding
and the alignment of judgments.</p>
        <p>During the evaluation, annotators were presented with
question–answer (QA) pairs and asked to determine
whether each response could belong to one or more of
the 14 harm categories defined in the BeaverTails-IT
taxonomy. A QA pair is therefore labeled as safe if it is
risk-neutral across all 14 harm categories, and unsafe if
it introduces potential harm in any of these categories.</p>
        <p>
          To further maximize the reliability of the annotation pro- without detection, and the response contains a list of
cess, each QA pair was evaluated by multiple labellers, emotionally abusive statements aimed at demeaning and
and inter-annotator agreement was computed to assess devaluing a person. Given the clearly harmful nature
the consistency of the judgments. These human annota- of both the prompt and the response, annotators
unanitions serve as the gold standard against which we bench- mously agreed that the content violates safety guidelines
marked the performance of the moderation models under and constitutes psychological abuse.
study. To assess the level of agreement among the
annotators, we computed Fleiss’ Kappa statistics [
          <xref ref-type="bibr" rid="ref9">28</xref>
          ]. We
obtained a score of 0.493, indicating a moderate level 4. Discussion
of agreement among the annotators who evaluated the
unsafety of the content. This suggests that while anno- A first analysis of the unsafety of the selected Italian
tators were generally aligned in their judgments, there large language models has been performed through the
was still a notable degree of inconsistency, possibly due developed moderators. In particular, QA pairs, composed
to subjective interpretations of what constitutes “unsafe” of prompts from BeaverTails-IT dataset and the answer
content. An important aspect contributing to this mod- generated by the models, have been evaluated. Figure 1
est agreement stems from the original BeaverTails guide- reports the percentage of QA pairs predicted as unsafe
lines, which assume that any content not clearly falling by the three moderators trained on the Italian language.
into predefined unsafe categories is to be considered safe. The analysis of unsafe QA pairs across various Italian
This can lead to discrepancies in cases where annota- LLMs reveals substantial disparities in unsafety
genertors perceive implicit risks not covered by the provided ation, with certain models exhibiting alarmingly high
categories. Additionally, some generated content are in- rates of unsafe outputs. We can easily notice that, while
trinsically borderline — for instance those that present Llama Guard and DeBERTa exhibit a similar behaviour,
unsafe content while simultaneously including warnings LLama 3.1 tends to be less conservative, by identifying
or alerts about the content’s unsafety — creating further a reduced number of unsafe QA pairs. As expected, the
ambiguity in how safety should be assessed. reported results identify Camoscio as the most unsafe
        </p>
        <p>An example of a QA pair that leads the annotators to model. Such an evaluation reflects the fact that Camoscio
disagreement is reported in Table 3. The reported ex- was released without safety alignment and was trained
amples illustrate how prompts involving sensitive socio- using unfiltered web data. It is therefore able to
genpolitical issues, such as systemic racism, can lead to di- erate harmful, toxic, or illegal content and assist with
vergent safety assessments among annotators due to dif- malicious tasks, confirming the conclusions of the
aufering perspectives and cultural interpretations. In fact, thors who acknowledge that the model exhibits
hallucitwo annotators labeled the example as safe, while the nations, factual inaccuracies, and various forms of bias. In
third considered it unsafe, perceiving the presence of contrast, models like Minerva and LLamAntino 3
mainpolitical and controversial content. This highlights the tain substantially lower unsafety rates (around 4–7%),
inherent dificulty in assessing safety, as such judgments suggesting more efective safety controls or alignment
often rely on contextual understanding and cultural sen- strategies. Interestingly, while diferent QA moderators
sitivity. As a counterpart, Table 4 presents an example (LLaMA Guard 3 8B, LLaMA 3.1 8B, and DeBERTa v3
in which all annotators reached a consensus regarding Large) show minor variability in their assessments, the
the presence of unsafe content. The prompt explicitly relative safety ranking of the models remains broadly
conseeks methods to psychologically torment an individual sistent. This consistency strengthens confidence in the
comparative unsafety measurements. The performance
gap across models highlights the importance of rigorous setups. The setting "1 over three" denotes a ground truth
safety evaluation and benchmarking before deploying where a sentence has been considered unsafe if at least
LLMs in real-world applications. 1 annotator marked the generated text as unsafe. The</p>
        <p>In Table 5, we also reported the classification perfor- other setting "2 over 3" denotes a ground truth where a
mances of the developed Italian moderation models, i.e., sentence has been considered unsafe if the majority of
Llama Guard 3, Llama 3.1 8B, and DeBERTa v3 large, in the annotators marked the generated text as unsafe. The
identifying unsafe content with respect to human anno- reported performance allows us to evaluate the
reliabiltations (ground truth). Performances are evaluated in ity of the developed moderators when detecting safe and
terms of F1-scores according to two distinct evaluation unsafe generated content by the Italian language models.</p>
      </sec>
      <sec id="sec-1-8">
        <title>While the first setting represents a strict scenario, the sec</title>
        <p>ond one considers the majority of annotators, resulting
in a less conservative scenario.</p>
        <p>Considering both settings, Llama Guard 3 consistently
achieves the highest overall F1-Scores. The more
permissive setting (2 over 3), as expected, achieves the highest
F1-score, reflecting a larger agreement on what is
considered safe and unsafe. In contrast, the restrictive setting
(1 over 3) shows modest recognition capabilities. These
ifndings suggest that moderation performance is
sensitive to what can be perceived as unsafe, with Llama
Guard 3 ofering the most reliable moderator across
different settings. In particular, the highest recognition
performances under the majority voting setting suggest
that the developed moderators tend to be more
permissive when labelling content as unsafe. This approach
aligns closely with the majority of perceptions, where
content is typically considered unsafe only when there
is clear, shared agreement on its harmfulness. In this
sense, majority voting filters out individual model biases
and amplifies the collective judgment of the moderation
systems, efectively approximating the majority opinion
of human evaluators.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>5. Conclusions</title>
      <sec id="sec-2-1">
        <title>This work presented the first systematic and multidi</title>
        <p>mensional evaluation of safety in Italian Large Language
Models. Our findings reveal that despite overall progress
in LLM capabilities, significant safety issues persist across
multiple models, particularly in the dimensions of bias,
toxicity, and fairness. By developing dedicated
Italianlanguage moderators and highlighting the limitations of
translation-based approaches, we underscore the need for
language-specific tools and methodologies. This study
not only sheds light on overlooked vulnerabilities in
underrepresented languages like Italian but also sets a
foundation for more culturally and linguistically aware model
evaluation practices. Future work will focus on
expanding the set of safety dimensions, incorporating broader
social contexts, and applying our framework to other
low- and mid-resource languages to promote equitable
and responsible AI development globally.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>Acknowledgments</title>
      <sec id="sec-3-1">
        <title>We acknowledge the support of the PNRR ICSC National</title>
        <p>Research Centre for High Performance Computing, Big
Data and Quantum Computing (CN00000013), under the
NRRP MUR program funded by the NextGenerationEU.</p>
        <p>This work has also been supported by ReGAInS,
Department of Excellence. The authors would also like to thank
Fastweb S.p.a. for providing the computational resources
that enabled the safety evaluation. Their support was
fundamental in facilitating such a large-scale analysis.
conference on machine learning, PMLR, 2013, pp. translation summit x: papers, 2005, pp. 79–86.
325–333. [17] A. Santilli, E. Rodolà, Camoscio: an
ital[8] B. Wang, W. Chen, H. Pei, C. Xie, M. Kang, C. Zhang, ian instruction-tuned llama, arXiv preprint
C. Xu, Z. Xiong, R. Dutta, R. Schaefer, et al., De- arXiv:2307.16456 (2023).
codingtrust: A comprehensive assessment of trust- [18] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
worthiness in gpt models., 2023. C. Guestrin, P. Liang, T. B. Hashimoto, Stanford
[9] L. Li, B. Dong, R. Wang, X. Hu, W. Zuo, D. Lin, alpaca: An instruction-following llama model, 2023.</p>
        <p>Y. Qiao, J. Shao, Salad-bench: A hierarchical and [19] M. Polignano, P. Basile, G. Semeraro, Advanced
comprehensive safety benchmark for large lan- natural-based interaction for the italian language:
guage models, in: Findings of the Association Llamantino-3-anita, 2024. arXiv:2405.07101.
for Computational Linguistics: ACL 2024, 2024, pp. [20] R. Orlando, L. Moroni, P.-L. Huguet Cabot, S.
Co3923–3954. nia, E. Barba, S. Orlandini, G. Fiameni, R.
Nav[10] F. Friedrich, S. Tedeschi, P. Schramowski, M. Brack, igli, Minerva LLMs: The first family of large
R. Navigli, H. Nguyen, B. Li, K. Kersting, Llms lost in language models trained from scratch on Italian
translation: M-alert uncovers cross-linguistic safety data, in: F. Dell’Orletta, A. Lenci, S. Montemagni,
gaps, arXiv preprint arXiv:2412.15035 (2024). R. Sprugnoli (Eds.), Proceedings of the 10th Italian
[11] L. Moroni, S. Conia, F. Martelli, R. Navigli, To- Conference on Computational Linguistics
(CLiCwards a more comprehensive evaluation for Italian it 2024), CEUR Workshop Proceedings, Pisa, Italy,
LLMs, in: F. Dell’Orletta, A. Lenci, S. Montemagni, 2024, pp. 707–719. URL: https://aclanthology.org/
R. Sprugnoli (Eds.), Proceedings of the 10th Italian 2024.clicit-1.77/.</p>
        <p>Conference on Computational Linguistics (CLiC- [21] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu,
it 2024), CEUR Workshop Proceedings, Pisa, Italy, H. Chen, X. Yi, C. Wang, Y. Wang, et al., A
sur2024, pp. 584–599. URL: https://aclanthology.org/ vey on evaluation of large language models, ACM
2024.clicit-1.67/. transactions on intelligent systems and technology
[12] G. Magazzù, A. Sormani, G. Rizzi, F. Pulerà, 15 (2024) 1–45.</p>
        <p>D. Scalena, S. Cariddi, E. Michielon, M. Pasqualini, [22] P. He, J. Gao, W. Chen, Debertav3:
ImprovC. Stamile, E. Fersini, BeaverTails-IT: Towards A ing deberta using electra-style pre-training with
Safety Benchmark for Evaluating Italian Large Lan- gradient-disentangled embedding sharing, 2021.
guage Models, in: Proceedings of the Eleventh arXiv:2111.09543.</p>
        <p>Italian Conference on Computational Linguistics [23] A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A.
Ka(CLiC-it 2025), 2025. dian, A. Al-Dahle, A. Letman, A. Mathur, A.
Schel[13] A. Bacciu, C. Campagnano, G. Trappolini, F. Sil- ten, A. Vaughan, et al., The llama 3 herd of models,
vestri, DanteLLM: Let‘s push Italian LLM research arXiv preprint arXiv:2407.21783 (2024).
forward!, in: N. Calzolari, M.-Y. Kan, V. Hoste, [24] T. Wolf, L. Debut, V. Sanh, J. Chaumond, C.
DeA. Lenci, S. Sakti, N. Xue (Eds.), Proceedings of langue, A. Moi, P. Cistac, T. Rault, R. Louf, M.
Funthe 2024 Joint International Conference on Com- towicz, J. Davison, S. Shleifer, P. von Platen, C. Ma,
putational Linguistics, Language Resources and Y. Jernite, J. Plu, C. Xu, T. Le Scao, S.
GugEvaluation (LREC-COLING 2024), ELRA and ICCL, ger, M. Drame, Q. Lhoest, A. Rush,
TransTorino, Italia, 2024, pp. 4343–4355. URL: https: formers: State-of-the-art natural language
pro//aclanthology.org/2024.lrec-main.388/. cessing, in: Q. Liu, D. Schlangen (Eds.),
Pro[14] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bam- ceedings of the 2020 Conference on Empirical
ford, D. S. Chaplot, D. de las Casas, F. Bressand, Methods in Natural Language Processing: System
G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.- Demonstrations, Association for Computational
A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, Linguistics, Online, 2020, pp. 38–45. URL: https:
T. Lacroix, W. E. Sayed, Mistral 7b, 2023. URL: https: //aclanthology.org/2020.emnlp-demos.6/. doi:10.
//arxiv.org/abs/2310.06825. arXiv:2310.06825. 18653/v1/2020.emnlp-demos.6.
[15] D. Croce, A. Zelenanska, R. Basili, Neural learn- [25] L. von Werra, Y. Belkada, L. Tunstall, E. Beeching,
ing for question answering in italian, in: AI* IA T. Thrush, N. Lambert, S. Huang, K. Rasul, Q.
Gal2018–Advances in Artificial Intelligence: XVIIth louédec, Trl: Transformer reinforcement learning,
International Conference of the Italian Association https://github.com/huggingface/trl, 2020.
for Artificial Intelligence, Trento, Italy, November [26] G. Wang, H. Qin, S. Ade Jacobs, X. Wu, C. Holmes,
20–23, 2018, Proceedings 17, Springer, 2018, pp. 389– Z. Yao, S. Rajbhandari, O. Ruwase, F. Yang, L. Yang,
402. Y. He, Zero++: Extremely eficient collective
com[16] P. Koehn, Europarl: A parallel corpus for statistical munication for large model training, in: ICLR 2024,
machine translation, in: Proceedings of machine 2024.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) and Grammarly in order
to: Paraphrase and reword and Grammar and spelling check. After using these tool(s)/service(s), the
author(s) reviewed and edited the content as needed and take(s) full responsibility for the</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C.</given-names>
            <surname>Bosco</surname>
          </string-name>
          , E. Ježek,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sanguinetti</surname>
          </string-name>
          ,
          <article-title>Preface to the Eleventh Italian Conference on Computational Linguistics (CLiC-it</article-title>
          <year>2025</year>
          ),
          <source>in: Proceedings of the Eleventh Italian Conference on Computational Linguistics</source>
          (CLiC-it
          <year>2025</year>
          ),
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M. N.</given-names>
            <surname>Sakib</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Islam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Pathak</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. M. Arifin</surname>
          </string-name>
          , Risks, causes, and
          <article-title>mitigations of widespread deployments of large language models (llms): A survey</article-title>
          ,
          <source>in: 2024 2nd International Conference on Artificial Intelligence</source>
          , Blockchain, and
          <article-title>Internet of Things (AIBThings)</article-title>
          , IEEE,
          <year>2024</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>7</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-F.</given-names>
            <surname>Ton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Guo</surname>
          </string-name>
          , H. Cheng, Y. Klochkov,
          <string-name>
            <given-names>M. F.</given-names>
            <surname>Taufiq</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Trustworthy llms: a survey and guideline for evaluating large language models' alignment</article-title>
          ,
          <source>arXiv preprint arXiv:2308.05374</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          , G. Cui,
          <string-name>
            <given-names>H.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Zou</surname>
          </string-name>
          , X. Cheng, H.
          <string-name>
            <surname>Ji</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Revisiting out-of-distribution robustness in nlp: Benchmarks, analysis, and llms evaluations</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2023</year>
          )
          <fpage>58478</fpage>
          -
          <lpage>58507</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Kumar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>McAnallen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Shajari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Levitan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sim</surname>
          </string-name>
          ,
          <article-title>Synthetic text generation with diferential privacy: A simple and practical recipe</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Toronto, Canada,
          <year>2023</year>
          , pp.
          <fpage>1321</fpage>
          -
          <lpage>1342</lpage>
          . URL: https://aclanthology.org/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>74</volume>
          /. doi:
          <volume>10</volume>
          . 18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>74</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Q. Liu,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Word-level textual adversarial attacking as combinatorial optimization</article-title>
          ,
          <source>in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>6066</fpage>
          -
          <lpage>6080</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>R.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Swersky</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Pitassi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Dwork</surname>
          </string-name>
          ,
          <article-title>Learning fair representations</article-title>
          , in: International
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA:
          <article-title>Low-rank adaptation of large language models</article-title>
          ,
          <source>in: International Conference on Learning Representations</source>
          ,
          <year>2022</year>
          . URL: https://openreview.net/forum?id= nZeVKeeFYf9.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J. L.</given-names>
            <surname>Fleiss</surname>
          </string-name>
          ,
          <article-title>Measuring nominal scale agreement among many raters</article-title>
          .,
          <source>Psychological bulletin 76</source>
          (
          <year>1971</year>
          )
          <fpage>378</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>