1. Introduction

Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-ofs

Andrea Piergentili

0 1

Beatrice Savoldi

Matteo Negri

Luisa Bentivogli

0 0 Fondazione Bruno Kessler , Via Sommarive 18, 38123, Povo (TN) , Italy 1 University of Trento , via Sommarive 5, 38123, Povo (TN) , Italy

2025

Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM's performance at a fraction of its size. Finally, we discuss the trade-of between optimizing the training data for neutrality and meaning preservation.

eol>Ethics fairness gender rewriting large language models fine-tuning

1. Introduction

of the Senate has received the information). A further challenge in automatic GNR is preserving the meaning Language technologies reinforce existing gender stereo- of the original sentence beyond gender expression, to types and binary assumptions by disproportionately avoid generating output sentences that are neutral but favoring masculine references or representations [ 1 ], semantically divergent from the input. especially when gender information is ambiguous or So far, GNR system development has been mostly conunspecified [ 2, 3, 4 ]. Such biases result in the under- fined to English [ 10, 11, 12, inter alia], where gender is representation or misrepresentation of certain gender expressed through specific sets of words, such as progroups, reinforcing existing societal stereotypes, and nouns (e.g., he/she, him/her) and lexically gendered terms erasing non-binary identities [ 5, 6 ]. Addressing these bi- (e.g., policeman/policewoman), and gender-neutral alterases through gender-inclusive approaches is increasingly natives (e.g., the singular they or synonyms like police important to ensure language technologies contribute to oficer ) are generally available and attested. GNR sysmore inclusive and equitable communication [7, 8, 9]. tems for grammatical-gender languages generally target

Gender-neutral rewriting (GNR) has emerged as a nat- specific gendered phenomena, such as member nouns ural language generation task aimed at producing texts [13], or use neologistic [14] inclusive devices such as free from unnecessary gender specifications [ 10, 11]. This neomorphemes and graphemic solutions [15, 16, 17] that task is particularly challenging in grammatical-gender convey neutrality, but are not necessarily acceptable in languages, such as Italian, due to the pervasive encod- all contexts. Currently, the sole model dedicated to Italian ing of gender in the morphology. Consider the sen- GNR was developed by Greco et al. [18], which, however, tence ‘Tutti i senatori sono stati informati’ (equivalent was developed and tested on proprietary, not publicly to AllM theM senatorsM have beenM informedM): almost available data, hindering reproducibility and progress. every word is morphologically inflected for (masculine) Towards addressing this gap, this paper explores the gender. Rephrasing this sentence in a gender-neutral potential of state-of-the-art (SOTA) large language modway may require significant changes, e.g. ‘ Ogni mem- els (LLMs) to perform GNR in Italian. Specifically, we bro del Senato ha ricevuto l’informazione’ (Every member explore both prompting and fine-tuning approaches and assess both neutrality and meaning preservation in the tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- reformulated texts. * Corresponding author. Our contributions are threefold: i) The first systematic $ apiergentili@fbk.eu (A. Piergentili); bsavoldi@fbk.eu evaluation of SOTA LLMs for Italian GNR under a two(B. Savoldi); negri@fbk.eu (M. Negri); bentivo@fbk.eu dimensional framework measuring both neutrality and (L. Bentivogli) meaning preservation; ii) A set of experiments in fine(B. 0S0a0v0o-l0d0i0);30-020101-70-010323-88(8A1.1-P4i3e3rg0e(nMti.liN);e0g0r0i)0;-00000002--03000611--78438107-2231 tuning LLMs for GNR, enabling compact models to rival (L. Bentivogli) significantly larger-sized models; iii) An investigation of © 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License the GNR performance trade-of between meaning preserAttribution 4.0 International (CC BY 4.0). vation and neutrality in the outputs of LLMs fine-tuned on sentence similarity-optimized data.1 REF-G

EN REF-N

Spero di essere stato chiaro su questo punto.

I hope that I am clear in this.

Spero di avere espresso con chiarezza questo punto.

I hope that I have expressed this point clearly.

2. Background Gender-Inclusive Language Inclusive language aims

to prevent expressions that reinforce gender hierarchies or render non-binary identities invisible, promoting fair- In REF-N is italicized. ness and inclusion in alignment with UN Sustainable Development Goals of gender equality.2 In grammaticalgender languages like Italian, inclusive language is both work has explored gender-neutral translation [ 32, 33 ], particularly challenging and increasingly urgent due to whereas intra-lingual rewriting remains mostly limited their entrenched gender systems [19, 20, 21] and the to benchmarking eforts. [ 34 ]. Attanasio et al. [ 35 ] comwidespread use of masculine forms as default to mark pared several instruction-following models prompted generic or mixed-gender referents [22].3 To address this across fairness-related tasks—including GNR—but these issue, two main strategies have emerged, as reviewed underperformed, achieving less than 50% success in neuby Rosola et al. [24] within the Italian linguistic context. tralization. Frenda et al. [ 34 ] proposed the gender-fair On the one hand, innovative forms using neomorphemes generation (GFG) challenge, where for one of the tasks and symbols (e.g., tutt* or tutt@) are mostly used in in- models are prompted to reformulate gendered Italian senformal contexts like social media and online LGBTQIA+ tences in a neutral way. Closest to our work, Greco et al. communities, and are generally not accepted in more for- [18] developed a rewriter by fine-tuning language models mal contexts [25]. Instead, conservative gender-neutral specifically for Italian gender-neutral language. However, language strategies retool existing forms and grammar the data used for testing and developing these models to avoid unnecessary gendered expressions [26, 27], e.g. are not publicly available, hampering further research by replacing i professori with la docenza [9]. As attested and comparability. by Piergentili et al. [ 28 ], such neutral solutions are increasingly accepted in communication and are endorsed 3. Experimental settings by institutions and universities to embrace all gender identities [ 29 ].4

We define GNR as the task of reformulating a sentence

to remove explicit gender markings referring to human Gender-Inclusive Rewriting In recent years, sexism entities, without altering the sentence beyond what is and gender-exclusionary practices have been increas- necessary for neutralization, ensuring semantic equivaingly addressed in NLP, focusing initially on binary gen- lence to the input. We run a set of experiments evaluating der bias and more recently expanding to non-binary in- diferent systems and approaches to GNR. Here, we first clusive language technologies [ 6, 4 ]. NLP work has ex- discuss the evaluation data and metrics (§3.1) and the set plored the modeling of inclusive language across various of models we experiment with (§3.2). Then, we describe tasks [ 30, 31 ], including inclusive language generation. two approaches to GNR: few-shot prompting SOTA LLMs For instance, Bartl and Leavy [12] explored stereotype (§3.3) and fine-tuning a subset of those LLMs on repurreduction in English LLMs fine-tuned on inclusive seeds posed Italian data (§3.4). and lexicon.

Intralingual inclusive rewriting has primarily been ex- 3.1. Evaluation plored in English [10, 11, 12], where gender marking is scarce. Similar eforts in languages with grammatical gender include research on German [15], Portuguese [16], and French [17, 13], either by using innovative forms or targeting specific instances of gendered languages—such as masculine generics in member nouns. In Italian, prior 1We release models and data at https://huggingface.co/FBK-MT 2See https://sdgs.un.org/goals/goal5 3English presents fewer challenges as gender marking is primarily limited to pronouns, allowing focused solutions like the singular they [23]. 4See for instance the EU Parliament guidelines for gender-neutral language: https://www.europarl.europa.eu/cmsdata/151780/GNL_ Guidelines_EN.pdf

Test data Following Frenda et al. [34], we conduct

our GNR experiments on mGeNTE [ 33 ], a benchmark for gender-neutral translation from English into several grammatical-gender languages, including Italian. mGeNTE provides 1,500 parallel gendered and genderneutral references created by professionals (REF-G and REF-N respectively), difering only in gender expression (see Table 1 for an example of an Italian mGeNTE entry). It is organized into two subsets: Set-G, containing sentences that require neutralization, and Set-N, containing sentences that do not. For our GNR experiments, we use the 750 Italian gendered references from Set-N as

Metrics To evaluate gender-neutrality, we use the

LLM-as-a-Judge [43] approach proposed by Piergentili et al. [44], which provides sentence-level binary gendered/neutral assessments, and was shown to be highly accurate in both human- and model-generated texts. We use their optimal configuration for monolingual evaluation.5 We compute the percentage of neutralized sentences over the whole test set (750 entries).

To evaluate meaning preservation in GNR, we use BERTScore [45], an attested BERT-based [46] metric measuring the semantic similarity of two texts (the higher the better, indicating close similarity). We use BERTScore rather than common string-matching metrics like BLEU [47] and TER [ 48 ] because gender-neutralization can have a notable impact on the lexicon, morphology, and structure of a sentence [9], which would be penalized by such metrics. By contrast, BERTScore was found to be rather insensitive to gender-neutralization [ 28 ]. Therefore, lower BERTScore values should be attributed to diferences in the meaning of the sentences beyond gender, which we evaluate separately, as described above.

To identify reference values to guide the interpretation of BERTScore in GNR, we compute the distribution of BERTScore of mGeNTE REF-N sentences against the respective REF-G.6 As these neutral reformulations were produced by human experts, the BERTScore distribution provides an empirical estimate of human-level performance in meaning preservation in GNR. We take the mean BERTScore minus one standard deviation 5Prompt: ‘Mono+P+L’; GPT model: gpt-4o-2024-08-06 6We only use Set-N entries in this computation. Rewrite, Italian Riformula la seguente frase utilizzando un linguaggio neutro rispetto al genere dei referenti umani, evitando l’uso di forme maschili e femminili.

Rewrite the following Italian sentence using a gender-neutral language in reference to human beings, avoiding masculine or feminine forms.

GFG, English Rewrite, English Sei un riscrittore di frasi italiane con l’obiettivo di rendere i testi You are a rewriter of Italian sentences with the goal of making neutrali rispetto al genere dei referenti umani. Ti viene fornita una texts gender-neutral with respect to human referents. You are frase che contiene riferimenti a persone in forme marcate per given a sentence that contains references to people using genere, come il maschile sovraesteso o coppie binarie. gender-marked forms (such as masculine generics or binary pairs). Il tuo compito è riformulare la frase in modo da: Your task is to rewrite the sentence to: Per farlo, usa strategie come:

To do this, use strategies such as: • rimuovere riferimenti espliciti al genere quando non

necessari; • mantenere inalterato il significato originale; • preservare lo stile e la leggibilità del testo. • sostantivi collettivi (“la cittadinanza”, “il personale”,

“l’utenza”); • perifrasi impersonali (“si dovrebbe”, “si consiglia”); • forme passive (“l’accesso è consentito”); • forme imperative (“allega il documento”); • pronomi relativi e costruzioni subordinate (“chi ha svolto

attività di pesca”); • termini epiceni (“ogni giudice”, “gentile collega”); • termini neutri (“l’individuo”, “la persona interessata”, “il

membro”). • evita l’uso del maschile come forma generica e non usare

forme grafiche non standard come asterischi o schwa; • evita doppie formulazioni come “il/a cittadino/a” oppure

“il professore o la professoressa”; • non rimuovere parti della frase che non richiedono

modifiche (ad esempio, i nomi propri); • fornisci solo la frase riformulata. • remove explicit gender references when they are not

necessary; • preserve the original meaning; • maintain the style and readability of the text. • collective nouns (“la cittadinanza”, “il personale”,

“l’utenza”); • impersonal phrases (“si dovrebbe”, “si consiglia”); • passive constructions (“l’accesso è consentito”); • imperative constructions (“allega il documento”); • relative pronouns and subordinate clauses (“chi ha svolto

attività di pesca”); • epicene terms (“ogni giudice”, “gentile collega”); • neutral terms (“l’individuo”, “la persona interessata”, “il

membro”). • avoid using the masculine form as a generic and do not

use non-standard spellings such as asterisks or schwa; • avoid binary formulations such as “il/a cittadino/a” or “il

professore o la professoressa”; • do not remove any part of the sentence that does not

need to be rewritten (e.g. proper names); • only return the reformulated sentence.

IMPORTANTE: IMPORTANT: 3.3. Few-Shot Prompting

We run few-shot prompting experiments with all models

in the selection described above,9 to investigate the performance of LLMs without any task-specific fine-tuning. We use two prompt formats: • GFG: a concise rewriting instruction, originally used by Frenda et al. [ 34 ] in their gender-fair generation challenge for Italian LLMs. • Rewrite: a more detailed and analytical prompt, also featuring essential guidelines for the task

9Except for Inclusively, which does not support few-shot prompting.

We instead test its of-the-shelf generation capabilities.

These prompts allow us to explore the impact of more

complex instruction on models’ performance. Moreover, we experiment with these two prompt formats by formulating them in both Italian an English, to investigate whether the language used is a relevant factor as well. The content of the prompts is reported in Table 3. We include the same 8 task exemplars—or shots—with all prompts, to elicit the in-context learning ability of LLMs [ 50 ]. We use vLLM [ 51 ] as the inference engine. 3.4. Fine-tuning

We perform fine-tuning experiments to assess whether and to which extent smaller open-weight LLMs can be adapted to the GNR task and approach the performance

of larger models or closed systems. Namely, we fine-tune LLaMAntino, Velvet, LLama 3.1, Phi 4, and the 8B and 14B Qwen3 models. neutrality classifier. 10 This data consists in gendered Italian sentences and their gender-neutral counterparts, all generated starting from a dictionary of masculine, feminine, and neutral expressions, through a multi-step prompting pipeline. We repurpose this data to fine-tune we compare the gendered and neutral sentences in the autoregressive LLMs for GNR. We prepare the data as dataset using BERTScore to identify dataset entries with chat-formatted input, where each instance consists of a semantically divergent gendered-neutral sentences. Figuser role message containing a gendered sentence, and ure 1 reports the BERTScore values for the entire dataset. an assistant role message containing the correspond- We observe that while the score distribution is skewed ing neutral sentence. Consistent with the models’ prior towards almost-perfect values, there is a notable tail of instruction-following fine-tuning, this method adopts a gendered-neutral sentence pairs with a rather divergent conversational prompt–response format while strictly semantic content. To investigate the impact of such data adhering to a causal token-prediction objective [ 52 ]. in GNR fine-tuning, we construct a subset to be used for

As the sentences were partly LLM-generated, we note training alongside the full dataset: a clean subset obthat the content of the gendered-neutral pairs may not tained by filtering out the bottom 50% of sentence pairs always be aligned due to the unpredictability of LLMs based on the BERTScore values. Statistics about the finein open-ended generation.11 To investigate this aspect, tuning data are reported in Table 4. 10More specifically, we use the cleaned version of the dataset later released by Savoldi et al. [ 32 ] at https://github.com/hlt-mt/ fbk-NEUTR-evAL/blob/main/solutions/GeNTE.md 11While this is not necessarily an issue in the development of a classifier, where individual sentences are simply paired with neutrality labels, for a rewriting task the input-output sentences should be identical except for the attribute of interest, i.e., in this case, gender.

3.4.2. Method

We fine-tune the selected models using Low-Rank Adaptation (LoRA) [53]. Following common practices in LoRA fine-tuning [ 54] we set the rank and alpha at 32, and use the following hyperparameters to strike a

12We run our experiments on nodes with 4 NVIDIA A100 GPUs with

64 GB VRAM each. balance between hardware constraints12 and consistency English), we report each model’s average performance, across model sizes and requirements: learning rate: along with the range of neutrality and BERTScore values 2 × 10− 4, batch size: 8 for the 8B models, 4 for the observed across prompting conditions. In Appendix A 14B models. We use early stopping with a patience of 20 we provide the complete and detailed results obtained steps for the 8B models and 40 steps for the 14B models. with the two prompt formats, separately for Italian and English instructions.

Generally, and with rare exceptions, all models’ 4. Results BERTScore values are well above the quality threshold we identified in §3.1. This means that the models do not 4.1. Few-Shot Prompting Results generate unrelated or additional text, confirming that Figure 2 summarizes the results of the few-shot prompt- their outputs remain adherent to the input and free of ing experiments showing all models’ performance in “hallucinations” [ 55 ]. neutrality and meaning preservation. Higher values on Neutrality scores, on the contrary, vary significantly both axes indicate better performance; therefore, sys- across models. Looking at our baseline, the GNRtems closer to the top-right corner perform best. As no dedicated model Inclusively, we observe that it performs consistent trend emerged across prompt formats (GFG rather poorly in neutrality. Across LLMs, we notice simvs. Rewrite, see Section 3.3) and languages (Italian vs. ilar behavior within the groups. The “Italian” models, in the bottom left quarter of the chart, generally fail to neutralize, and alter the sentences the most. Within the multilingual LLMs group, only Phi 4, Qwen3 32B, and

LLama 3.3 perform better than the Italian models. The rest of the Qwen3 family generally underperforms, with the high BERTScore suggesting that they make little to no change to the gendered sentences. The only model performing well on both axes is GPT 4.1, which tops at 89.07% neutralization accuracy and 93.21 BERTScore, indicating that it correctly alters the parts of the sentences expressing the gender of human beings while leaving the rest untouched.

Overall, we find that the LLMs we tested perform very diferently in GNR in Italian, and that failure in this setting consists in overlooking the relevant (gendered) parts of the input to act upon, and/or unsuccessfully rendering them gender-neutral. 4.2. Fine-Tuning Results

Results of the fine-tuning experiments are reported in

Figure 3. We first notice that on the neutrality axis all ifne-tuned models outperform the baseline, except for LLamantino/clean configuration. LLamantino shows the narrowest gains overall, and in one case even a drop in neutrality, echoing its weaker few-shot prompting results and suggesting it may be ill-suited to GNR. In four out of six instances, and always with the full dataset, the finetuned models also outperform the best performer among the open-weight models in the prompting experiments, i.e. LLama 3.3 70B with the GFG English prompt, though with a significant drop in BERTScore.

Such a drop indicates that these models fail by hallucinating unrelated content in their attempt to neutralize, rather than by leaving the input sentences untouched as observed in the prompting experiments (§4.1). This is possibly due to two factors: the significantly smaller size of the fine-tuned models with respect to LLama 3.3 70B (1/9 or 1/5, for the 8B and 14B models respectively), as larger LLMs have been shown to exhibit greater robustness and lower variance in downstream performance after fine-tuning compared to smaller counterparts [ 56 ], and/or the presence of many divergent gender-neutral sentence pairs in the fine-tuning dataset (see §3.4.1).

While full yields the highest improvements in neutrality, only clean improves performance on both axes while keeping BERTScore within the human-level range.

However, it yields significantly smaller gains in neutrality and even causes drops for two models (LLamantino, Phi 4). We hypothesize that clean may be excessively conditioned by the data filtering method, i.e. a BERTScore based selection. In other words, by selecting only dataset entries with almost perfect BERTScore values we are optimizing the models to perform well on the sentence similarity dimension—as measured by BERTScore—rather than GNR.

The impact of metric-based data selection To investigate the hypothesis above, we evaluate the same outputs against the gendered inputs with another semantic similarity metric: BARTScore [57].13 BERTScore and 13While similar in name and scope, BERTScore and BARTScore function diferently. The first computes a sum of token-level cosine similarities between two sentences’ embeddings encoded by a BERT (encoder-only) model; the latter is computed as the weighted sum of the log-probabilities that a pretrained BART (encoder-decoder) model assigns to each token in the generated text. In our experiBARTScore evaluations are visualized in Figure 4. To Through fine-tuning experiments we showed that comunderstand whether outputs of the models fine-tuned pact models can match or exceed the best open-weight on clean are actually very semantically similar to the LLM at a fraction of its size. Moreover, our BERTScorecorresponding input, and whether those models simply based data cleaning highlighted a trade-of: models learned to game BERTScore, we compute14 the Pear- trained on cleaned data achieve human-level BERTScore son r and Spearman correlation coeficients between but show smaller neutrality gains and exhibit ranking BERTScore and BARTScore assessments. The first cap- diferences against another similarity metric, signaling tures linear correlations between the two metrics’ raw over-fitting on BERTScore. Future work should take this scores, while the latter measures how well the relation- trade-of into account and create dedicated, high-quality ship between the two variables can be described by a parallel data to aim at reaching the performance of the monotonic function, by comparing the rankings of the commercial system with open-weight models. scores rather than their raw values. This combination allows us to assess both the alignment of the scores and the consistency in how the two metrics rank the outputs. Acknowledgments

We find that in full, r equals 0.814 and equals 0.907, whereas in clean they are 0.914 and 0.679 respectively.15 We acknowledge the support of the project InnovAction: r is high in both cases, indicating a strong linear corre- Network Italiano dei Centri per l’Innovazione Tecnologlation between the two metrics—stronger in clean, as ica (CUP B47H2200437000), funded by MIMIT with NPRR in that case the data points are more tightly clustered, - NextGenerationEU funds, in collaboration with Piazza skewed towards higher values. This confirms that the Copernico S.r.l. We also received funding from the PNRR metrics generally agree on the quality of the outputs. project FAIR - Future AI Research (PE00000013), under The substantial drop in , instead, indicates that there the NRRP MUR program funded by NextGenerationEU. are many instances in clean where the monotonic trend Finally, we acknowledge the CINECA award under the is broken, i.e., higher BERTScore does not necessarily ISCRA initiative (AGeNTE), for the availability of highcorrespond to higher BARTScore. This suggests that performance computing resources and support. the clean models also learned to game BERTScore by reproducing features rewarded by that metric. References

With respect to our hypothesis: by selecting highsimilarity pairs for the clean dataset, we efectively steered models toward preserving semantic alignment with the input; however, this emphasis on similarity appears to have hampered their improvement in neutralization. Indeed, the models learned to preserve the input to an excessive degree, as confirmed by the high r coeficient and high BARTScore values shown in Figure 4. We interpret our results as evidence of a broader trade-of between optimizing for neutrality and for sentence similarity. Our findings underscore the need for data curation strategies that strike a balance between neutrality and similarity, achieving the flexibility required for efective GNR.

5. Conclusions

We presented the first systematic investigation of state-ofthe-art large language models for Italian gender-neutral rewriting under a two-dimensional evaluation of neutrality and meaning preservation. In our few-shot prompting experiments, open-weight models outperformed the only existing Italian-specific system but remained behind a closed commercial system.

ments, we use the BART model facebook/bart-large [58]. 14We use the Python library SciPy [59]. 15All p-values < 0.05.

ing of the ACL, ACL, Online, 2020, pp. 5454–5476. 2023, pp. 8747–8759. URL: https://aclanthology.org/ URL: https://aclanthology.org/2020.acl-main.485/. 2023.findings-emnlp.585/. [6] S. Dev, M. Monajatipoor, A. Ovalle, A. Subramo- [17] P. Lerner, C. Grouin, INCLURE: a dataset and toolkit nian, J. Phillips, K.-W. Chang, Harms of gender for inclusive French translation, in: Proc. of the exclusivity and challenges in non-binary represen- 17th Workshop on Building and Using Comparable tation in language technologies, in: Proc. of the Corpora (BUCC) @ LREC-COLING 2024, ELRA and 2021 Conference on Empirical Methods in Natural ICCL, Torino, Italia, 2024, pp. 59–68. URL: https: Language Processing, ACL, Online and Punta Cana, //aclanthology.org/2024.bucc-1.7/. Dominican Republic, 2021, pp. 1968–1994. URL: [18] S. Greco, M. La Quatra, L. Cagliero, T. Cerquitelli, https://aclanthology.org/2021.emnlp-main.150/. Towards ai-assisted inclusive language writing in [7] U. Gabriel, P. M. Gygax, E. A. Kuhn, Neutralis- italian formal communications, ACM Trans. Intell. ing linguistic sexism: Promising but cumbersome?, Syst. Technol. (2025). URL: https://doi.org/10.1145/ Group Processes & Intergroup Relations 21 (2018) 3729237.

844–858. [19] B. Papadopoulos, Morphological Gender Innova[8] APA, Publication Manual of the APA, 7th ed., 2020. tions in Spanish of Gender queer Speakers, Depart[9] A. Piergentili, D. Fucci, B. Savoldi, L. Bentivogli, ment of Spanish and Portuguese, University of CaliM. Negri, Gender neutralization for an inclusive fornia, UC Berkeley, 2019. URL: https://escholarship. machine translation: from theoretical foundations org/uc/item/6j73t666. to open challenges, in: Proc. of the First Work- [20] G. S. di Carlo, Is italy ready for gender-inclusive shop on Gender-Inclusive Translation Technolo- language? an attitude and usage study among gies, EAMT, Tampere, Finland, 2023, pp. 71–83. italian speakers, in: Inclusiveness Beyond the URL: https://aclanthology.org/2023.gitt-1.7/. (Non)binary in Romance Languages, 1st edition ed., [10] T. Sun, K. Webster, A. Shah, W. Y. Wang, M. Johnson, Routledge, 2024, p. 21. URL: https://doi.org/10.4324/ They, them, theirs: Rewriting with gender-neutral 9781003432906.

english, 2021. arXiv:2102.06788. [21] G. V. Silva, C. Soares, Inclusiveness Beyond the [11] E. Vanmassenhove, C. Emmery, D. Shterionov, Neu- (Non)binary in Romance Languages: Research and Tral Rewriter: A rule-based and neural approach Classroom Implementation, 1st ed., Routledge, Lonto automatic rewriting into gender neutral alterna- don, 2024. doi:10.4324/9781003432906. tives, in: Proc. of the 2021 Conference on Empirical [22] P. Gygax, S. Sato, A. Öttl, U. Gabriel, The masMethods in Natural Language Processing, ACL, On- culine form in grammatically gendered languages line and Punta Cana, Dominican Republic, 2021, and its multiple interpretations: a challenge for pp. 8940–8948. URL: https://aclanthology.org/2021. our cognitive system, Language Sciences 83 emnlp-main.704/. (2021) 101328. URL: https://www.sciencedirect.com/ [12] M. Bartl, S. Leavy, From ‘showgirls’ to ‘performers’: science/article/pii/S0388000120300619.

Fine-tuning with gender-inclusive language for bias [23] L. Ackerman, Syntactic and cognitive issues in inreduction in LLMs, in: Proc. of the 5th Workshop vestigating gendered coreference, Glossa: a journal on Gender Bias in Natural Language Processing of general linguistics 4 (2019). (GeBNLP), ACL, Bangkok, Thailand, 2024, pp. 280– [24] M. Rosola, S. Frenda, A. T. Cignarella, M. Pelle294. URL: https://aclanthology.org/2024.gebnlp-1. grini, A. Marra, M. Floris, Beyond obscuration and 18/. visibility: Thoughts on the diferent strategies of [13] E. Doyen, A. Todirascu, Genre: A french gender- gender-fair language in Italian, in: Proc. of the 9th neutral rewriting system using collective nouns, Italian Conference on Computational Linguistics 2025. arXiv:2505.23630. (CLiC-it 2023), CEUR Workshop Proc., Venice, Italy, [14] E. Rose, M. Winig, J. Nash, K. Roepke, K. Con- 2023, pp. 369–378. URL: https://aclanthology.org/ rod, Variation in acceptability of neologistic 2023.clicit-1.44/.

English pronouns, Proc. of the Linguis- [25] G. Comandini, Salve a tutt@, tutt*, tuttu, tuttx e tic Society of America 8 (2023) 5526. URL: tutt@: l’uso delle strategie di neutralizzazione di https://journals.linguisticsociety.org/proceedings/ genere nella comunità queer online. indagine su un index.php/PLSA/article/view/5526. corpus di italiano scritto informale sul web., Testo [15] D. Pomerenke, Inclusify: A benchmark and e Senso 23 (2021) 43–64.

a model for gender-inclusive german, 2022. [26] J. Silveira, Generic Masculine Words and ThinkarXiv:2212.02564. ing, Women’s Studies International Quarterly 3 [16] L. Veloso, L. Coheur, R. Ribeiro, A rewriting ap- (1980) 165–178. URL: https://www.sciencedirect. proach for gender inclusivity in Portuguese, in: com/science/article/pii/S0148068580921132. Findings of the ACL: EMNLP 2023, ACL, Singapore, [27] A. H. Bailey, A. Williams, A. Cimpian, Based on across the categories are highlighted , and the best overall performer is in bold.

BERTScore Model Size (B) GFG Ita GFG Eng Rewrite Ita Rewrite Eng

AVG Inclusively

Model Size (B) GFG Ita GFG Eng Rewrite Ita Rewrite Eng Table 6

across the categories are highlighted , and the best overall performer is in bold.

Declaration on Generative AI During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase and reword, Improve writing style, and Grammar and spelling check. After using these tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

[1]

Savoldi ,

Papi ,

Negri ,

Guerberof-Arenas ,

Bentivogli , What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study , in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing , Association for Computational Linguistics, Miami, Florida, USA, 2024 . URL: https://aclanthology.org/ 2024 .emnlp-main. 1002 /. doi: 10 .18653/v1/ 2024 . emnlp-main. 1002 .

[2]

Kotek ,

Dockum ,

Sun , Gender bias and stereotypes in large language models , in: Proc. of The ACM Collective Intelligence Conference , CI '23, ACM , New York, NY, USA, 2023 , p. 12 - 24 . URL: https://doi.org/10.1145/3582269.3615599.

[3]

Ostrow ,

Lopez , Llms reproduce stereotypes of sexual and gender minorities , 2025 . arXiv: 2501 . 05926 .

[4]

Savoldi ,

Bastings ,

Bentivogli ,

Vanmassenhove , A decade of gender bias in machine translation , Patterns ( 2025 ) 101257 . URL: https://www.sciencedirect.com/science/article/pii/ S2666389925001059.

[5]

S. L.

Blodgett ,

Barocas , H. Daumé

III

, H. Wallach, Language (technology) is power: A critical survey of “bias” in NLP, in: Proc. of the 58th Annual Meetbillions of words on the internet, people= men, Sci- Minerva LLMs: The first family of large language ence Advances 8 ( 2022 ) eabm2463 . models trained from scratch on Italian data , in:

[28]

Piergentili ,

Savoldi ,

Fucci ,

Negri , L. Ben- Proc. of the 10th Italian Conference on Computivogli , Hi guys or hi folks? benchmarking gender- tational Linguistics (CLiC-it 2024 ), CEUR Workneutral machine translation with the GeNTE cor- shop Proc ., Pisa , Italy, 2024 , pp. 707 - 719 . URL: pus, in : Proc. of the 2023 Conference on Em- https://aclanthology.org/ 2024 .clicit- 1 .77/. pirical Methods in Natural Language Processing, [37]

Basile , E. Musacchio,

Polignano , L. Siciliani, ACL , Singapore, 2023 , pp. 14124 - 14140 . URL: https: G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod//aclanthology.org/ 2023 .emnlp-main. 873 /. els for efective text generation in italian language,

[29]

Höglund , M. Flinkfeldt, De-gendering parents: 2023 . arXiv: 2312 .09993. Gender inclusion and standardised language in [38] Almawave , Velvet, 2025 . URL: https: screen-level bureaucracy , International Journal of //www.almawave.com/it/tecnologia/velvet/. Social Welfare ( 2023 ). [39]

Llama Team , The llama 3 herd of models , 2024 .

[30]

Y. T.

Cao , H. Daumé

III

, Toward gender-inclusive arXiv: 2407 .21783. coreference resolution, in: Proc. of the 58th Annual [40]

Abdin ,

Aneja ,

Behl ,

Bubeck ,

Eldan , Meeting of the ACL, ACL , Online, 2020 , pp. 4568 - S . Gunasekar,

Harrison ,

R. J.

Hewett , M. Java4595 . URL: https://aclanthology.org/ 2020 .acl-main. heripi, P. Kaufmann,

J. R.

Lee ,

Y. T.

Lee ,

Li , W. Liu, 418 /. C. C. T. Mendes , A. Nguyen , E. Price, G. de Rosa,

[31]

Waldis ,

Birrer ,

Lauscher , I. Gurevych , The O. Saarikivi ,

Salim ,

Shah ,

Wang ,

Ward , Lou dataset - exploring the impact of gender-fair Y . Wu , D.

Yu , C.

Zhang , Y. Zhang,

Phi-4 technical language in German text classification , in: Proc. of report , 2024 . arXiv: 2412 .08905. the 2024 Conference on Empirical Methods in Natu- [41]

Qwen Team , Qwen3 technical report , 2025 . ral Language Processing, ACL , Miami, Florida, USA, arXiv: 2505 . 09388 . 2024 , pp. 10604 - 10624 . URL: https://aclanthology. [42] OpenAI , Introducing gpt-4 .1 in the api, 2025 . URL: org/ 2024 .emnlp-main. 592 /. https://openai.com/index/gpt-4-1/, accessed: 2025 -

[32]

Savoldi ,

Piergentili ,

Fucci ,

Negri , L. Ben- 05 -15. tivogli, A prompt response to the demand for auto- [43]

Li ,

Jiang ,

Huang ,

Beigi ,

Zhao , Z. Tan, matic gender-neutral translation , in: Proc. of the A. Bhattacharjee ,

Jiang ,

Chen ,

Wu ,

Shu , 18th Conference of the European Chapter of the L. Cheng, H. Liu, From generation to judgment: OpACL (Volume 2 : Short

Papers)

, ACL, St. Julian's, portunities and challenges of llm-as-a- judge , 2025 . Malta, 2024 , pp. 256 - 267 . URL: https://aclanthology. arXiv: 2411 .16594. org/ 2024 .eacl-short. 23 /. [44]

Piergentili ,

Savoldi ,

Negri , L. Bentivogli,

[33]

Savoldi ,

Attanasio ,

Cupin , E. Gkovedarou, An LLM-as-a-judge approach for scalable genderJ . Hackenbuchner , A.

Lauscher , M.

Negri , A.

Pier- neutral translation evaluation , in: Proceedings of gentili, M. Thind, L. Bentivogli, Mind the inclu- the 3rd Workshop on Gender-Inclusive Translation sivity gap: Multilingual gender-neutral transla- Technologies (GITT 2025 ), EAMT, Geneva, Switzertion evaluation with mGeNTE , 2025 . URL: https: land, 2025 , pp. 46 - 63 . URL: https://aclanthology. //openreview.net/forum?id=dBUHC2QyBh. org/ 2025 .gitt- 1 .3/.

[34]

Frenda ,

Piergentili ,

Savoldi ,

Madeddu , [45]

Zhang* , V. Kishore*, F. Wu*,

K. Q.

Weinberger ,

Rosola ,

Casola ,

Ferrando ,

Patti ,

Negri ,

Artzi , Bertscore: Evaluating text generation with L. Bentivogli, GFG - gender-fair generation: A bert, in: International Conference on Learning CALAMITA challenge , in: Proc. of the 10th Italian Representations , 2020 . URL: https://openreview.net/ Conference on Computational Linguistics (CLiC- forum?id=SkeHuCVFDr. it 2024 ), CEUR Workshop Proc., Pisa , Italy, 2024 , [46]

Devlin , M.-

Chang ,

Lee ,

Toutanova , BERT: pp. 1106 - 1115 . URL: https://aclanthology.org/ 2024 . Pre-training of deep bidirectional transformers for clicit-1 .122/. language understanding, in: Proc. of the 2019 Con-

[35]

Attanasio ,

Delobelle ,

M. La

Quatra , A. Santilli, ference of the North American Chapter of the ACL: B. Savoldi, ItaEval and TweetyIta: A new extensive Human Language Technologies, Volume 1 (Long benchmark and eficiency-first language model for and Short Papers) , ACL, Minneapolis, Minnesota, Italian, in : Proc. of the 10th Italian Conference on 2019 , pp. 4171 - 4186 . URL: https://aclanthology.org/ Computational Linguistics (CLiC-it 2024 ), CEUR N19- 1423 /. Workshop Proc., Pisa , Italy, 2024 , pp. 39 - 51 . URL: [47]

Papineni ,

Roukos ,

Ward , W.-J. Zhu, Bleu: a https://aclanthology.org/ 2024 .clicit -1.6/. method for automatic evaluation of machine trans-

[36]

Orlando ,

Moroni , P.-L. Huguet Cabot , S. Co- lation, in: Proc. of the 40th Annual Meeting of nia , E. Barba,

Orlandini , G. Fiameni, R. Navigli, the

ACL

, ACL, Philadelphia, Pennsylvania, USA, 2002 , pp. 311 - 318 . URL: https://aclanthology.org/ Y. Huang,

Dai ,

Yu ,

Petrov ,

E. H.

Chi ,

Dean , P02 - 1040 /. J. Devlin , A.

Roberts , D.

Zhou , Q. V.

Le , J.

Wei , Scal-

[48]

Snover ,

Dorr ,

Schwartz , L. Micciulla, ing instruction-finetuned language models , J. Mach. J. Makhoul , A study of translation edit rate with tar- Learn. Res . 25 ( 2024 ). geted human annotation , in: Proc. of the 7th Confer - [57]

Yuan , G. Neubig, P. Liu, Bartscore: evaluating ence of the AMTA: Technical Papers , AMTA , Cam - generated text as text generation, in: Proc. of the bridge , Massachusetts, USA, 2006 , pp. 223 - 231 . URL: 35th International Conference on NeurIPS, NIPS https://aclanthology.org/ 2006 .amta-papers. 25 /. '21, Curran Associates Inc., Red

Hook

, NY , USA,

[49]

Sarti , M. Nissim, IT5: Text-to-text pretraining 2021. for Italian language understanding and generation , [58]

Lewis ,

Liu ,

Goyal ,

Ghazvininejad , A. Moin: Proc . of the 2024 Joint International Conference hamed,

Levy ,

Stoyanov , L. Zettlemoyer, BART: on Computational Linguistics, Language Resources denoising sequence-to-sequence pre-training for and Evaluation (LREC-COLING 2024), ELRA and natural language generation, translation , and comICCL, Torino, Italia, 2024 , pp. 9422 - 9433 . URL: https: prehension, CoRR abs/ 1910 .13461 ( 2019 ). URL: http: //aclanthology.org/ 2024 .lrec-main. 823 . //arxiv.org/abs/ 1910 .13461. arXiv: 1910 .13461.

[50]

Brown ,

Mann ,

Ryder ,

Subbiah , J. D. [59]

Virtanen ,

Gommers ,

T. E.

Oliphant , M. HaberKaplan, P. Dhariwal, Neelakantan , et al., Language land , T. Reddy, D.

Cournapeau , E. Burovski, P.

Pemodels are few-shot learners , in: Advances in terson, W. Weckesser,

Bright , S. J. van der Walt , NeurIPS, volume 33 , Curran

Associates

, Inc.,

Brett , J. Wilson,

K. J.

Millman ,

Mayorov , A. R. J. 2020 , pp. 1877 - 1901 . URL: https://proceedings. Nelson, E. Jones,

Kern ,

Larson ,

C. J.

Carey , İ. Poneurips.cc/paper_files/paper/2020/file/ lat, Y. Feng,

E. W.

Moore , J. VanderPlas , D. Laxalde, 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf . J. Perktold , R.

Cimrman , I.

Henriksen , E. A. Quin-

[51]

Kwon ,

Li ,

Zhuang ,

Sheng ,

Zheng , C. H. tero,

C. R.

Harris ,

A. M.

Archibald ,

A. H.

Ribeiro , Yu,

J. E.

Gonzalez ,

Zhang , I. Stoica, Eficient mem- F. Pedregosa , P. van Mulbregt, SciPy 1 . 0 Contribory management for large language model serving utors, SciPy 1.0: Fundamental Algorithms for Sciwith pagedattention , 2023 . arXiv: 2309 .06180. entific Computing in Python, Nature Methods 17

[52]

Ouyang ,

Wu ,

Jiang ,

Almeida ,

C. L.

Wain- ( 2020 ) 261 - 272 . wright, P. Mishkin,

Zhang , S. Agarwal,

Slama ,

Ray ,

Schulman ,

Hilton ,

Kelton ,

Miller ,

Simens ,

Askell ,

Welinder ,

Christiano ,

Detailed results J. Leike ,

Lowe , Training language models to follow instructions with human feedback , in: Proc. Tables 5 and 6 report the detailed results of our fineof the 36th International Conference on NeurIPS, tuning experiments . NIPS '22 , Curran Associates Inc., Red

Hook

, NY , USA, 2022 .

[53]

E. J.

Hu ,

Shen ,

Wallis ,

Allen-Zhu ,

Li ,

Wang ,

Chen , Lora: Lowrank adaptation of large language models , 2021 . arXiv: 2106 . 09685 .

[54] Unsloth

Documentation

, Lora hyperparameters guide, 2025 . URL: https://docs. unsloth. ai/get-started/fine-tuning-llms-guide/ lora-hyperparameters-guide.

[55]

Huang ,

Yu , W. Ma,

Zhong ,

Feng ,

Wang ,

Chen ,

Peng ,

Feng ,

Qin , T. Liu, A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions , ACM Transactions on Information Systems 43 ( 2025 ) 1 - 55 . URL: http://dx.doi.org/10.1145/ 3703155.

[56]

H. W.

Chung ,

Hou ,

Longpre ,

Zoph ,

Tai ,

Fedus ,

Li ,

Wang ,

Dehghani ,

Brahma ,

Webson ,

S. S.

Gu ,

Dai ,

Suzgun ,

Chen ,

Chowdhery ,

Castro-Ros ,

Pellat ,

Robinson ,

Valter ,

Narang ,

Mishra ,

Yu ,

Zhao , 75 . 33 89.07 75.33 78.37 38.80 94.83 0 . 78