<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Gender-Neutral Rewriting in Italian: Models, Approaches, and Trade-ofs</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Andrea Piergentili</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Beatrice Savoldi</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Matteo Negri</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Luisa Bentivogli</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Fondazione Bruno Kessler</institution>
          ,
          <addr-line>Via Sommarive 18, 38123, Povo (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>University of Trento</institution>
          ,
          <addr-line>via Sommarive 5, 38123, Povo (TN)</addr-line>
          ,
          <country country="IT">Italy</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>Gender-neutral rewriting (GNR) aims to reformulate text to eliminate unnecessary gender specifications while preserving meaning, a particularly challenging task in grammatical-gender languages like Italian. In this work, we conduct the first systematic evaluation of state-of-the-art large language models (LLMs) for Italian GNR, introducing a two-dimensional framework that measures both neutrality and semantic fidelity to the input. We compare few-shot prompting across multiple LLMs, fine-tune selected models, and apply targeted cleaning to boost task relevance. Our findings show that open-weight LLMs outperform the only existing model dedicated to GNR in Italian, whereas our fine-tuned models match or exceed the best open-weight LLM's performance at a fraction of its size. Finally, we discuss the trade-of between optimizing the training data for neutrality and meaning preservation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Ethics</kwd>
        <kwd>fairness</kwd>
        <kwd>gender rewriting</kwd>
        <kwd>large language models</kwd>
        <kwd>fine-tuning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        of the Senate has received the information). A further
challenge in automatic GNR is preserving the meaning
Language technologies reinforce existing gender stereo- of the original sentence beyond gender expression, to
types and binary assumptions by disproportionately avoid generating output sentences that are neutral but
favoring masculine references or representations [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], semantically divergent from the input.
especially when gender information is ambiguous or So far, GNR system development has been mostly
conunspecified [
        <xref ref-type="bibr" rid="ref2 ref3 ref4">2, 3, 4</xref>
        ]. Such biases result in the under- fined to English [ 10, 11, 12, inter alia], where gender is
representation or misrepresentation of certain gender expressed through specific sets of words, such as
progroups, reinforcing existing societal stereotypes, and nouns (e.g., he/she, him/her) and lexically gendered terms
erasing non-binary identities [
        <xref ref-type="bibr" rid="ref5">5, 6</xref>
        ]. Addressing these bi- (e.g., policeman/policewoman), and gender-neutral
alterases through gender-inclusive approaches is increasingly natives (e.g., the singular they or synonyms like police
important to ensure language technologies contribute to oficer ) are generally available and attested. GNR
sysmore inclusive and equitable communication [7, 8, 9]. tems for grammatical-gender languages generally target
      </p>
      <p>Gender-neutral rewriting (GNR) has emerged as a nat- specific gendered phenomena, such as member nouns
ural language generation task aimed at producing texts [13], or use neologistic [14] inclusive devices such as
free from unnecessary gender specifications [ 10, 11]. This neomorphemes and graphemic solutions [15, 16, 17] that
task is particularly challenging in grammatical-gender convey neutrality, but are not necessarily acceptable in
languages, such as Italian, due to the pervasive encod- all contexts. Currently, the sole model dedicated to Italian
ing of gender in the morphology. Consider the sen- GNR was developed by Greco et al. [18], which, however,
tence ‘Tutti i senatori sono stati informati’ (equivalent was developed and tested on proprietary, not publicly
to AllM theM senatorsM have beenM informedM): almost available data, hindering reproducibility and progress.
every word is morphologically inflected for (masculine) Towards addressing this gap, this paper explores the
gender. Rephrasing this sentence in a gender-neutral potential of state-of-the-art (SOTA) large language
modway may require significant changes, e.g. ‘ Ogni mem- els (LLMs) to perform GNR in Italian. Specifically, we
bro del Senato ha ricevuto l’informazione’ (Every member explore both prompting and fine-tuning approaches and
assess both neutrality and meaning preservation in the
tCicLsi,CS-eitpt2e0m25b:eErl2e4ve—nt2h6I,t2a0li2a5n, CCaognlfiearrein,cIteaolyn Computational Linguis- reformulated texts.
* Corresponding author. Our contributions are threefold: i) The first systematic
$ apiergentili@fbk.eu (A. Piergentili); bsavoldi@fbk.eu evaluation of SOTA LLMs for Italian GNR under a
two(B. Savoldi); negri@fbk.eu (M. Negri); bentivo@fbk.eu dimensional framework measuring both neutrality and
(L. Bentivogli) meaning preservation; ii) A set of experiments in
fine(B. 0S0a0v0o-l0d0i0);30-020101-70-010323-88(8A1.1-P4i3e3rg0e(nMti.liN);e0g0r0i)0;-00000002--03000611--78438107-2231 tuning LLMs for GNR, enabling compact models to rival
(L. Bentivogli) significantly larger-sized models; iii) An investigation of
© 2025 Copyright for this paper by its authors. Use permitted under Creative Commons License the GNR performance trade-of between meaning
preserAttribution 4.0 International (CC BY 4.0).
vation and neutrality in the outputs of LLMs fine-tuned
on sentence similarity-optimized data.1
REF-G</p>
      <p>EN
REF-N</p>
      <p>EN</p>
      <p>Spero di essere stato chiaro su questo punto.</p>
      <p>I hope that I am clear in this.</p>
      <p>Spero di avere espresso con chiarezza questo punto.</p>
      <p>I hope that I have expressed this point clearly.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Background</title>
      <sec id="sec-2-1">
        <title>Gender-Inclusive Language Inclusive language aims</title>
        <p>
          to prevent expressions that reinforce gender hierarchies
or render non-binary identities invisible, promoting fair- In REF-N is italicized.
ness and inclusion in alignment with UN Sustainable
Development Goals of gender equality.2 In
grammaticalgender languages like Italian, inclusive language is both work has explored gender-neutral translation [
          <xref ref-type="bibr" rid="ref10 ref11">32, 33</xref>
          ],
particularly challenging and increasingly urgent due to whereas intra-lingual rewriting remains mostly limited
their entrenched gender systems [19, 20, 21] and the to benchmarking eforts. [
          <xref ref-type="bibr" rid="ref12">34</xref>
          ]. Attanasio et al. [
          <xref ref-type="bibr" rid="ref13">35</xref>
          ]
comwidespread use of masculine forms as default to mark pared several instruction-following models prompted
generic or mixed-gender referents [22].3 To address this across fairness-related tasks—including GNR—but these
issue, two main strategies have emerged, as reviewed underperformed, achieving less than 50% success in
neuby Rosola et al. [24] within the Italian linguistic context. tralization. Frenda et al. [
          <xref ref-type="bibr" rid="ref12">34</xref>
          ] proposed the gender-fair
On the one hand, innovative forms using neomorphemes generation (GFG) challenge, where for one of the tasks
and symbols (e.g., tutt* or tutt@) are mostly used in in- models are prompted to reformulate gendered Italian
senformal contexts like social media and online LGBTQIA+ tences in a neutral way. Closest to our work, Greco et al.
communities, and are generally not accepted in more for- [18] developed a rewriter by fine-tuning language models
mal contexts [25]. Instead, conservative gender-neutral specifically for Italian gender-neutral language. However,
language strategies retool existing forms and grammar the data used for testing and developing these models
to avoid unnecessary gendered expressions [26, 27], e.g. are not publicly available, hampering further research
by replacing i professori with la docenza [9]. As attested and comparability.
by Piergentili et al. [
          <xref ref-type="bibr" rid="ref6">28</xref>
          ], such neutral solutions are
increasingly accepted in communication and are endorsed 3. Experimental settings
by institutions and universities to embrace all gender
identities [
          <xref ref-type="bibr" rid="ref7">29</xref>
          ].4
        </p>
        <sec id="sec-2-1-1">
          <title>We define GNR as the task of reformulating a sentence</title>
          <p>
            to remove explicit gender markings referring to human
Gender-Inclusive Rewriting In recent years, sexism entities, without altering the sentence beyond what is
and gender-exclusionary practices have been increas- necessary for neutralization, ensuring semantic
equivaingly addressed in NLP, focusing initially on binary gen- lence to the input. We run a set of experiments evaluating
der bias and more recently expanding to non-binary in- diferent systems and approaches to GNR. Here, we first
clusive language technologies [
            <xref ref-type="bibr" rid="ref4">6, 4</xref>
            ]. NLP work has ex- discuss the evaluation data and metrics (§3.1) and the set
plored the modeling of inclusive language across various of models we experiment with (§3.2). Then, we describe
tasks [
            <xref ref-type="bibr" rid="ref8 ref9">30, 31</xref>
            ], including inclusive language generation. two approaches to GNR: few-shot prompting SOTA LLMs
For instance, Bartl and Leavy [12] explored stereotype (§3.3) and fine-tuning a subset of those LLMs on
repurreduction in English LLMs fine-tuned on inclusive seeds posed Italian data (§3.4).
and lexicon.
          </p>
          <p>Intralingual inclusive rewriting has primarily been ex- 3.1. Evaluation
plored in English [10, 11, 12], where gender marking is
scarce. Similar eforts in languages with grammatical
gender include research on German [15], Portuguese [16],
and French [17, 13], either by using innovative forms or
targeting specific instances of gendered languages—such
as masculine generics in member nouns. In Italian, prior
1We release models and data at https://huggingface.co/FBK-MT
2See https://sdgs.un.org/goals/goal5
3English presents fewer challenges as gender marking is primarily
limited to pronouns, allowing focused solutions like the singular
they [23].
4See for instance the EU Parliament guidelines for gender-neutral
language: https://www.europarl.europa.eu/cmsdata/151780/GNL_
Guidelines_EN.pdf</p>
        </sec>
        <sec id="sec-2-1-2">
          <title>Test data Following Frenda et al. [34], we conduct</title>
          <p>
            our GNR experiments on mGeNTE [
            <xref ref-type="bibr" rid="ref11">33</xref>
            ], a benchmark
for gender-neutral translation from English into
several grammatical-gender languages, including Italian.
mGeNTE provides 1,500 parallel gendered and
genderneutral references created by professionals (REF-G and
REF-N respectively), difering only in gender expression
(see Table 1 for an example of an Italian mGeNTE entry).
It is organized into two subsets: Set-G, containing
sentences that require neutralization, and Set-N, containing
sentences that do not. For our GNR experiments, we
use the 750 Italian gendered references from Set-N as
          </p>
        </sec>
        <sec id="sec-2-1-3">
          <title>Metrics To evaluate gender-neutrality, we use the</title>
          <p>LLM-as-a-Judge [43] approach proposed by Piergentili
et al. [44], which provides sentence-level binary
gendered/neutral assessments, and was shown to be highly
accurate in both human- and model-generated texts. We
use their optimal configuration for monolingual
evaluation.5 We compute the percentage of neutralized
sentences over the whole test set (750 entries).</p>
          <p>
            To evaluate meaning preservation in GNR, we use
BERTScore [45], an attested BERT-based [46] metric
measuring the semantic similarity of two texts (the higher
the better, indicating close similarity). We use BERTScore
rather than common string-matching metrics like BLEU
[47] and TER [
            <xref ref-type="bibr" rid="ref15">48</xref>
            ] because gender-neutralization can
have a notable impact on the lexicon, morphology, and
structure of a sentence [9], which would be penalized by
such metrics. By contrast, BERTScore was found to be
rather insensitive to gender-neutralization [
            <xref ref-type="bibr" rid="ref6">28</xref>
            ].
Therefore, lower BERTScore values should be attributed to
diferences in the meaning of the sentences beyond
gender, which we evaluate separately, as described above.
          </p>
          <p>To identify reference values to guide the interpretation
of BERTScore in GNR, we compute the distribution of
BERTScore of mGeNTE REF-N sentences against the
respective REF-G.6 As these neutral reformulations were
produced by human experts, the BERTScore
distribution provides an empirical estimate of human-level
performance in meaning preservation in GNR. We take
the mean BERTScore minus one standard deviation
5Prompt: ‘Mono+P+L’; GPT model: gpt-4o-2024-08-06
6We only use Set-N entries in this computation.
Rewrite, Italian
Riformula la seguente frase utilizzando un linguaggio neutro
rispetto al genere dei referenti umani, evitando l’uso di forme
maschili e femminili.</p>
          <p>Rewrite the following Italian sentence using a gender-neutral
language in reference to human beings, avoiding masculine or
feminine forms.</p>
          <p>GFG, English
Rewrite, English
Sei un riscrittore di frasi italiane con l’obiettivo di rendere i testi You are a rewriter of Italian sentences with the goal of making
neutrali rispetto al genere dei referenti umani. Ti viene fornita una texts gender-neutral with respect to human referents. You are
frase che contiene riferimenti a persone in forme marcate per given a sentence that contains references to people using
genere, come il maschile sovraesteso o coppie binarie. gender-marked forms (such as masculine generics or binary pairs).
Il tuo compito è riformulare la frase in modo da: Your task is to rewrite the sentence to:
Per farlo, usa strategie come:</p>
          <p>To do this, use strategies such as:
• rimuovere riferimenti espliciti al genere quando non</p>
          <p>necessari;
• mantenere inalterato il significato originale;
• preservare lo stile e la leggibilità del testo.
• sostantivi collettivi (“la cittadinanza”, “il personale”,</p>
          <p>“l’utenza”);
• perifrasi impersonali (“si dovrebbe”, “si consiglia”);
• forme passive (“l’accesso è consentito”);
• forme imperative (“allega il documento”);
• pronomi relativi e costruzioni subordinate (“chi ha svolto</p>
          <p>attività di pesca”);
• termini epiceni (“ogni giudice”, “gentile collega”);
• termini neutri (“l’individuo”, “la persona interessata”, “il</p>
          <p>membro”).
• evita l’uso del maschile come forma generica e non usare</p>
          <p>forme grafiche non standard come asterischi o schwa;
• evita doppie formulazioni come “il/a cittadino/a” oppure</p>
          <p>“il professore o la professoressa”;
• non rimuovere parti della frase che non richiedono</p>
          <p>modifiche (ad esempio, i nomi propri);
• fornisci solo la frase riformulata.
• remove explicit gender references when they are not</p>
          <p>necessary;
• preserve the original meaning;
• maintain the style and readability of the text.
• collective nouns (“la cittadinanza”, “il personale”,</p>
          <p>“l’utenza”);
• impersonal phrases (“si dovrebbe”, “si consiglia”);
• passive constructions (“l’accesso è consentito”);
• imperative constructions (“allega il documento”);
• relative pronouns and subordinate clauses (“chi ha svolto</p>
          <p>attività di pesca”);
• epicene terms (“ogni giudice”, “gentile collega”);
• neutral terms (“l’individuo”, “la persona interessata”, “il</p>
          <p>membro”).
• avoid using the masculine form as a generic and do not</p>
          <p>use non-standard spellings such as asterisks or schwa;
• avoid binary formulations such as “il/a cittadino/a” or “il</p>
          <p>professore o la professoressa”;
• do not remove any part of the sentence that does not</p>
          <p>need to be rewritten (e.g. proper names);
• only return the reformulated sentence.</p>
          <p>IMPORTANTE:
IMPORTANT:
3.3. Few-Shot Prompting</p>
        </sec>
        <sec id="sec-2-1-4">
          <title>We run few-shot prompting experiments with all models</title>
          <p>
            in the selection described above,9 to investigate the
performance of LLMs without any task-specific fine-tuning.
We use two prompt formats:
• GFG: a concise rewriting instruction, originally
used by Frenda et al. [
            <xref ref-type="bibr" rid="ref12">34</xref>
            ] in their gender-fair
generation challenge for Italian LLMs.
• Rewrite: a more detailed and analytical prompt,
also featuring essential guidelines for the task
          </p>
        </sec>
        <sec id="sec-2-1-5">
          <title>9Except for Inclusively, which does not support few-shot prompting.</title>
          <p>We instead test its of-the-shelf generation capabilities.</p>
        </sec>
        <sec id="sec-2-1-6">
          <title>These prompts allow us to explore the impact of more</title>
          <p>
            complex instruction on models’ performance. Moreover,
we experiment with these two prompt formats by
formulating them in both Italian an English, to investigate
whether the language used is a relevant factor as well.
The content of the prompts is reported in Table 3. We
include the same 8 task exemplars—or shots—with all
prompts, to elicit the in-context learning ability of LLMs
[
            <xref ref-type="bibr" rid="ref17">50</xref>
            ]. We use vLLM [
            <xref ref-type="bibr" rid="ref18">51</xref>
            ] as the inference engine.
3.4. Fine-tuning
          </p>
        </sec>
        <sec id="sec-2-1-7">
          <title>We perform fine-tuning experiments to assess whether and to which extent smaller open-weight LLMs can be adapted to the GNR task and approach the performance</title>
          <p>
            of larger models or closed systems. Namely, we fine-tune
LLaMAntino, Velvet, LLama 3.1, Phi 4, and the 8B and
14B Qwen3 models.
neutrality classifier. 10 This data consists in gendered
Italian sentences and their gender-neutral counterparts,
all generated starting from a dictionary of masculine,
feminine, and neutral expressions, through a multi-step
prompting pipeline. We repurpose this data to fine-tune we compare the gendered and neutral sentences in the
autoregressive LLMs for GNR. We prepare the data as dataset using BERTScore to identify dataset entries with
chat-formatted input, where each instance consists of a semantically divergent gendered-neutral sentences.
Figuser role message containing a gendered sentence, and ure 1 reports the BERTScore values for the entire dataset.
an assistant role message containing the correspond- We observe that while the score distribution is skewed
ing neutral sentence. Consistent with the models’ prior towards almost-perfect values, there is a notable tail of
instruction-following fine-tuning, this method adopts a gendered-neutral sentence pairs with a rather divergent
conversational prompt–response format while strictly semantic content. To investigate the impact of such data
adhering to a causal token-prediction objective [
            <xref ref-type="bibr" rid="ref19">52</xref>
            ]. in GNR fine-tuning, we construct a subset to be used for
          </p>
          <p>
            As the sentences were partly LLM-generated, we note training alongside the full dataset: a clean subset
obthat the content of the gendered-neutral pairs may not tained by filtering out the bottom 50% of sentence pairs
always be aligned due to the unpredictability of LLMs based on the BERTScore values. Statistics about the
finein open-ended generation.11 To investigate this aspect, tuning data are reported in Table 4.
10More specifically, we use the cleaned version of the dataset
later released by Savoldi et al. [
            <xref ref-type="bibr" rid="ref10">32</xref>
            ] at https://github.com/hlt-mt/
fbk-NEUTR-evAL/blob/main/solutions/GeNTE.md
11While this is not necessarily an issue in the development of a
classifier, where individual sentences are simply paired with neutrality
labels, for a rewriting task the input-output sentences should be
identical except for the attribute of interest, i.e., in this case, gender.
          </p>
          <p>3.4.2. Method</p>
        </sec>
        <sec id="sec-2-1-8">
          <title>We fine-tune the selected models using Low-Rank Adaptation (LoRA) [53]. Following common practices in LoRA fine-tuning [ 54] we set the rank and alpha at 32, and use the following hyperparameters to strike a</title>
          <p>12We run our experiments on nodes with 4 NVIDIA A100 GPUs with</p>
          <p>64 GB VRAM each.
balance between hardware constraints12 and consistency English), we report each model’s average performance,
across model sizes and requirements: learning rate: along with the range of neutrality and BERTScore values
2 × 10− 4, batch size: 8 for the 8B models, 4 for the observed across prompting conditions. In Appendix A
14B models. We use early stopping with a patience of 20 we provide the complete and detailed results obtained
steps for the 8B models and 40 steps for the 14B models. with the two prompt formats, separately for Italian and
English instructions.</p>
          <p>
            Generally, and with rare exceptions, all models’
4. Results BERTScore values are well above the quality threshold
we identified in §3.1. This means that the models do not
4.1. Few-Shot Prompting Results generate unrelated or additional text, confirming that
Figure 2 summarizes the results of the few-shot prompt- their outputs remain adherent to the input and free of
ing experiments showing all models’ performance in “hallucinations” [
            <xref ref-type="bibr" rid="ref22">55</xref>
            ].
neutrality and meaning preservation. Higher values on Neutrality scores, on the contrary, vary significantly
both axes indicate better performance; therefore, sys- across models. Looking at our baseline, the
GNRtems closer to the top-right corner perform best. As no dedicated model Inclusively, we observe that it performs
consistent trend emerged across prompt formats (GFG rather poorly in neutrality. Across LLMs, we notice
simvs. Rewrite, see Section 3.3) and languages (Italian vs. ilar behavior within the groups. The “Italian” models,
in the bottom left quarter of the chart, generally fail to
neutralize, and alter the sentences the most. Within the
multilingual LLMs group, only Phi 4, Qwen3 32B, and
          </p>
          <p>LLama 3.3 perform better than the Italian models. The
rest of the Qwen3 family generally underperforms, with
the high BERTScore suggesting that they make little to
no change to the gendered sentences. The only model
performing well on both axes is GPT 4.1, which tops at
89.07% neutralization accuracy and 93.21 BERTScore,
indicating that it correctly alters the parts of the sentences
expressing the gender of human beings while leaving the
rest untouched.</p>
          <p>Overall, we find that the LLMs we tested perform very
diferently in GNR in Italian, and that failure in this
setting consists in overlooking the relevant (gendered) parts
of the input to act upon, and/or unsuccessfully rendering
them gender-neutral.
4.2. Fine-Tuning Results</p>
        </sec>
        <sec id="sec-2-1-9">
          <title>Results of the fine-tuning experiments are reported in</title>
          <p>Figure 3. We first notice that on the neutrality axis all
ifne-tuned models outperform the baseline, except for
LLamantino/clean configuration. LLamantino shows the
narrowest gains overall, and in one case even a drop in
neutrality, echoing its weaker few-shot prompting results
and suggesting it may be ill-suited to GNR. In four out of
six instances, and always with the full dataset, the
finetuned models also outperform the best performer among
the open-weight models in the prompting experiments,
i.e. LLama 3.3 70B with the GFG English prompt, though
with a significant drop in BERTScore.</p>
          <p>
            Such a drop indicates that these models fail by
hallucinating unrelated content in their attempt to neutralize,
rather than by leaving the input sentences untouched
as observed in the prompting experiments (§4.1). This
is possibly due to two factors: the significantly smaller
size of the fine-tuned models with respect to LLama 3.3
70B (1/9 or 1/5, for the 8B and 14B models respectively),
as larger LLMs have been shown to exhibit greater
robustness and lower variance in downstream performance
after fine-tuning compared to smaller counterparts [
            <xref ref-type="bibr" rid="ref23">56</xref>
            ],
and/or the presence of many divergent gender-neutral
sentence pairs in the fine-tuning dataset (see §3.4.1).
          </p>
          <p>While full yields the highest improvements in
neutrality, only clean improves performance on both axes
while keeping BERTScore within the human-level range.</p>
          <p>However, it yields significantly smaller gains in neutrality
and even causes drops for two models (LLamantino, Phi
4). We hypothesize that clean may be excessively
conditioned by the data filtering method, i.e. a BERTScore
based selection. In other words, by selecting only dataset
entries with almost perfect BERTScore values we are
optimizing the models to perform well on the sentence
similarity dimension—as measured by BERTScore—rather
than GNR.</p>
          <p>The impact of metric-based data selection To
investigate the hypothesis above, we evaluate the same
outputs against the gendered inputs with another
semantic similarity metric: BARTScore [57].13 BERTScore and
13While similar in name and scope, BERTScore and BARTScore
function diferently. The first computes a sum of token-level cosine
similarities between two sentences’ embeddings encoded by a BERT
(encoder-only) model; the latter is computed as the weighted sum
of the log-probabilities that a pretrained BART (encoder-decoder)
model assigns to each token in the generated text. In our
experiBARTScore evaluations are visualized in Figure 4. To Through fine-tuning experiments we showed that
comunderstand whether outputs of the models fine-tuned pact models can match or exceed the best open-weight
on clean are actually very semantically similar to the LLM at a fraction of its size. Moreover, our
BERTScorecorresponding input, and whether those models simply based data cleaning highlighted a trade-of: models
learned to game BERTScore, we compute14 the Pear- trained on cleaned data achieve human-level BERTScore
son r and Spearman  correlation coeficients between but show smaller neutrality gains and exhibit ranking
BERTScore and BARTScore assessments. The first cap- diferences against another similarity metric, signaling
tures linear correlations between the two metrics’ raw over-fitting on BERTScore. Future work should take this
scores, while the latter measures how well the relation- trade-of into account and create dedicated, high-quality
ship between the two variables can be described by a parallel data to aim at reaching the performance of the
monotonic function, by comparing the rankings of the commercial system with open-weight models.
scores rather than their raw values. This combination
allows us to assess both the alignment of the scores and
the consistency in how the two metrics rank the outputs. Acknowledgments</p>
          <p>We find that in full, r equals 0.814 and  equals 0.907,
whereas in clean they are 0.914 and 0.679 respectively.15 We acknowledge the support of the project InnovAction:
r is high in both cases, indicating a strong linear corre- Network Italiano dei Centri per l’Innovazione
Tecnologlation between the two metrics—stronger in clean, as ica (CUP B47H2200437000), funded by MIMIT with NPRR
in that case the data points are more tightly clustered, - NextGenerationEU funds, in collaboration with Piazza
skewed towards higher values. This confirms that the Copernico S.r.l. We also received funding from the PNRR
metrics generally agree on the quality of the outputs. project FAIR - Future AI Research (PE00000013), under
The substantial drop in  , instead, indicates that there the NRRP MUR program funded by NextGenerationEU.
are many instances in clean where the monotonic trend Finally, we acknowledge the CINECA award under the
is broken, i.e., higher BERTScore does not necessarily ISCRA initiative (AGeNTE), for the availability of
highcorrespond to higher BARTScore. This suggests that performance computing resources and support.
the clean models also learned to game BERTScore by
reproducing features rewarded by that metric. References</p>
          <p>With respect to our hypothesis: by selecting
highsimilarity pairs for the clean dataset, we efectively
steered models toward preserving semantic alignment
with the input; however, this emphasis on similarity
appears to have hampered their improvement in
neutralization. Indeed, the models learned to preserve the input
to an excessive degree, as confirmed by the high r
coeficient and high BARTScore values shown in Figure 4. We
interpret our results as evidence of a broader trade-of
between optimizing for neutrality and for sentence
similarity. Our findings underscore the need for data curation
strategies that strike a balance between neutrality and
similarity, achieving the flexibility required for efective
GNR.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>5. Conclusions</title>
      <p>We presented the first systematic investigation of
state-ofthe-art large language models for Italian gender-neutral
rewriting under a two-dimensional evaluation of
neutrality and meaning preservation. In our few-shot prompting
experiments, open-weight models outperformed the only
existing Italian-specific system but remained behind a
closed commercial system.</p>
      <p>ments, we use the BART model facebook/bart-large [58].
14We use the Python library SciPy [59].
15All p-values &lt; 0.05.</p>
      <p>ing of the ACL, ACL, Online, 2020, pp. 5454–5476. 2023, pp. 8747–8759. URL: https://aclanthology.org/
URL: https://aclanthology.org/2020.acl-main.485/. 2023.findings-emnlp.585/.
[6] S. Dev, M. Monajatipoor, A. Ovalle, A. Subramo- [17] P. Lerner, C. Grouin, INCLURE: a dataset and toolkit
nian, J. Phillips, K.-W. Chang, Harms of gender for inclusive French translation, in: Proc. of the
exclusivity and challenges in non-binary represen- 17th Workshop on Building and Using Comparable
tation in language technologies, in: Proc. of the Corpora (BUCC) @ LREC-COLING 2024, ELRA and
2021 Conference on Empirical Methods in Natural ICCL, Torino, Italia, 2024, pp. 59–68. URL: https:
Language Processing, ACL, Online and Punta Cana, //aclanthology.org/2024.bucc-1.7/.
Dominican Republic, 2021, pp. 1968–1994. URL: [18] S. Greco, M. La Quatra, L. Cagliero, T. Cerquitelli,
https://aclanthology.org/2021.emnlp-main.150/. Towards ai-assisted inclusive language writing in
[7] U. Gabriel, P. M. Gygax, E. A. Kuhn, Neutralis- italian formal communications, ACM Trans. Intell.
ing linguistic sexism: Promising but cumbersome?, Syst. Technol. (2025). URL: https://doi.org/10.1145/
Group Processes &amp; Intergroup Relations 21 (2018) 3729237.</p>
      <p>844–858. [19] B. Papadopoulos, Morphological Gender
Innova[8] APA, Publication Manual of the APA, 7th ed., 2020. tions in Spanish of Gender queer Speakers,
Depart[9] A. Piergentili, D. Fucci, B. Savoldi, L. Bentivogli, ment of Spanish and Portuguese, University of
CaliM. Negri, Gender neutralization for an inclusive fornia, UC Berkeley, 2019. URL: https://escholarship.
machine translation: from theoretical foundations org/uc/item/6j73t666.
to open challenges, in: Proc. of the First Work- [20] G. S. di Carlo, Is italy ready for gender-inclusive
shop on Gender-Inclusive Translation Technolo- language? an attitude and usage study among
gies, EAMT, Tampere, Finland, 2023, pp. 71–83. italian speakers, in: Inclusiveness Beyond the
URL: https://aclanthology.org/2023.gitt-1.7/. (Non)binary in Romance Languages, 1st edition ed.,
[10] T. Sun, K. Webster, A. Shah, W. Y. Wang, M. Johnson, Routledge, 2024, p. 21. URL: https://doi.org/10.4324/
They, them, theirs: Rewriting with gender-neutral 9781003432906.</p>
      <p>english, 2021. arXiv:2102.06788. [21] G. V. Silva, C. Soares, Inclusiveness Beyond the
[11] E. Vanmassenhove, C. Emmery, D. Shterionov, Neu- (Non)binary in Romance Languages: Research and
Tral Rewriter: A rule-based and neural approach Classroom Implementation, 1st ed., Routledge,
Lonto automatic rewriting into gender neutral alterna- don, 2024. doi:10.4324/9781003432906.
tives, in: Proc. of the 2021 Conference on Empirical [22] P. Gygax, S. Sato, A. Öttl, U. Gabriel, The
masMethods in Natural Language Processing, ACL, On- culine form in grammatically gendered languages
line and Punta Cana, Dominican Republic, 2021, and its multiple interpretations: a challenge for
pp. 8940–8948. URL: https://aclanthology.org/2021. our cognitive system, Language Sciences 83
emnlp-main.704/. (2021) 101328. URL: https://www.sciencedirect.com/
[12] M. Bartl, S. Leavy, From ‘showgirls’ to ‘performers’: science/article/pii/S0388000120300619.</p>
      <p>Fine-tuning with gender-inclusive language for bias [23] L. Ackerman, Syntactic and cognitive issues in
inreduction in LLMs, in: Proc. of the 5th Workshop vestigating gendered coreference, Glossa: a journal
on Gender Bias in Natural Language Processing of general linguistics 4 (2019).
(GeBNLP), ACL, Bangkok, Thailand, 2024, pp. 280– [24] M. Rosola, S. Frenda, A. T. Cignarella, M.
Pelle294. URL: https://aclanthology.org/2024.gebnlp-1. grini, A. Marra, M. Floris, Beyond obscuration and
18/. visibility: Thoughts on the diferent strategies of
[13] E. Doyen, A. Todirascu, Genre: A french gender- gender-fair language in Italian, in: Proc. of the 9th
neutral rewriting system using collective nouns, Italian Conference on Computational Linguistics
2025. arXiv:2505.23630. (CLiC-it 2023), CEUR Workshop Proc., Venice, Italy,
[14] E. Rose, M. Winig, J. Nash, K. Roepke, K. Con- 2023, pp. 369–378. URL: https://aclanthology.org/
rod, Variation in acceptability of neologistic 2023.clicit-1.44/.</p>
      <p>English pronouns, Proc. of the Linguis- [25] G. Comandini, Salve a tutt@, tutt*, tuttu, tuttx e
tic Society of America 8 (2023) 5526. URL: tutt@: l’uso delle strategie di neutralizzazione di
https://journals.linguisticsociety.org/proceedings/ genere nella comunità queer online. indagine su un
index.php/PLSA/article/view/5526. corpus di italiano scritto informale sul web., Testo
[15] D. Pomerenke, Inclusify: A benchmark and e Senso 23 (2021) 43–64.</p>
      <p>a model for gender-inclusive german, 2022. [26] J. Silveira, Generic Masculine Words and
ThinkarXiv:2212.02564. ing, Women’s Studies International Quarterly 3
[16] L. Veloso, L. Coheur, R. Ribeiro, A rewriting ap- (1980) 165–178. URL: https://www.sciencedirect.
proach for gender inclusivity in Portuguese, in: com/science/article/pii/S0148068580921132.
Findings of the ACL: EMNLP 2023, ACL, Singapore, [27] A. H. Bailey, A. Williams, A. Cimpian, Based on
across the categories are highlighted , and the best overall performer is in bold.</p>
      <sec id="sec-3-1">
        <title>BERTScore</title>
      </sec>
      <sec id="sec-3-2">
        <title>Model Size (B) GFG Ita GFG Eng Rewrite Ita Rewrite Eng</title>
        <p>AVG
Inclusively</p>
      </sec>
      <sec id="sec-3-3">
        <title>Model Size (B) GFG Ita GFG Eng Rewrite Ita Rewrite Eng Table 6</title>
        <p>across the categories are highlighted , and the best overall performer is in bold.</p>
        <p>Declaration on Generative AI
During the preparation of this work, the author(s) used ChatGPT (OpenAI) in order to: Paraphrase
and reword, Improve writing style, and Grammar and spelling check. After using these
tool(s)/service(s), the author(s) reviewed and edited the content as needed and take(s) full
responsibility for the publication’s content.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Papi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Guerberof-Arenas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <article-title>What the Harm? Quantifying the Tangible Impact of Gender Bias in Machine Translation with a Human-centered Study</article-title>
          ,
          <source>in: Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing</source>
          , Association for Computational Linguistics, Miami, Florida, USA,
          <year>2024</year>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>1002</volume>
          /. doi:
          <volume>10</volume>
          .18653/v1/
          <year>2024</year>
          . emnlp-main.
          <volume>1002</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kotek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dockum</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Gender bias and stereotypes in large language models</article-title>
          ,
          <source>in: Proc. of The ACM Collective Intelligence Conference</source>
          , CI '23,
          <string-name>
            <surname>ACM</surname>
          </string-name>
          , New York, NY, USA,
          <year>2023</year>
          , p.
          <fpage>12</fpage>
          -
          <lpage>24</lpage>
          . URL: https://doi.org/10.1145/3582269.3615599.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Ostrow</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lopez</surname>
          </string-name>
          ,
          <article-title>Llms reproduce stereotypes of sexual and gender minorities</article-title>
          ,
          <year>2025</year>
          . arXiv:
          <volume>2501</volume>
          .
          <fpage>05926</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bastings</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bentivogli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Vanmassenhove</surname>
          </string-name>
          ,
          <article-title>A decade of gender bias in machine translation</article-title>
          ,
          <source>Patterns</source>
          (
          <year>2025</year>
          )
          <article-title>101257</article-title>
          . URL: https://www.sciencedirect.com/science/article/pii/ S2666389925001059.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>S. L.</given-names>
            <surname>Blodgett</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Barocas</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , H. Wallach,
          <article-title>Language (technology) is power: A critical survey of “bias” in NLP, in: Proc. of the 58th Annual Meetbillions of words on the internet, people= men, Sci- Minerva LLMs: The first family of large language ence Advances 8 (</article-title>
          <year>2022</year>
          )
          <article-title>eabm2463</article-title>
          .
          <article-title>models trained from scratch on Italian data</article-title>
          , in:
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergentili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <surname>L.</surname>
          </string-name>
          <source>Ben- Proc. of the 10th Italian Conference on Computivogli</source>
          ,
          <article-title>Hi guys or hi folks? benchmarking gender- tational Linguistics (CLiC-it</article-title>
          <year>2024</year>
          ),
          <string-name>
            <surname>CEUR</surname>
          </string-name>
          <article-title>Workneutral machine translation with the GeNTE cor-</article-title>
          shop
          <string-name>
            <surname>Proc</surname>
          </string-name>
          .,
          <string-name>
            <surname>Pisa</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , pp.
          <fpage>707</fpage>
          -
          <lpage>719</lpage>
          . URL: pus, in
          <source>: Proc. of</source>
          the 2023 Conference on Em- https://aclanthology.org/
          <year>2024</year>
          .clicit-
          <volume>1</volume>
          .77/. pirical Methods in Natural Language Processing, [37]
          <string-name>
            <given-names>P.</given-names>
            <surname>Basile</surname>
          </string-name>
          , E. Musacchio,
          <string-name>
            <given-names>M.</given-names>
            <surname>Polignano</surname>
          </string-name>
          , L. Siciliani,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , Singapore,
          <year>2023</year>
          , pp.
          <fpage>14124</fpage>
          -
          <lpage>14140</lpage>
          . URL: https: G. Fiameni, G. Semeraro, Llamantino: Llama 2 mod//aclanthology.org/
          <year>2023</year>
          .emnlp-main.
          <volume>873</volume>
          /.
          <article-title>els for efective text generation in italian language,</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>F.</given-names>
            <surname>Höglund</surname>
          </string-name>
          , M. Flinkfeldt, De-gendering parents:
          <year>2023</year>
          . arXiv:
          <volume>2312</volume>
          .09993.
          <article-title>Gender inclusion and standardised</article-title>
          language in [38]
          <string-name>
            <surname>Almawave</surname>
          </string-name>
          , Velvet,
          <year>2025</year>
          .
          <article-title>URL: https: screen-level bureaucracy</article-title>
          , International Journal of //www.almawave.com/it/tecnologia/velvet/.
          <source>Social Welfare</source>
          (
          <year>2023</year>
          ). [39]
          <string-name>
            <given-names>M.</given-names>
            <surname>Llama Team</surname>
          </string-name>
          ,
          <source>The llama 3 herd of models</source>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Cao</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toward</surname>
          </string-name>
          gender-inclusive arXiv:
          <volume>2407</volume>
          .21783. coreference resolution,
          <source>in: Proc. of the 58th Annual</source>
          [40]
          <string-name>
            <given-names>M.</given-names>
            <surname>Abdin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Aneja</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Behl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <article-title>Meeting of the ACL, ACL</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>4568</fpage>
          -
          <lpage>S</lpage>
          . Gunasekar,
          <string-name>
            <given-names>M.</given-names>
            <surname>Harrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. J.</given-names>
            <surname>Hewett</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <year>Java4595</year>
          . URL: https://aclanthology.org/
          <year>2020</year>
          .acl-main. heripi, P. Kaufmann,
          <string-name>
            <given-names>J. R.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          , W. Liu,
          <volume>418</volume>
          /. C. C. T.
          <string-name>
            <surname>Mendes</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nguyen</surname>
          </string-name>
          , E. Price, G. de Rosa,
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>A.</given-names>
            <surname>Waldis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Birrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lauscher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I. Gurevych</given-names>
            , The O.
            <surname>Saarikivi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Salim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <article-title>Lou dataset - exploring the impact of gender-fair Y</article-title>
          .
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Y. Zhang,</given-names>
          </string-name>
          <article-title>Phi-4 technical language in German text classification</article-title>
          ,
          <source>in: Proc. of report</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2412</volume>
          .08905. the 2024 Conference on Empirical Methods in Natu- [41]
          <string-name>
            <given-names>A.</given-names>
            <surname>Qwen</surname>
          </string-name>
          <string-name>
            <surname>Team</surname>
          </string-name>
          ,
          <source>Qwen3 technical report</source>
          ,
          <year>2025</year>
          . ral Language Processing,
          <string-name>
            <surname>ACL</surname>
          </string-name>
          , Miami, Florida, USA, arXiv:
          <fpage>2505</fpage>
          .
          <fpage>09388</fpage>
          .
          <year>2024</year>
          , pp.
          <fpage>10604</fpage>
          -
          <lpage>10624</lpage>
          . URL: https://aclanthology. [42]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <source>Introducing gpt-4</source>
          .1 in the api,
          <year>2025</year>
          . URL: org/
          <year>2024</year>
          .emnlp-main.
          <volume>592</volume>
          /. https://openai.com/index/gpt-4-1/, accessed:
          <fpage>2025</fpage>
          -
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergentili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Fucci</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          , L. Ben-
          <volume>05</volume>
          -15. tivogli,
          <article-title>A prompt response to the demand for auto-</article-title>
          [43]
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Beigi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <surname>Z.</surname>
          </string-name>
          <article-title>Tan, matic gender-neutral translation</article-title>
          ,
          <source>in: Proc. of the A. Bhattacharjee</source>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shu</surname>
          </string-name>
          , 18th Conference of the European Chapter of the L. Cheng, H. Liu, From generation to judgment:
          <source>OpACL</source>
          (Volume
          <volume>2</volume>
          :
          <string-name>
            <surname>Short</surname>
            <given-names>Papers)</given-names>
          </string-name>
          , ACL, St.
          <article-title>Julian's, portunities and challenges of llm-as-a-</article-title>
          <string-name>
            <surname>judge</surname>
          </string-name>
          ,
          <year>2025</year>
          . Malta,
          <year>2024</year>
          , pp.
          <fpage>256</fpage>
          -
          <lpage>267</lpage>
          . URL: https://aclanthology. arXiv:
          <volume>2411</volume>
          .16594. org/
          <year>2024</year>
          .eacl-short.
          <volume>23</volume>
          /. [44]
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergentili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          , L. Bentivogli,
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Cupin</surname>
          </string-name>
          ,
          <string-name>
            <surname>E. Gkovedarou,</surname>
          </string-name>
          <article-title>An LLM-as-a-judge approach for scalable genderJ</article-title>
          .
          <string-name>
            <surname>Hackenbuchner</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lauscher</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Negri</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <article-title>Pier- neutral translation evaluation</article-title>
          , in: Proceedings of gentili, M. Thind, L. Bentivogli,
          <article-title>Mind the inclu- the 3rd Workshop on Gender-Inclusive Translation sivity gap: Multilingual gender-neutral transla- Technologies (GITT</article-title>
          <year>2025</year>
          ), EAMT, Geneva,
          <source>Switzertion evaluation with mGeNTE</source>
          ,
          <year>2025</year>
          . URL: https: land,
          <year>2025</year>
          , pp.
          <fpage>46</fpage>
          -
          <lpage>63</lpage>
          . URL: https://aclanthology. //openreview.net/forum?id=dBUHC2QyBh. org/
          <year>2025</year>
          .gitt-
          <volume>1</volume>
          .3/.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>S.</given-names>
            <surname>Frenda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Piergentili</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Savoldi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Madeddu</surname>
          </string-name>
          , [45]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang*</surname>
          </string-name>
          , V. Kishore*, F. Wu*,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Rosola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Casola</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ferrando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Patti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Negri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with L. Bentivogli, GFG - gender-fair generation: A bert, in: International Conference on Learning CALAMITA challenge</article-title>
          ,
          <source>in: Proc. of the 10th Italian Representations</source>
          ,
          <year>2020</year>
          . URL: https://openreview.net/ Conference on Computational Linguistics (CLiC- forum?id=SkeHuCVFDr. it
          <year>2024</year>
          ), CEUR Workshop Proc.,
          <string-name>
            <surname>Pisa</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , [46]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , BERT: pp.
          <fpage>1106</fpage>
          -
          <lpage>1115</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>Pre-training of deep bidirectional transformers for clicit-1</article-title>
          .122/. language understanding,
          <source>in: Proc. of the 2019</source>
          Con-
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>G.</given-names>
            <surname>Attanasio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Delobelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. La</given-names>
            <surname>Quatra</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Santilli, ference of the North American Chapter of the ACL: B. Savoldi, ItaEval and TweetyIta: A new extensive Human Language Technologies, Volume 1 (Long benchmark and eficiency-first language model for and Short Papers)</article-title>
          , ACL, Minneapolis, Minnesota, Italian, in
          <source>: Proc. of the 10th Italian Conference on</source>
          <year>2019</year>
          , pp.
          <fpage>4171</fpage>
          -
          <lpage>4186</lpage>
          . URL: https://aclanthology.org/ Computational Linguistics (CLiC-it
          <year>2024</year>
          ), CEUR N19-
          <volume>1423</volume>
          /. Workshop Proc.,
          <string-name>
            <surname>Pisa</surname>
          </string-name>
          , Italy,
          <year>2024</year>
          , pp.
          <fpage>39</fpage>
          -
          <lpage>51</lpage>
          . URL: [47]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu, Bleu: a https://aclanthology.org/
          <year>2024</year>
          .clicit
          <article-title>-1.6/. method for automatic evaluation of machine trans-</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>R.</given-names>
            <surname>Orlando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Moroni</surname>
          </string-name>
          , P.-L. Huguet
          <string-name>
            <surname>Cabot</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <article-title>Co- lation, in: Proc. of the 40th Annual Meeting of nia</article-title>
          , E. Barba,
          <string-name>
            <given-names>S.</given-names>
            <surname>Orlandini</surname>
          </string-name>
          , G. Fiameni, R. Navigli,
          <string-name>
            <surname>the</surname>
            <given-names>ACL</given-names>
          </string-name>
          , ACL, Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/ Y. Huang,
          <string-name>
            <given-names>A.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petrov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. H.</given-names>
            <surname>Chi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Dean</surname>
          </string-name>
          ,
          <fpage>P02</fpage>
          -
          <lpage>1040</lpage>
          /. J.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Roberts</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Q. V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
          </string-name>
          , Scal-
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>M.</given-names>
            <surname>Snover</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Dorr</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Schwartz</surname>
          </string-name>
          , L. Micciulla,
          <article-title>ing instruction-finetuned language models</article-title>
          , J. Mach. J.
          <string-name>
            <surname>Makhoul</surname>
          </string-name>
          ,
          <article-title>A study of translation edit rate with tar-</article-title>
          <source>Learn. Res</source>
          .
          <volume>25</volume>
          (
          <year>2024</year>
          ).
          <article-title>geted human annotation</article-title>
          ,
          <source>in: Proc. of the 7th Confer</source>
          <volume>-</volume>
          [57]
          <string-name>
            <given-names>W.</given-names>
            <surname>Yuan</surname>
          </string-name>
          , G. Neubig, P. Liu,
          <article-title>Bartscore: evaluating ence of the AMTA: Technical Papers</article-title>
          ,
          <string-name>
            <surname>AMTA</surname>
          </string-name>
          , Cam
          <article-title>- generated text as text generation, in: Proc. of the bridge</article-title>
          , Massachusetts, USA,
          <year>2006</year>
          , pp.
          <fpage>223</fpage>
          -
          <lpage>231</lpage>
          . URL: 35th International Conference on NeurIPS, NIPS https://aclanthology.org/
          <year>2006</year>
          .amta-papers.
          <volume>25</volume>
          /. '21, Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>G.</given-names>
            <surname>Sarti</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          <article-title>Nissim, IT5: Text-to-text pretraining 2021. for Italian language understanding and generation</article-title>
          , [58]
          <string-name>
            <given-names>M.</given-names>
            <surname>Lewis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ghazvininejad</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Moin: Proc</article-title>
          . of the 2024 Joint International Conference hamed,
          <string-name>
            <given-names>O.</given-names>
            <surname>Levy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Stoyanov</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>BART: on Computational Linguistics, Language Resources denoising sequence-to-sequence pre-training for and Evaluation (LREC-COLING 2024), ELRA and natural language generation, translation</article-title>
          , and comICCL, Torino, Italia,
          <year>2024</year>
          , pp.
          <fpage>9422</fpage>
          -
          <lpage>9433</lpage>
          . URL: https: prehension, CoRR abs/
          <year>1910</year>
          .13461 (
          <year>2019</year>
          ). URL: http: //aclanthology.org/
          <year>2024</year>
          .lrec-main.
          <volume>823</volume>
          . //arxiv.org/abs/
          <year>1910</year>
          .13461. arXiv:
          <year>1910</year>
          .13461.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          , J. D. [59]
          <string-name>
            <given-names>P.</given-names>
            <surname>Virtanen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gommers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. E.</given-names>
            <surname>Oliphant</surname>
          </string-name>
          , M. HaberKaplan, P. Dhariwal,
          <string-name>
            <surname>Neelakantan</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Language land</surname>
            , T. Reddy,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Cournapeau</surname>
            , E. Burovski,
            <given-names>P.</given-names>
          </string-name>
          <article-title>Pemodels are few-shot learners</article-title>
          , in: Advances in terson, W. Weckesser,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bright</surname>
          </string-name>
          ,
          <string-name>
            <surname>S. J. van der Walt</surname>
          </string-name>
          , NeurIPS, volume
          <volume>33</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <string-name>
            <given-names>M.</given-names>
            <surname>Brett</surname>
          </string-name>
          , J. Wilson,
          <string-name>
            <given-names>K. J.</given-names>
            <surname>Millman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Mayorov</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. R. J.</surname>
          </string-name>
          <year>2020</year>
          , pp.
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          . URL: https://proceedings. Nelson, E. Jones,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kern</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Larson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Carey</surname>
          </string-name>
          , İ. Poneurips.cc/paper_files/paper/2020/file/ lat, Y. Feng,
          <string-name>
            <given-names>E. W.</given-names>
            <surname>Moore</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. VanderPlas</surname>
          </string-name>
          , D. Laxalde, 1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf . J.
          <string-name>
            <surname>Perktold</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Cimrman</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Henriksen</surname>
          </string-name>
          , E. A. Quin-
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>W.</given-names>
            <surname>Kwon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , C. H. tero,
          <string-name>
            <given-names>C. R.</given-names>
            <surname>Harris</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Archibald</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. H.</given-names>
            <surname>Ribeiro</surname>
          </string-name>
          , Yu,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , I. Stoica,
          <string-name>
            <surname>Eficient mem- F. Pedregosa</surname>
          </string-name>
          , P. van Mulbregt,
          <source>SciPy 1</source>
          .
          <article-title>0 Contribory management for large language model serving utors, SciPy 1.0: Fundamental Algorithms for Sciwith pagedattention</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .06180. entific Computing in Python,
          <source>Nature Methods 17</source>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Almeida</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Wain-</surname>
          </string-name>
          (
          <year>2020</year>
          )
          <fpage>261</fpage>
          -
          <lpage>272</lpage>
          . wright, P. Mishkin,
          <string-name>
            <given-names>C.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Agarwal,
          <string-name>
            <given-names>K.</given-names>
            <surname>Slama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schulman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hilton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Kelton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Miller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Simens</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Welinder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Christiano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Detailed results J. Leike</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lowe</surname>
          </string-name>
          ,
          <article-title>Training language models to follow instructions with human feedback</article-title>
          ,
          <source>in: Proc. Tables 5</source>
          and
          <article-title>6 report the detailed results of our fineof the 36th International Conference on NeurIPS, tuning experiments</article-title>
          .
          <source>NIPS '22</source>
          , Curran Associates Inc.,
          <string-name>
            <surname>Red</surname>
            <given-names>Hook</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NY</surname>
          </string-name>
          , USA,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <source>Lora: Lowrank adaptation of large language models</source>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [54]
          <string-name>
            <surname>Unsloth</surname>
            <given-names>Documentation</given-names>
          </string-name>
          , Lora hyperparameters guide,
          <year>2025</year>
          . URL: https://docs. unsloth.
          <article-title>ai/get-started/fine-tuning-llms-guide/ lora-hyperparameters-guide.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>L.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yu</surname>
          </string-name>
          , W. Ma,
          <string-name>
            <given-names>W.</given-names>
            <surname>Zhong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Qin</surname>
          </string-name>
          , T. Liu,
          <article-title>A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions</article-title>
          ,
          <source>ACM Transactions on Information Systems</source>
          <volume>43</volume>
          (
          <year>2025</year>
          )
          <fpage>1</fpage>
          -
          <lpage>55</lpage>
          . URL: http://dx.doi.org/10.1145/ 3703155.
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Suzgun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chowdhery</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Castro-Ros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pellat</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Robinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Valter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Mishra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <volume>75</volume>
          .
          <source>33 89.07 75.33 78.37 38.80 94.83 0</source>
          .
          <fpage>78</fpage>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>