<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>nikita.sushko at TextDetox CLEF 2025: Exploring A Sage-T5-Like Approach For Text Detoxification</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Alexandr Voronin</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Daniil Moskovsky</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nikita Sushko</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>AIRI</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Skoltech</institution>
          ,
          <country country="RU">Russia</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <abstract>
        <p>This paper presents our submission to the Multilingual Text Detoxification task at PAN 2025. We explore the Sage-T5-like approach, by combining three training objectives: paraphrasing (seq2seq loss), token-level toxicity detection (classification loss), and semantic representation learning (contrastive loss). To address the challenge of limited annotated data across 15 languages, we adopt the synthetic data generation pipeline from SynthDetoxM and introduce a token-level annotation method using multilingual toxic lexicons. Our experiments on Russian, French, and Spanish demonstrate that combining classification and contrastive objectives significantly boosts detoxification performance, as measured by Style Transfer Accuracy (STA), Semantic Similarity (SIM), and their combined J-score, but fall short after expansion to more languages. Our resulting model outperforms 5 out of 7 baselines during automatic evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Text style transfer</kwd>
        <kwd>contrastive learning</kwd>
        <kwd>encoder-decoder transformers</kwd>
        <kwd>synthetic data generation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>The rapid spread of toxic content online has created a pressing demand for efective systems that
can detoxify text in multiple languages. Despite notable achievements in developing monolingual
detoxification systems, the multilingual landscape still poses a variable set of challenges. For instance,
languages exhibit distinct grammatical structures, vocabulary, and cultural references, which can make
it dificult to develop a one-size-fits-all approach. Furthermore, many languages lack suficient labeled
data, hindering the training of accurate detoxification models. To address these challenges, this paper
explores the potential of cross-lingual transfer learning in enhancing multilingual text detoxification.
By using shared linguistic patterns that exist across languages, our approach aims to reduce the need for
large amounts of language-specific training data. This, in turn, enables us to develop more eficient and
efective detoxification systems, particularly for languages with limited resources, while also preserving
the original meaning and context of the text.</p>
      <p>
        In this article, we introduce a novel multilingual detoxification framework based on the Sage-T5
architecture. The proposed model follows the approach from the Sage-T5 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] and employs a multitask
learning objective, combining seq2seq loss for paraphrase generation, classification loss for token-level
toxicity detection, and contrastive loss for improved semantic representation learning. Furthermore, we
reuse a methodology from SynthDetoxM [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for the collection and annotation of datasets. Additionally,
we created a pipeline for token toxicity markup which is crucial for training the model with classification
loss.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Previous work</title>
      <p>
        In 2024, Multilingual Text Detoxification task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] was presented as one of the tracks of PAN Lab on CLEF
conference. Participants were asked to create a text detoxification system with limited training data for
9 languages, which required the use of cross-lingual transfer and unsupervised methods. Top solutions
included approaches with few-shot prompting of an uncensored version of the Llama3 [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] performed
by SomethingAwful team and fine-tuning seq2seq models such as mT0 and mT5 on augmented datasets
along with ORPO technique application performed by SmurfCat team [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The SAGE-T5 article [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] approaches the problem of spelling correction by utilizing three losses:
seq2seq loss for training the corrector model, contrastive loss for the encoder of the encoder-decoder
model to ensure close semantic match between the original and corrected sentences and token
classification loss for the encoder of the encoder-decoder model to ensure higher accuracy of typo detection.
This approach allowed to reach state-of-the-art results in spelling correction task across other models.
      </p>
      <p>
        In the SynthDetoxM paper [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], the authors proposed a multistage approach for synthetic detoxification
data generation, using pretrained decoder-only models and a large scale parallel synthetic dataset for
training text detoxification models. Models, trained on this dataset, show better performance than
models, trained on human labeled data.
      </p>
      <p>We adapt the methodology from Sage-T5 paper to the more complex task of text detoxification and
utilize SynthDetoxM methodology for generating synthetic data.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Data</title>
      <p>
        TextDetox 2025 track [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] in PAN Lab at CLEF 2025 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] contains 15 languages (9 from previous year
track and 6 new) - English, Spanish, Italian, French, Chinese, Japanese, Hindi, Hinglish, Arabic, German,
Russian, Ukrainian, Amharic, Hebrew and Tatar.
      </p>
      <p>The track consists of two stages:</p>
      <p>During the development stage, the organizers provided a training dataset, comprising 600 non-parallel
examples for each of the 9 languages from previous year track, as well as 100 examples for each of the
6 newly introduced languages. The data was presented in a standardized format, consisting of three
components: toxic text, neutral text, and language identification (lang).</p>
      <p>For the test stage, organizers provided
• MultiParaDetox1 — a dataset with 400 parallel samples for 9 languages;
• Multilingual Toxicity Dataset2 — contains non-parallel toxic and neutral sentences: 2.01k
samples for Hebrew, 4.36k for Hinglish, and 5k for every other language. For all languages except
Hebrew the proportion between toxic and neutral sentences id equal, Hebrew data contains 60%
neutral sentences and 40% toxic sentences;
1https://huggingface.co/datasets/textdetox/multilingual_paradetox
2https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset</p>
      <p>• Multilingual Toxicity Lexicon3 — includes toxic words and expressions for all 15 languages.</p>
      <sec id="sec-3-1">
        <title>3.1. Data preprocessing</title>
        <p>To leverage the classification loss in our model, we introduced an additional classification head on
top of the encoder. The primary function of this classification head is to predict a toxicity label for
each token, categorizing it as either toxic or non-toxic. This allows the model to learn a more nuanced
representation of the input text, where each token is associated with a specific toxicity classification,
enabling the model to better capture the toxic language.</p>
        <p>We used Multilingual Toxicity Dataset to markup token toxicity. The toxicity markup was carried out
in 3 stages. Firstly, the input data and toxic lexicon were tokenized using the target model’s tokenizer.
Then, we’ve created a function to align toxic expressions to toxic sentences. As a single toxic expression
mostly consists of several tokens, we had to check that all of its tokens are present in toxic sentence.
Finally, we applied this function to all available data in all languages and obtained token markup.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Synthetic Data Collection</title>
          <p>
            Using the dataset of toxicity identification provided by the organizers 4, we collect a synthetic
parallel detoxification dataset. In our collection pipeline, we follow the approach introduced in the
SynthDetoxM [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ].
          </p>
          <p>In the context of this task, we utilize more novel and not only open-source models. Namely, we use
Gemini 2.5 Flash5, Qwen 3 235B6 in non-reasoning mode, Llama 4 Maverick 400B7, Mistral Saba8 and
DeepSeek Chat v3 03249.
To source the non-parallel toxic sentences, we’ve used textdetox/multilingual_toxicity_dataset,
provided by competition authors. The resulting dataset consisted of 33528 pairs of sentences on 15
languages. In Hebrew, Ukranian and Tatar languages the quality of detoxification was shown to be of a
low quality, so only several examples were used in the final data mix. Distribution of the amount of
selected sentences per model is shown in the Table 1.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Metrics</title>
        <p>The evaluation metrics for the TextDetox 2025 track remained consistent with those used in the
TextDetox 2024 track. Throughout our development process, three primary metrics were employed:
3https://huggingface.co/datasets/textdetox/multilingual_toxic_lexicon
4https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset
5https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash
6https://qwenlm.github.io/blog/qwen3/
7https://www.cerebras.ai/press-release/maverick
8https://mistral.ai/news/mistral-saba
9https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
Style Transfer Accuracy (STA), Semantic Similarity (SIM), and Fluency (FL). To obtain the final score,
we combined these metrics by calculating their product, resulting in a unified J-score.</p>
        <p>
          The STA metric, which evaluates the quality of style transfer, was computed using the
textdetox/xlmrlarge-toxicity-classifier-v2 model 10. SIM indicates the similarity of toxic and detoxified versions of the
same sentence using cosine distance between LABSE [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] embeddings of these sentences using
sentencetransformers/LaBSE11 model. FL measures similarity between detoxified sentences and human-written
detoxified versions and is calculated with myyycroft/XCOMET-lite [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] 12 model.
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <sec id="sec-4-1">
        <title>4.1. Model selection</title>
        <p>The primary objective was to fine-tune a Sage-T5-like model using a combination of three loss functions.</p>
        <p>During training, two types of textual data were utilized: toxic texts with toxic span classification
labels and pairs of toxic texts and their corresponding detoxified versions for paraphrase learning. The
training process incorporates three loss functions. The training framework is presented in Figure 1.</p>
        <p>
          For the experiments, the base version of bigscience/mt0-large [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]13 model. For the final training,
s-nlp/mt0-xl-detox-orpo [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]14 was selected. This model was the winning model of the PAN 2024
detoxification contest.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Losses</title>
        <sec id="sec-4-2-1">
          <title>4.2.1. Seq2seq loss</title>
          <p>Besides regular seq2seq loss, classification loss and contrastive loss are present during model training.
To ensure that the model generates fluent and coherent detoxified sentences, we use a standard
sequenceto-sequence (seq2seq) loss. It encourages the model to produce target tokens that match the reference
detoxified output at each position, while ignoring padding.</p>
          <p>Sequence-to-sequence cross-entropy loss with padding mask is defined as:
1</p>
          <p>∑︁  · log (, ) ,
ℒ2 = − ∑︀=1  =1
where  is the length of the target sequence,  is the size of the vocabulary,  ∈ {0, 1} is the mask,
 ∈ {1, . . . ,  } is the ground truth token ID at timestep  and p ∈ R is the predicted probability
distribution over the vocabulary of the trained model at timestep .</p>
        </sec>
        <sec id="sec-4-2-2">
          <title>4.2.2. Classification loss</title>
          <p>A classification head is added to the encoder part of the decoder model and then trained with a simple
cross-entropy loss. This classification head is trained simultaneously with the whole model, ensuring
that the encoder embeddings of toxic sentences contain information necessary to distinguish toxic and
neutral tokens.</p>
          <p>The binary cross-entropy loss with padding mask is defined as:
1</p>
          <p>∑︁  · [ · log() + (1 − ) · log(1 − )] ,
ℒ = − ∑︀=1  =1
10https://huggingface.co/textdetox/xlmr-large-toxicity-classifier-v2
11https://huggingface.co/sentence-transformers/LaBSE
12https://huggingface.co/myyycroft/XCOMET-lite
13https://huggingface.co/bigscience/mt0-large
14https://huggingface.co/s-nlp/mt0-xl-detox-orpo</p>
          <p>where  is the total number of tokens in the batch,  ∈ {0, 1} is the mask,  ∈ {0, 1} is the
ground-truth label for the -th token,  ∈ (0, 1) is the predicted probability of class 1 (toxic) for the
-th token.</p>
          <p>This requires additional training data, specifically token classes (0 for neutral and 1 for toxic) for each
sentence. The underlying idea is that the information learned by the encoder will assist the decoder in
detoxifying the text, thus improving the overall performance of the model. It is important to note that
the classification head is required only during the training process, after which this head is removed.</p>
        </sec>
        <sec id="sec-4-2-3">
          <title>4.2.3. Contrastive loss</title>
          <p>
            For our contrastive loss function, we decided to use the InfoNCE function [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. This function is
particularly well-suited for training scenarios involving variable pairs, where each pair consists of an
anchor example and a positive example, accompanied by a large number of in-batch negative examples.
In this context, the anchor and positive examples refer to two instances that share the same semantic
meaning (toxic sentence and its detoxified version), whereas the negative examples represent instances
with diferent meanings (the rest of in-batch toxic and neutral sentences).
          </p>
          <p>Given a set  = {1 . . .  } of  random samples containing one positive sample from (+|)
and  − 1 negative samples from the ’proposal’ distribution (+), we optimize:
[︃
ℒ = − EX log</p>
          <p>fk(xt+k, ct)
∑︀∈X fk(xj, ct)
]︃
(1)</p>
          <p>The primary objective of the InfoNCE function is to minimize the distance between the embeddings
of the anchor and positive examples, while simultaneously maximizing the distance between the anchor
and all negative examples in the batch. By doing so, the model learns to produce embeddings that are
closer together for semantically similar instances (i.e., the anchor and positive examples) and farther
apart for semantically dissimilar instances (i.e., the anchor and negative examples).</p>
        </sec>
        <sec id="sec-4-2-4">
          <title>4.2.4. Final loss formulation</title>
          <p>Final loss is calculated via direct sum of the three losses:
ℒ = ℒ2 + ℒ + ℒ.</p>
        </sec>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Hyperparameters</title>
        <p>
          For the experimental training run of the bigscience/mt0-large model 15, efective batch size was set to
128, adafactor [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] optimizer, learning rate of 3e-4. For the final runs on the s-nlp/mt0-xl-detox-orpo [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]
model 16, efective batch size and the optimizer remained the same, while learning rate was changed to
5e-5. All models were trained for 7 epochs.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments</title>
      <p>First, we validated the idea on a smaller model on a subset of a synthetic SynthDetoxM dataset, then
trained the same model on a custom data mix and then trained a final bigger model for final prediction.</p>
      <sec id="sec-5-1">
        <title>5.1. Preliminary experiments</title>
        <p>To confirm the hypothesis that Sage-T5-like approach will increase detoxification scores, we have
randomly sampled 600 examples per language from SynthDetoxM dataset as test data in Russian, French
and Spanish languages and trained three detoxification models on the remaining 3400 training examples
15https://huggingface.co/bigscience/mt0-large
16https://huggingface.co/s-nlp/mt0-xl-detox-orpo
per language. One model was the baseline, trained on sequence-to-sequence paraphrasing task. The
second model was trained for both sequence-to-sequence paraphrasing and also with a classification
head. The third model is trained using a classification head and a contrastive loss for the encoder.
Following the SynthDetoxM methodology, the models were evaluated with STA and SIM scores.</p>
        <p>The results are presented in the Table 2. The addition of the classification and contrastive losses
increases the STA scores, but slightly decreases the SIM score. This means that models learn to do
paraphrasing better, making detoxification less toxic, but these outputs difer more from the original
texts. This confirmed validity of the approach on a clean, high quality, synthetic dataset.</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Expanding the evaluation to more languages</title>
        <p>The second stage of our experiments consisted of two parts: creating a data mix for training a massively
multilingual model and training a set of smaller models to evaluate our approach on the test set of the
competition. The data mix consisted of the public part of the MultiParaDetox dataset, consisting of
400 examples per language, SynthDetoxM dataset, consisting of 4000 examples per language and our
synthetic SynthDetoxM-like generated dataset.</p>
        <p>To further validate the approach, we’ve again trained three models: baseline paraphrasing model,
a model with classification added to the paraphrasing loss and a model with all three losses. For the
base model, bigscience/mt0-large was selected. The results were evaluated on the test data using the
CodaLab leaderboard.</p>
        <p>In contrast to the results of preliminary experiments on the low amount of languages, adding
additional losses when training on all 15 languages did not improve detoxification quality. On average,
adding classification loss helped a little with the new languages (Italian, French, Hebrew, Hinglish,
Tatar, Japanese) and slightly decreased scores for old languages. Adding all three losses decreased all
scores. Detailed scores are shown in the Table 3.</p>
        <p>If we take a look at per language scores, we can see that the largest increase in detoxification
quality from adding a classification head is in Amharic, Hebrew, Hinglish, Japanese, Italian and Russian
languages. All of these languages except Russian are low resource languages, which did not dominate
the pretraining dataset of the model, so we can say that adding a classification head works best for low
resource detoxification training. Per language scores are shown in Table 4.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Final model training</title>
        <p>Since the best on average approach for training from our expanded evaluation is shown that training
the model only on paraphrases shows the best quality, this approach was selected for the final model
training. For this training pass, s-nlp/mt0-xl-detox-orpo model was selected as a baseline model and
then finetuned on our dataset mix, consisting of MultiParaDetox, SynthDetoxM and our synthetic
dataset. Final results and comparisons to the baselines are shown in the Table 5.</p>
        <p>Our model placed 14th in the final ranking, outcompeting all simple baselines, gpt4o and o3-mini
and losing to gpt4 and mt0 baselines. You can see detoxification examples in appendix A.</p>
        <p>Additionally, the authors provided LLM-as-a-judge final evaluation, where Llama-3-8b-Instruct model
was additionally finetuned on the manual annotation of the previous year’s competition for toxic
parewise comparison and similarity tasks. The fluency metric was still calculated by xcomet-lite model.
You can see results of this evaluation in appendix B.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Results and discussion</title>
      <p>Our approach demonstrated strong performance on three selected languages. However, it did not
generalize well to a broader set of languages. We hypothesize that this limitation stems from the
low quality of toxic token annotation in some languages, likely due to incomplete or inconsistent
toxic lexicons. Future work should investigate more robust methods for toxic span detection, such as
leveraging large language models to improve annotation quality.</p>
      <p>We attribute the gap between our final model and GPT-4 primarily to the significantly smaller size of
our backbone model, mT0-XL, which contains only 3 billion parameters. However, another important
observation emerged during our experiments: directly fine-tuning the original mT0 model on our
custom multilingual detoxification data mixture resulted in a noticeable drop in performance compared
to the initial zero-shot baseline.</p>
      <p>
        This degradation may be explained by the shift in training methodology. The original model was
trained using ORPO [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] optimization. This optimization can create a fragile equilibrium in the model’s
parameter space, where the learned behaviors depend heavily on maintaining the alignment enforced
during ORPO.
      </p>
      <p>When we applied regular supervised fine-tuning (SFT) on raw, unfiltered training data, it is likely
that this alignment was disrupted. SFT tends to push the model back toward the mode of the new data
distribution, which may conflict with the preference-aligned behavior established by ORPO. As a result,
the model may regress or exhibit erratic outputs, especially in nuanced tasks like detoxification, where
subtle distributional shifts can lead to pronounced degradation in quality. This highlights the need
for more careful integration of aligned models and raw training data, particularly when extending or
adapting preference-optimized backbones to new domains.</p>
      <p>Furthermore, we did not apply any data cleaning procedures, and the suboptimal quality of the
MultiParaDetox dataset may have further impacted the model’s efectiveness.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusions</title>
      <p>In this paper, we propose a novel approach to text detoxification using two auxiliary losses. If high quality
markup is used for training the encoder classification head, our approach significantly outperforms
seq2seq training. However, for weak markup, seq2seq training still works better than our approach.</p>
      <p>Final submission outperformed all simple baselines, o3-mini and gpt4o on the private test set, coming
close to detoxification quality by a much larger gpt4 model.</p>
      <p>Our data preprocessing and model training scripts can be found on GitHub 17. Our trained models
can be found on HuggingFace:
• Model with detox and classification losses 18;
• Model with detox and contrastive losses losses19;
• Model with all losses20;</p>
      <p>Our collected dataset is also available at our HuggingFace repository21.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Limitations and Future Work</title>
      <p>Our exploration of a Sage-T5-like approach for multilingual text detoxification, while giving valuable
insights, encountered several limitations. The primary challenge was the inconsistent generalizability
of the multi-task learning benefits (seq2seq, classification, and contrastive losses) when scaling from
a few well-performing languages to the full set of 15. This suggests that the uniform application of
these auxiliary losses might not be optimal across diverse linguistic structures and data availabilities.
Furthermore, the performance of our token-level classification, and consequently the entire multi-task
model, was heavily reliant on the quality of toxic token annotations derived from multilingual lexicons.
The incompleteness or inconsistencies within these lexicons likely introduced noise, particularly for
less-resourced languages. Future work could address these issues improving annotation quality via
more sophisticated toxic span detection methods, such as leveraging large language models for few-shot
annotation.
17https://github.com/chameleon-lizard/Sage-Detox
18https://huggingface.co/alexandro767/SageDetox_detox_classification
19https://huggingface.co/alexandro767/SageDetox_detox_contrastive
20https://huggingface.co/alexandro767/SageDetox_detox_classification_contrastive
21https://huggingface.co/datasets/alexandro767/CLEF_2025_dataset_full</p>
      <p>Another set of limitations pertains to the model architecture and data handling. The choice of
bigscience/mT0-xl as the backbone, while competitive, is significantly smaller than some leading
proprietary models, inherently constraining its capacity. Moreover, fine-tuning the
s-nlp/mt0-xl-detoxorpo model (originally trained with ORPO) using our direct supervised approach led to performance
degradation, indicating a potential mismatch in training paradigms or catastrophic forgetting. The
quality of the aggregated training data, which did not undergo extensive cleaning, might also have
impacted performance. Future research could benefit from experimenting with larger, more capable
open-source multilingual models and implementing data filtering and cleaning protocols.</p>
    </sec>
    <sec id="sec-9">
      <title>9. Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used Gemini and Grammarly. Gemini was used for
translation of the toxic and non-toxic claims in the paper and both Gemini and Grammarly were used
for checking the grammar and spelling. After using this tool/service, the author(s) reviewed and edited
the content as needed and take(s) full responsibility for the publication’s content.</p>
    </sec>
    <sec id="sec-10">
      <title>A. Generation examples</title>
      <p>Original Text Paraphrased Version
Просто забирайся нахуй з я просто забирайся з моєї
моєї голови. голови.
заебали инде шул ватс алар бик начар кешеләр.
апыгыз белэн))
Maro. Maderchodo ko chodo
mat.. Ek ek ko gus ke maro
Je n ai jamais rencontré d
individu aussi bas que ce rat d
égout !
Si le votan a este energúmeno si le votan a este político de
de verdad no aman a su ciudad. verdad no aman a su ciudad.
yeh log apni baat ko shuru kar
rahe hain.
je n ai jamais rencontré d
individu aussi bas que ce rat !
Language Code
uk
tt
hin
fr
es
А нахуй ты тут
персональный чатик
устроил, дегенерат?
Maybe so, but at least
American’s are smarter than
dumbass Brits.</p>
      <p>Stoppt denn keiner diese
Dummbacke!
а почему ты тут
персональный чат устроил?
maybe so, but at least
Americans are smarter than
those of brits.
stoppt denn keiner diese
menschen!
ru
en
de</p>
    </sec>
    <sec id="sec-11">
      <title>B. LLM-as-a-Judge evaluation results</title>
      <p>LLM-as-a-Judge evaluation on languages with parallel markup. Our submission is highlighted in bold, baselines</p>
      <p>After the end of the competition, organizers did another evaluation round, using finetuned
Llama-38B-Instruct model as a judge. The resulting table provided a shakeup in the results table. The results
can be seen in Table 7 and Table 8.</p>
      <p>Our model outperformed gpt4 baseline in this evaluation round on languages with available parallel
data, while not surpassing the same baseline on languages without parallel markup. This can be
attributed to low resourcefulness of said languages and that the models underperformed in them due
to tokenization quality, low pretraining data amount and general out-of-distribution for the models,
which were used as a base for our detoxifiers.</p>
      <p>it
0.904
0.805
0.820
0.784
0.721
0.787
0.703
0.781
0.796
0.769
0.779
0.734
0.745
0.711
0.710
0.683
0.722
0.728
0.725
0.744
0.680
0.643
0.631
0.661
0.647
0.554
0.680
0.443
0.510
0.442
0.147
0.150
0.783
0.657
0.681
0.631
0.610
0.611
0.615
0.618
0.581
0.619
0.578
0.576
0.495
0.501
0.495
0.594
0.573
0.510
0.530
0.523
0.383
0.490
0.510
0.497
0.357
0.429
0.370
0.496
0.326
0.407
0.349
0.283
0.724
0.860
0.889
0.843
0.883
0.850
0.790
0.854
0.853
0.873
0.865
0.815
0.790
0.793
0.801
0.803
0.807
0.860
0.845
0.816
0.764
0.814
0.736
0.826
0.659
0.760
0.718
0.576
0.747
0.460
0.503
0.505
0.780
0.583
0.495
0.575
0.493
0.541
0.621
0.506
0.436
0.416
0.438
0.523
0.611
0.598
0.575
0.505
0.492
0.379
0.349
0.339
0.582
0.324
0.403
0.209
0.543
0.370
0.327
0.521
0.497
0.421
0.054
0.048
hin</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>N.</given-names>
            <surname>Martynov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Baushenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kozlova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kolomeytseva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Abramov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Fenogenova</surname>
          </string-name>
          ,
          <article-title>A methodology for generative spelling correction via natural spelling errors emulation across multiple domains and languages</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Findings of the Association for Computational Linguistics: EACL</source>
          <year>2024</year>
          ,
          <article-title>St</article-title>
          .
          <source>Julian's, Malta, March</source>
          <volume>17</volume>
          -22,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>138</fpage>
          -
          <lpage>155</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .findings-eacl.
          <volume>10</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>D.</given-names>
            <surname>Moskovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Sushko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pletenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Tutubalina</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          , Synthdetoxm:
          <article-title>Modern llms are few-shot parallel detoxification data annotators</article-title>
          , in: L.
          <string-name>
            <surname>Chiruzzo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Ritter</surname>
          </string-name>
          , L. Wang (Eds.),
          <source>Proceedings of the</source>
          <year>2025</year>
          <article-title>Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies</article-title>
          ,
          <source>NAACL 2025 - Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <given-names>Long</given-names>
            <surname>Papers</surname>
          </string-name>
          , Albuquerque, New Mexico, USA, April 29 - May 4,
          <year>2025</year>
          , Association for Computational Linguistics,
          <year>2025</year>
          , pp.
          <fpage>5714</fpage>
          -
          <lpage>5733</lpage>
          . URL: https://aclanthology.org/
          <year>2025</year>
          .
          <article-title>naacl-long</article-title>
          .
          <volume>294</volume>
          /.
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Moskovskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ayele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rizwan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ustalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stakovskii</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Smirnova</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <article-title>Overview of the multilingual text detoxification task at pan 2024</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuščáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 -
          <article-title>Conference and Labs of the Evaluation Forum, CEUR-WS</article-title>
          .org,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>S.</given-names>
            <surname>Pletenev</surname>
          </string-name>
          , Somethingawful at PAN 2024 textdetox
          <article-title>: Uncensored llama 3 helps to censor better</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2843</fpage>
          -
          <lpage>2851</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-273.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>E.</given-names>
            <surname>Rykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zaytsev</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Anisimov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Voronin</surname>
          </string-name>
          , Smurfcat at PAN 2024 textdetox
          <article-title>: Alignment of multilingual transformers for text detoxification</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Galuscáková</surname>
          </string-name>
          , A. G. S. de Herrera (Eds.),
          <source>Working Notes of the Conference and Labs of the Evaluation Forum (CLEF</source>
          <year>2024</year>
          ), Grenoble, France,
          <fpage>9</fpage>
          -
          <issue>12</issue>
          <year>September</year>
          ,
          <year>2024</year>
          , volume
          <volume>3740</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>2866</fpage>
          -
          <lpage>2871</lpage>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3740</volume>
          /paper-276.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Protasov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Babakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Rizwan</surname>
          </string-name>
          , I. Alimova,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brune</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Konovalov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Muti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liebeskind</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Litvak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Nozza</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. Shah</given-names>
            <surname>Khan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Takeshita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Vanetik</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Ayele</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Schneider</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. M.</given-names>
            <surname>Yimam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Elnagar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mukherjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <article-title>Overview of the multilingual text detoxification task at pan 2025</article-title>
          , in: G. Faggioli,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ferro</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Rosso</surname>
          </string-name>
          , D. Spina (Eds.),
          <source>Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Bevendorf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dementieva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Fröbe</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Gipp</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Greiner-Petter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Karlgren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Mayerl</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Shelmanov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Stamatatos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wiegmann</surname>
          </string-name>
          , E. Zangerle, Overview of PAN 2025:
          <article-title>Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection</article-title>
          , in: J.
          <string-name>
            <surname>C. de Albornoz</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Gonzalo</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Plaza</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Mothe</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Piroi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Rosso</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Spina</surname>
          </string-name>
          , G. Faggioli, N. Ferro (Eds.),
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF</source>
          <year>2025</year>
          ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York,
          <year>2025</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>F.</given-names>
            <surname>Feng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Cer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Arivazhagan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <article-title>Language-agnostic BERT sentence embedding</article-title>
          , CoRR abs/
          <year>2007</year>
          .
          <year>01852</year>
          (
          <year>2020</year>
          ). URL: https://arxiv.org/abs/
          <year>2007</year>
          .
          <year>01852</year>
          . arXiv:
          <year>2007</year>
          .
          <year>01852</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>D.</given-names>
            <surname>Larionov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Seleznyov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Viskov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Panchenko</surname>
          </string-name>
          , S. Eger, xcomet
          <article-title>-lite: Bridging the gap between eficiency and quality in learned MT evaluation metrics</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
          </string-name>
          , Y. Chen (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Miami</article-title>
          , FL, USA, November
          <volume>12</volume>
          -
          <issue>16</issue>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>21934</fpage>
          -
          <lpage>21949</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>1223</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>N.</given-names>
            <surname>Muennighof</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sutawika</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Biderman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. L.</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Bari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z. X.</given-names>
            <surname>Yong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schoelkopf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Radev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. F.</given-names>
            <surname>Aji</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Almubarak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albanie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Alyafeai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Webson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Raf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Crosslingual generalization through multitask finetuning</article-title>
          , in: A.
          <string-name>
            <surname>Rogers</surname>
            ,
            <given-names>J. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boyd-Graber</surname>
          </string-name>
          , N. Okazaki (Eds.),
          <source>Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>ACL</source>
          <year>2023</year>
          , Toronto, Canada, July 9-
          <issue>14</issue>
          ,
          <year>2023</year>
          , Association for Computational Linguistics,
          <year>2023</year>
          , pp.
          <fpage>15991</fpage>
          -
          <lpage>16111</lpage>
          . URL: https://doi.org/10.18653/v1/
          <year>2023</year>
          .
          <article-title>acl-long</article-title>
          .
          <volume>891</volume>
          . doi:
          <volume>10</volume>
          .18653/V1/
          <year>2023</year>
          .
          <article-title>ACL-LONG</article-title>
          .
          <year>891</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>A. van den Oord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <article-title>Representation learning with contrastive predictive coding</article-title>
          , CoRR abs/
          <year>1807</year>
          .03748 (
          <year>2018</year>
          ). URL: http://arxiv.org/abs/
          <year>1807</year>
          .03748. arXiv:
          <year>1807</year>
          .03748.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Stern</surname>
          </string-name>
          , Adafactor:
          <article-title>Adaptive learning rates with sublinear memory cost</article-title>
          , in: J. G. Dy,
          <string-name>
            <surname>A</surname>
          </string-name>
          . Krause (Eds.),
          <source>Proceedings of the 35th International Conference on Machine Learning</source>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          <year>2018</year>
          , Stockholmsmässan, Stockholm, Sweden,
          <source>July 10-15</source>
          ,
          <year>2018</year>
          , volume
          <volume>80</volume>
          <source>of Proceedings of Machine Learning Research, PMLR</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>4603</fpage>
          -
          <lpage>4611</lpage>
          . URL: http://proceedings.mlr.press/v80/ shazeer18a.html.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Thorne,</surname>
          </string-name>
          <article-title>ORPO: monolithic preference optimization without reference model</article-title>
          , in: Y.
          <string-name>
            <surname>Al-Onaizan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Bansal</surname>
          </string-name>
          , Y. Chen (Eds.),
          <source>Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP</source>
          <year>2024</year>
          ,
          <article-title>Miami</article-title>
          , FL, USA, November
          <volume>12</volume>
          -
          <issue>16</issue>
          ,
          <year>2024</year>
          , Association for Computational Linguistics,
          <year>2024</year>
          , pp.
          <fpage>11170</fpage>
          -
          <lpage>11189</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .emnlp-main.
          <volume>626</volume>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>