1. Introduction

nikita.sushko at TextDetox CLEF 2025: Exploring A Sage-T5-Like Approach For Text Detoxification

Alexandr Voronin

Daniil Moskovsky

0 1

Nikita Sushko

0 1 0 AIRI , Russia 1 Skoltech , Russia

2025

This paper presents our submission to the Multilingual Text Detoxification task at PAN 2025. We explore the Sage-T5-like approach, by combining three training objectives: paraphrasing (seq2seq loss), token-level toxicity detection (classification loss), and semantic representation learning (contrastive loss). To address the challenge of limited annotated data across 15 languages, we adopt the synthetic data generation pipeline from SynthDetoxM and introduce a token-level annotation method using multilingual toxic lexicons. Our experiments on Russian, French, and Spanish demonstrate that combining classification and contrastive objectives significantly boosts detoxification performance, as measured by Style Transfer Accuracy (STA), Semantic Similarity (SIM), and their combined J-score, but fall short after expansion to more languages. Our resulting model outperforms 5 out of 7 baselines during automatic evaluation.

eol>Text style transfer contrastive learning encoder-decoder transformers synthetic data generation

1. Introduction

The rapid spread of toxic content online has created a pressing demand for efective systems that can detoxify text in multiple languages. Despite notable achievements in developing monolingual detoxification systems, the multilingual landscape still poses a variable set of challenges. For instance, languages exhibit distinct grammatical structures, vocabulary, and cultural references, which can make it dificult to develop a one-size-fits-all approach. Furthermore, many languages lack suficient labeled data, hindering the training of accurate detoxification models. To address these challenges, this paper explores the potential of cross-lingual transfer learning in enhancing multilingual text detoxification. By using shared linguistic patterns that exist across languages, our approach aims to reduce the need for large amounts of language-specific training data. This, in turn, enables us to develop more eficient and efective detoxification systems, particularly for languages with limited resources, while also preserving the original meaning and context of the text.

In this article, we introduce a novel multilingual detoxification framework based on the Sage-T5 architecture. The proposed model follows the approach from the Sage-T5 [ 1 ] and employs a multitask learning objective, combining seq2seq loss for paraphrase generation, classification loss for token-level toxicity detection, and contrastive loss for improved semantic representation learning. Furthermore, we reuse a methodology from SynthDetoxM [ 2 ] for the collection and annotation of datasets. Additionally, we created a pipeline for token toxicity markup which is crucial for training the model with classification loss.

2. Previous work

In 2024, Multilingual Text Detoxification task [ 3 ] was presented as one of the tracks of PAN Lab on CLEF conference. Participants were asked to create a text detoxification system with limited training data for 9 languages, which required the use of cross-lingual transfer and unsupervised methods. Top solutions included approaches with few-shot prompting of an uncensored version of the Llama3 [ 4 ] performed by SomethingAwful team and fine-tuning seq2seq models such as mT0 and mT5 on augmented datasets along with ORPO technique application performed by SmurfCat team [ 5 ].

The SAGE-T5 article [ 1 ] approaches the problem of spelling correction by utilizing three losses: seq2seq loss for training the corrector model, contrastive loss for the encoder of the encoder-decoder model to ensure close semantic match between the original and corrected sentences and token classification loss for the encoder of the encoder-decoder model to ensure higher accuracy of typo detection. This approach allowed to reach state-of-the-art results in spelling correction task across other models.

In the SynthDetoxM paper [ 2 ], the authors proposed a multistage approach for synthetic detoxification data generation, using pretrained decoder-only models and a large scale parallel synthetic dataset for training text detoxification models. Models, trained on this dataset, show better performance than models, trained on human labeled data.

We adapt the methodology from Sage-T5 paper to the more complex task of text detoxification and utilize SynthDetoxM methodology for generating synthetic data.

3. Data

TextDetox 2025 track [ 6 ] in PAN Lab at CLEF 2025 [ 7 ] contains 15 languages (9 from previous year track and 6 new) - English, Spanish, Italian, French, Chinese, Japanese, Hindi, Hinglish, Arabic, German, Russian, Ukrainian, Amharic, Hebrew and Tatar.

The track consists of two stages:

During the development stage, the organizers provided a training dataset, comprising 600 non-parallel examples for each of the 9 languages from previous year track, as well as 100 examples for each of the 6 newly introduced languages. The data was presented in a standardized format, consisting of three components: toxic text, neutral text, and language identification (lang).

For the test stage, organizers provided • MultiParaDetox1 — a dataset with 400 parallel samples for 9 languages; • Multilingual Toxicity Dataset2 — contains non-parallel toxic and neutral sentences: 2.01k samples for Hebrew, 4.36k for Hinglish, and 5k for every other language. For all languages except Hebrew the proportion between toxic and neutral sentences id equal, Hebrew data contains 60% neutral sentences and 40% toxic sentences; 1https://huggingface.co/datasets/textdetox/multilingual_paradetox 2https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset

• Multilingual Toxicity Lexicon3 — includes toxic words and expressions for all 15 languages.

3.1. Data preprocessing

To leverage the classification loss in our model, we introduced an additional classification head on top of the encoder. The primary function of this classification head is to predict a toxicity label for each token, categorizing it as either toxic or non-toxic. This allows the model to learn a more nuanced representation of the input text, where each token is associated with a specific toxicity classification, enabling the model to better capture the toxic language.

We used Multilingual Toxicity Dataset to markup token toxicity. The toxicity markup was carried out in 3 stages. Firstly, the input data and toxic lexicon were tokenized using the target model’s tokenizer. Then, we’ve created a function to align toxic expressions to toxic sentences. As a single toxic expression mostly consists of several tokens, we had to check that all of its tokens are present in toxic sentence. Finally, we applied this function to all available data in all languages and obtained token markup.

3.1.1. Synthetic Data Collection

Using the dataset of toxicity identification provided by the organizers 4, we collect a synthetic parallel detoxification dataset. In our collection pipeline, we follow the approach introduced in the SynthDetoxM [ 2 ].

In the context of this task, we utilize more novel and not only open-source models. Namely, we use Gemini 2.5 Flash5, Qwen 3 235B6 in non-reasoning mode, Llama 4 Maverick 400B7, Mistral Saba8 and DeepSeek Chat v3 03249. To source the non-parallel toxic sentences, we’ve used textdetox/multilingual_toxicity_dataset, provided by competition authors. The resulting dataset consisted of 33528 pairs of sentences on 15 languages. In Hebrew, Ukranian and Tatar languages the quality of detoxification was shown to be of a low quality, so only several examples were used in the final data mix. Distribution of the amount of selected sentences per model is shown in the Table 1.

3.2. Metrics

The evaluation metrics for the TextDetox 2025 track remained consistent with those used in the TextDetox 2024 track. Throughout our development process, three primary metrics were employed: 3https://huggingface.co/datasets/textdetox/multilingual_toxic_lexicon 4https://huggingface.co/datasets/textdetox/multilingual_toxicity_dataset 5https://cloud.google.com/vertex-ai/generative-ai/docs/models/gemini/2-5-flash 6https://qwenlm.github.io/blog/qwen3/ 7https://www.cerebras.ai/press-release/maverick 8https://mistral.ai/news/mistral-saba 9https://huggingface.co/deepseek-ai/DeepSeek-V3-0324 Style Transfer Accuracy (STA), Semantic Similarity (SIM), and Fluency (FL). To obtain the final score, we combined these metrics by calculating their product, resulting in a unified J-score.

The STA metric, which evaluates the quality of style transfer, was computed using the textdetox/xlmrlarge-toxicity-classifier-v2 model 10. SIM indicates the similarity of toxic and detoxified versions of the same sentence using cosine distance between LABSE [ 8 ] embeddings of these sentences using sentencetransformers/LaBSE11 model. FL measures similarity between detoxified sentences and human-written detoxified versions and is calculated with myyycroft/XCOMET-lite [ 9 ] 12 model.

4. Methodology 4.1. Model selection

The primary objective was to fine-tune a Sage-T5-like model using a combination of three loss functions.

During training, two types of textual data were utilized: toxic texts with toxic span classification labels and pairs of toxic texts and their corresponding detoxified versions for paraphrase learning. The training process incorporates three loss functions. The training framework is presented in Figure 1.

For the experiments, the base version of bigscience/mt0-large [ 10 ]13 model. For the final training, s-nlp/mt0-xl-detox-orpo [ 5 ]14 was selected. This model was the winning model of the PAN 2024 detoxification contest.

4.2. Losses 4.2.1. Seq2seq loss

Besides regular seq2seq loss, classification loss and contrastive loss are present during model training. To ensure that the model generates fluent and coherent detoxified sentences, we use a standard sequenceto-sequence (seq2seq) loss. It encourages the model to produce target tokens that match the reference detoxified output at each position, while ignoring padding.

Sequence-to-sequence cross-entropy loss with padding mask is defined as: 1

∑︁ · log (, ) , ℒ2 = − ∑︀=1 =1 where is the length of the target sequence, is the size of the vocabulary, ∈ {0, 1} is the mask, ∈ {1, . . . , } is the ground truth token ID at timestep and p ∈ R is the predicted probability distribution over the vocabulary of the trained model at timestep .

4.2.2. Classification loss

A classification head is added to the encoder part of the decoder model and then trained with a simple cross-entropy loss. This classification head is trained simultaneously with the whole model, ensuring that the encoder embeddings of toxic sentences contain information necessary to distinguish toxic and neutral tokens.

The binary cross-entropy loss with padding mask is defined as: 1

∑︁ · [ · log() + (1 − ) · log(1 − )] , ℒ = − ∑︀=1 =1 10https://huggingface.co/textdetox/xlmr-large-toxicity-classifier-v2 11https://huggingface.co/sentence-transformers/LaBSE 12https://huggingface.co/myyycroft/XCOMET-lite 13https://huggingface.co/bigscience/mt0-large 14https://huggingface.co/s-nlp/mt0-xl-detox-orpo

where is the total number of tokens in the batch, ∈ {0, 1} is the mask, ∈ {0, 1} is the ground-truth label for the -th token, ∈ (0, 1) is the predicted probability of class 1 (toxic) for the -th token.

This requires additional training data, specifically token classes (0 for neutral and 1 for toxic) for each sentence. The underlying idea is that the information learned by the encoder will assist the decoder in detoxifying the text, thus improving the overall performance of the model. It is important to note that the classification head is required only during the training process, after which this head is removed.

4.2.3. Contrastive loss

For our contrastive loss function, we decided to use the InfoNCE function [ 11 ]. This function is particularly well-suited for training scenarios involving variable pairs, where each pair consists of an anchor example and a positive example, accompanied by a large number of in-batch negative examples. In this context, the anchor and positive examples refer to two instances that share the same semantic meaning (toxic sentence and its detoxified version), whereas the negative examples represent instances with diferent meanings (the rest of in-batch toxic and neutral sentences).

Given a set = {1 . . . } of random samples containing one positive sample from (+|) and − 1 negative samples from the ’proposal’ distribution (+), we optimize: [︃ ℒ = − EX log

fk(xt+k, ct) ∑︀∈X fk(xj, ct) ]︃ (1)

The primary objective of the InfoNCE function is to minimize the distance between the embeddings of the anchor and positive examples, while simultaneously maximizing the distance between the anchor and all negative examples in the batch. By doing so, the model learns to produce embeddings that are closer together for semantically similar instances (i.e., the anchor and positive examples) and farther apart for semantically dissimilar instances (i.e., the anchor and negative examples).

4.2.4. Final loss formulation

Final loss is calculated via direct sum of the three losses: ℒ = ℒ2 + ℒ + ℒ.

4.3. Hyperparameters

For the experimental training run of the bigscience/mt0-large model 15, efective batch size was set to 128, adafactor [ 12 ] optimizer, learning rate of 3e-4. For the final runs on the s-nlp/mt0-xl-detox-orpo [ 5 ] model 16, efective batch size and the optimizer remained the same, while learning rate was changed to 5e-5. All models were trained for 7 epochs.

5. Experiments

First, we validated the idea on a smaller model on a subset of a synthetic SynthDetoxM dataset, then trained the same model on a custom data mix and then trained a final bigger model for final prediction.

5.1. Preliminary experiments

To confirm the hypothesis that Sage-T5-like approach will increase detoxification scores, we have randomly sampled 600 examples per language from SynthDetoxM dataset as test data in Russian, French and Spanish languages and trained three detoxification models on the remaining 3400 training examples 15https://huggingface.co/bigscience/mt0-large 16https://huggingface.co/s-nlp/mt0-xl-detox-orpo per language. One model was the baseline, trained on sequence-to-sequence paraphrasing task. The second model was trained for both sequence-to-sequence paraphrasing and also with a classification head. The third model is trained using a classification head and a contrastive loss for the encoder. Following the SynthDetoxM methodology, the models were evaluated with STA and SIM scores.

The results are presented in the Table 2. The addition of the classification and contrastive losses increases the STA scores, but slightly decreases the SIM score. This means that models learn to do paraphrasing better, making detoxification less toxic, but these outputs difer more from the original texts. This confirmed validity of the approach on a clean, high quality, synthetic dataset.

5.2. Expanding the evaluation to more languages

The second stage of our experiments consisted of two parts: creating a data mix for training a massively multilingual model and training a set of smaller models to evaluate our approach on the test set of the competition. The data mix consisted of the public part of the MultiParaDetox dataset, consisting of 400 examples per language, SynthDetoxM dataset, consisting of 4000 examples per language and our synthetic SynthDetoxM-like generated dataset.

To further validate the approach, we’ve again trained three models: baseline paraphrasing model, a model with classification added to the paraphrasing loss and a model with all three losses. For the base model, bigscience/mt0-large was selected. The results were evaluated on the test data using the CodaLab leaderboard.

In contrast to the results of preliminary experiments on the low amount of languages, adding additional losses when training on all 15 languages did not improve detoxification quality. On average, adding classification loss helped a little with the new languages (Italian, French, Hebrew, Hinglish, Tatar, Japanese) and slightly decreased scores for old languages. Adding all three losses decreased all scores. Detailed scores are shown in the Table 3.

If we take a look at per language scores, we can see that the largest increase in detoxification quality from adding a classification head is in Amharic, Hebrew, Hinglish, Japanese, Italian and Russian languages. All of these languages except Russian are low resource languages, which did not dominate the pretraining dataset of the model, so we can say that adding a classification head works best for low resource detoxification training. Per language scores are shown in Table 4.

5.3. Final model training

Since the best on average approach for training from our expanded evaluation is shown that training the model only on paraphrases shows the best quality, this approach was selected for the final model training. For this training pass, s-nlp/mt0-xl-detox-orpo model was selected as a baseline model and then finetuned on our dataset mix, consisting of MultiParaDetox, SynthDetoxM and our synthetic dataset. Final results and comparisons to the baselines are shown in the Table 5.

Our model placed 14th in the final ranking, outcompeting all simple baselines, gpt4o and o3-mini and losing to gpt4 and mt0 baselines. You can see detoxification examples in appendix A.

Additionally, the authors provided LLM-as-a-judge final evaluation, where Llama-3-8b-Instruct model was additionally finetuned on the manual annotation of the previous year’s competition for toxic parewise comparison and similarity tasks. The fluency metric was still calculated by xcomet-lite model. You can see results of this evaluation in appendix B.

6. Results and discussion

Our approach demonstrated strong performance on three selected languages. However, it did not generalize well to a broader set of languages. We hypothesize that this limitation stems from the low quality of toxic token annotation in some languages, likely due to incomplete or inconsistent toxic lexicons. Future work should investigate more robust methods for toxic span detection, such as leveraging large language models to improve annotation quality.

We attribute the gap between our final model and GPT-4 primarily to the significantly smaller size of our backbone model, mT0-XL, which contains only 3 billion parameters. However, another important observation emerged during our experiments: directly fine-tuning the original mT0 model on our custom multilingual detoxification data mixture resulted in a noticeable drop in performance compared to the initial zero-shot baseline.

This degradation may be explained by the shift in training methodology. The original model was trained using ORPO [ 13 ] optimization. This optimization can create a fragile equilibrium in the model’s parameter space, where the learned behaviors depend heavily on maintaining the alignment enforced during ORPO.

When we applied regular supervised fine-tuning (SFT) on raw, unfiltered training data, it is likely that this alignment was disrupted. SFT tends to push the model back toward the mode of the new data distribution, which may conflict with the preference-aligned behavior established by ORPO. As a result, the model may regress or exhibit erratic outputs, especially in nuanced tasks like detoxification, where subtle distributional shifts can lead to pronounced degradation in quality. This highlights the need for more careful integration of aligned models and raw training data, particularly when extending or adapting preference-optimized backbones to new domains.

Furthermore, we did not apply any data cleaning procedures, and the suboptimal quality of the MultiParaDetox dataset may have further impacted the model’s efectiveness.

7. Conclusions

In this paper, we propose a novel approach to text detoxification using two auxiliary losses. If high quality markup is used for training the encoder classification head, our approach significantly outperforms seq2seq training. However, for weak markup, seq2seq training still works better than our approach.

Final submission outperformed all simple baselines, o3-mini and gpt4o on the private test set, coming close to detoxification quality by a much larger gpt4 model.

Our data preprocessing and model training scripts can be found on GitHub 17. Our trained models can be found on HuggingFace: • Model with detox and classification losses 18; • Model with detox and contrastive losses losses19; • Model with all losses20;

Our collected dataset is also available at our HuggingFace repository21.

8. Limitations and Future Work

Our exploration of a Sage-T5-like approach for multilingual text detoxification, while giving valuable insights, encountered several limitations. The primary challenge was the inconsistent generalizability of the multi-task learning benefits (seq2seq, classification, and contrastive losses) when scaling from a few well-performing languages to the full set of 15. This suggests that the uniform application of these auxiliary losses might not be optimal across diverse linguistic structures and data availabilities. Furthermore, the performance of our token-level classification, and consequently the entire multi-task model, was heavily reliant on the quality of toxic token annotations derived from multilingual lexicons. The incompleteness or inconsistencies within these lexicons likely introduced noise, particularly for less-resourced languages. Future work could address these issues improving annotation quality via more sophisticated toxic span detection methods, such as leveraging large language models for few-shot annotation. 17https://github.com/chameleon-lizard/Sage-Detox 18https://huggingface.co/alexandro767/SageDetox_detox_classification 19https://huggingface.co/alexandro767/SageDetox_detox_contrastive 20https://huggingface.co/alexandro767/SageDetox_detox_classification_contrastive 21https://huggingface.co/datasets/alexandro767/CLEF_2025_dataset_full

Another set of limitations pertains to the model architecture and data handling. The choice of bigscience/mT0-xl as the backbone, while competitive, is significantly smaller than some leading proprietary models, inherently constraining its capacity. Moreover, fine-tuning the s-nlp/mt0-xl-detoxorpo model (originally trained with ORPO) using our direct supervised approach led to performance degradation, indicating a potential mismatch in training paradigms or catastrophic forgetting. The quality of the aggregated training data, which did not undergo extensive cleaning, might also have impacted performance. Future research could benefit from experimenting with larger, more capable open-source multilingual models and implementing data filtering and cleaning protocols.

9. Declaration on Generative AI

During the preparation of this work, the author(s) used Gemini and Grammarly. Gemini was used for translation of the toxic and non-toxic claims in the paper and both Gemini and Grammarly were used for checking the grammar and spelling. After using this tool/service, the author(s) reviewed and edited the content as needed and take(s) full responsibility for the publication’s content.

A. Generation examples

Original Text Paraphrased Version Просто забирайся нахуй з я просто забирайся з моєї моєї голови. голови. заебали инде шул ватс алар бик начар кешеләр. апыгыз белэн)) Maro. Maderchodo ko chodo mat.. Ek ek ko gus ke maro Je n ai jamais rencontré d individu aussi bas que ce rat d égout ! Si le votan a este energúmeno si le votan a este político de de verdad no aman a su ciudad. verdad no aman a su ciudad. yeh log apni baat ko shuru kar rahe hain. je n ai jamais rencontré d individu aussi bas que ce rat ! Language Code uk tt hin fr es А нахуй ты тут персональный чатик устроил, дегенерат? Maybe so, but at least American’s are smarter than dumbass Brits.

Stoppt denn keiner diese Dummbacke! а почему ты тут персональный чат устроил? maybe so, but at least Americans are smarter than those of brits. stoppt denn keiner diese menschen! ru en de

B. LLM-as-a-Judge evaluation results

LLM-as-a-Judge evaluation on languages with parallel markup. Our submission is highlighted in bold, baselines

After the end of the competition, organizers did another evaluation round, using finetuned Llama-38B-Instruct model as a judge. The resulting table provided a shakeup in the results table. The results can be seen in Table 7 and Table 8.

Our model outperformed gpt4 baseline in this evaluation round on languages with available parallel data, while not surpassing the same baseline on languages without parallel markup. This can be attributed to low resourcefulness of said languages and that the models underperformed in them due to tokenization quality, low pretraining data amount and general out-of-distribution for the models, which were used as a base for our detoxifiers.

it 0.904 0.805 0.820 0.784 0.721 0.787 0.703 0.781 0.796 0.769 0.779 0.734 0.745 0.711 0.710 0.683 0.722 0.728 0.725 0.744 0.680 0.643 0.631 0.661 0.647 0.554 0.680 0.443 0.510 0.442 0.147 0.150 0.783 0.657 0.681 0.631 0.610 0.611 0.615 0.618 0.581 0.619 0.578 0.576 0.495 0.501 0.495 0.594 0.573 0.510 0.530 0.523 0.383 0.490 0.510 0.497 0.357 0.429 0.370 0.496 0.326 0.407 0.349 0.283 0.724 0.860 0.889 0.843 0.883 0.850 0.790 0.854 0.853 0.873 0.865 0.815 0.790 0.793 0.801 0.803 0.807 0.860 0.845 0.816 0.764 0.814 0.736 0.826 0.659 0.760 0.718 0.576 0.747 0.460 0.503 0.505 0.780 0.583 0.495 0.575 0.493 0.541 0.621 0.506 0.436 0.416 0.438 0.523 0.611 0.598 0.575 0.505 0.492 0.379 0.349 0.339 0.582 0.324 0.403 0.209 0.543 0.370 0.327 0.521 0.497 0.421 0.054 0.048 hin

[1]

Martynov ,

Baushenko ,

Kozlova ,

Kolomeytseva ,

Abramov ,

Fenogenova , A methodology for generative spelling correction via natural spelling errors emulation across multiple domains and languages , in: Y. Graham, M. Purver (Eds.), Findings of the Association for Computational Linguistics: EACL 2024 , St . Julian's, Malta, March 17 -22, 2024 , Association for Computational Linguistics, 2024 , pp. 138 - 155 . URL: https://aclanthology.org/ 2024 .findings-eacl. 10 .

[2]

Moskovskiy ,

Sushko ,

Pletenev ,

Tutubalina ,

Panchenko , Synthdetoxm: Modern llms are few-shot parallel detoxification data annotators , in: L. Chiruzzo , A. Ritter , L. Wang (Eds.), Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies , NAACL 2025 - Volume 1 :

Long

Papers , Albuquerque, New Mexico, USA, April 29 - May 4, 2025 , Association for Computational Linguistics, 2025 , pp. 5714 - 5733 . URL: https://aclanthology.org/ 2025 . naacl-long . 294 /.

[3]

Dementieva ,

Moskovskiy ,

Babakov ,

A. A.

Ayele ,

Rizwan ,

Schneider ,

Wang ,

S. M.

Yimam ,

Ustalov ,

Stakovskii ,

Smirnova ,

Elnagar ,

Mukherjee ,

Panchenko , Overview of the multilingual text detoxification task at pan 2024 , in: G. Faggioli,

Ferro ,

Galuščáková , A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation Forum, CEUR-WS .org, 2024 .

[4]

Pletenev , Somethingawful at PAN 2024 textdetox : Uncensored llama 3 helps to censor better , in: G. Faggioli,

Ferro ,

Galuscáková , A. G. S. de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024 ), Grenoble, France, 9 - 12 September , 2024 , volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 2843 - 2851 . URL: https://ceur-ws. org/ Vol- 3740 /paper-273.pdf.

[5]

Rykov ,

Zaytsev , I. Anisimov ,

Voronin , Smurfcat at PAN 2024 textdetox : Alignment of multilingual transformers for text detoxification , in: G. Faggioli,

Ferro ,

Galuscáková , A. G. S. de Herrera (Eds.), Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2024 ), Grenoble, France, 9 - 12 September , 2024 , volume 3740 of CEUR Workshop Proceedings, CEUR-WS.org , 2024 , pp. 2866 - 2871 . URL: https://ceur-ws. org/ Vol- 3740 /paper-276.pdf.

[6]

Dementieva ,

Protasov ,

Babakov ,

Rizwan , I. Alimova,

Brune ,

Konovalov ,

Muti ,

Liebeskind ,

Litvak ,

Nozza ,

S. Shah

Khan ,

Takeshita ,

Vanetik ,

A. A.

Ayele ,

Schneider ,

Wang ,

S. M.

Yimam ,

Elnagar ,

Mukherjee ,

Panchenko , Overview of the multilingual text detoxification task at pan 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[7]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[8]

Feng ,

Yang ,

Cer ,

Arivazhagan ,

Wang , Language-agnostic BERT sentence embedding , CoRR abs/ 2007 . 01852 ( 2020 ). URL: https://arxiv.org/abs/ 2007 . 01852 . arXiv: 2007 . 01852 .

[9]

Larionov ,

Seleznyov ,

Viskov ,

Panchenko , S. Eger, xcomet -lite: Bridging the gap between eficiency and quality in learned MT evaluation metrics , in: Y. Al-Onaizan , M. Bansal , Y. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , Miami , FL, USA, November 12 - 16 , 2024 , Association for Computational Linguistics, 2024 , pp. 21934 - 21949 . URL: https://aclanthology.org/ 2024 .emnlp-main. 1223 .

[10]

Muennighof ,

Wang ,

Sutawika ,

Roberts ,

Biderman ,

T. L.

Scao ,

M. S.

Bari ,

Shen ,

Z. X.

Yong ,

Schoelkopf ,

Tang ,

Radev ,

A. F.

Aji ,

Almubarak ,

Albanie ,

Alyafeai ,

Webson ,

Raf ,

Rafel , Crosslingual generalization through multitask finetuning , in: A. Rogers , J. L. Boyd-Graber , N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1 : Long

Papers)

, ACL 2023 , Toronto, Canada, July 9- 14 , 2023 , Association for Computational Linguistics, 2023 , pp. 15991 - 16111 . URL: https://doi.org/10.18653/v1/ 2023 . acl-long . 891 . doi: 10 .18653/V1/ 2023 . ACL-LONG . 891 .

[11] A. van den Oord ,

Li ,

Vinyals , Representation learning with contrastive predictive coding , CoRR abs/ 1807 .03748 ( 2018 ). URL: http://arxiv.org/abs/ 1807 .03748. arXiv: 1807 .03748.

[12]

Shazeer ,

Stern , Adafactor: Adaptive learning rates with sublinear memory cost , in: J. G. Dy, A . Krause (Eds.), Proceedings of the 35th International Conference on Machine Learning , ICML 2018 , Stockholmsmässan, Stockholm, Sweden, July 10-15 , 2018 , volume 80 of Proceedings of Machine Learning Research, PMLR , 2018 , pp. 4603 - 4611 . URL: http://proceedings.mlr.press/v80/ shazeer18a.html.

[13]

Hong ,

Lee , J. Thorne, ORPO: monolithic preference optimization without reference model , in: Y. Al-Onaizan , M. Bansal , Y. Chen (Eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024 , Miami , FL, USA, November 12 - 16 , 2024 , Association for Computational Linguistics, 2024 , pp. 11170 - 11189 . URL: https://aclanthology.org/ 2024 .emnlp-main. 626 .