PAN 2024 Multilingual TextDetox: Exploring Cross-lingual
                         Transfer Using Large Language Models⋆
                         Notebook for PAN at CLEF 2024

                         Vitaly Protasov1
                         1
                             Artificial Intelligence Research Institute (AIRI), Moscow, Russia


                                         Abstract
                                         Text detoxification is a text-to-text generation task that relies on available data for experiments. In recent years,
                                         this task has primarily focused on well-resourced languages while neglecting lower-resource languages. This
                                         work explores various approaches to building a multilingual solution for different languages, with an emphasis on
                                         9 languages in the Multilingual Text Detoxification Task at PAN 2024. Throughout the experiments, we consider
                                         not only different model types but also employ fine-tuning on various combinations of datasets. As a result,
                                         we achieve third place in human evaluation and show promising progress towards developing a multilingual
                                         solution for the text detoxification task using large language models such as mT0 and XGLM. We also observe
                                         that fine-tuning on combinations of relatively similar languages is a promising direction—especially when real
                                         data for some languages is lacking.

                                         Keywords
                                         PAN 2024, Multilingual Text Detoxification (TextDetox) 2024, cross-lingual transfer, large language models


                         1. Introduction
                         Text detoxification is the process of rewriting a given text to remove or rephrase toxic or rude elements,
                         making it more respectful and appropriate for a wider audience. This task has gained significant
                         attention due to the growing concern for creating a safer and more inclusive online environment
                         [1]. The text detoxification task presents several challenges. One primary issue is establishing clear
                         and formalized criteria for defining inappropriate content. Another challenge lies in deciding the
                         appropriate action for detected inappropriate parts of a sentence—whether they should be deleted,
                         rewritten, or preserved—and whether the original meaning should be revised. While annotation criteria
                         can help address these concerns, ensuring consistent understanding across different research studies
                         and proposed datasets is crucial for making detoxification methods more deterministic and stable.
                            Another significant challenge here is the application of detoxification methods across various lan-
                         guages [2]. Current research predominantly focuses on high-resource languages such as English and
                         Russian, while languages with fewer resources and data remain underrepresented. To address this gap,
                         the PAN at CLEF 2024 has introduced a Multilingual Text Detoxification task [3, 4], offering data for 9
                         languages to support and advance research and development in this critical area.
                            In this study, we focus on developing a multilingual solution for the text detoxification task. We
                         explore various methodologies, including training of encoder-decoder models such as mBART[5] and
                         mT0 [6], as well as the training of decoder-only Large Language Model (LLM) XGLM[7]. Additionally, we
                         conduct experiments on cross-lingual transfer by training models with different language combinations
                         to achieve the highest performance on the test evaluation set. The combination of predictions from
                         such models as mT0 and XGLM proved to be our best solution, securing fourth place in the test stage
                         based on automatic evaluation and third place according to manual human evaluation.

                         CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France
                         ⋆
                           You can use this document as the template for preparing your publication. We recommend using the latest version of the
                           ceurart style.
                         *
                           Corresponding author.
                         †
                           These authors contributed equally.
                         $ vitasprotas@gmail.com (V. Protasov)
                                      © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Related work
In earlier works, unsupervised methods like CondBERT [8] demonstrated effectiveness in the text
detoxification task by identifying and rephrasing toxic parts of the text. However, these methods were
eventually surpassed by encoder-decoder approaches [9, 2]. Consequently, newer methods emerged,
treating detoxification similar to machine translation task, where toxic text is the input and the detoxified
version is the output. This approach has led to the inclusion of a growing number of languages, fostering
solutions for languages previously unaddressed. For example, [10] organized a competition centered on
detoxifying Russian text, highlighting various methods, including decoder-only models. Additionally,
[11] introduced a dataset for English and proposed new detoxification methods. Following this, [12]
investigated strategies for transferring knowledge to new languages using translation models as an
intermediary step.
   Our work aims to address the underrepresentation of methods for different languages by considering
a new dataset for nine languages.


3. Experimental setup
3.1. Models
Most recent high-performing text detoxification approaches rely on encoder-decoder models. We
decided to start with models of this architecture and include decoder-only LLMs, which have shown
impressive results in natural language processing (NLP)[13]. Our experiments consider encoder-decoder
models such as mBART and mT0, while for the decoder-only approach, we focus on XGLM. Previous
studies have highlighted the effectiveness of models like mBART and mT5[14], but we decide to
also consider mT0 due to its multitask fine-tuning capability, which could be advantageous for text
detoxification. Additionally, in our selection of decoder-only model, we choose XGLM because of its
multilingual nature and good reported performance results. It’s worth noting that XGLM was not
trained on Amharic data; thus, no results are reported in this language here. Regarding the mT0, we aim
to consider different model sizes such as base1 , large2 , and xl3 . Also, we consider XGLM (7.5B)4 only.

3.2. Fine-tuning pipeline
Along with the dataset provided for the development5 , we utilize both English6 and Russian7 ParaDetox
[11] datasets during the experiments. Our objective is to utilize these monolingual datasets to tailor
models for text detoxification before fine-tuning on the provided multilingual dataset. As previously
stated in Section 1, potential discrepancies exist in the collected datasets. Therefore, we do not consider
merging such datasets for fine-tuning and decide to utilize them separately.
   Though we consider both encoder-decoder and decoder-based models, the fine-tuning process in
general is not necessarily different. We examine two types of datasets during fine-tuning: utilizing
English and Russian ParaDetox datasets as well as the MultilingualParaDetox dataset for nine languages.
With the first ones, we explore whether initial fine-tuning on English and Russian languages improves
the convergence of multilingual models during final fine-tuning on the MPD dataset. Finally, we conduct
fine-tuning on a multilingual dataset using different language combinations: (i) monolingual fine-tuning
on each language independently; (ii) multilingual fine-tuning using all available languages; (iii) fine-
tuning on different combinations of languages. For the third approach, our hypothesis suggests that
closely related languages significantly impact fine-tuning success more than a monolingual approach.
1
  bigscience/mt0-base
2
  bigscience/mt0-large
3
  bigscience/mt0-xl
4
  facebook/xglm-7.5B
5
  textdetox/multilingual_paradetox
6
  https://huggingface.co/datasets/s-nlp/paradetox
7
  https://huggingface.co/datasets/s-nlp/ru_paradetox
Also, since the MPD lacks annotated data—with only 400 examples available for each language—various
combinations of similar languages should be considered to augment training data and may lead to
improved performance on holdout sets.
   Due to the constraints of computational resources for LLMs like mT0-xl and XGLM, we choose not
to fine-tune all their weights. Instead, we investigate the potential of utilizing Low Rank Adaptation
(LoRA) [15] to facilitate the training of these large models.


4. Results
During the experiments, prompts are not used, and we rely solely on model convergence without
providing additional instructions during fine-tuning and evaluation. Additionally, early stopping [16] is
employed. All reported results are based on metric values measured automatically on the test set.
   We iteratively test our hypotheses without considering all potential scenarios within each experiment
iteration, so we do not present results for every experimental setup.

4.1. Multilingual fine-tuning
First, we aim to investigate the multilingual performance of various models and consider fine-tuning
them on different combinations of datasets, including PD and MPD. Table 1 presents the results of these
experiments. Notably, utilizing the PD dataset can enhance target performance in most languages after
fine-tuning mBART and mT0-large.
   For other models, we decided not to explore different combinations of datasets due to the large size
and complexity of experiments. Thus, for them we only included results for PD and MPD together. As
we can see, mT0-large achieves the maximum absolute values in target performance in five languages;
mT0-xl+LoRA is the best for one language only, and XGLM+LoRA performs the best in three languages.

Table 1
Test results of different multilingual models based on the datasets used during fine-tuning.
         Model          Dataset      en       es     de      zh      ar       hi     uk         ru    am
         mBART           MPD        0.36    0.28    0.31    0.05    0.19     0.11   0.24       0.27   0.08
         mBART          PD+MPD      0.38     0.26   0.28    0.07   0.24     0.14    0.25       0.34   0.11
        mT0-base         MPD        0.35     0.25   0.28    0.07    0.31     0.12   0.31       0.30   0.09
        mT0-base        PD+MPD      0.38     0.29   0.32    0.06   0.36     0.13    0.34       0.32   0.11
        mT0-large       PD+MPD      0.48     0.39   0.42    0.11    0.43      0.2   0.48       0.46   0.17
       mT0-xl+LoRA      PD+MPD      0.41     0.41   0.45    0.10    0.47     0.16   0.43       0.41   0.15
       XGLM+LoRA        PD+MPD      0.44    0.466   0.43    0.08   0.505    0.303   0.42       0.40     -


4.2. Fine-tuning across different combination of languages
In Section 4.1, we found that mT0-large and XGLM+LoRA showed the most promising performance.
However, conducting experiments with LLMs like XGLM requires significant computational resources
and time. Since our focus in this section is to explore fine-tuning across various language combinations,
we choose to conduct experiments here with mT0-large only due to the trade-off between its size and
multilingual performance.
   In Table 2, we present the results obtained from fine-tuning using different combinations of languages.
Specifically, we observe that fine-tuning the model in Russian or Ukrainian separately yields poorer
performance than fine-tuning their combination. Similar patterns were observed in experiments with
Hindi and Amharic, where training on their combination resulted in the best performance. However,
when it comes to German, English, Spanish, and Arabic, fine-tuning using their combination shows
worse results than fine-tuning them separately. Nevertheless, English and Spanish still exhibit similar
improvement patterns when combined.
  As a result, we can approve our hypothesis that closely related languages can serve to improve results
while training on their combination, enriching the training dataset.

Table 2
Test results for the mT0-large based on the combination of languages used during fine-tuning.
                 Dataset                en       es     de     zh       ar         hi       uk         ru       am
             PD RU + MPD RU              -        -      -      -        -         -         -       0.456       -
            PD RU + MPD UK               -        -      -      -        -         -       0.543        -        -
           PD RU + MPD RU+UK             -        -      -      -        -         -       0.583     0.525       -
                 MPD HI                  -        -      -      -        -       0.245       -          -        -
                MPD AM                   -        -      -      -        -         -         -          -      0.285
               MPD HI+AM                 -        -      -      -        -       0.274       -          -      0.298
             PD EN + MPD EN            0.525      -      -      -        -         -         -          -        -
                 MPD DE                  -        -    0.502    -        -         -         -          -        -
           PD EN + MPD EN+DE           0.471      -    0.481    -        -         -         -          -        -
                 MPD ES                  -      0.35     -      -        -         -         -          -        -
           PD EN + MPD ES+EN           0.498    0.39     -      -        -         -         -          -        -
                 MPD AR                  -        -      -      -      0.502       -         -          -        -
               MPD ES+AR                 -      0.37     -      -      0.491       -         -          -        -


4.3. Excluding toxic lexicon from combined results
Based on the reported results, our final submission combines the top-performing outputs from the
mt0-large model for English, German, Ukrainian, Russian, and Amharic with those from XGLM+LoRA
for Spanish, Arabic, and Hindi. Although we could not exceed the performance of the delete baseline for
Chinese, we have replicated its results and included them in our final submission.
   Afterward, we chose to preprocess these combined results by excluding words from a multilingual
toxic lexicon dataset8 provided in the competition. Table 3 illustrates a comparison of the results before
and after excluding toxic lexicon words at this stage. As we can see, the removal of such words positively
impacts almost all languages, though it did not affect the results for Chinese and German.

Table 3
Test results of the submission with the best combined outcomes before and after excluding toxic lexicon words.
                            en         es       de      zh       ar        hi       uk         ru       am
                 Before    0.525     0.466     0.502   0.175   0.505     0.303     0.583     0.525     0.298
                 After     0.531     0.472     0.502   0.175   0.523     0.320     0.629     0.542     0.311


4.4. Manual evaluation results
As mentioned earlier, our top submission secured fourth place in the automatic test evaluation, yet it
reached third place in the manual evaluation through human annotation (refer to Table 4). Notably,
according to human evaluation, our results for such languages as Spanish, Hindi, and Arabic are the
top ones, indicating that decoder-only LLMs are more effective at handling the text detoxification task
and generating more human-like text.


5. Conclusion
This study explored fine-tuning various models with different architectures for the task of text detoxifi-
cation. Our experiments also investigated the use of varied combinations of datasets and languages

8
    textdetox/multilingual_toxic_lexicon
during fine-tuning. By combining different approaches, we achieved fourth place in test evaluation and
third place in human evaluation. In this work, we particularly demonstrated that cross-lingual transfer
between languages is a promising approach, improving languages such as Ukrainian and Amharic by
transferring knowledge from closely related languages such as Russian and Indian respectively. We
also showed that training decoder-only LLMs can be a promising direction, yielding the best results
according to human evaluation, which totally aligns with the latest advancements in the NLP sphere.

Table 4
The leaderboard with first 5 participants ranked by the target average metric. Top-1 results are highlighted by
bold and underline; Top-3 participants are highlighted by bold.
       Participant          average     en     es     de     zh      ar     hi     uk      ru     am
       Human References     0.85        0.88   0.79   0.71   0.93    0.82   0.97   0.90    0.80   0.85
       SomethingAwful       0.77        0.86   0.83   0.89   0.53    0.74   0.86   0.69    0.84   0.71
       adugeen              0.74        0.83   0.73   0.70   0.60    0.82   0.68   0.84    0.76   0.71
       VitalyProtasov       0.72        0.69   0.81   0.77   0.49    0.79   0.87   0.67    0.73   0.68
       nikita.sushko        0.71        0.70   0.62   0.79   0.47    0.89   0.84   0.67    0.74   0.68
       erehulka             0.71        0.88   0.71   0.85   0.68    0.78   0.52   0.63    0.65   0.69


References
 [1] G. Floto, M. M. T. pour, P. Farinneya, Z. Tang, A. Pesaranghader, M. Bharadwaj, S. Sanner, Dif-
     fudetox: A mixed diffusion model for text detoxification, ArXiv abs/2306.08505 (2023). URL:
     https://api.semanticscholar.org/CorpusID:259164399.
 [2] D. Moskovskiy, D. Dementieva, A. Panchenko, Exploring cross-lingual text detoxification with
     large multilingual language models., ArXiv abs/2206.02252 (2022). URL: https://api.semanticscholar.
     org/CorpusID:249394890.
 [3] D. Dementieva, D. Moskovskiy, N. Babakov, A. A. Ayele, N. Rizwan, F. Schneider, X. Wang, S. M.
     Yimam, D. Ustalov, E. Stakovskii, A. Smirnova, A. Elnagar, A. Mukherjee, A. Panchenko, Overview
     of the multilingual text detoxification task at pan 2024, in: G. Faggioli, N. Ferro, P. Galuščáková,
     A. G. S. de Herrera (Eds.), Working Notes of CLEF 2024 - Conference and Labs of the Evaluation
     Forum, CEUR-WS.org, 2024.
 [4] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko-
     renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova,
     E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024:
     Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking
     Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality,
     Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the
     CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg
     New York, 2024.
 [5] Y. Liu, J. Gu, N. Goyal, X. Li, S. Edunov, M. Ghazvininejad, M. Lewis, L. Zettlemoyer, Multilingual
     denoising pre-training for neural machine translation, Transactions of the Association for Compu-
     tational Linguistics 8 (2020) 726–742. URL: https://api.semanticscholar.org/CorpusID:210861178.
 [6] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen,
     Z.-X. Yong, H. Schoelkopf, X. Tang, D. R. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai,
     A. Webson, E. Raff, C. Raffel, Crosslingual generalization through multitask finetuning, in: Annual
     Meeting of the Association for Computational Linguistics, 2023. URL: https://api.semanticscholar.
     org/CorpusID:253264914.
 [7] X. V. Lin, T. Mihaylov, M. Artetxe, T. Wang, S. Chen, D. Simig, M. Ott, N. Goyal, S. Bhosale, J. Du,
     R. Pasunuru, S. Shleifer, P. S. Koura, V. Chaudhary, B. O’Horo, J. Wang, L. Zettlemoyer, Z. Kozareva,
     M. T. Diab, V. Stoyanov, X. Li, Few-shot learning with multilingual language models, ArXiv
     abs/2112.10668 (2021). URL: https://api.semanticscholar.org/CorpusID:260651613.
 [8] D. Dale, A. Voronov, D. Dementieva, V. Logacheva, O. Kozlova, N. Semenov, A. Panchenko,
     Text detoxification using large pre-trained neural models, ArXiv abs/2109.08914 (2021). URL:
     https://api.semanticscholar.org/CorpusID:237572304.
 [9] L. Laugier, J. Pavlopoulos, J. S. Sorensen, L. Dixon, Civil rephrases of toxic texts with self-
     supervised transformers, ArXiv abs/2102.05456 (2021). URL: https://api.semanticscholar.org/
     CorpusID:231861515.
[10] V. Logacheva, D. Dementieva, I. Krotova, A. Fenogenova, I. Nikishina, T. Shavrina, A. Panchenko,
     A study on manual and automatic evaluation for text style transfer: The case of detoxification,
     Proceedings of the 2nd Workshop on Human Evaluation of NLP Systems (HumEval) (2022). URL:
     https://api.semanticscholar.org/CorpusID:248780050.
[11] V. Logacheva, D. Dementieva, S. Ustyantsev, D. Moskovskiy, D. Dale, I. V. Krotova, N. Semenov,
     A. Panchenko, Paradetox: Detoxification with parallel data, in: Annual Meeting of the Association
     for Computational Linguistics, 2022. URL: https://api.semanticscholar.org/CorpusID:248780527.
[12] D. Dementieva, D. Moskovskiy, D. Dale, A. Panchenko, Exploring methods for cross-lingual
     text style transfer: The case of text detoxification, in: International Joint Conference on Natural
     Language Processing, 2023. URL: https://api.semanticscholar.org/CorpusID:265445167.
[13] H. Naveed, A. U. Khan, S. Qiu, M. Saqib, S. Anwar, M. Usman, N. Barnes, A. S. Mian, A
     comprehensive overview of large language models, ArXiv abs/2307.06435 (2023). URL: https:
     //api.semanticscholar.org/CorpusID:259847443.
[14] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A
     massively multilingual pre-trained text-to-text transformer, in: North American Chapter of the
     Association for Computational Linguistics, 2020. URL: https://api.semanticscholar.org/CorpusID:
     225040574.
[15] J. E. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, W. Chen, Lora: Low-rank adaptation of large
     language models, ArXiv abs/2106.09685 (2021). URL: https://api.semanticscholar.org/CorpusID:
     235458009.
[16] J. Dodge, G. Ilharco, R. Schwartz, A. Farhadi, H. Hajishirzi, N. A. Smith, Fine-tuning pretrained
     language models: Weight initializations, data orders, and early stopping, ArXiv abs/2002.06305
     (2020). URL: https://api.semanticscholar.org/CorpusID:211132951.