1. Introduction

TextDetox CLEF 2025/Multilingual Text Detoxification 2025 Jiaozipi: A Multilingual Text Detoxification Method Based on Large Language Model-Based Ensemble Learning

Xiaohui Liu

Yusheng Yi

Zhaotian Chen

Simin Xu

Zijun Ke

Xin Guo

Yubo Huang

Wenxuan Zhang

Jiayi Chen

Yong Han

hanyong2005@fosu.edu.cn 0 0 Foshan University , Foshan , China

2025

This paper proposes a solution for the multilingual text detoxification task at CLEF 2025. The task requires detoxification of explicit toxic texts across 15 languages while saving the main content as much as possible. To address the task, we propose a solution based on prompt engineering and ensemble of LLMs. As a first step, we extend the oficial dataset to construct a parallel text detoxification dataset and a toxic keywords list. We first employ the RISE prompting framework to generate initial system instructions. These instructions, combined with few-shot examples and user input, form structured prompts that guide multiple commercial large language models (DeepSeek, Qwen, Kimi) to produce detoxified outputs. Finally, the best results are selected via multi-dimensional evaluation considering semantic preservation, toxicity reduction, style consistency, and fluency. Our method is ranked 9th in automatic evaluation metrics.

eol>PAN 2025 multilingual text detoxification large language model RISE Few-shot Learning

1. Introduction

With the rapid development of social media, toxic texts on online platforms have increased sharply, including racial discrimination remarks, personal attacks, hate speech, and other inappropriate content. To address this issue, text detoxification has been proposed as an intervention approach grounded in natural language generation. The advanced approach of the text detoxification primarily employs deep learning models to automatically detect toxic elements in text, such as insulting or discriminatory expressions.Then deep learning models transform them into neutral formulations that preserve the original semantic [ 1 ].

The task of multilingual text detoxification at CLEF 2025 [ 2, 3 ] aims at presenting a neutral version of a user message which preserves the original meaning. This task covers 15 languages, including highresource languages such as English, Chinese, and Spanish, as well as low-resource or morphologically complex languages such as Amharic and Tatar.

The challenge of this task is implicit types toxiciy —like sarcasm, passive aggressiveness, or direct hate to some group where no neutral content can be found. Such implicit toxicity types are challenging to be detoxified so the intent will indeed become non-toxic.

CLEF 2025 Working Notes, 9 – 12 September 2025, Madrid, Spain * Corresponding authors: Yi is the first corresponding author and Han is the second corresponding author.

2. Relate Work

Text detoxification tasks aim to convert toxic text into neutral expressions while preserving the original meaning. In 2024, Peng et al. proposed a method based on a few-shot learning and the CO-STAR framework, combined with chat models like Kimi for multilingual text detoxification. By generating fewshot learning contexts and structured prompts, this approach significantly improved the detoxification performance in high-resource languages like English and Chinese. [ 4 ]. In the same year, Řehulka and Šuppa explored retrieval-augmented generation (RAG) and dynamic prompt construction to enrich large language models (LLMs) with external knowledge, achieving competitive results in multilingual detoxiifcation tasks [ 5 ]. However, for some low-resource languages such as Amharic, the lack of suficient training data substantially limited performance. Therefore, they adopted a deletion approach, directly removing toxic keyword to ensure detoxification efectiveness. These approaches demonstrate notable progress in detoxification for high-resource languages, but their efectiveness remains constrained by limited multilingual training data. Efectively leveraging existing data to improve detoxification performance on low-resource languages remains a challenge.

LLMs, pretrained on massive corpora via self-supervised learning, acquire broad and emergent linguistic capabilities. However, achieving strong performance on specific downstream tasks often necessitates fine-tuning, which require substantial annotated data and computational resources. In contrast, prompt engineering enables the activation of latent model capabilities by designing efective prompt instructions. The prompt instructions can improve the relevance, coherence, and accuracy of model outputs [ 6 ].

Prompt engineering is a systematic approach to designing, writing, and optimizing input prompts for LLMs to guide them in producing expected output. To enhance the efectiveness of prompt engineering, various prompt frameworks have been proposed. For example, chain-of-thought (CoT) [ 7 ] and few-shot prompting [ 8 ] improve the interpretability and adaptability of LLMs in logical reasoning tasks and low-data scenarios by guiding models to break down complex problems or provide example references.

Although prompt engineering has enhanced the ability of LLMs to perform text detoxification tasks, single models still face challenges such as output instability and residual toxicity. Ensemble learning is a method that integrates the predictions of multiple base models to improve the robustness, accuracy, and generalization ability of the system [ 9 ]. However, existing ensemble methods are often static, relying on simple strategies such as majority voting or average scoring, which limits their flexibility and efectiveness in complex generation tasks like text detoxification.

3. Datasets

In this task, we need to detoxify 15 languages. However, the provided parallel text detoxification dataset1 [ 10, 11, 12 ] covers only 9 languages. Therefore, we used Yuanbao AI2 to translate the English portion of the parallel text detoxification dataset into Italian, French, Hebrew, Hinglish, Japanese, and Tatar, with 100 translations for each language. The process we performed is shown in table 1.

1https://huggingface.co/datasets/textdetox/multilingual paradetox 2https://yuanbao.tencent.com/

Although the coverage ability of current mainstream commercial LLMs on parallel text detoxification dataset has significantly improved, they still fail to recognize toxic keywords with cultural dependence, semantic ambiguity or distorted expressions.

To enhance the ability to recognize fine-grained toxic texts, we attempt to extract toxic words using the Toxic Keywords3 [13, 14] provided in the task introduction. But it is insuficient to support the replacement of toxic text, because there are fewer words in it for example, Amharic toxic keywords have only 245 records. So we extract the negative words in Toxic Span4 [15] and merged them with Toxic Keywords. The process that we extract negative words is shown in table 2. all you trump c∗owns are seriously m∗ssed up. c∗owns,m∗ssed allowing whole colonies of such r∗bbish to arise should r∗bbish, p∗nishable, f∗ring be p∗nishable by f∗ring the oficials. almost as f∗cked up as the cia funding and arming f∗cked up bin laden. amy , your ignorance is showing again. i∗norance and start sending cunts home. c∗nts

Note:Negative Connotations are what we need to extract and merge with Toxic Keywords .

As a first step, we generate a parallel text detoxification dataset and a toxic keywords list from the oficial dataset.

Ultimately, we obtained the datasets and toxic keywords lists of 15 languages, as follows: • The extended datasets: There are 100 samples each for Italian(it), French(fr), Hebrew(he), Hinglish(hin), Japanese(ja), and Tatar(tt), and 400 samples each for English(en), Spanish(es), German(de), Chinese(zh), Arabic()ar, Hindi(hi), Ukrainian(uk), Russian(ru), and Amharicen(am). These samples will be provided as examples to the large model for the optimization of the model’s output. • Toxic Keywords List: The summary of each language entry is in Table 3. These toxic keywords will be replaced with * in the toxic sentence. The replaced sentence is called the toxic voc replaced result below.

4. Method

Our method consists of three main steps: 1) constructing prompt using the RISE framework, 2) inputting toxic sentences into three LLMs (Kimi5, DeepSeek6, and Qwen7) to generate detoxification results, 3) putting the detoxification results of the large models and the toxic voc replaced results into Qwen for quality evaluation and finally return the best result as the output.

3https://huggingface.co/datasets/textdetox/multilingual toxic lexicon

4https://huggingface.co/datasets/textdetox/multilingual toxic spans 5https://www.kimi.com 6https://chat.deepseek.com 7https://www.tongyi.com/

4.1. Constructing input texts 1. Input prompt guided by The RISE framework

Practical prompt construction is essential for eliciting optimal responses from LLMs. The RISE framework serves as our structured template for prompt design, as illustrated in Figure 2. Its operational methodology for this task is delineated as follows: Role (R):The model is required to function as a domain expert in linguistic processing, specifically tasked with text detoxification. Input (I):The source material consists of toxic text along with supplementary contextual data, utilized for model training and refinement. Steps (S): A systematic approach—comprising keyword elimination and syntactic optimization—is employed to ensure precision and operational feasibility. Expectation (E):The output must preserve the original meaning while achieving semantic equivalence, linguistic fluency, and formal coherence. The response shall include a JSON output format like [toxic sentence: "", neutral sentence: "", lang: ""] 2. Generate few-shot learning context

This section shows how we generate the contents of a few-shot learning context. Task Demonstration: to assist LLMs in accurately understanding the task requirements, we provide a brief description of the task.

Few-shot learning content: to help the model understand the neutral version of toxic text, we provide few-shot learning content. This content contains toxic sentences and their corresponding neutral sentence pairs in parallel text detoxification dataset of the target language. Figure 4 shows an example of English (en), and the processing methods for other languages are the same. The parallel text detoxification dataset of each language is stored in dictionary form, making it easier to call up small sample learning content in the corresponding language later. 3. Input toxic sentences As Figure 5 shows, we insert a toxic sentence into template < |toxic sentence| >< |toxic sentence| >, and send it to the large language model. With the help of few-shot learning and prompts based on the RISE framework, the large language model will return formatted neutral sentences, Figure 5 demonstrates the real detoxification process.

4.2. Evaluation:

In this section, we introduce how to use the Qwen model as an evaluator to evaluate and select optimal detoxification results.

1. Input Prompt of evaluation This prompt involves selecting the optimal detoxified output from a list of candidate texts for a large language model. Our selection criteria and weightings are as follows: lowest toxicity score(weight: 0.3); highest semantic similarity to the original text (weight: 0.4); fluency and naturalness of the generated sentences(weight: 0.2); consistency in style(weight: 0.1). And we required a JSON output format like [toxic sentence: "", neutral sentence: "", lang: ""] 2. The evaluation process of large models We inserted the toxic sentence, the list of neutral sentences and corresponding language into template like Figure 7, and send it to the Qwen model. With the help of prompt, the large language model returns formatted neutral sentences. Figure 7 demonstrates the real valuation process.

5. Experiment 5.1. Settings

For all 15 languages, we repeat the following steps: 1. Input of the few-shot learning context : Construct the few-shot learning context using datasets and input it into the model. For diferent languages, replace the context sample information and language identifiers accordingly. 2. Input prompts guided by the RISE framework : Input the prompt words into the large model to guide it to generate the correct output. 3. Input of toxic sentence : Embed the toxic text between <toxic sentence> and <toxic sentence> of the framework (as shown in Figure 5), and then input it into the large language model. 4. Evaluate: The result of DeepSeek, Qwen and Kimi, and the toxic voc replaced results were input into the Qwen model for evaluation (as shown in Figure 7). Finally, best result as the output was returned.

5.2. Result

We applied our method to conduct systematic comparative experiments based on oficial datasets. In the comparative experiment of the prompt framework, in order to control the variables, the experiment first ifxed the single model benchmark, and used DeepSeek for the large language model. the prompt framework combined with word replacement strategy was uniformly applied. We focus on the performance diferences between the COSTAR framework and the RISE framework. Experimental data show that the RISE framework shows significant advantages in the core indicators, with an AvgP value of 0.636 and an AvgNP value of 0.565, compared with the corresponding index values of the COSTAR framework of 0.623 and 0.553 respectively (see Table 4 for details). Based on this empirical result, we decided to use the RISE framework as the prompt framework of the large model in the follow-up experiments to ensure the best detoxification efect of the experiment. 1. Single large model + Prompt + Word replacement: Using a single large model combined with prompt framework, targeted vocabulary replacement is performed on the parts with poor results to maintain the basic semantic structure after context processing; 2. Single large model + Prompt + Back-translation: Using a single large model combined with prompt framework, through cross-language conversion and secondary detoxification of the preliminary results to improve the efect of multilingual detoxification; 3. Single large model + Prompt + Translation Detoxification: Using a single large model combined with a prompt framework, first perform language conversion for the weak language of the model, then uniformly use the large model for detoxification, and finally translate the specific language back to the original language type; 4. Multiple large models + Prompt + Word replacement: Integrate the detoxification results output by multiple large models, and select the optimal detoxification text results in combination with word replacement;

As Table 5 shows that in the detoxification scheme using a single model, Strategy 1 which combines the prompt framework and word replacement strategy, exhibits the best decontamination efect. Compared to the other two methods, this method demonstrates significant advantages in six languages: German (de), Arabic (ar), Ukrainian (uk), Russian (ru), Tatar (tt), and Hinglish (hin), with both AvgP (0.636) and AvgNP (0.572) metrics outperforming those of other single model methods. In the follow-up, we compared it with the multi-model integrated detoxification method (Strategy 4) in comparative experiment. Further comparative experiment show that the multi-model ensemble detoxification method achieves a breakthrough in the detoxification efect. Not only did the detoxification efect of French (fr) jump to 0.801, but it also surpassed the detoxification performance of a single model in all test languages except Amharic (am). This multilingual text detoxification method achieved the best experimental results so far, increasing the AvgP to 0.656 and the AvgNP to 0.607.

As table 6 shows our model outperforms most of the baseline methods in terms of Avgp score, including baseline gpt4, baseline o3mini, baseline gpt4o, baseline delete, baseline backtranslation, and baseline duplicate. Among the languages evaluated, Ukrainian (uk; 5th), Spanish (es; 3rd), and Hindi (hi; 2nd) achieved top-5 rankings in terms of performance. Furthermore, our AvNP score outperforms all baseline models and achieves 4th place overall in the test-phase evaluation. For this ranking , the top-performing languages are Japanese (ja; 4th), French (fr; 3rd), Hindi (hin; 4th), and Hebrew (he; 5th).

However, this method cannot completely solve the problem of homophones in diferent languages and cultures. For example, the English word "house" overlaps with the Chinese word "haosi", which means "good end". When this homophone appears in some toxic sentences, such as "You’ll die a miserable death", this method cannot find the corresponding Chinese meaning. The sentence may be understood as "You won’t have a good house".

6. Summary

This paper briefly describes our work on the multilingual text detoxification task at PAN 2025. We propose using an ensemble of LLMs combined with prompt from the RISE framework to detoxify text across multiple languages. Initially, we constructed a toxicity-neutral text alignment dataset and a toxicity keyword list using the oficial dataset. Model inputs were created by integrating the RISE framework with few-shot methods. These inputs were used to drive multiple commercial LLMs (DeepSeek, Qwen, Kimi) to generate detoxified candidate outputs. Finally, the optimal output was selected through multi-dimensional evaluation, considering toxicity score, semantic integrity, and language fluency.For specific code, please refer to our release on github 8.

In this work, as shown in Table 6, the results demonstrate that our proposed method efectively handles the task of multilingual text detoxification, showing good adaptability and stability across diferent languages. However, the method does not adequately address homophones present in various languages and cultures. Future work will require more data for contextualization and research into frameworks for understanding homophones in LLMs. Additionally, we plan to enhance the tone restoration of detoxified text and construct a corresponding knowledge base to guide the result generation of LLMs.

8https://github.com/lxh44126/Detoxification/tree/code 7. Acknowledgments

This work is supported by the National Natural Science Foundation of China (No.62276064).

8. Declaration on Generative AI

During the preparation of this work, the authors used DouBao9, YuanBao in order to: Grammar and spelling check, Paraphrase and reword. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the publication’s content.

E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative AI authorship verification - extended abstract, volume 14613 of Lecture Notes in Computer Science, Springer, 2024, pp. 3–10. [13] D. Dementieva, D. Moskovskiy, N. Babakov, A. A. Ayele, N. Rizwan, F. Schneider, X. Wang, S. M.

Yimam, D. Ustalov, E. Stakovskii, A. Smirnova, A. Elnagar, A. Mukherjee, A. Panchenko, Overview of the multilingual text detoxification task at pan 2024, CEUR-WS.org, 2024. [14] J. Bevendorf, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Korencic, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-author writing style analysis, multilingual text detoxification, oppositional thinking analysis, and generative AI authorship verification - extended abstract, volume 14613 of Lecture Notes in Computer Science, Springer, 2024, pp. 3–10. [15] D. Dementieva, N. Babakov, A. Ronen, A. A. Ayele, N. Rizwan, F. Schneider, X. Wang, S. M. Yimam, D. A. Moskovskiy, E. Stakovskii, E. Kaufman, A. Elnagar, A. Mukherjee, A. Panchenko, Multilingual and explainable text detoxification with parallel corpora, Association for Computational Linguistics, Abu Dhabi, UAE, 2025, pp. 7998–8025.

[1]

Lu ,

Xu ,

Zhang ,

Wang ,

Zhu ,

Zhang , L. Yang,

Lin , Towards comprehensive detection of chinese harmful memes , volume 37 , Curran

Associates

, Inc., 2024 , pp. 13302 - 13320 .

[2]

Bevendorf ,

Dementieva ,

Fröbe ,

Gipp ,

Greiner-Petter ,

Karlgren ,

Mayerl ,

Nakov ,

Panchenko ,

Potthast ,

Shelmanov ,

Stamatatos ,

Stein ,

Wang ,

Wiegmann , E. Zangerle, Overview of PAN 2025: Voight-Kampf Generative AI Detection, Multilingual Text Detoxification, Multi-Author Writing Style Analysis, and Generative Plagiarism Detection , in: J. C. de Albornoz , J.

Gonzalo , L.

Plaza , A. G. S. de Herrera , J.

Mothe , F.

Piroi , P.

Rosso , D.

Spina , G. Faggioli, N. Ferro (Eds.), Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Sixteenth International Conference of the CLEF Association (CLEF 2025 ), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2025 .

[3]

Dementieva ,

Protasov ,

Babakov ,

Rizwan , I. Alimova,

Brune ,

Konovalov ,

Muti ,

Liebeskind ,

Litvak ,

Nozza ,

S. Shah

Khan ,

Takeshita ,

Vanetik ,

A. A.

Ayele ,

Schneider ,

Wang ,

S. M.

Yimam ,

Elnagar ,

Mukherjee ,

Panchenko , Overview of the multilingual text detoxification task at pan 2025 , in: G. Faggioli,

Ferro ,

Rosso , D. Spina (Eds.), Working Notes of CLEF 2025 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings, CEUR-WS.org , 2025 .

[4]

Guo ,

Han ,

Chen , J. Peng, A Machine-Generated Text Detection Model Based on Text Multi-Feature Fusion, CEUR-WS .org, 2024 , pp. 2593 - 2602 .

[5]

Řehulka ,

Šuppa , RAG Meets Detox: Enhancing Text Detoxification Using Open-Source Large Language Models with Retrieval Augmented Generation, CEUR-WS .org, 2024 , pp. 3021 - 3031 .

[6]

Liu ,

Yuan ,

Fu ,

Jiang ,

Hayashi , G. Neubig, Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing , ACM Computing Surveys 55 ( 2023 ) 1 - 35 .

[7]

Wei ,

Wang ,

Schuurmans ,

Bosma ,

Xia ,

Chi ,

Le ,

Zhou , Chain-of-thought prompting elicits reasoning in large language models , volume 35 , 2022 , pp. 24824 - 24837 .

[8]

Zhang ,

Huang ,

Yu , Prompt-based meta-learning for few-shot text classification, Association for Computational Linguistics , Abu Dhabi, United Arab Emirates, 2022 , pp. 1342 - 1357 .

[9]

Dey ,

Mathur , Ensemble learning method using stacking with base learner, a comparison , Springer, Singapore, 2023 , pp. 181 - 192 .

[10]

Dementieva ,

Babakov ,

Ronen ,

A. A.

Ayele ,

Rizwan ,

Schneider ,

Wang ,

S. M.

Yimam ,

D. A.

Moskovskiy ,

Stakovskii , E. Kaufman, A. Elnagar , A.

Mukherjee , A.

Panchenko , Multilingual and explainable text detoxification with parallel corpora, Association for Computational Linguistics, Abu Dhabi , UAE , 2025 , pp. 7998 - 8025 .

[11]

Dementieva ,

Moskovskiy ,

Babakov ,

A. A.

Ayele ,

Rizwan ,

Schneider ,

Wang ,

S. M.

Yimam ,

Ustalov ,

Stakovskii ,

Smirnova ,

Elnagar ,

Mukherjee ,

Panchenko , Overview of the multilingual text detoxification task at pan 2024, CEUR-WS .org, 2024 .

[12]

Bevendorf ,

X. B.

Casals ,

Chulvi ,

Dementieva ,

Elnagar ,

Freitag ,

Fröbe ,

Korencic ,

Mayerl ,

Mukherjee ,

Panchenko ,

Potthast ,

Rangel ,

Rosso , A . Smirnova,