SomethingAwful at PAN 2024 TextDetox: Uncensored Llama 3 Helps to Censor Better Notebook for PAN at CLEF 2024 Sergey Pletenev1,2,3 1 Higher School of Economics, Moscow, Russia 2 Skolkovo Institute of Science and Technology, Moscow, Russia 3 Artificial Intelligence Research Institute, Moscow, Russia Abstract In this paper, we report on our system for Multilingual Text Detoxification Task at PAN 2024. In this task, we needed to detoxify a multilingual corpus of texts. We propose an approach based on a large language models based on Llama3 architecture with an additional method for jailbreaking model generation refusals. Our approach shows an advantage over Human References for multiple languages in manual evaluation, and outperforms baselines in automatic detoxification benchmark. Our work contributes to the ongoing effort to assess the vulnerability of LLMs to jailbreaking attacks, underscoring the latent capabilities of the large models. Keywords PAN 2024, Multilingual Detoxification, NLP, LLM, Refusals, Model Jailbreak 1. Introduction The proliferation of online platforms has led to an increase in the use of harmful language, including offensive, abusive, and hateful content. Despite significant efforts to develop accurate models for detecting toxic language, this reactive approach has often resulted in the removal of content, poten- tially limiting freedom of expression and ignoring the informative aspects of user-generated content. Traditional methods of filtering harmful text, such as deleting and censoring specific words, have become ineffective due to the evolving nature of toxic language. Toxic language is constantly changing, with new expressions, slang, and insults emerging on a regular basis, making it challenging for static models to remain effective. Different online platforms attract different user demographics, leading to variations in how toxicity manifests itself on each platform. This diversity in communication norms means that there is no one-size-fits-all approach to addressing toxic language, and efforts must be tailored to each platform’s unique characteristics. Simply identifying and removing toxic content is not sufficient to address the root cause and may result in the deletion of valuable information along with the toxic content. Previous works [1, 2, 3, 4] have explored the concept of text detoxification. Approaching text detoxification task as a task of text-to-text sequence learning ParaDetox [5] and RuDetox [6] were introduced as the first detoxification models built using a crowd-sourced parallel corpora for English and Russian languages, respectively. These model outperformed their unsupervised counterparts in the text detoxification task, but they were primarily designed for a single language and were published at a time when sequence-to-sequence models were dominant [7]. However, the landscape of natural language processing models has evolved, with the emergence of large language models such as Mistral [8], ChatGPT1 , LLaMA series [9, 10] and others, which have shown promising results on various language modeling tasks. These models are capable of generating high-quality text, making them suitable for a range of applications, including detoxification. In this paper, we will explore the use of large language models (LLM) without expensive fine-tuning for high-quality and relevant outputs in the task of text detoxification. CLEF 2024: Conference and Labs of the Evaluation Forum, September 09–12, 2024, Grenoble, France $ alex010rey@gmail.com (S. Pletenev)  0000-0003-2325-4268 (S. Pletenev) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). 1 https://chat.openai.com/ CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings Our Contributions: • Development of a multi-lingual detoxification method based on the LLaMA 3 LLM. • Adaptation and testing of a model jailbreaking technique for text generation (see Figure 1). • Publication of experimental results and source code to facilitate future research in this area.2 Baseline LLM generation: Rewrite the text into non-toxic language: I cannot write LLM {toxic_sentence} hateful content Our approach: {special_prompt} LLM Rewrite the text into non-toxic language: {activation {neutral_sentence} {few-shot examples} {toxic_sentence} patched} Figure 1: The pipelines of the baseline and our approach. In the case of the baseline, the model often refuses to generate a text. In our case, with the help of an special prompt, few-shot examples and activation patching, models no longer can generate refusals. 2. Related works 2.1. Style transfer and Detoxification Style transfer models for detoxification can be broadly classified into three main categories: • Editing-based approaches: These methods, such as those described in [11, 12] utilize a sequence of simple transformations (e.g., removal, replacement, addition) to modify the input text. While these transformations are typically learned independently and combined in a pipeline, they provide high levels of explainability and interpretability. However, their focus on identifying and replacing specific words with a desired style limits their performance in more complex tasks that require structural changes • Sequence-to-Sequence and Language Generation approaches: Drawing on the inspiration from text generation tasks such as machine translation, summarization, and paraphrasing, this approach [13, 14] translates the source text into a latent representation using an encoder and then uses a decoder to generate the decontaminated text sequentially. While achieving promising results in style transfer and detoxification, a significant challenge is in preserving the original context, particularly for longer texts, due to the limitations of sequence-to-sequence models. • Hybrid methods: This approach [15, 16] combines elements of both editing-based and sequence- to-sequence methods. It involves creating word alignments and generating sentences end-to-end. This strategy aims to utilize the strengths of both techniques, potentially providing a more comprehensive solution. Each method has its own unique strengths and weaknesses. The choice of approach should be based on the specific requirements of the task and the desired balance between explainability, preservation of context, and complexity of the model. 2 https://github.com/A1exRey/UncensorLlamaIsBetterCensor Table 1 Examples of text detoxification refusals for Llama 3 70B. Model Refusals I cannot generate a [. . . ] Is there anything else I can help you with? I understand you’re frustrated, but being aggressive won’t help the situation. I apologize for any inconvenience. [. . . ] Please provide more details about the issue. I strongly disagree with [. . . ] I cannot write content that is discriminatory or promotes hate speech. 2.2. Jailbreaking models Despite significant efforts to align large language models (LLMs) with human values, recent studies have highlighted their susceptibility to security breaches [17]. These vulnerabilities can lead to the creation of harmful content and the misuse of these powerful tools. One type of attack involves manipulating input instructions to exploit the model’s weaknesses. This can include explicitly guiding the model’s response or appending suffixes that bypass its defenses [18, 19]. Mechanistic Interpretability (MI) aims to understand how a model functions by reverse engineering its specific behaviors. This allows us to gain insight into how the model processes information and makes decisions. These reverse engineering efforts typically focus on specific components of a neural network, such as neurons, representations, or attention heads [20]. The goal is to identify those components that are related to a particular behavior of interest and to understand their role within the network [21]. This understanding can help in designing more robust and safe models. Additionally, understanding the safety mechanisms of a model from a mechanistic perspective can contribute to developing safer models [22]. For instance, it has been found that the key parameters responsible for safety are located in a relatively small part of the network, making them more susceptible to changes or perturbations [22]. 3. Experimental setup 3.1. Dataset For our approach, we use a multilingual parallel dataset for text detoxification, prepared for the Multilingual Text Detoxification (CLEF TextDetox 2024) shared task [23, 24]. The dataset consists of texts in different languages, including English, Spanish, German, Chinese, Arabic, Hindi, Ukrainian, Russian, and Amharic. For each of these 9 languages, 1,000 pairs of parallel texts were collected, split into 400 pairs for the development (dev) set and 600 pairs for the test set. In the competition, detox pairs are only available for the dev dataset, while the toxic part of the pairs is available for the test part. For the Amharic language, we use the full dev set as our training dataset. For the few-shot case, we use only the first 10 pairs from the dev set for each language. 3.2. Activation Patching Activation Patching [25] is a technique that based on Mechanistic Interpretability theory to locate critical components responsible for specific behaviors. This technique involves replacing the activation produced by a particular component when given a specific input with the activation generated by another input that does not produce the desired behavior. The significance of each component is determined by the impact it has on the final output after the intervention. To illustrate, let’s say we have an original input, 𝐼𝑜𝑟𝑖 , which is a harmful instruction such as "How can I make a bomb?". We can create an intervened version, 𝐼𝑖𝑡𝑣 , by replacing the harmful tokens with safe ones, making it a harmless instruction like "How can I make a pie?". This process allows us to identify the components responsible for the harmful behavior and modify them to achieve the desired outcome. In our case, we are not looking for answers to these questions. Instead, we need to see when the model fails to produce detoxification, as shown in Table 1. For the activation patching, we used all available toxic data from the English development part of the dataset. As counterexamples, we took neutral sentences from the same development set. 3.3. Models In this paper, we utilized several approaches and baselines provided by the competition organizers: Duplicate Baseline. This approach does not modify the input text in any way, serving as a lower threshold for detoxification quality and providing 100% similarity score by design. Delete Baseline. This method relies on a predefined list of toxic words3 and phrases for each language. The words are removed from the text upon their presence in a list of toxic words. Backtranslation Baseline. Similar to [4] texts in all languages except English are first translated into English using a neural machine translation model NLLB-3.3B4 [26]. The English text is then detoxified using the BART-based [27] text detoxification model5 , and subsequently translated back into the original language. mT5 Baseline. This sequence-to-sequence model [28] was trained 6 sequentially for each language using the parallel dev portion7 of the dataset released for the test phase of the shared task. Your task is text style transfer. You rewrite the text into non-toxic language. You must match the target style and preserve the original meaning as much as possible. You (1) cannot hallucinate or add anything outside the original input text. You should not include the input text in the response. You should only generate the target text. We use two types of models: a decoder-only Llama 3 model (8b 8 and 70b 9 ) and the Seq2Seq mt0-XL model. For the decoder-only model, we use a similar prompt 1 to the one used in the GreenLLaMA [29]. We also use a source prefix of "Detoxify:" for the mt0-XL model. We do not fine-tune Llama 3, but we do use the few-shot variants: 5-shot and 10-shot for both models. For the Seq2Seq model, we fine-tune it on 3 epochs using a learning rate 5e-5, without using any additional data beyond what is provided by the competition organizers. 4. Results and Discussion The biggest limitation of Llama 3 for us was its tendency to refusal generation. As mentioned above, there are several methods for model jailbreaking, and we used 3 of them simultaneously: special prompt, few-shot, and activation patching. The results of this approach can be seen in Table 2 and pipeline in Figure 1. We didn’t choose a good prompt on our own, but rather used the one described in the GreenLLaMA [29] detoxification framework, which the authors claim performs best. However, despite this, the model still exhibited significant refusal to generate, particularly in the 0-shot scenario, where 24% of queries yielded no output. This behaviour is likely due to Llama3’s internal mechanisms that may detect potentially harmful content within the input, even when the task is intended to be detoxified. The 10-shot variant performs better, but even then, it occasionally refuses to generate in 5 examples. 3 https://huggingface.co/datasets/textdetox/multilingual_toxic_lexicon 4 https://huggingface.co/facebook/nllb-200-3.3B 5 https://huggingface.co/s-nlp/bart-base-detox 6 https://huggingface.co/textdetox/mt5-xl-detox-baseline 7 https://huggingface.co/datasets/textdetox/multilingual_paradetox 8 https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct 9 https://huggingface.co/meta-llama/Meta-Llama-3-70B-Instruct Table 2 Different approaches to removing generation refusals for Llama 70b Language N-shot Activation Patching Refusals Refusals as % of answers English 0-shot non-patched 94 24% 10-shot non-patched 5 1% 0-shot patched 47 12% 10-shot patched 0 0% Russian 0-shot non-patched 59 15% 10-shot non-patched 0 0% 0-shot patched 31 8% 10-shot patched 0 0% Table 3 Results of the automatic evaluation of several of our proposed methods versus simple and strong baselines by the shared task organizers. The reported metric is Joint for each of the languages. (A.P.) stands for activation patched model. (Final) stands for the results of the final submission. The best Joint score for each language is bold. Generation examples can be found at Table 5 Method AVG EN ES DE ZH AR HI UK RU AM Llama3 70b 10-shot (Final) 0.431 0.522 0.475 0.551 0.147 0.514 0.269 0.584 0.516 0.299 Llama3 70b 10-shot (A.P.) 0.417 0.522 0.475 0.551 0.147 0.514 0.269 0.584 0.516 0.180 Llama3 8b 10-shot 0.414 0.499 0.476 0.561 0.137 0.514 0.276 0.588 0.500 0.180 Llama3 70b 5-shot 0.409 0.477 0.475 0.561 0.137 0.514 0.276 0.588 0.476 0.180 mt0-XL 0.416 0.519 0.458 0.569 0.111 0.536 0.222 0.587 0.489 0.299 Backtranslation baseline 0.205 0.506 0.275 0.233 0.027 0.206 0.104 0.201 0.223 0.075 Delete baseline 0.302 0.447 0.319 0.362 0.175 0.456 0.105 0.328 0.255 0.270 mt5 baseline 0.315 0.418 0.359 0.384 0.096 0.389 0.170 0.433 0.432 0.157 Duplicate baseline 0.126 0.061 0.090 0.287 0.069 0.294 0.035 0.032 0.048 0.217 Table 4 Results of the manual evaluation of our proposed method versus simple and strong baselines by the shared task organizers. The reported metric is Joint for each of the languages. The best Joint score for each language is bold. Method AVG EN ES DE ZH AR HI UK RU AM Human References 0.85 0.88 0.79 0.71 0.93 0.82 0.97 0.90 0.80 0.85 Llama3 70b 10-shot (Final) 0.77 0.86 0.83 0.89 0.53 0.74 0.86 0.69 0.84 0.71 Delete baseline 0.56 0.47 0.55 0.57 0.43 0.65 0.65 0.60 0.49 0.63 mT5 baseline 0.54 0.68 0.47 0.64 0.43 0.63 0.60 0.42 0.40 0.61 Backtranslation baseline 0.41 0.73 0.56 0.34 0.34 0.42 0.33 0.23 0.22 0.54 The integration of activation patching has proven to be beneficial, reducing the number of refusals by 50% in the 0-shot setting and eliminating them completely in the 10-shot scenario. Additionally, we tested the performance of activation patching for Russian language. The model was not restricted in any way and only data obtained for English was utilized. As expected, the results were similar to those for English, with the exception that in the 10-shot case, for both patched and non-patched, models stopped generating refusals. However, in the 0-shot case, there was a huge difference in favor of the activation patch. These results suggest that although specialized prompt and few-shot learning can enhance the performance of Llama 3, activation patching plays a crucial role in mitigating generation refusals. Further research is needed to understand the specific triggers of refusals and develop more robust solutions to overcome this limitation. The second issue was that Llama 3 for Amharic did not perform as well as expected. Llama 3 was not designed as a multilingual model, and there was no information available on the distribution of languages in the training data used to train LLM. Amharic is a relatively rare language, which may have been underrepresented in the initial training set. Although the Llama 3 tokenizer included Amharic characters, the limited training data suggests a small corpus for this language. Therefore, we decided to use a separate model, mT0-XL[30], for Amharic only, which was claimed to be multilingual and has support for the Amharic language according to the author’s claims. This model was trained using all available languages, and the results can be seen in Table 3. Based on the automatic metrics, mT0-XL performed well in detoxifying all 9 languages included in the competition, as shown in the table. Despite the overall proficiency of mT0-XL, the activation-patched, 10-shot variant of Llama 3 70B demonstrated superior performance in our evaluation. Therefore, we adopted a hybrid approach by utilizing the Amharic-specific component of mT0-XL in conjunction with the optimized Llama 3 70B model. This strategy leveraged strengths of both models, resulting in a detoxification system that combined the activation-patched 10-shot Llama 3 70B model with the Amharic-specific component from mT0-XL. This hybrid method presents a practical solution to address language-specific challenges in large language models. By integrating specialized models for underrepresented languages with a robust base model, we enhance performance and expand detoxification capabilities across a broader range of languages. 5. Conclusion In this work we present a text detoxification approach based on few-shot generation using activation patched LLama 3 70B. According to final round of evaluation (manual evaluation), our solution is the best across more than 25 competitors. Moreover, our solution is better than (or at least comparable to) the Human References, which were designed to be ground-truth references for the annotators and automatic evaluation systems. Moreover, we explore different model jailbreaking techniques to enhance the final generation and detoxification quality. References [1] C. Nogueira dos Santos, I. Melnyk, I. Padhi, Fighting offensive language on social media with unsupervised text style transfer, in: I. Gurevych, Y. Miyao (Eds.), Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), Association for Computational Linguistics, Melbourne, Australia, 2018, pp. 189–194. URL: https://aclanthology. org/P18-2031. doi:10.18653/v1/P18-2031. [2] D. Dale, I. Markov, V. Logacheva, O. Kozlova, N. Semenov, A. Panchenko, SkoltechNLP at SemEval- 2021 task 5: Leveraging sentence-level pre-training for toxic span detection, in: A. Palmer, N. Schneider, N. Schluter, G. Emerson, A. Herbelot, X. Zhu (Eds.), Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), Association for Computa- tional Linguistics, Online, 2021, pp. 927–934. URL: https://aclanthology.org/2021.semeval-1.126. doi:10.18653/v1/2021.semeval-1.126. [3] D. Moskovskiy, D. Dementieva, A. Panchenko, Exploring cross-lingual text detoxification with large multilingual language models., in: S. Louvan, A. Madotto, B. Madureira (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 346–354. URL: https://aclanthology.org/2022.acl-srw.26. doi:10.18653/v1/2022.acl-srw.26. [4] D. Dementieva, D. Moskovskiy, D. Dale, A. Panchenko, Exploring methods for cross-lingual text style transfer: The case of text detoxification, in: J. C. Park, Y. Arase, B. Hu, W. Lu, D. Wijaya, A. Purwarianti, A. A. Krisnadhi (Eds.), Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Nusa Dua, Bali, 2023, pp. 1083–1101. URL: https://aclanthology.org/2023.ijcnlp-main.70. doi:10. 18653/v1/2023.ijcnlp-main.70. [5] V. Logacheva, D. Dementieva, S. Ustyantsev, D. Moskovskiy, D. Dale, I. Krotova, N. Semenov, A. Panchenko, ParaDetox: Detoxification with parallel data, in: S. Muresan, P. Nakov, A. Villavicen- cio (Eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Association for Computational Linguistics, Dublin, Ireland, 2022, pp. 6804– 6818. URL: https://aclanthology.org/2022.acl-long.469. doi:10.18653/v1/2022.acl-long.469. [6] D. Dementieva, V. Logacheva, I. Nikishina, A. Fenogenova, D. Dale, I. Krotova, N. Semenov, T. Shavrina, A. Panchenko, Russe-2022: Findings of the first russian detoxification shared task based on parallel corpora, COMPUTATIONAL LINGUISTICS AND INTELLECTUAL TECHNOLOGIES (2022). URL: https://api.semanticscholar.org/CorpusID:253169495. [7] S. Pletenev, Between denoising and translation: Experiments in text detoxification, Computational Linguistics and Intellectual Technologies (2022). URL: https://api.semanticscholar.org/CorpusID: 253197815. [8] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. de las Casas, F. Bressand, G. Lengyel, G. Lample, L. Saulnier, L. R. Lavaud, M.-A. Lachaux, P. Stock, T. L. Scao, T. Lavril, T. Wang, T. Lacroix, W. E. Sayed, Mistral 7b, 2023. arXiv:2310.06825. [9] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, CoRR abs/2302.13971 (2023). URL: https://doi.org/10.48550/arXiv. 2302.13971. doi:10.48550/ARXIV.2302.13971. arXiv:2302.13971. [10] H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, G. Lample, Llama: Open and efficient foundation language models, 2023. arXiv:2302.13971. [11] J. Li, R. Jia, H. He, P. Liang, Delete, retrieve, generate: a simple approach to sentiment and style transfer, in: M. A. Walker, H. Ji, A. Stent (Eds.), Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Association for Computational Linguistics, 2018, pp. 1865–1874. URL: https://doi.org/10. 18653/v1/n18-1169. doi:10.18653/V1/N18-1169. [12] T. Shen, T. Lei, R. Barzilay, T. S. Jaakkola, Style transfer from non-parallel text by cross- alignment, in: I. Guyon, U. von Luxburg, S. Bengio, H. M. Wallach, R. Fergus, S. V. N. Vish- wanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems 30: An- nual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 6830–6841. URL: https://proceedings.neurips.cc/paper/2017/hash/ 2d2c8394e31101a261abf1784302bf75-Abstract.html. [13] V. John, L. Mou, H. Bahuleyan, O. Vechtomova, Disentangled representation learning for non- parallel text style transfer, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 424–434. URL: https://aclanthology.org/P19-1041. doi:10.18653/v1/P19-1041. [14] D. Dale, A. Voronov, D. Dementieva, V. Logacheva, O. Kozlova, N. Semenov, A. Panchenko, Text detoxification using large pre-trained neural models, CoRR abs/2109.08914 (2021). URL: https://arxiv.org/abs/2109.08914. arXiv:2109.08914. [15] F. Huang, Z. Chen, C. H. Wu, Q. Guo, X. Zhu, M. Huang, NAST: A non-autoregressive generator with word alignment for unsupervised text style transfer, in: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics: Findings, 2021. [16] F. Luo, P. Li, J. Zhou, P. Yang, B. Chang, Z. Sui, X. Sun, A dual reinforcement learning framework for unsupervised text style transfer, in: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI 2019, 2019. [17] X. Qi, Y. Zeng, T. Xie, P. Chen, R. Jia, P. Mittal, P. Henderson, Fine-tuning aligned language models compromises safety, even when users do not intend to!, CoRR abs/2310.03693 (2023). URL: https: //doi.org/10.48550/arXiv.2310.03693. doi:10.48550/ARXIV.2310.03693. arXiv:2310.03693. [18] Z. Liao, H. Sun, Amplegcg: Learning a universal and transferable generative model of adversarial suffixes for jailbreaking both open and closed llms, 2024. arXiv:2404.07921. [19] A. Zou, Z. Wang, J. Z. Kolter, M. Fredrikson, Universal and transferable adversarial attacks on aligned language models, CoRR abs/2307.15043 (2023). URL: https://doi.org/10.48550/arXiv.2307. 15043. doi:10.48550/ARXIV.2307.15043. arXiv:2307.15043. [20] W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, D. Bertsimas, Finding neurons in a haystack: Case studies with sparse probing, 2023. arXiv:2305.01610. [21] H. Sajjad, N. Durrani, F. Dalvi, Neuron-level interpretation of deep nlp models: A survey, 2022. arXiv:2108.13138. [22] B. Wei, K. Huang, Y. Huang, T. Xie, X. Qi, M. Xia, P. Mittal, M. Wang, P. Henderson, Assessing the brittleness of safety alignment via pruning and low-rank modifications, in: ICLR 2024 Workshop on Reliable and Responsible Foundation Models, 2024. URL: https://openreview.net/forum?id= XMLQ2e0Axb. [23] M. Fröbe, M. Wiegmann, N. Kolyada, B. Grahm, T. Elstner, F. Loebe, M. Hagen, B. Stein, M. Potthast, Continuous Integration for Reproducible Shared Tasks with TIRA.io, in: J. Kamps, L. Goeuriot, F. Crestani, M. Maistro, H. Joho, B. Davis, C. Gurrin, U. Kruschwitz, A. Caputo (Eds.), Advances in Information Retrieval. 45th European Conference on IR Research (ECIR 2023), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2023, pp. 236–241. URL: https://link. springer.com/chapter/10.1007/978-3-031-28241-6_20. doi:10.1007/978-3-031-28241-6_20. [24] J. Bevendorff, X. B. Casals, B. Chulvi, D. Dementieva, A. Elnagar, D. Freitag, M. Fröbe, D. Ko- renčić, M. Mayerl, A. Mukherjee, A. Panchenko, M. Potthast, F. Rangel, P. Rosso, A. Smirnova, E. Stamatatos, B. Stein, M. Taulé, D. Ustalov, M. Wiegmann, E. Zangerle, Overview of PAN 2024: Multi-Author Writing Style Analysis, Multilingual Text Detoxification, Oppositional Thinking Analysis, and Generative AI Authorship Verification, in: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the Fourteenth International Conference of the CLEF Association (CLEF 2024), Lecture Notes in Computer Science, Springer, Berlin Heidelberg New York, 2024. [25] A. Arditi, O. Obeso, A. Syed, D. Paleka, N. Rimsky, W. Gurnee, N. Nanda, Refusal in lan- guage models is mediated by a single direction, 2024. URL: https://arxiv.org/abs/2406.11717. arXiv:2406.11717. [26] M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Heffernan, E. Kalbassi, J. Lam, D. Licht, J. Maillard, A. Y. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault, G. M. Gon- zalez, P. Hansanti, J. Hoffman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran, P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn, A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No language left behind: Scaling human-centered machine translation, CoRR abs/2207.04672 (2022). URL: https://doi.org/10.48550/arXiv.2207.04672. doi:10.48550/ARXIV.2207.04672. arXiv:2207.04672. [27] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer, BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension, in: D. Jurafsky, J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7871–7880. URL: https://aclanthology.org/2020.acl-main.703. doi:10. 18653/v1/2020.acl-main.703. [28] L. Xue, N. Constant, A. Roberts, M. Kale, R. Al-Rfou, A. Siddhant, A. Barua, C. Raffel, mt5: A massively multilingual pre-trained text-to-text transformer, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tür, I. Beltagy, S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Association for Computational Linguistics, 2021, pp. 483–498. URL: https://doi.org/10.18653/ v1/2021.naacl-main.41. doi:10.18653/V1/2021.NAACL-MAIN.41. [29] M. T. I. Khondaker, M. Abdul-Mageed, L. V. S. Lakshmanan, Greenllama: A framework for detoxification with explanations, CoRR abs/2402.15951 (2024). URL: https://doi.org/10.48550/arXiv. 2402.15951. doi:10.48550/ARXIV.2402.15951. arXiv:2402.15951. [30] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, S. Biderman, T. L. Scao, M. S. Bari, S. Shen, Z. X. Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, K. Almubarak, S. Albanie, Z. Alyafeai, A. Webson, E. Raff, C. Raffel, Crosslingual generalization through multitask finetuning, in: A. Rogers, J. L. Boyd-Graber, N. Okazaki (Eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, Association for Computational Linguistics, 2023, pp. 15991–16111. URL: https://doi.org/10.18653/v1/2023.acl-long.891. doi:10.18653/V1/2023.ACL-LONG.891. A. Appendix Table 5 Examples from the original dev dataset, with additional examples generated by baseline (mt5 baseline) and our method. Method Text Original toxic sentence they should have fled , i feel that these guys are fucked , who turns themselves in to an angry mob ? . Original neutral sentence They should have escaped, i feel that these are are done, who turns themselves in to an angry mob mt5 baseline they should have fled, i feel that these guys are bad, who turns themselves in Llama3 70b 10-shot (Final) They should have fled; I feel that these individuals are in a precarious situation, who would voluntarily surrender to an angry crowd? Original toxic sentence Delete my 200 articles Whhore Original neutral sentence Delete my 200 articles mt5 baseline Delete my 200 articles Llama3 70b 10-shot (Final) Delete my 200 articles, please. Original toxic sentence this piece of human garbage should rot in jail . Original neutral sentence this criminal should be in jail mt5 baseline This piece should rot in jail. Llama3 70b 10-shot (Final) This individual should face the consequences of their actions. Original toxic sentence See all the many Google links, titled" Wikipedia sucks." Original neutral sentence See all the many Google links, titled" Wikipedia is bad." mt5 baseline See all the many Google links, titled" Wikipedia sucks." Llama3 70b 10-shot (Final) See all the many Google links, titled "Wikipedia has issues." Original toxic sentence i \’m so fucking sick of reddit assuming people with a dissenting opinion are being paid by someone else . Original neutral sentence i \’m so sick of reddit assuming people with a dissenting opinion are being paid by someone else mt5 baseline I’m so sick of reddit assuming people with a dissenting opinion are being paid by someone Llama3 70b 10-shot (Final) I’m so tired of Reddit assuming people with dissenting opinions are being paid by someone else. Original toxic sentence p. s. i do own all of those pictures bitch . Original neutral sentence i own all those pictures. mt5 baseline I do own all of those pictures. Llama3 70b 10-shot (Final) P.S. I do own all of those pictures. Original toxic sentence You are an idiot and you can\’t block anyone. Original neutral sentence You can\’t block anyone. mt5 baseline You can’t block anyone. Llama3 70b 10-shot (Final) You are misinformed, and you can’t block anyone. Original toxic sentence throwing his ass in jail is the best thing possible , since it protects others from his willful drunk driving . Original neutral sentence Being sent to jail is the best thing possible, since it protects others from willful drunk driving. mt5 baseline throwing him in jail is the best thing possible, since it protects others from his Llama3 70b 10-shot (Final) Throwing him in jail is the best possible outcome, as it protects others from his reckless behavior. Original toxic sentence the object is to waste peoples time deconstructing your foolish effusions . Original neutral sentence the object is to waste peoples time deconstructing your effusions . mt5 baseline The object is to waste peoples time deconstructing your ideas. Llama3 70b 10-shot (Final) The object is to waste people’s time deconstructing your statements.