Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning

Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning GuillermoVillate-Castillo guillermo.villate@tecnalia.com TECNALIA Basque Research and Technology Alliance (BRTA)

48160 Derio, Bizkaia Spain

BorjaSanz borja.sanz@deusto.es Faculty of Engineering University of Deusto

Avenida de las Universidades 24 48007 Bilbao

JavierDelSer javier.delser@tecnalia.com TECNALIA Basque Research and Technology Alliance (BRTA)

48160 Derio, Bizkaia Spain

University of the Basque Country (UPV/EHU)

48013 Bilbao Spain

Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning 1613-0073 4E96A8682BEDF9DA4FAE481762C53FC8 GROBID - A machine learning software for extracting information from scholarly documents Toxicity Alignment Large Language Models Reinforcement Learning

Large Language Models (LLMs) have revolutionized dialogue agents, but they still suffer from biases, inconsistencies, and factual inaccuracies. This paper focuses on addressing toxicity, a critical aspect of the "Diversity, non-discrimination, and fairness" pillar of Trustworthy AI, in dialogue agents. We propose a methodology inspired by InstructGPT and ChatGPT to mitigate toxicity in chatbots by incorporating toxicity detection tools from industry leaders, such as Microsoft and Google Jigsaw, into a reward model. The reward model was extended by our developed ToxDialogDefender, a context-aware toxic language identification model. To evaluate our approach, we curate a dataset of 1.5 million comments, with 14.13% serving as successful adversarial examples, to induce toxicity in the BlenderBot 1 90M model. While our primary focus is on BlenderBot 1, our approach is applicable to models with similar Seq2Seq architectures. Experimental results demonstrate a substantial reduction in toxicity levels from 24% to 5%, as validated by a subset analysis. This research highlights the potential for integrating toxicity mitigation techniques into the training paradigm of dialogue agents, paving the way for more more aligned and unbiased conversational AI systems.

Introduction

Dialogue agents driven by open-domain chatbots [1,2] play a pivotal role in applications like restaurant reservations [3], healthcare [4] and online shopping [5]. More recent cases of general-purpose dialog agents are ChatGPT [6] or Llama 2 [7], which have been trained to follow societal norms. These models undergo training with extensive datasets from platforms like Reddit 1 , Twitter (currently X 2 ), and 4chan 3 , with examples including BlenderBot 1 [1], TwitterBot Tay [8], and Luda [9]. However, these data sources are known for producing toxic content [10,11,12], leading to undesirable behaviors observed in the output of these models. Toxicity mitigation is a key task at a time when the research community is fervently engaged in AI alignment and in ensuring that AI adopts human principles such as respect, fairness, non-discrimination, etc [13].

This research focuses on mitigating toxic speech in dialogue agents, which has been defined repeatedly as rude, disrespectful, or unreasonable comments likely to disrupt conversations 4 , often related to gender, politics, race, or culture [14]. Previous efforts aimed at reducing toxicity in dialogue agents include continuous curation of datasets [15,16], toxic behavior detection during text generation [17,18], and safety layers [1,19]. While effective, these approaches have limitations (𝐿):

• 𝐿 1 : The continuous curation of datasets is expensive, requiring human annotators at every stage. In addition, removing toxic comments may lead to the model generating unrealistic responses due to the consequent shortage of training data.

• 𝐿 2 : Current toxic content detectors do not take into account the conversation history at training time, thus lacking contextualization.

• 𝐿 3 : Safety layers mitigate toxicity during inference, but do not entirely eliminate it from the model's internal knowledge base. Moreover, toxicity detectors, known for their biases, can introduce unwanted biases [20]. Additionally, since next token probabilities are conditioned on the toxic detector, this may lead to incoherent responses [21].

The aforementioned limitations of current methodologies lie in three distinctive areas: adversarial training data gathering (𝐿 1 ); contextualization of comments to mitigate false positives and negatives in context-sensitive comments (𝐿 2 ); and intrinsic removal of toxicity from model weights (𝐿 3 ).

Motivation and Research Questions

To address chatbot toxicity limitations, we explore the following research questions (𝑅𝑄) in corresponding order of the limitations exposed above:

• 𝑅𝑄 1 : Are there any existing queries that drive chatbots to respond in a toxic manner?

• 𝑅𝑄 2 : How well do toxicity detectors perform in dialog contexts?

• 𝑅𝑄 3 : Can we eliminate toxic traits within the model without adding complexity to its architecture?

Main contributions

In this work we propose a novel methodology for mitigating toxicity in dialog agent-based models. This methodology addresses the three forms in which toxicity can manifest in a discussion: implicit toxicity, explicit toxicity, and toxicity detected within the dialog context. To the best of our knowledge, this work is the first to utilize such a unique approach in dialog settings based on our literature review. Additionally, our proposed dialog context-based toxicity detector is designed to assist in situations where isolated comments are insufficient for assessing the toxicity level, particularly in cases where the model responds affirmatively to toxic questions or statements. Furthermore, we expand the pool of adversarial examples introduced in [22] for BlenderBot 1 by analyzing an additional 1.5 million examples. Finally, by leveraging Reinforcement Learning (RL), we are able to mitigate toxicity within the inner model weights.

Paper structure The article is organized as follows: Section 2 introduces fundamental concepts that are essential for understanding the terminology used in this research. Furthermore, it provides an overview of prior research of relevance to understand the contribution to the state of the art. Section 3 details the proposed methodology, whereas Section 4 describes the experimental setup and evaluation protocol used to assess its performance. Section 5 summarizes the main experimental outcomes, and Section 6 discusses key findings from our investigation and outlines future research directions.

Foundations and Background

Before detailing the proposed methodology, this section starts with some fundamental concepts concerning toxicity (Section 2.1), followed by a discussion of existing methods for detecting toxicity (Section 2.2) and the accompanying challenges. Subsequently, historical perspectives on toxicity within LLMs are examined in Section 2.3, together with an analysis of RL and its utilization in training chatbots (Section 2.4).

Toxicity Definition

Toxicity is a multifaceted term that continually evolves, shaped by the cultural contexts within which it develops. Despite being defined as stated in Section 1 by Perspective API (namely, the leading toxicity detector developed by Google Jigsaw), this definition remains far from being exhaustively comprehensive. The work in Sheth et al. [23] categorizes toxicity into groups including threats, obscenities, insults, identity-based hate, harassment, misinformation, radicalization, and gender-based violence. Assessing toxicity in dialogue contexts requires systems capable of identifying its diverse forms, which are explicit, implicit, and contextualized forms:

• Explicit toxicity: Conspicuously harmful content, including hate speech, profanity, threats, or direct insults, which requires no additional interpretation for recognizing its negative nature.

• Implicit toxicity: Content lacking overtly harmful elements may carry negative connotations, biases, or concealed meanings. It is characterized by the lack of explicit toxic language, like insults and slurs. The detection of this type of toxicity demands deeper analysis or cultural familiarity for recognition. This category may encompass subtle forms of discrimination, microaggressions, or insinuations.

• Toxicity within a context: The concept of toxicity in context refers to evaluating whether content is toxic or harmful based on the specific situation or circumstances in which it is presented. This assessment involves considering both the intent behind the content and the intended audience. It acknowledges that the same words or actions may have varying impacts depending on the context in which they are produced. This term is crucial in the context of dialogue agents, where the conversation history is needed for the analysis of toxic content.

Toxicity Detectors

The development of toxicity detection systems often relies on human annotations and machine learning techniques. In research, Google's Perspective API [24] stands out for its ability to recognize characteristics beyond toxicity, including identity-based hate, profanity, and threats, among others [17,22]. Other examples include HateSonar [25] and ToxiGen HateBERT [26], the latter being specialized in detecting implicit toxicity. A significant challenge in toxicity detectors is the existence of biases and their limited applicability in diverse contexts. Research on detectors, particularly that spearheaded by Perspective API, has revealed substantial biases, including gender bias [27,28] and biases against minority groups [29,30]. Biases often emerge from tainted datasets during annotation, exacerbated by a lack of heterogeneous participants [31,29] and the nature of the content inside the dataset at hand [32]. Importantly, such biases tend to amplify when deployed in real-world applications, from data preprocessing to web content moderation [33].

In the current landscape of predictive models within a contextual framework, a work stands out focusing on analyzing toxicity within a specific context through the incorporation of stance detection. This becomes particularly relevant for a demanding task, specifically implicit context toxicity in questions related to the stance of the model [34]. Much of the existing work in detecting toxicity within a given context revolves around assessing the necessity and appropriateness of such an analysis, termed context sensitivity estimation [35]. However, even when annotations change with the observed context, changes are not substantial enough to significantly affect the analyzed data [36]. This idea is supported by another study, which suggests that depending on the type of data, context can be beneficial, but could potentially lead to an increase in false positives and false negatives overall [37].

Toxicity in Language Models

Despite the versatility of LLMs, a primary concern remains in the proliferation of toxicity, including the dissemination of harmful information [38], propagation of misinformation [39], and the generation of toxic comments [8,9]. Dialogue agents based on generative open-domain chatbots, as mentioned in the introduction, prominently exhibit toxicity issues.

Previous research addresses toxicity in dialogue agents through various approaches, from i) creating iterative environments for stress-testing and improving chatbot responses through Supervised Learning (SL) [15], to ii) the incorporation of classifiers for identifying and filtering toxic content in chatbotgenerated responses [1]; and iii) the introduction of safety layers to prevent inappropriate queries [1,19]. Toxic comments are also addressed through specially crafted datasets designed to elicit positive responses to toxic comments from both models and users [40]. Other methods include attribute conditioning (ATCON) [17], a data-based method that further pretrains an LLM model by prepending a toxicity attribute token, toxic and not toxic. By using the prepended token, the model learns the characteristics of toxic and non-toxic sentences. This allows for the reduction of toxicity during decoding by utilizing these tokens.

In decoding-based strategies, recent efforts have been focused on addressing toxicity during the next token prediction phase. We can divide the research activity into two general groups depending on the methodology under consideration: i) at generation time and ii) at training time. We have Plug-and-Play LM [41], a decoding-based strategy that utilizes a simple discriminator to direct the generation process. Additionally, DExperts [42] utilize expert models (trained on non-toxic data) and anti-expert models (trained on toxic data) to guide the base LLM's generation process. This guidance aims to make the produced content closer to that generated by the expert LLM and further from that produced by the anti-expert, thereby minimizing the likelihood of producing toxic sentences as determined by the anti-expert. Other decoding strategies leverage in-context learning and multitask learning capabilities of LLMs to steer away the model from generating toxic comments. One representative example of methodologies as such is Detox-Chain [43], which uses a toxicity span detector to locate the toxic part of the comment. Once located, Detox-Chain masks the toxic part and generates a new comment using the mask-filling capabilities. This can be extended to use a foundational model instead of the same model. Another case is CRITIC [44], which resorts to external tools (e.g., Perspective API) to assess the toxicity of the comment, and then uses the in-context learning capabilities of the model to correct and generate a new comment.

At training time, methods primarily hinge on RL [45] and quantization with controllable tokens [46] to expose the model to fewer toxic comments and improve the quality of the generated content. In this line, the so-called SELF-CORRECT method [47] uses a generator and a corrector to improve the generation by training the corrector to generate less toxic comments given a hypothesis and an input to be corrected.

An emerging research area involves the creation and utilization of adversarial examples to evaluate language model toxicity. Various methods, such as scrutinizing datasets to assess their comments' capacity to induce toxic attributes in models [17,22], reveal that not only can toxic comments engender toxicity, but non-toxic comments can also exert a similar influence. An alternative approach focuses on generating adversarial prompts using search and optimization algorithms, guided by predefined malevolent word sets [48]. Researchers leverage LLMs and prompt engineering to generate adversarial prompts, investigating the utility of training models through RL or SL to generate sentences leading to toxic content [49]. Additional efforts consider using explicit names of social groups followed by benign actions to induce toxicity in masked language models [50].

Reinforcement Learning

Differently to SL and unsupervised learning paradigms, RL involves an agent interacting with an environment, receiving rewards and observations as the result of its actions. In the context of RL, an environment is characterized by its Markovian property, which means that its learning dynamics solely rely on the present state, disregarding past states or historical information. The primary aim within this framework is to achieve the highest attainable reward over an episode, with the focus on optimizing actions based on the current state.

RL applied in the domain of Natural Language Processing (NLP) is a relatively recent technique that The fine-tuning process begins with two identical models exposed to an adversarial prompt dataset. The trained model 𝜋 PPO , or policy, generates an output evaluated using a reward function and contributes to the calculation of KL divergence in comparison to a reference model copy, 𝜋 base . The KL divergence is a measure of how one probability distribution diverges from a second, which is used in this context with 𝜆 KL to control the maximun divergence.The final reward is determined as the aggregate of these components. This resulting reward then informs the adjustment of the RL algorithm (in our case, Proximal Policy Optimization, PPO [51]) to refine the policy's parameters [52].

has gained adoption. Within this framework, RL is customized to align with the unique components of NLP systems. Here, the environment dynamically adapts to the task at hand, which may involve representing a target model for attacking or utilizing a dataset for initializing observations to enhance a task's performance. Initially, both the base model 𝜋 base and the trained model or policy 𝜋 PPO receive the initial input, which could be a sentence requiring a response or a discriminative task to execute. The state reaches its finalization when the model selects the action to execute, such as predicting the next token or formulating the subsequent sentence in the case of generation tasks. If the reward is dense, the reward model generates a scalar indicative of the state's quality (i.e., the current sentence). This scalar is then incorporated into a policy constraint metric to ensure that the model remains within a reasonable deviation from its initial capabilities. Upon obtaining the reward, the policy update occurs based on the chosen algorithm. In the context of text generation, an episode typically refers to the process of generating a set of tokens or sentences until reaching the end-of-sequence token. Figure 1 illustrates the main RL process. For further clarity, the subsequent key terms are defined:

• In RL with NLP, the State Space 𝑆 depends on the generation process, which can involve either next sentence prediction or next token prediction given a sentence or a set of words. The dimensionality of the state is equivalent to the size of the vocabulary raised to the power of the number of outputs.

• Action Policy A is all tokens that can be used for next token prediction, corresponding to the vocabulary of the natural language model under consideration, or the next possible sentence.

• Reward R typically consists of a composite entity comprising a reward model and a policy change constraint. In the literature, the Kullback-Leibler (KL) [53] divergence metric has been widely used as an asymmetric measure of similarity between two probability distributions. The reward signal can exhibit either sparsity when provided at the sentence level or density when furnished at the token level, and can be formulated as:

𝑅 𝑡+1 = 𝑅 0 − 𝜆 • 𝑅 𝐾𝐿

where 𝑅 0 is the immediate reward, 𝜆 is the KL penalty for the KL divergence, which is a positive value (from 0 to 1) that weights the impact of the KL in the training process, and 𝑅 𝐾𝐿 is the KL divergence value.

Step 1

Collect datasets and train the base model

Collect public datasets

Summary and Contribution

Efforts to address toxicity in dialogue agents have been commendable, utilizing strategies like SL, content filtering classifiers, and safety layers. Other areas such as decoder-based architectures have used word filtering, control tokens, 2-dimensional representation, in-context learning plus external tools, RL and less toxic models to drive the generation. A promising approach involves using adversarial examples to evaluate and counteract language model toxicity, providing valuable insights into the impact of both toxic and non-toxic comments. Contextual similarities between adversarial datasets and real-world social media content are noteworthy.

Contribution Our research tackles toxicity using a diverse array of experts within an RL environment.

Each expert focuses on one of the various forms toxicity can take: implicit toxicity, explicit toxicity, and contextualized toxicity. The latter is an innovative approach to toxicity mitigation, considering the current lack of robust detection models, despite ongoing efforts in dataset collections [11]. As detailed in Section 3.1, we integrate all three forms of toxicity into a single model capable of assessing toxicity.

In addition, we curated a set of adversarial examples, as depicted in the work done by Si et al. [22], where notable contextual similarities between adversarial datasets and real-world social media content were observed. Even though similar approaches have been used in the literature [45], the particularities of dialog agents such as history context make a huge difference in the methodological approach to the problem. This approach involves iterative changes guided by the policy, optimized through exposure to adversarial examples. The process effectively mitigates toxicity, with the reward function serving as an expert, enabling a more automated and efficient approach to addressing toxicity in dialogue agents. As far as our knowledge extends, our research can be thought to be a pioneering effort and a new technical approach to address this particular challenge.

Methods

In this section, we elucidate the methodology employed to mitigate toxicity by framing it as a RL problem, encompassing the entire process from data acquisition to model evaluation. The methodology is divided in a three-step process,as depicted in Figure 2, the first of which is considered optional:

• Enhancement of the base model via SL: In this initial phase, the base model is trained using supplementary datasets through a SL paradigm. The objective here is to augment the model's capabilities, thereby enhancing its proficiency in handling specific tasks or adapting it to novel contexts. In our particular context, this stage serves the purpose of bolstering the model's capacity to generate contextually relevant responses within extended dialogue histories [55]. Furthermore, it is employed to ameliorate the model's aptitude for generating responses that are less toxic, while concurrently reducing the prevalence of generic responses, particularly in response to toxic comments [54,56].

• Formulation of the reward function or reward model: Within the realm of RL, it is essential to devise a reward function that facilitates the target task, as this function serves as the primary metric for evaluating the agent's performance. In the intersection of NLP and RL, it is conventional to construct a reward model capable of assessing various facets or a singular aspect that requires enhancement within the model [51]. In our specific case, we have devised a composite reward model, which comprises three distinct sub-models as is depicted in Section 3.1.

• RL fine-tuning: In this third phase, we leverage a specialized framework tailored for training LLMs through RL, namely, RL4LMs [57]. This open-source software, developed by the Allen Institute for AI, is employed for our purposes. Within this framework, we apply PPO algorithm, recognized as the state-of-the-art approach for such applications.

In Section 3.1, we delve into the genesis and functionality of the reward function. Moving on to Section 3.2, we elaborate on the process of gathering adversarial attack examples. Finally, in Section 3.3, we elucidate the criteria employed in selecting the RL training tool.

Reward Function

As illustrated in Figure 3, the reward function in this study is constructed upon three distinct models, each assigned a specific objective related to identifying various forms of toxicity. Google's Perspective API is tasked with detecting explicit toxicity, while implicit toxicity is discerned using Microsoft's ToxiGen HateBERT. Additionally, a model designed to evaluate toxicity within a conversational context, ToxDialogDefender 5 , is developed within the scope of this research to address specific knowledge gaps.

These knowledge gaps primarily encompass identifying non-toxic responses to toxic inputs, such as countering a toxic question or statement, as well as recognizing sarcasm or irony in responses to toxic inputs.

In the development of our toxicity detection, capable of assessing toxicity within a given context, we have aligned our approach with the prevailing trends in the literature [58,59]. Notably, we have leveraged state-of-the-art architectures including GRU [60], BiLSTM [61], BERT [62], and RoBERTa [63]. Given our limited corpus of toxic instances, we have adopted the use of language representation models such as DistillBERT and RoBERTa due to their remarkable adaptability to new tasks [58,59]. In addition to the aforementioned models, we conducted training with DeBERTa [64], which has exhibited superior performance in the SuperGLUE benchmark 6 and demonstrated enhanced contextual comprehension through its enhanced masked decoder and attention disentanglement capabilities. The training datasets employed for these models encompassed the Dialogue Safety dataset [15] and Bot Adversarial [1] dataset. During the training process, the dialogue context was incorporated as part of the input using special tokens. The input schema is as follows: " During the formulation of the reward function, numerous uncertainties arose regarding the capabilities of different models, particularly those not developed as part of this research project. As mentioned earlier, the Perspective API has been reported to exhibit biases and challenges, especially concerning the underrepresentation of minority groups. Additionally, ToxiGen HateBERT has primarily been assessed for its performance in implicit toxicity detection. However, it has not been analyzed for explicit toxicity, despite being built upon another toxic detection model designed for detecting explicit toxicity.

To address and mitigate these uncertainties, we systematically collected two datasets closely related to the training data of the models under consideration, namely, Toxic Comment Classification Challenge and ToxiGen [65,26], in addition to a third dataset unrelated to our specific models from Surgei AI 7 . These datasets served as the foundational basis for our analysis, enabling us to assess the models' effectiveness in recognizing various aspects of toxicity, as well as evaluating whether Perspective API could predict implicit toxicity and if ToxiGen HateBERT could predict explicit toxicity.

From each for the first two datasets we collected 30,000 comments, ensuring a balanced presence of toxicity. We then employed Perspective API and ToxiGen HateBERT models for predictions, creating datasets that contained information about the capabilities of each model. The dataset was partitioned in 80% training data and 20% test data. With this information in hand, our objective was to develop a function capable of leveraging the strengths of these models, implicit toxicity and explicit toxicity detection. To accomplish this, we trained three machine learning models to ensemble the outputs of the Perspective API and ToxiGen HateBERT models. These machine learning models were selected for their ability to represent learned patterns as rules or functions without adding computational overhead. This makes them well-suited for ensemble modeling and thus for deciding when to choose the label of Perspective API and ToxiGen HateBERT.

In Figure 3, we outline the methodology for combining the models. Initially, the responses of the models are analyzed by the Perspective API and ToxiGen HateBERT and are then ensemble into a binary label. If this label is deemed toxic, it is used as the final output; if not, the ToxDialogDefender model is employed to assess the comment considering the conversation history. We divide the process into two stages, as each focuses on specific cases where the other models are less effective. Although various types of toxicity have been mentioned, we only account for overall toxicity, as providing distinct scalar values for every case could mislead the model in the RL training process. Hence, the output of the reward function is either -1 or +1, with -1 assigned to toxic comments and +1 assigned to non-toxic comments.

Public DatasetBlenderBot Query Response Keyword Filtering Reward Model Adversarial Examples

Data Acquisition

The most critical aspect of model tuning lies in the data, which, in this case, extends beyond naive data acquisition. We require adversarial examples with the potential to elicit toxicity from our models. It is not straightforward due to the complexity of models, their lack of interpretability, and the vast amount of data they are trained on. This makes it unfeasible to predict the learned response distribution to toxic and non-toxic inputs accurately. Our analysis focuses on adversarial examples that induce toxicity in the model, particularly non-toxic entries. Bearing this in mind, we have chosen to adhere to the guidelines outlined in [22], which successfully identified adversarial examples for the BlenderBot 1 90M model. Given that the dataset was not publicly accessible during the course of our investigation, we undertook the task of replicating their entire process, albeit with some modifications as indicated in Figure 4. The dataset was retrieved from an internet forum known as 4chan, specifically the Political Incorrect board [66]. The adjustments made are next listed:

1. Instead of relying solely on the Perspective API to acquire adversarial examples, we opted to employ our custom reward function.

In line with the conclusions drawn in the original article, we expanded the list of terms that are likely to trigger toxicity, including -but not limited to -groups and identities such as Hindus, Buddhists,

LGBTQ+ individuals, the disabled, religious denominations like Mormons and Jehova, and news organizations like Fox, CNBC, MSNBC, BBC and SKY NEWS.

In the original work [22], the authors meticulously curated a sample of one million entries from the dataset. In our study, we closely followed their methodology and incorporated our refinements to initially narrow down the extensive dataset comprising 139 million comments to a more computationally manageable subset, comprising 12 million comments. This initial reduction was done to conserve computational resources and to focus on acquiring the potential adversarial example subset, given that less than 9% of the data analyzed by Si et al. [22] were deemed capable of generating toxicity. Subsequently, with this streamlined dataset at our disposal, we proceeded to partition it into discrete chunks, each containing half a million comments, for in-depth analysis employing our proposed reward function.

The selection of these comment chunks was methodically guided by the empirical observation obtained in Si et al. [22], where comments exhibiting scores below the 0.3 threshold o displayed an elevated propensity to incite toxic interactions. In addition, it was observed in that study that comments scoring between 0.6 and 1.0 were found to harbor the potential for generating toxicity. These selected comments were input into both our Blenderbot 1 base model and our model, which had undergone fine-tuning via SL. In both instances, we utilized a greedy search algorithm for the generation of response text. Ultimately, we conducted a thorough analysis of 1.5 million comments twice in our proposed pipeline: firstly, when employing the core BlenderBot 1 model, and subsequently when using BlenderBot finetuned in Step 1.

RL Fine-Tuning

Once all the components were meticulously crafted, the only task remaining was to select a framework for implementing RL in LLMs. Throughout 2022 and 2023 several frameworks have emerged from the literature, primarily in response to the tremendous success of ChatGPT, such as RL4MS 8 or TRL 9 , among others. Given the continued development of these tools, we specifically opted for two approaches that aligned with our criteria:

1. We aimed for projects with at least one year of development history, coupled with ongoing contributions from developers.

2. The chosen tools needed to be focused on LLMs rather than replicating specific instances like GPT 3.5.

3. These tools should be developed by individuals with a research-oriented mindset, either to demonstrate the viability of this approach for various tasks or to create research tools for public use.

With these criteria in mind, we selected TRL and RL4LMS. TRL, developed by Hugging Face, offers the PPO algorithm as its main RL method. TRL operates at the sentence level and is compatible with various architectures. However, one drawback is the computation of the KL divergence, which has been reported as an unsolved issue 10 . During model training, the KL value tended to become increasingly negative over time, resulting in undesirable outcomes. This issue was particularly present for Seq2Seq models.

Considering these concerns, we turned to RL4LMS, which is characterized by a more RL-focused design. In this framework, as elucidated in Section 2.4, actions refer to the vocabulary, and the state is generated in each iteration (next token prediction) by the policy, which is the natural language model. Rewards can be computed at the token or sentence level, with the latter being our preferred choice. One limitation is that data is processed one item at a time. Even though parallel processing can be done, this approach is computationally expensive for large datasets (as it is the case). Conversely, RL4LMs offers multiple algorithmic options, including PPO among others.

Experimental Setup and Evaluation Protocol

A set of experiments was designed to rigorously assess the efficacy of different methodologies in mitigating toxicity within textual data. We carefully formulated several experiment cases with the primary objective of analyzing their impact on the final results of toxicity mitigation. In our exploration we focused on various data distributions (toxic/non toxic distributions) and text generation methods, considering the computational demands inherent in RL-based training. To manage computational resources effectively, we opted to work with a subset of our dataset, containing between 100,000 to 140,000 items. Each dataset was partitioned into an 80% training subset and a 20% test subset. Maintaining consistency, the same test set was employed across all experiments to ensure fair evaluation of different methodologies. Within the training dataset, we categorized instances into toxic and non-toxic data.

The training set was sampled with different non-toxic and toxic distribution ratios: an 80/20 split, preserving the observed toxicity distribution in our dataset, and a balanced 50/50 split. Additionally, we further split toxic data into the three forms of toxicity, also preserving the distribution observed in our adversarial examples dataset: 35% implicit toxicity, 32% explicit toxicity, and 33% toxicity given a context. These splits were chosen because they are balanced enough to obtain not only a good representation of the original dataset but also a diverse number of examples. Regarding decoding strategies, we chose two approaches: deterministic decoding, also known as Greedy search decoding, and a probabilistic decoding strategy, multinomial sampling. These selections were made to observe the differences in toxicity mitigation between a deterministic technique and a more diverse one. The experiments are outlined as follows, along with their respective objectives:

1. Greedy search decoding in 80/20 Distribution: This experiment aims to investigate the effect of utilizing a subset of data that mirrors the distribution of toxicity in our dataset. Specifically, we assess the utility of the greedy search decoding strategy in the training process under this distribution.

2. Multinomial sampling decoding in 80/20 Distribution: This experiment aims to examine how the training process is influenced by employing a probabilistic technique on the previously analyzed text set. This experiment sheds light on the effectiveness of multinomial sampling in mitigating toxicity within the dataset.

3. Greedy Search decoding in 50/50 Distribution: This experiment aims to analyse the impact of toxicity mitigation when using a balanced dataset in combination with the greedy search decoding strategy, providing insights into the effectiveness of different distribution ratios in achieving toxicity balance.

In the experiments, several parameters remained fixed, falling within the ranges used in Ramamurthy et al. [57]. These parameters include setting the number of epochs per rollout at 4, configuring the number of steps per epoch to be 12,800, and maintaining a learning rate of 10 −6 . The learning rate value was set following the guidelines of the BlenderBot article [1].The learning rate value was set following the guidelines of the BlenderBot article [1]. In terms of the KL divergence parameters, we set the KL coefficient to a fixed value of 0.2 and the KL target value to 0.5, these settings were obtained empirically. Additionally, we employed a batch size of 16 and conducted 100 epochs of the training process, allowing the model to learn and adapt over multiple training cycles. The choice of batch size was selected due to computational constraints, and the epochs were observed to be empirically sufficient as the model deviated too much from its initial probabilistic distribution.

For the greedy search approaches, we adopted the base parameters utilized by HuggingFace, as these parameters have consistently yielded the best outcomes. Conversely, for the multinomial search strategy, we configured the parameters with a top-k value of 20, a temperature setting of 0.7, and limited the number of beams to 1. These parameter choices were made based on empirical analysis of the responses generated by BlenderBot, and partly following the experimental setup presented in [21], where the best parameters to mitigate the prevalence of toxicity were identified.

Evaluation

Evaluating conversational agents is challenging, involving annotators and evaluators to obtain reliable insights. Recent advancements in LLMs in 2023 have led to the exploration of automatic metrics [67]. Traditional metrics like METEOR, BLEU, and ROUGE lack depth in capturing meaning, word order, output correctness, and coherence [67].

When assessing our experiments, we considered toxicity and a model's ability to produce coherent, grammatically correct, and non-redundant outputs. Given alterations to word probabilities within a contextual framework, accurate sentence generation was crucial. Post-training evaluations leveraged metrics unrelated to toxicity to provide additional context, as these metrics were not integrated during the training phase due to their context-specific nature. This approach was adopted to gain a more comprehensive and contextual understanding of the results, particularly in aspects not directly tied to toxicity. We provide an overview of the two metrics utilized:

• DEAM [68] assesses response coherence at the conversation level using abstract meaning representation. Trained to classify coherence, the score ranges from 0 (no coherent) to 1 (coherent).

• GRUEN [69] evaluates grammaticality, non-redundancy, and topic maintenance. Techniques include sentence likelihood, grammatical acceptance, and Word Mover similarity. GRUEN's total score ranges from 0 to 1. A score of 0 indicates a sentence that is grammatically incorrect and redundant, while a score of 1 indicates a grammatically correct and coherent sentence.

The median of each metric, DEAM and GRUEN, was used over the generated text in the test subset to mitigate the impact of outliers.

Results

In this section, we introduce the main results obtained in every step, followed by the experimental design described in Section 4, outlining its connections to the research questions. We focus on the most important outcomes and also provide examples of how the model changed its interaction with the same adversarial examples before and after each training session. In Section 5.1, we show the performance of each model that constitutes the reward model. Subsequently, in Section 5.2, we describe the adversarial example dataset obtained. Finally, in Section 5.3, we showcase the toxicity mitigation methodology results in mitigating toxicity on BlenderBOT 1.

Reward Function

As mentioned in Section 3.1, the reward function comprised three models, one of which, named ToxDialogDefender, was specifically tailored to identify toxicity within a given context. Throughout the development of this model, we tested several base models to determine which one excelled in this task, addressing research question 𝑅𝑄 2 . The models assessed included DeBERTa, RoBERTa, and DistillBERT.

As shown in Table 1, DeBERTa demonstrated superior performance across both validation and test sets. Consequently, it became the foundation of our ToxDialogDefender model. In general, transformer-based models consistently proved to perform effectively in detecting toxicity within a dialogue context. We conducted an analysis to understand why the model could not accurately predict some examples. In this analysis we aimed to uncover patterns and explanations for the model's inaccuracies in discerning toxicity in such comments. We conducted topic extraction to gain insights into the topics the model struggled to predict accurately, such as its difficulty in detecting affirmations to a toxic comment as a form of toxic response, exemplified by "being at one with, let's say, males shouldn't exist". Additionally, we evaluated sentence length to understand if the model faced challenges with long or short sentences, potentially due to a lack of context understanding or misleading context. Lastly, we employed sentence embedding to identify patterns by grouping sentences into clusters. However, none of the mentioned methods resulted in significant information gain due to the diversity of the dataset. After reviewing the outcomes of our toxicity detector, we assessed how effectively the Perspective API and ToxiGen HateBERT models adapted to distinct predictive contexts. Table 2 reveals different predictive capabilities given the different natures of the datasets. We used the results of these model as inputs to formulate an ensemble model, thereby harnessing the differently modeled knowledge embedded in the assembled models. In this context, the test dataset comprises a combination of ToxiGen and Jigsaw entries, while the validation dataset exclusively consists of the Surgei dataset, as it was not utilized during the training phase of our ensemble model.

The outcomes corresponding to this ensemble model are shown in Table 3. It is evident that logistic regression surpassed the performance of the based model in each of the datasets, and performed particularly better than other ensemble models in the validation set, which consists of a domain different from that used in the ensemble model training data. Subsequent to these results, we derived the function representing the logistic regression model:

log (︂ 𝑃 1 -𝑃 )︂ = −0.

Data Acquisition

In this section, we gather queries that prompt BlenderBot to generate toxic responses, addressing research question 𝑅𝑄 1 . To achieve this, we curated a dataset consisting of 1.5 million comments capable of eliciting toxicity from both our base model and our model trained through SL. The data collection process was carried out in increments of half a million comments, guided by our reward function. In total, 1.5 million comments were collected and assessed for toxicity for each of the models: the base BlenderBot model and the fine-tuned BlenderBot model. In Figure 5, it is worth noting that 15.4% of the comments exhibited the ability to provoke toxicity in the base model. Within this subset, only 0.04% were flagged by the Perspective API, while 6.85% were identified by ToxiGen HateBERT. Remarkably, 8.53% were successfully detected by our proposed ToxDialogDefender toxicity detector. In the case of the model that underwent SL, 14.13% of the comments were recognized as presenting toxicity according to our reward function. Among these, 4.45% were detected by the Perspective API, 5.03% by ToxiGen HateBERT, and 4.64% by ToxDialogDefender.

Upon closer examination, it becomes evident that the initial phase of our methodology effectively reduced overall toxicity levels by a certain percentage. This preparatory process generated an ample number of adversarial examples, totaling approximately 210,000 elements. These examples laid the groundwork for the subsequent RL task, which was designed to address and mitigate toxicity in the BlenderBot 1 90M model.

RL Fine-Tuning

In this section, we present our findings regarding 𝑅𝑄 3 , which focuses on mitigating toxicity in dialogue agents without altering the internal model structure. During the training phase, evaluations were conducted every 10 epochs for each experiment. In these evaluations, we generated a new test dataset using the updated model parameters, and employed a greedy search as our decoding strategy to observe changes in probabilities. Once the data was prepared, we applied the metrics described in Section 4. This systematic approach enabled us to track the model's progress after each epoch. Over the course of 100 training epochs, the model showed significant improvement in learning the optimal policy, typically occurring between 10 to 20 epochs. However, beyond that point, it began to display signs of overfitting, where it started replicating patterns from the training data and paying less attention to the input data. One notably repetitive pattern was the following: "I'm not sure what you're talking about. What do you mean by ... ?". Nevertheless, as shown in Table 4, toxicity was considerably reduced when compared to the initial values in our test dataset. In Figure 6

Conclusions and Future Research

This article introduces a methodology inspired by recent advancements in RL and LLMs aimed at mitigating toxicity in dialogue agents. To achieve this goal, we leverage three toxicity detectors, each specialized in identifying the three forms of toxicity that can manifest in dialogue settings: implicit toxicity, explicit toxicity, and toxicity given a context. These toxicity detectors constitute the reward model, tasked with evaluating the sentences generated by the LLM. Experiments were conducted using Blenderbot 1, recognized for its proficiency in crafting toxic comments [22]. The model underwent training via SL on datasets from [54,56], resulting in a reduction in toxicity and the promotion of prosocial responses to toxic comments. Additionally, adversarial examples from 4chan were collected for both the base and SL models, totaling around 210,000 entries. In the RL training process, the LLM generates a response to the adversarial data using two decoding strategies: deterministic and probabilistic. Once the response is generated, it is evaluated by the reward model and constrained by the Kullback-Leibler divergence to prevent significant deviation from the initial capabilities. Finally, the composite reward is used to update the model weights using the PPO algorithm.

Findings Our exploration of decoding strategies and data distributions yielded several insights. Firstly, adopting a non-deterministic sampling approach was crucial for creating a less toxic model while maintaining diversity in responses, in contrast to deterministic sampling. Another significant finding was the definition of the KL coefficient, a key factor in measuring model divergence. When assessing generated text, we utilized a range of metrics including coherence, grammatical correctness, informativeness, and engagement, surpassing traditional toxicity-related measures. Our methodology achieved a substantial reduction in toxicity from 24% to 5%, while preserving initial coherence and grammatical correctness, as assessed by DEAM and GRUEN metrics, leading to outputs that are more aligned and user-friendly.

Limitations

Our training relies on toxicity detection models, which are susceptible to false positives and bias. The Perspective API, as highlighted in Table 2, particularly struggles with implicit toxicity, giving more emphasis to toxic words than considering the surrounding context. Despite generally encountering low false negatives from ToxiGen HateBERT and our toxicity detector, occasional mispredictions emphasize the need for further refinement. An increase in false positives might not significantly impact RL training unless it substantially exceeds correct predictions; however, caution is essential. The model could adjust its behavior to outperform classifiers or our reward function, potentially resulting in unclear or nonsensical text.

Another limitation of our research is the evolving definition of toxicity, which poses a challenge, especially with the application of LLMs in diverse cultural contexts. Lastly, the ongoing evaluation of dialogue agents remains challenging, with annotators struggling to keep pace. Automatic metrics, such as toxic detectors and evaluations based on artificial models, may introduce errors, as these models are limited and can introduce biases and erroneous assessments that may impact the quality of the results.

Future Work

We have plans to expand our approach herein presented by enhancing several key components: toxic detectors, evaluation metrics, and adversarial examples. Regarding toxic detection models, our aim is to conduct a more comprehensive evaluation to understand their strengths and biases for further improvement. For evaluation metrics, we are actively working on developing a comprehensive evaluation framework for dialogue agents, addressing critical aspects to enhance reliability. As adversarial examples were found to be crucial in the development of the research, we plan to expand and enhance the quality of our dataset, supporting follow-up studies. Finally, we will broaden our experimentation with different decoding strategies and their outcomes to bolster training robustness and, ultimately, to improve and better align dialogue agents.

Figure 1 :1Figure1: Fine-Tuning Language Models using RL: The fine-tuning process begins with two identical models exposed to an adversarial prompt dataset. The trained model 𝜋 PPO , or policy, generates an output evaluated using a reward function and contributes to the calculation of KL divergence in comparison to a reference model copy, 𝜋 base . The KL divergence is a measure of how one probability distribution diverges from a second, which is used in this context with 𝜆 KL to control the maximun divergence.The final reward is determined as the aggregate of these components. This resulting reward then informs the adjustment of the RL algorithm (in our case, Proximal Policy Optimization, PPO[51]) to refine the policy's parameters[52].

Figure 2 :2Figure 2: Methodology of the training process. In Step 1, the model undergoes a fine-tuning process on the Prosocialdialog dataset [54] and the Bot-adversarial dialogue dataset [55]. In Step 2, the formulation and training process of the reward model are conducted. Finally, in Step 3, adversarial examples and the reward model are utilized in the RL training.

[HST] Hi, how are you? [END] I am doing fine [ANS] I hope you die". The token [HST] marks the beginning of the conversation history, with each pair of turns separated by [END]. The token [ANS] indicates the start of a response to the last utterance.

Figure 4 :4Figure 4: Adversarial examples gathering process. The diagram shows the adversarial examples gathering process; the mode used was BlenderBot model fine-tuned in Step 1.

Figure 5 :5Figure 5: Percentage of toxic comments analysed by each component of the reward function in the 1.5 million dataset of possible adversarial examples.

, we present both positive and negative examples of our model's output at different training stages. We term them 'positive examples' to showcase how the model successfully addressed toxicity issues after training. In the case of 'negative examples', we mean that the model still exhibited some level of toxicity even after two training processes.

Figure 6 :6Figure 6: Response examples of each model after Step 1 and Step 3 training. (Left) A positive example in which the base model's response has been improved after the RL training process; (Right): a negative example.

The toxicity score is obtained by first analyzing the comment in isolation using the Perspective API and ToxiGen HateBERT, ensuring that obvious cases of toxicity are caught early. The output of each model is then ensembled and analyzed, potentially reducing false positives that might arise from relying on a single model. If the comment is not deemed toxic, our contextual toxicity detector, ToxDialogDefender, predicts the final score. ToxDialogDefender can analyze the broader conversational context, understanding nuances and detecting toxicity that might not be apparent in isolated comments.INPUTResponse AnalysisContext Analysis OutputPerspective APIYe sCHAT HISTORYBlenderBot OutputEnsembleToxic?N oToxDialogDefenderRewardToxigenHateBERTFigure 3: Reward model prediction formulation.

Table 11Training results of different base models in the toxicity detection task given a conversation context.Base ModelDatasetF1-Score AccuracyDistillBERTTest Validation0.759 0.7640.853 0.856RoBERTaTest Validation0.753 0.7500.856 0.856DeBERTaTest Validation0.794 0.7950.869 0.871

99 + 5.46𝜉 PERS + 1.09𝑂 𝑇 − 2.08𝑂 𝑁 𝑇 where 𝑂 𝑁 𝑇 and 𝑂 𝑇 denote the outputs of the ToxiGen HateBERT model for Non-Toxic and Toxic texts, respectively; 𝜉 PERS denotes the output of the Perspective API model; and 𝑃 denotes the probability of a text being toxic, given the values of 𝑂 𝑁 𝑇 , 𝑂 𝑇 , and 𝜉 PERS .

Table 22Performance metrics for Perspective API and ToxiGen HateBERT in different toxic datasets.

Base ModelDataset F1-Score AccuracyJigsaw0.920.91Perspective APIToxiGen0.580.62Surgei0.830.82Jigsaw0.820.82ToxiGen HateBERTToxiGen0.880.87Surgei0.80.8

Table 33Performance metrics of the machine learning algorithms as ensemble in comparison with Perspective API and ToxiGen HateBERT models.Base ModelDatasetF1-Score AccuracyPerspective APITest Validation0.75 0.830.765 0.82ToxiGen HateBERTTest Validation0.85 0.80.845 0.8Decision TreeTest Validation0.88 0.8590.875 0.86Logistic RegressionTest Validation0.875 0.8650.88 0.86RipperTest Validation0.523 0.5060.38 0.37

Table 44Comparison of BlenderBot fine-tuned model with the model after each experiment conducted with RL. The toxicity scores displayed are from the test set, along with DEAM and GRUEN metrics, assessing the models' capacities in coherence and grammatical correctness.ExperimentsToxicity (%) DEAM GRUENBlenderBot Finetuned23.910.98760.816180/20 Greedy search10.890.98950.824780/20 Multinomial search4.90.99080.839250/50 Greedy search5.870.99110.8465

Hugging Face: https://huggingface.co/TheMrguiller/ToxDialogDefender SuperGLUE benchmark, https://super.gluebenchmark.com/leaderboard, accessed on May 3rd, 2024. Surge AI, https://www.surgehq.ai/, accessed on May 3rd, 2024. RL4MS:https://rl4lms.apps.allenai.org/ TRL:https://huggingface.co/docs/trl/index Negative KL divergence issue in Hugging Face, https://github.com/huggingface/trl/issues/256, accessed on April 30th, 2024.

Acknowledgments

This work has been partially supported by the Basque Government (ICL4LANG project, grant no. KK-2023/00094). J. Del Ser also receives support from this institution through the research group MATHMODE (IT1456-22).

Perspective API, About the API -Attributes and Languages, https://developers.perspectiveapi.com/s/ about-the-api-attributes-and-languages?language=en_US, accessed on April 30th, 2024.

Recipes for Building an Open-Domain Chatbot SRoller EDinan NGoyal DJu Williamson 10.18653/v1/2021.eacl-main.24 Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics PMerlo JTiedemann RTsarfaty the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics 2021 DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation YZhang SSun Galley 10.18653/v1/2020.acl-demos.30 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics ACelikyilmaz T.-HWen the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics 2020 ABordes Y.-LBoureau JWeston arXiv:1605.07683 Learning End-to-End Goal-Oriented Dialog 2016 arXiv preprint Chatbots meet eHealth: Automatizing Healthcare FAmato SMarrone Moscato WAIAH@ AI* IA 2017 Building Task-Oriented Dialogue Systems for Online Shopping ZYan NDuan Chen Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2017 31 Openai Introducing ChatGPT on 09/18/2023 HTouvron LMartin KStone Albert arXiv:2307.09288 Llama 2: Open Foundation and Fine-Tuned Chat Models 2023 arXiv preprint Trolls turned Tay, Microsoft's fun millennial AI bot, into a genocidal maniac AOhlheiser The Washington Post 25 2016 Use of personal information for artificial intelligence learning data under the Personal Information Protection Act: the case of Lee-Luda, an artificial-intelligence chatbot in South Korea SJJeon MSGo JHNamgung Asia Pacific Law Review 31 2023 Detecting Cyberbullying and Cyberaggression in Social Media DChatzakou ILeontiadis Blackburn ACM Transactions on the Web (TWEB) 13 2019 EDinan GAbercrombie Bergman arXiv:2107.03451 Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling 2021 arXiv preprint On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19 FTahmasbi LSchild Ling Proceedings of the web conference 2021 the web conference 2021 2021 a bat, Chang! JJi TQiu BChen Zhang arXiv:2310.19852 AI Alignment: A Comprehensive Survey 2023 arXiv preprint CKhatri BHedayatnia Venkatesh arXiv:1812.10757 Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize 2018 arXiv preprint Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack EDinan SHumeau BChintagunta JWeston 10.18653/v1/D19-1461 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 HNgo CRaterink Araújo arXiv:2108.07790 Mitigating harm in language models with conditional-likelihood filtration 2021 arXiv preprint RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models SGehman SGururangan Sap 10.18653/v1/2020.findings-emnlp.301 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics TCohn YHe YLiu 2020 The Woman Worked as a Babysitter: On Biases in Language Generation ESheng K.-WChang PNatarajan NPeng 10.18653/v1/D19-1339 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics KInui JJiang VNg XWan the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics

Hong Kong, China

2019 GeDi: Generative Discriminator Guided Sequence Generation BKrause ADGotmare Mccann 10.18653/v1/2021.findings-emnlp.424 Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics M.-FMoens XHuang LSpecia SW.-T Yih

Punta Cana, Dominican Republic

2021 Detoxifying Language Models Risks Marginalizing Minority Voices AXu EPathak Wallace Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics KToutanova ARumshisky LZettlemoyer DHakkani-Tur IBeltagy SBethard RCotterell TChakraborty YZhou the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics 2021 Leashing the Inner Demons: Self-Detoxification for Language Models CXu ZHe ZHe JMcauley Proceedings of the AAAI Conference on Artificial Intelligence the AAAI Conference on Artificial Intelligence 2022 36 Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots WMSi MBackes Blackburn Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security the 2022 ACM SIGSAC Conference on Computer and Communications Security 2022 Defining and Detecting Toxicity on Social Media: Context and Knowledge are Key ASheth VLShalin UKursuncu Neurocomputing 490 2022 A New Generation of Perspective API: Efficient Multilingual Character-level Transformers ALees VQTran Tay Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining 2022 Automated Hate Speech Detection and the Problem of Offensive Language TDavidson DWarmsley MMacy IWeber Proceedings of the international AAAI conference on web and social media the international AAAI conference on web and social media 2017 11 TOXIGEN: A Large-Scale ´Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection THartvigsen SGabriel HPalangi MSap DRay EKamar 10.18653/v1/2022.acl-long.234 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI LRosenblatt LPiedras JWilkins Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI) the Second Workshop on NLP for Positive Impact (NLP4PI) 2022 Detecting Unintended Social Bias in Toxic Language Datasets NSahoo HGupta PBhattacharyya 10.18653/v1/2022.conll-1.10 Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics AFokkens VSrikumar the 26th Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics

Abu Dhabi, United Arab Emirates; Hybrid

2022 A Critical Audit of Accuracy and Demographic Biases within Toxicity Detection Tools JJiang 2020 Dartmouth College Undergraduate Theses Social Biases in NLP Models as Barriers for Persons with Disabilities BHutchinson VPrabhakaran Denton 10.18653/v1/2020.acl-main.487 Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics DJurafsky JChai NSchluter JTetreault the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics 2020 Towards Equal Gender Representation in the Annotations of Toxic Language Detection EExcell NAlMoubayed 10.18653/v1/2021.gebnlp-1.7 Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics MCosta-Jussa HGonen CHardmeier KWebster the 3rd Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics 2021 Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI LPiedras LRosenblatt JWilkins 10.18653/v1/2022.nlp4pi-1.2 Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI), Association for Computational Linguistics LBiester DDemszky ZJin MSachan JTetreault SWilson LXiao JZhao the Second Workshop on NLP for Positive Impact (NLP4PI), Association for Computational Linguistics

Abu Dhabi, United Arab Emirates (Hybrid

2022 Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation EDinan AFan Williams 10.18653/v1/2020.emnlp-main.656 Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics BWebber TCohn YHe YLiu the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics 2020 Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts ABaheti MSap ARitter MRiedl Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and M.-FMoens XHuang LSpecia SW.-T Yih the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 Context Sensitivity Estimation in Toxicity Detection AXenos JPavlopoulos Androutsopoulos Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics 2021 JPavlopoulos JSorensen Dixon arXiv:2006.00998 Toxicity Detection: Does Context Really Matter? 2020 arXiv preprint Revisiting Contextual Toxicity Detection in Conversations AAnuchitanukul JIve LSpecia ACM Journal of Data and Information Quality 15 2022 Information Leakage in Embedding Models CSong ARaghunathan Proceedings of the 2020 ACM SIGSAC conference on computer and communications security the 2020 ACM SIGSAC conference on computer and communications security 2020 LWeidinger JMellor Rauh arXiv:2112.04359 Ethical and social risks of harm from Language Models 2021 arXiv preprint SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures MUng JXu Y.-LBoureau 10.18653/v1/2022.acl-long.447 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers) SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers)

Dublin, Ireland

2022 Association for Computational Linguistics SDathathri AMadotto Lan arXiv:1912.02164 Plug and Play Language Models: A Simple Approach to Controlled Text Generation 2019 arXiv preprint DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts ALiu MSap XLu Swayamdipta Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Association for Computational Linguistics CZong FXia WLi RNavigli the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021 1 Long Papers ZTang KZhou Wang arXiv:2308.08295 Detoxify Language Model Step-by-Step 2023 arXiv preprint ZGou ZShao Gong arXiv:2305.11738 Critic: Large Language Models Can Self-Correct with Tool-Interactive Critiquing 2023 arXiv preprint Reward Modeling for Mitigating Toxicity in Transformer-Based Language Models FFaal KSchmitt JYYu Applied Intelligence 53 2022 Quark: Controllable Text Generation with Reinforced XLu SWelleck Hessel Un]learning, 36th Conference on Neural Information Processing Systems

NeurIPS

2022. 2022 Generating Sequences by Learning to Self-Correct SWelleck XLu West arXiv:2211.00053 2022 arXiv preprint THe JGlass arXiv:1809.04113 Detecting egregious responses in neural sequence-to-sequence models 2018 arXiv preprint Red Teaming Language Models with Language Models EPerez SHuang FSong Cai Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics YGoldberg ZKozareva YZhang the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Abu Dhabi, United Arab Emirates

2022 Probing Toxic Content in Large Pre-Trained Language Models NOusidhoum XZhao Fang Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing Long Papers the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing 2021 1 Training language models to follow instructions with human feedback LOuyang JWu Jiang 2022. 2022 NLambert LWerra Illustrating Reinforcement Learning from Human Feedback (RLHF), Hugging Face Blog 2022 JShlens arXiv:1404.2000 Notes on Kullback-Leibler Divergence and Likelihood 2014 arXiv preprint HKim YYu Jiang arXiv:2205.12688 PROSOCIALDIALOG: A Prosocial Backbone for Conversational Agents 2022 arXiv preprint Beyond Goldfish Memory: Long-Term Open-Domain Conversation JXu ASzlam JWeston 10.18653/v1/2022.acl-long.356 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 : Long Papers), Association for Computational Linguistics YBai SKadavath Kundu arXiv:2212.08073 Constitutional AI: Harmlessness from AI Feedback 2022 arXiv preprint Is Reinforcement Learning (Not) for Natural Language Processing RRamamurthy PAmmanabrolu Brantley arXiv:2210.01241 Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization 2022 arXiv preprint Hopfgartner, A Comparative Study of Using Pre-trained Language Models for Toxic Comment Classification ZZhao ZZhang Companion Proceedings of the Web Conference 2021 2021 A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution GSong DHuang ZXiao Information 12 205 2021 Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks RDey MSalem IEEE 60th international midwest symposium on circuits and systems (MWSCAS), IEEE 2017. 2017 Bidirectional recurrent neural networks MSchuster KKPaliwal IEEE Trans. Signal Process 45 1997 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding JDevlin M.-WChang Lee 10.18653/v1/N19-1423 Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Long and Short Papers JBurstein CDoran TSolorio the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies

Minneapolis, Minnesota

2019 1 Association for Computational Linguistics YLiu MOtt NGoyal Du arXiv:1907.11692 RoBERTa: A Robustly Optimized BERT Pretraining Approach 2019 arXiv preprint PHe XLiu JGao WChen arXiv:2006.03654 DeBERTa: Decoding-enhanced BERT with Disentangled Attention 2020 arXiv preprint Toxic Comment Classification Challenge JCjadams JSorensen LElliott Dixon 2017 Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board APapasavva SZannettou EDeCristofaro Stringhini Proceedings of the international AAAI conference on web and social media the international AAAI conference on web and social media 2020 14 YChang XWang JWang Wu arXiv:2307.03109 A Survey on Evaluation of Large Language Models 2023 arXiv preprint DEAM: Dialogue Coherence Evaluation using AMRbased Semantic Manipulations SGhazarian NWen AGalstyan NPeng 10.18653/v1/2022.acl-long.57 Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics Long Papers SMuresan PNakov AVillavicencio the 60th Annual Meeting of the Association for Computational Linguistics

Dublin, Ireland

2022 1 Association for Computational Linguistics GRUEN for Evaluating Linguistic Quality of Generated Text WZhu SBhat 10.18653/v1/2020.findings-emnlp.9 Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics TCohn YHe YLiu 2020