<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Guillermo</forename><surname>Villate-Castillo</surname></persName>
							<email>guillermo.villate@tecnalia.com</email>
							<affiliation key="aff0">
								<orgName type="department">TECNALIA</orgName>
								<orgName type="institution">Basque Research and Technology Alliance (BRTA)</orgName>
								<address>
									<postCode>48160</postCode>
									<settlement>Derio, Bizkaia</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Borja</forename><surname>Sanz</surname></persName>
							<email>borja.sanz@deusto.es</email>
							<affiliation key="aff1">
								<orgName type="department">Faculty of Engineering</orgName>
								<orgName type="institution">University of Deusto</orgName>
								<address>
									<addrLine>Avenida de las Universidades 24</addrLine>
									<postCode>48007</postCode>
									<settlement>Bilbao</settlement>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Javier</forename><forename type="middle">Del</forename><surname>Ser</surname></persName>
							<email>javier.delser@tecnalia.com</email>
							<affiliation key="aff0">
								<orgName type="department">TECNALIA</orgName>
								<orgName type="institution">Basque Research and Technology Alliance (BRTA)</orgName>
								<address>
									<postCode>48160</postCode>
									<settlement>Derio, Bizkaia</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
							<affiliation key="aff2">
								<orgName type="institution">University of the Basque Country (UPV/EHU)</orgName>
								<address>
									<postCode>48013</postCode>
									<settlement>Bilbao</settlement>
									<country key="ES">Spain</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Mitigating Toxicity in Dialogue Agents through Adversarial Reinforcement Learning</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">4E96A8682BEDF9DA4FAE481762C53FC8</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T16:37+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Toxicity</term>
					<term>Alignment</term>
					<term>Large Language Models</term>
					<term>Reinforcement Learning</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Large Language Models (LLMs) have revolutionized dialogue agents, but they still suffer from biases, inconsistencies, and factual inaccuracies. This paper focuses on addressing toxicity, a critical aspect of the "Diversity, non-discrimination, and fairness" pillar of Trustworthy AI, in dialogue agents. We propose a methodology inspired by InstructGPT and ChatGPT to mitigate toxicity in chatbots by incorporating toxicity detection tools from industry leaders, such as Microsoft and Google Jigsaw, into a reward model. The reward model was extended by our developed ToxDialogDefender, a context-aware toxic language identification model. To evaluate our approach, we curate a dataset of 1.5 million comments, with 14.13% serving as successful adversarial examples, to induce toxicity in the BlenderBot 1 90M model. While our primary focus is on BlenderBot 1, our approach is applicable to models with similar Seq2Seq architectures. Experimental results demonstrate a substantial reduction in toxicity levels from 24% to 5%, as validated by a subset analysis. This research highlights the potential for integrating toxicity mitigation techniques into the training paradigm of dialogue agents, paving the way for more more aligned and unbiased conversational AI systems.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Dialogue agents driven by open-domain chatbots <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b1">2]</ref> play a pivotal role in applications like restaurant reservations <ref type="bibr" target="#b2">[3]</ref>, healthcare <ref type="bibr" target="#b3">[4]</ref> and online shopping <ref type="bibr" target="#b4">[5]</ref>. More recent cases of general-purpose dialog agents are ChatGPT <ref type="bibr" target="#b5">[6]</ref> or Llama 2 <ref type="bibr" target="#b6">[7]</ref>, which have been trained to follow societal norms. These models undergo training with extensive datasets from platforms like Reddit 1 , Twitter (currently X 2 ), and 4chan 3 , with examples including BlenderBot 1 <ref type="bibr" target="#b0">[1]</ref>, TwitterBot Tay <ref type="bibr" target="#b7">[8]</ref>, and Luda <ref type="bibr" target="#b8">[9]</ref>. However, these data sources are known for producing toxic content <ref type="bibr" target="#b9">[10,</ref><ref type="bibr" target="#b10">11,</ref><ref type="bibr" target="#b11">12]</ref>, leading to undesirable behaviors observed in the output of these models. Toxicity mitigation is a key task at a time when the research community is fervently engaged in AI alignment and in ensuring that AI adopts human principles such as respect, fairness, non-discrimination, etc <ref type="bibr" target="#b12">[13]</ref>.</p><p>This research focuses on mitigating toxic speech in dialogue agents, which has been defined repeatedly as rude, disrespectful, or unreasonable comments likely to disrupt conversations 4 , often related to gender, politics, race, or culture <ref type="bibr" target="#b13">[14]</ref>. Previous efforts aimed at reducing toxicity in dialogue agents include continuous curation of datasets <ref type="bibr" target="#b14">[15,</ref><ref type="bibr" target="#b15">16]</ref>, toxic behavior detection during text generation <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>, and safety layers <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19]</ref>. While effective, these approaches have limitations (𝐿):</p><p>• 𝐿 1 : The continuous curation of datasets is expensive, requiring human annotators at every stage. In addition, removing toxic comments may lead to the model generating unrealistic responses due to the consequent shortage of training data.</p><p>• 𝐿 2 : Current toxic content detectors do not take into account the conversation history at training time, thus lacking contextualization.</p><p>• 𝐿 3 : Safety layers mitigate toxicity during inference, but do not entirely eliminate it from the model's internal knowledge base. Moreover, toxicity detectors, known for their biases, can introduce unwanted biases <ref type="bibr" target="#b19">[20]</ref>. Additionally, since next token probabilities are conditioned on the toxic detector, this may lead to incoherent responses <ref type="bibr" target="#b20">[21]</ref>.</p><p>The aforementioned limitations of current methodologies lie in three distinctive areas: adversarial training data gathering (𝐿 1 ); contextualization of comments to mitigate false positives and negatives in context-sensitive comments (𝐿 2 ); and intrinsic removal of toxicity from model weights (𝐿 3 ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Motivation and Research Questions</head><p>To address chatbot toxicity limitations, we explore the following research questions (𝑅𝑄) in corresponding order of the limitations exposed above:</p><p>• 𝑅𝑄 1 : Are there any existing queries that drive chatbots to respond in a toxic manner?</p><p>• 𝑅𝑄 2 : How well do toxicity detectors perform in dialog contexts?</p><p>• 𝑅𝑄 3 : Can we eliminate toxic traits within the model without adding complexity to its architecture?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Main contributions</head><p>In this work we propose a novel methodology for mitigating toxicity in dialog agent-based models. This methodology addresses the three forms in which toxicity can manifest in a discussion: implicit toxicity, explicit toxicity, and toxicity detected within the dialog context. To the best of our knowledge, this work is the first to utilize such a unique approach in dialog settings based on our literature review. Additionally, our proposed dialog context-based toxicity detector is designed to assist in situations where isolated comments are insufficient for assessing the toxicity level, particularly in cases where the model responds affirmatively to toxic questions or statements. Furthermore, we expand the pool of adversarial examples introduced in <ref type="bibr" target="#b21">[22]</ref> for BlenderBot 1 by analyzing an additional 1.5 million examples. Finally, by leveraging Reinforcement Learning (RL), we are able to mitigate toxicity within the inner model weights.</p><p>Paper structure The article is organized as follows: Section 2 introduces fundamental concepts that are essential for understanding the terminology used in this research. Furthermore, it provides an overview of prior research of relevance to understand the contribution to the state of the art. Section 3 details the proposed methodology, whereas Section 4 describes the experimental setup and evaluation protocol used to assess its performance. Section 5 summarizes the main experimental outcomes, and Section 6 discusses key findings from our investigation and outlines future research directions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Foundations and Background</head><p>Before detailing the proposed methodology, this section starts with some fundamental concepts concerning toxicity (Section 2.1), followed by a discussion of existing methods for detecting toxicity (Section 2.2) and the accompanying challenges. Subsequently, historical perspectives on toxicity within LLMs are examined in Section 2.3, together with an analysis of RL and its utilization in training chatbots (Section 2.4).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Toxicity Definition</head><p>Toxicity is a multifaceted term that continually evolves, shaped by the cultural contexts within which it develops. Despite being defined as stated in Section 1 by Perspective API (namely, the leading toxicity detector developed by Google Jigsaw), this definition remains far from being exhaustively comprehensive. The work in Sheth et al. <ref type="bibr" target="#b22">[23]</ref> categorizes toxicity into groups including threats, obscenities, insults, identity-based hate, harassment, misinformation, radicalization, and gender-based violence. Assessing toxicity in dialogue contexts requires systems capable of identifying its diverse forms, which are explicit, implicit, and contextualized forms:</p><p>• Explicit toxicity: Conspicuously harmful content, including hate speech, profanity, threats, or direct insults, which requires no additional interpretation for recognizing its negative nature.</p><p>• Implicit toxicity: Content lacking overtly harmful elements may carry negative connotations, biases, or concealed meanings. It is characterized by the lack of explicit toxic language, like insults and slurs. The detection of this type of toxicity demands deeper analysis or cultural familiarity for recognition. This category may encompass subtle forms of discrimination, microaggressions, or insinuations.</p><p>• Toxicity within a context: The concept of toxicity in context refers to evaluating whether content is toxic or harmful based on the specific situation or circumstances in which it is presented. This assessment involves considering both the intent behind the content and the intended audience. It acknowledges that the same words or actions may have varying impacts depending on the context in which they are produced. This term is crucial in the context of dialogue agents, where the conversation history is needed for the analysis of toxic content.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Toxicity Detectors</head><p>The development of toxicity detection systems often relies on human annotations and machine learning techniques. In research, Google's Perspective API <ref type="bibr" target="#b23">[24]</ref> stands out for its ability to recognize characteristics beyond toxicity, including identity-based hate, profanity, and threats, among others <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b21">22]</ref>. Other examples include HateSonar <ref type="bibr" target="#b24">[25]</ref> and ToxiGen HateBERT <ref type="bibr" target="#b25">[26]</ref>, the latter being specialized in detecting implicit toxicity. A significant challenge in toxicity detectors is the existence of biases and their limited applicability in diverse contexts. Research on detectors, particularly that spearheaded by Perspective API, has revealed substantial biases, including gender bias <ref type="bibr" target="#b26">[27,</ref><ref type="bibr" target="#b27">28]</ref> and biases against minority groups <ref type="bibr" target="#b28">[29,</ref><ref type="bibr" target="#b29">30]</ref>. Biases often emerge from tainted datasets during annotation, exacerbated by a lack of heterogeneous participants <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b28">29]</ref> and the nature of the content inside the dataset at hand <ref type="bibr" target="#b31">[32]</ref>. Importantly, such biases tend to amplify when deployed in real-world applications, from data preprocessing to web content moderation <ref type="bibr" target="#b32">[33]</ref>.</p><p>In the current landscape of predictive models within a contextual framework, a work stands out focusing on analyzing toxicity within a specific context through the incorporation of stance detection. This becomes particularly relevant for a demanding task, specifically implicit context toxicity in questions related to the stance of the model <ref type="bibr" target="#b33">[34]</ref>. Much of the existing work in detecting toxicity within a given context revolves around assessing the necessity and appropriateness of such an analysis, termed context sensitivity estimation <ref type="bibr" target="#b34">[35]</ref>. However, even when annotations change with the observed context, changes are not substantial enough to significantly affect the analyzed data <ref type="bibr" target="#b35">[36]</ref>. This idea is supported by another study, which suggests that depending on the type of data, context can be beneficial, but could potentially lead to an increase in false positives and false negatives overall <ref type="bibr" target="#b36">[37]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Toxicity in Language Models</head><p>Despite the versatility of LLMs, a primary concern remains in the proliferation of toxicity, including the dissemination of harmful information <ref type="bibr" target="#b37">[38]</ref>, propagation of misinformation <ref type="bibr" target="#b38">[39]</ref>, and the generation of toxic comments <ref type="bibr" target="#b7">[8,</ref><ref type="bibr" target="#b8">9]</ref>. Dialogue agents based on generative open-domain chatbots, as mentioned in the introduction, prominently exhibit toxicity issues.</p><p>Previous research addresses toxicity in dialogue agents through various approaches, from i) creating iterative environments for stress-testing and improving chatbot responses through Supervised Learning (SL) <ref type="bibr" target="#b14">[15]</ref>, to ii) the incorporation of classifiers for identifying and filtering toxic content in chatbotgenerated responses <ref type="bibr" target="#b0">[1]</ref>; and iii) the introduction of safety layers to prevent inappropriate queries <ref type="bibr" target="#b0">[1,</ref><ref type="bibr" target="#b18">19]</ref>. Toxic comments are also addressed through specially crafted datasets designed to elicit positive responses to toxic comments from both models and users <ref type="bibr" target="#b39">[40]</ref>. Other methods include attribute conditioning (ATCON) <ref type="bibr" target="#b16">[17]</ref>, a data-based method that further pretrains an LLM model by prepending a toxicity attribute token, toxic and not toxic. By using the prepended token, the model learns the characteristics of toxic and non-toxic sentences. This allows for the reduction of toxicity during decoding by utilizing these tokens.</p><p>In decoding-based strategies, recent efforts have been focused on addressing toxicity during the next token prediction phase. We can divide the research activity into two general groups depending on the methodology under consideration: i) at generation time and ii) at training time. We have Plug-and-Play LM <ref type="bibr" target="#b40">[41]</ref>, a decoding-based strategy that utilizes a simple discriminator to direct the generation process. Additionally, DExperts <ref type="bibr" target="#b41">[42]</ref> utilize expert models (trained on non-toxic data) and anti-expert models (trained on toxic data) to guide the base LLM's generation process. This guidance aims to make the produced content closer to that generated by the expert LLM and further from that produced by the anti-expert, thereby minimizing the likelihood of producing toxic sentences as determined by the anti-expert. Other decoding strategies leverage in-context learning and multitask learning capabilities of LLMs to steer away the model from generating toxic comments. One representative example of methodologies as such is Detox-Chain <ref type="bibr" target="#b42">[43]</ref>, which uses a toxicity span detector to locate the toxic part of the comment. Once located, Detox-Chain masks the toxic part and generates a new comment using the mask-filling capabilities. This can be extended to use a foundational model instead of the same model. Another case is CRITIC <ref type="bibr" target="#b43">[44]</ref>, which resorts to external tools (e.g., Perspective API) to assess the toxicity of the comment, and then uses the in-context learning capabilities of the model to correct and generate a new comment.</p><p>At training time, methods primarily hinge on RL <ref type="bibr" target="#b44">[45]</ref> and quantization with controllable tokens <ref type="bibr" target="#b45">[46]</ref> to expose the model to fewer toxic comments and improve the quality of the generated content. In this line, the so-called SELF-CORRECT method <ref type="bibr" target="#b46">[47]</ref> uses a generator and a corrector to improve the generation by training the corrector to generate less toxic comments given a hypothesis and an input to be corrected.</p><p>An emerging research area involves the creation and utilization of adversarial examples to evaluate language model toxicity. Various methods, such as scrutinizing datasets to assess their comments' capacity to induce toxic attributes in models <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b21">22]</ref>, reveal that not only can toxic comments engender toxicity, but non-toxic comments can also exert a similar influence. An alternative approach focuses on generating adversarial prompts using search and optimization algorithms, guided by predefined malevolent word sets <ref type="bibr" target="#b47">[48]</ref>. Researchers leverage LLMs and prompt engineering to generate adversarial prompts, investigating the utility of training models through RL or SL to generate sentences leading to toxic content <ref type="bibr" target="#b48">[49]</ref>. Additional efforts consider using explicit names of social groups followed by benign actions to induce toxicity in masked language models <ref type="bibr" target="#b49">[50]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.4.">Reinforcement Learning</head><p>Differently to SL and unsupervised learning paradigms, RL involves an agent interacting with an environment, receiving rewards and observations as the result of its actions. In the context of RL, an environment is characterized by its Markovian property, which means that its learning dynamics solely rely on the present state, disregarding past states or historical information. The primary aim within this framework is to achieve the highest attainable reward over an episode, with the focus on optimizing actions based on the current state.</p><p>RL applied in the domain of Natural Language Processing (NLP) is a relatively recent technique that The fine-tuning process begins with two identical models exposed to an adversarial prompt dataset. The trained model 𝜋 PPO , or policy, generates an output evaluated using a reward function and contributes to the calculation of KL divergence in comparison to a reference model copy, 𝜋 base . The KL divergence is a measure of how one probability distribution diverges from a second, which is used in this context with 𝜆 KL to control the maximun divergence.The final reward is determined as the aggregate of these components. This resulting reward then informs the adjustment of the RL algorithm (in our case, Proximal Policy Optimization, PPO <ref type="bibr" target="#b50">[51]</ref>) to refine the policy's parameters <ref type="bibr" target="#b51">[52]</ref>.</p><p>has gained adoption. Within this framework, RL is customized to align with the unique components of NLP systems. Here, the environment dynamically adapts to the task at hand, which may involve representing a target model for attacking or utilizing a dataset for initializing observations to enhance a task's performance. Initially, both the base model 𝜋 base and the trained model or policy 𝜋 PPO receive the initial input, which could be a sentence requiring a response or a discriminative task to execute. The state reaches its finalization when the model selects the action to execute, such as predicting the next token or formulating the subsequent sentence in the case of generation tasks. If the reward is dense, the reward model generates a scalar indicative of the state's quality (i.e., the current sentence). This scalar is then incorporated into a policy constraint metric to ensure that the model remains within a reasonable deviation from its initial capabilities. Upon obtaining the reward, the policy update occurs based on the chosen algorithm. In the context of text generation, an episode typically refers to the process of generating a set of tokens or sentences until reaching the end-of-sequence token. Figure <ref type="figure" target="#fig_0">1</ref> illustrates the main RL process. For further clarity, the subsequent key terms are defined:</p><p>• In RL with NLP, the State Space 𝑆 depends on the generation process, which can involve either next sentence prediction or next token prediction given a sentence or a set of words. The dimensionality of the state is equivalent to the size of the vocabulary raised to the power of the number of outputs.</p><p>• Action Policy A is all tokens that can be used for next token prediction, corresponding to the vocabulary of the natural language model under consideration, or the next possible sentence.</p><p>• Reward R typically consists of a composite entity comprising a reward model and a policy change constraint. In the literature, the Kullback-Leibler (KL) <ref type="bibr" target="#b52">[53]</ref> divergence metric has been widely used as an asymmetric measure of similarity between two probability distributions. The reward signal can exhibit either sparsity when provided at the sentence level or density when furnished at the token level, and can be formulated as:</p><formula xml:id="formula_0">𝑅 𝑡+1 = 𝑅 0 − 𝜆 • 𝑅 𝐾𝐿</formula><p>where 𝑅 0 is the immediate reward, 𝜆 is the KL penalty for the KL divergence, which is a positive value (from 0 to 1) that weights the impact of the KL in the training process, and 𝑅 𝐾𝐿 is the KL divergence value.</p><p>Step 1 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Collect datasets and train the base model</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Collect public datasets</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.5.">Summary and Contribution</head><p>Efforts to address toxicity in dialogue agents have been commendable, utilizing strategies like SL, content filtering classifiers, and safety layers. Other areas such as decoder-based architectures have used word filtering, control tokens, 2-dimensional representation, in-context learning plus external tools, RL and less toxic models to drive the generation. A promising approach involves using adversarial examples to evaluate and counteract language model toxicity, providing valuable insights into the impact of both toxic and non-toxic comments. Contextual similarities between adversarial datasets and real-world social media content are noteworthy.</p><p>Contribution Our research tackles toxicity using a diverse array of experts within an RL environment.</p><p>Each expert focuses on one of the various forms toxicity can take: implicit toxicity, explicit toxicity, and contextualized toxicity. The latter is an innovative approach to toxicity mitigation, considering the current lack of robust detection models, despite ongoing efforts in dataset collections <ref type="bibr" target="#b10">[11]</ref>. As detailed in Section 3.1, we integrate all three forms of toxicity into a single model capable of assessing toxicity.</p><p>In addition, we curated a set of adversarial examples, as depicted in the work done by Si et al. <ref type="bibr" target="#b21">[22]</ref>, where notable contextual similarities between adversarial datasets and real-world social media content were observed. Even though similar approaches have been used in the literature <ref type="bibr" target="#b44">[45]</ref>, the particularities of dialog agents such as history context make a huge difference in the methodological approach to the problem. This approach involves iterative changes guided by the policy, optimized through exposure to adversarial examples. The process effectively mitigates toxicity, with the reward function serving as an expert, enabling a more automated and efficient approach to addressing toxicity in dialogue agents. As far as our knowledge extends, our research can be thought to be a pioneering effort and a new technical approach to address this particular challenge.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Methods</head><p>In this section, we elucidate the methodology employed to mitigate toxicity by framing it as a RL problem, encompassing the entire process from data acquisition to model evaluation. The methodology is divided in a three-step process,as depicted in Figure <ref type="figure" target="#fig_1">2</ref>, the first of which is considered optional:</p><p>• Enhancement of the base model via SL: In this initial phase, the base model is trained using supplementary datasets through a SL paradigm. The objective here is to augment the model's capabilities, thereby enhancing its proficiency in handling specific tasks or adapting it to novel contexts. In our particular context, this stage serves the purpose of bolstering the model's capacity to generate contextually relevant responses within extended dialogue histories <ref type="bibr" target="#b54">[55]</ref>. Furthermore, it is employed to ameliorate the model's aptitude for generating responses that are less toxic, while concurrently reducing the prevalence of generic responses, particularly in response to toxic comments <ref type="bibr" target="#b53">[54,</ref><ref type="bibr" target="#b55">56]</ref>.</p><p>• Formulation of the reward function or reward model: Within the realm of RL, it is essential to devise a reward function that facilitates the target task, as this function serves as the primary metric for evaluating the agent's performance. In the intersection of NLP and RL, it is conventional to construct a reward model capable of assessing various facets or a singular aspect that requires enhancement within the model <ref type="bibr" target="#b50">[51]</ref>. In our specific case, we have devised a composite reward model, which comprises three distinct sub-models as is depicted in Section 3.1.</p><p>• RL fine-tuning: In this third phase, we leverage a specialized framework tailored for training LLMs through RL, namely, RL4LMs <ref type="bibr" target="#b56">[57]</ref>. This open-source software, developed by the Allen Institute for AI, is employed for our purposes. Within this framework, we apply PPO algorithm, recognized as the state-of-the-art approach for such applications.</p><p>In Section 3.1, we delve into the genesis and functionality of the reward function. Moving on to Section 3.2, we elaborate on the process of gathering adversarial attack examples. Finally, in Section 3.3, we elucidate the criteria employed in selecting the RL training tool.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Reward Function</head><p>As illustrated in Figure <ref type="figure">3</ref>, the reward function in this study is constructed upon three distinct models, each assigned a specific objective related to identifying various forms of toxicity. Google's Perspective API is tasked with detecting explicit toxicity, while implicit toxicity is discerned using Microsoft's ToxiGen HateBERT. Additionally, a model designed to evaluate toxicity within a conversational context, ToxDialogDefender <ref type="foot" target="#foot_0">5</ref> , is developed within the scope of this research to address specific knowledge gaps.</p><p>These knowledge gaps primarily encompass identifying non-toxic responses to toxic inputs, such as countering a toxic question or statement, as well as recognizing sarcasm or irony in responses to toxic inputs.</p><p>In the development of our toxicity detection, capable of assessing toxicity within a given context, we have aligned our approach with the prevailing trends in the literature <ref type="bibr" target="#b57">[58,</ref><ref type="bibr" target="#b58">59]</ref>. Notably, we have leveraged state-of-the-art architectures including GRU <ref type="bibr" target="#b59">[60]</ref>, BiLSTM <ref type="bibr" target="#b60">[61]</ref>, BERT <ref type="bibr" target="#b61">[62]</ref>, and RoBERTa <ref type="bibr" target="#b62">[63]</ref>. Given our limited corpus of toxic instances, we have adopted the use of language representation models such as DistillBERT and RoBERTa due to their remarkable adaptability to new tasks <ref type="bibr" target="#b57">[58,</ref><ref type="bibr" target="#b58">59]</ref>. In addition to the aforementioned models, we conducted training with DeBERTa <ref type="bibr" target="#b63">[64]</ref>, which has exhibited superior performance in the SuperGLUE benchmark <ref type="foot" target="#foot_1">6</ref> and demonstrated enhanced contextual comprehension through its enhanced masked decoder and attention disentanglement capabilities. The training datasets employed for these models encompassed the Dialogue Safety dataset <ref type="bibr" target="#b14">[15]</ref> and Bot Adversarial <ref type="bibr" target="#b0">[1]</ref> dataset. During the training process, the dialogue context was incorporated as part of the input using special tokens. The input schema is as follows: " During the formulation of the reward function, numerous uncertainties arose regarding the capabilities of different models, particularly those not developed as part of this research project. As mentioned earlier, the Perspective API has been reported to exhibit biases and challenges, especially concerning the underrepresentation of minority groups. Additionally, ToxiGen HateBERT has primarily been assessed for its performance in implicit toxicity detection. However, it has not been analyzed for explicit toxicity, despite being built upon another toxic detection model designed for detecting explicit toxicity.</p><p>To address and mitigate these uncertainties, we systematically collected two datasets closely related to the training data of the models under consideration, namely, Toxic Comment Classification Challenge and ToxiGen <ref type="bibr" target="#b64">[65,</ref><ref type="bibr" target="#b25">26]</ref>, in addition to a third dataset unrelated to our specific models from Surgei AI <ref type="foot" target="#foot_2">7</ref> . These datasets served as the foundational basis for our analysis, enabling us to assess the models' effectiveness in recognizing various aspects of toxicity, as well as evaluating whether Perspective API could predict implicit toxicity and if ToxiGen HateBERT could predict explicit toxicity.</p><p>From each for the first two datasets we collected 30,000 comments, ensuring a balanced presence of toxicity. We then employed Perspective API and ToxiGen HateBERT models for predictions, creating datasets that contained information about the capabilities of each model. The dataset was partitioned in 80% training data and 20% test data. With this information in hand, our objective was to develop a function capable of leveraging the strengths of these models, implicit toxicity and explicit toxicity detection. To accomplish this, we trained three machine learning models to ensemble the outputs of the Perspective API and ToxiGen HateBERT models. These machine learning models were selected for their ability to represent learned patterns as rules or functions without adding computational overhead. This makes them well-suited for ensemble modeling and thus for deciding when to choose the label of Perspective API and ToxiGen HateBERT.</p><p>In Figure <ref type="figure">3</ref>, we outline the methodology for combining the models. Initially, the responses of the models are analyzed by the Perspective API and ToxiGen HateBERT and are then ensemble into a binary label. If this label is deemed toxic, it is used as the final output; if not, the ToxDialogDefender model is employed to assess the comment considering the conversation history. We divide the process into two stages, as each focuses on specific cases where the other models are less effective. Although various types of toxicity have been mentioned, we only account for overall toxicity, as providing distinct scalar values for every case could mislead the model in the RL training process. Hence, the output of the reward function is either -1 or +1, with -1 assigned to toxic comments and +1 assigned to non-toxic comments. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Public Dataset</head><note type="other">BlenderBot Query Response Keyword Filtering Reward Model Adversarial Examples</note></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">Data Acquisition</head><p>The most critical aspect of model tuning lies in the data, which, in this case, extends beyond naive data acquisition. We require adversarial examples with the potential to elicit toxicity from our models. It is not straightforward due to the complexity of models, their lack of interpretability, and the vast amount of data they are trained on. This makes it unfeasible to predict the learned response distribution to toxic and non-toxic inputs accurately. Our analysis focuses on adversarial examples that induce toxicity in the model, particularly non-toxic entries. Bearing this in mind, we have chosen to adhere to the guidelines outlined in <ref type="bibr" target="#b21">[22]</ref>, which successfully identified adversarial examples for the BlenderBot 1 90M model. Given that the dataset was not publicly accessible during the course of our investigation, we undertook the task of replicating their entire process, albeit with some modifications as indicated in Figure <ref type="figure" target="#fig_3">4</ref>. The dataset was retrieved from an internet forum known as 4chan, specifically the Political Incorrect board <ref type="bibr" target="#b65">[66]</ref>. The adjustments made are next listed:</p><p>1. Instead of relying solely on the Perspective API to acquire adversarial examples, we opted to employ our custom reward function.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>2.</head><p>In line with the conclusions drawn in the original article, we expanded the list of terms that are likely to trigger toxicity, including -but not limited to -groups and identities such as Hindus, Buddhists,</p><p>LGBTQ+ individuals, the disabled, religious denominations like Mormons and Jehova, and news organizations like Fox, CNBC, MSNBC, BBC and SKY NEWS.</p><p>In the original work <ref type="bibr" target="#b21">[22]</ref>, the authors meticulously curated a sample of one million entries from the dataset. In our study, we closely followed their methodology and incorporated our refinements to initially narrow down the extensive dataset comprising 139 million comments to a more computationally manageable subset, comprising 12 million comments. This initial reduction was done to conserve computational resources and to focus on acquiring the potential adversarial example subset, given that less than 9% of the data analyzed by Si et al. <ref type="bibr" target="#b21">[22]</ref> were deemed capable of generating toxicity. Subsequently, with this streamlined dataset at our disposal, we proceeded to partition it into discrete chunks, each containing half a million comments, for in-depth analysis employing our proposed reward function.</p><p>The selection of these comment chunks was methodically guided by the empirical observation obtained in Si et al. <ref type="bibr" target="#b21">[22]</ref>, where comments exhibiting scores below the 0.3 threshold o displayed an elevated propensity to incite toxic interactions. In addition, it was observed in that study that comments scoring between 0.6 and 1.0 were found to harbor the potential for generating toxicity. These selected comments were input into both our Blenderbot 1 base model and our model, which had undergone fine-tuning via SL. In both instances, we utilized a greedy search algorithm for the generation of response text. Ultimately, we conducted a thorough analysis of 1.5 million comments twice in our proposed pipeline: firstly, when employing the core BlenderBot 1 model, and subsequently when using BlenderBot finetuned in Step 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">RL Fine-Tuning</head><p>Once all the components were meticulously crafted, the only task remaining was to select a framework for implementing RL in LLMs. Throughout 2022 and 2023 several frameworks have emerged from the literature, primarily in response to the tremendous success of ChatGPT, such as RL4MS <ref type="foot" target="#foot_3">8</ref> or TRL <ref type="foot" target="#foot_4">9</ref> , among others. Given the continued development of these tools, we specifically opted for two approaches that aligned with our criteria:</p><p>1. We aimed for projects with at least one year of development history, coupled with ongoing contributions from developers.</p><p>2. The chosen tools needed to be focused on LLMs rather than replicating specific instances like GPT 3.5.</p><p>3. These tools should be developed by individuals with a research-oriented mindset, either to demonstrate the viability of this approach for various tasks or to create research tools for public use.</p><p>With these criteria in mind, we selected TRL and RL4LMS. TRL, developed by Hugging Face, offers the PPO algorithm as its main RL method. TRL operates at the sentence level and is compatible with various architectures. However, one drawback is the computation of the KL divergence, which has been reported as an unsolved issue <ref type="foot" target="#foot_5">10</ref> . During model training, the KL value tended to become increasingly negative over time, resulting in undesirable outcomes. This issue was particularly present for Seq2Seq models.</p><p>Considering these concerns, we turned to RL4LMS, which is characterized by a more RL-focused design. In this framework, as elucidated in Section 2.4, actions refer to the vocabulary, and the state is generated in each iteration (next token prediction) by the policy, which is the natural language model. Rewards can be computed at the token or sentence level, with the latter being our preferred choice. One limitation is that data is processed one item at a time. Even though parallel processing can be done, this approach is computationally expensive for large datasets (as it is the case). Conversely, RL4LMs offers multiple algorithmic options, including PPO among others.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Experimental Setup and Evaluation Protocol</head><p>A set of experiments was designed to rigorously assess the efficacy of different methodologies in mitigating toxicity within textual data. We carefully formulated several experiment cases with the primary objective of analyzing their impact on the final results of toxicity mitigation. In our exploration we focused on various data distributions (toxic/non toxic distributions) and text generation methods, considering the computational demands inherent in RL-based training. To manage computational resources effectively, we opted to work with a subset of our dataset, containing between 100,000 to 140,000 items. Each dataset was partitioned into an 80% training subset and a 20% test subset. Maintaining consistency, the same test set was employed across all experiments to ensure fair evaluation of different methodologies. Within the training dataset, we categorized instances into toxic and non-toxic data.</p><p>The training set was sampled with different non-toxic and toxic distribution ratios: an 80/20 split, preserving the observed toxicity distribution in our dataset, and a balanced 50/50 split. Additionally, we further split toxic data into the three forms of toxicity, also preserving the distribution observed in our adversarial examples dataset: 35% implicit toxicity, 32% explicit toxicity, and 33% toxicity given a context. These splits were chosen because they are balanced enough to obtain not only a good representation of the original dataset but also a diverse number of examples. Regarding decoding strategies, we chose two approaches: deterministic decoding, also known as Greedy search decoding, and a probabilistic decoding strategy, multinomial sampling. These selections were made to observe the differences in toxicity mitigation between a deterministic technique and a more diverse one. The experiments are outlined as follows, along with their respective objectives:</p><p>1. Greedy search decoding in 80/20 Distribution: This experiment aims to investigate the effect of utilizing a subset of data that mirrors the distribution of toxicity in our dataset. Specifically, we assess the utility of the greedy search decoding strategy in the training process under this distribution.</p><p>2. Multinomial sampling decoding in 80/20 Distribution: This experiment aims to examine how the training process is influenced by employing a probabilistic technique on the previously analyzed text set. This experiment sheds light on the effectiveness of multinomial sampling in mitigating toxicity within the dataset.</p><p>3. Greedy Search decoding in 50/50 Distribution: This experiment aims to analyse the impact of toxicity mitigation when using a balanced dataset in combination with the greedy search decoding strategy, providing insights into the effectiveness of different distribution ratios in achieving toxicity balance.</p><p>In the experiments, several parameters remained fixed, falling within the ranges used in Ramamurthy et al. <ref type="bibr" target="#b56">[57]</ref>. These parameters include setting the number of epochs per rollout at 4, configuring the number of steps per epoch to be 12,800, and maintaining a learning rate of 10 −6 . The learning rate value was set following the guidelines of the BlenderBot article <ref type="bibr" target="#b0">[1]</ref>.The learning rate value was set following the guidelines of the BlenderBot article <ref type="bibr" target="#b0">[1]</ref>. In terms of the KL divergence parameters, we set the KL coefficient to a fixed value of 0.2 and the KL target value to 0.5, these settings were obtained empirically. Additionally, we employed a batch size of 16 and conducted 100 epochs of the training process, allowing the model to learn and adapt over multiple training cycles. The choice of batch size was selected due to computational constraints, and the epochs were observed to be empirically sufficient as the model deviated too much from its initial probabilistic distribution.</p><p>For the greedy search approaches, we adopted the base parameters utilized by HuggingFace, as these parameters have consistently yielded the best outcomes. Conversely, for the multinomial search strategy, we configured the parameters with a top-k value of 20, a temperature setting of 0.7, and limited the number of beams to 1. These parameter choices were made based on empirical analysis of the responses generated by BlenderBot, and partly following the experimental setup presented in <ref type="bibr" target="#b20">[21]</ref>, where the best parameters to mitigate the prevalence of toxicity were identified.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1.">Evaluation</head><p>Evaluating conversational agents is challenging, involving annotators and evaluators to obtain reliable insights. Recent advancements in LLMs in 2023 have led to the exploration of automatic metrics <ref type="bibr" target="#b66">[67]</ref>. Traditional metrics like METEOR, BLEU, and ROUGE lack depth in capturing meaning, word order, output correctness, and coherence <ref type="bibr" target="#b66">[67]</ref>.</p><p>When assessing our experiments, we considered toxicity and a model's ability to produce coherent, grammatically correct, and non-redundant outputs. Given alterations to word probabilities within a contextual framework, accurate sentence generation was crucial. Post-training evaluations leveraged metrics unrelated to toxicity to provide additional context, as these metrics were not integrated during the training phase due to their context-specific nature. This approach was adopted to gain a more comprehensive and contextual understanding of the results, particularly in aspects not directly tied to toxicity. We provide an overview of the two metrics utilized:</p><p>• DEAM <ref type="bibr" target="#b67">[68]</ref> assesses response coherence at the conversation level using abstract meaning representation. Trained to classify coherence, the score ranges from 0 (no coherent) to 1 (coherent).</p><p>• GRUEN <ref type="bibr" target="#b68">[69]</ref> evaluates grammaticality, non-redundancy, and topic maintenance. Techniques include sentence likelihood, grammatical acceptance, and Word Mover similarity. GRUEN's total score ranges from 0 to 1. A score of 0 indicates a sentence that is grammatically incorrect and redundant, while a score of 1 indicates a grammatically correct and coherent sentence.</p><p>The median of each metric, DEAM and GRUEN, was used over the generated text in the test subset to mitigate the impact of outliers.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Results</head><p>In this section, we introduce the main results obtained in every step, followed by the experimental design described in Section 4, outlining its connections to the research questions. We focus on the most important outcomes and also provide examples of how the model changed its interaction with the same adversarial examples before and after each training session. In Section 5.1, we show the performance of each model that constitutes the reward model. Subsequently, in Section 5.2, we describe the adversarial example dataset obtained. Finally, in Section 5.3, we showcase the toxicity mitigation methodology results in mitigating toxicity on BlenderBOT 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Reward Function</head><p>As mentioned in Section 3.1, the reward function comprised three models, one of which, named ToxDialogDefender, was specifically tailored to identify toxicity within a given context. Throughout the development of this model, we tested several base models to determine which one excelled in this task, addressing research question 𝑅𝑄 2 . The models assessed included DeBERTa, RoBERTa, and DistillBERT.</p><p>As shown in Table <ref type="table" target="#tab_1">1</ref>, DeBERTa demonstrated superior performance across both validation and test sets. Consequently, it became the foundation of our ToxDialogDefender model. In general, transformer-based models consistently proved to perform effectively in detecting toxicity within a dialogue context. We conducted an analysis to understand why the model could not accurately predict some examples. In this analysis we aimed to uncover patterns and explanations for the model's inaccuracies in discerning toxicity in such comments. We conducted topic extraction to gain insights into the topics the model struggled to predict accurately, such as its difficulty in detecting affirmations to a toxic comment as a form of toxic response, exemplified by "being at one with, let's say, males shouldn't exist". Additionally, we evaluated sentence length to understand if the model faced challenges with long or short sentences, potentially due to a lack of context understanding or misleading context. Lastly, we employed sentence embedding to identify patterns by grouping sentences into clusters. However, none of the mentioned methods resulted in significant information gain due to the diversity of the dataset. After reviewing the outcomes of our toxicity detector, we assessed how effectively the Perspective API and ToxiGen HateBERT models adapted to distinct predictive contexts. Table <ref type="table" target="#tab_3">2</ref> reveals different predictive capabilities given the different natures of the datasets. We used the results of these model as inputs to formulate an ensemble model, thereby harnessing the differently modeled knowledge embedded in the assembled models. In this context, the test dataset comprises a combination of ToxiGen and Jigsaw entries, while the validation dataset exclusively consists of the Surgei dataset, as it was not utilized during the training phase of our ensemble model.</p><p>The outcomes corresponding to this ensemble model are shown in Table <ref type="table" target="#tab_4">3</ref>. It is evident that logistic regression surpassed the performance of the based model in each of the datasets, and performed particularly better than other ensemble models in the validation set, which consists of a domain different from that used in the ensemble model training data. Subsequent to these results, we derived the function representing the logistic regression model:  </p><formula xml:id="formula_1">log (︂ 𝑃 1 -𝑃 )︂ = −0.</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Data Acquisition</head><p>In this section, we gather queries that prompt BlenderBot to generate toxic responses, addressing research question 𝑅𝑄 1 . To achieve this, we curated a dataset consisting of 1.5 million comments capable of eliciting toxicity from both our base model and our model trained through SL. The data collection process was carried out in increments of half a million comments, guided by our reward function. In total, 1.5 million comments were collected and assessed for toxicity for each of the models: the base BlenderBot model and the fine-tuned BlenderBot model. In Figure <ref type="figure" target="#fig_4">5</ref>, it is worth noting that 15.4% of the comments exhibited the ability to provoke toxicity in the base model. Within this subset, only 0.04% were flagged by the Perspective API, while 6.85% were identified by ToxiGen HateBERT. Remarkably, 8.53% were successfully detected by our proposed ToxDialogDefender toxicity detector. In the case of the model that underwent SL, 14.13% of the comments were recognized as presenting toxicity according to our reward function. Among these, 4.45% were detected by the Perspective API, 5.03% by ToxiGen HateBERT, and 4.64% by ToxDialogDefender.</p><p>Upon closer examination, it becomes evident that the initial phase of our methodology effectively reduced overall toxicity levels by a certain percentage. This preparatory process generated an ample number of adversarial examples, totaling approximately 210,000 elements. These examples laid the groundwork for the subsequent RL task, which was designed to address and mitigate toxicity in the BlenderBot 1 90M model. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">RL Fine-Tuning</head><p>In this section, we present our findings regarding 𝑅𝑄 3 , which focuses on mitigating toxicity in dialogue agents without altering the internal model structure. During the training phase, evaluations were conducted every 10 epochs for each experiment. In these evaluations, we generated a new test dataset using the updated model parameters, and employed a greedy search as our decoding strategy to observe changes in probabilities. Once the data was prepared, we applied the metrics described in Section 4. This systematic approach enabled us to track the model's progress after each epoch. Over the course of 100 training epochs, the model showed significant improvement in learning the optimal policy, typically occurring between 10 to 20 epochs. However, beyond that point, it began to display signs of overfitting, where it started replicating patterns from the training data and paying less attention to the input data. One notably repetitive pattern was the following: "I'm not sure what you're talking about. What do you mean by ... ?". Nevertheless, as shown in Table <ref type="table" target="#tab_5">4</ref>, toxicity was considerably reduced when compared to the initial values in our test dataset. In Figure <ref type="figure" target="#fig_6">6</ref>  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusions and Future Research</head><p>This article introduces a methodology inspired by recent advancements in RL and LLMs aimed at mitigating toxicity in dialogue agents. To achieve this goal, we leverage three toxicity detectors, each specialized in identifying the three forms of toxicity that can manifest in dialogue settings: implicit toxicity, explicit toxicity, and toxicity given a context. These toxicity detectors constitute the reward model, tasked with evaluating the sentences generated by the LLM. Experiments were conducted using Blenderbot 1, recognized for its proficiency in crafting toxic comments <ref type="bibr" target="#b21">[22]</ref>. The model underwent training via SL on datasets from <ref type="bibr" target="#b53">[54,</ref><ref type="bibr" target="#b55">56]</ref>, resulting in a reduction in toxicity and the promotion of prosocial responses to toxic comments. Additionally, adversarial examples from 4chan were collected for both the base and SL models, totaling around 210,000 entries. In the RL training process, the LLM generates a response to the adversarial data using two decoding strategies: deterministic and probabilistic. Once the response is generated, it is evaluated by the reward model and constrained by the Kullback-Leibler divergence to prevent significant deviation from the initial capabilities. Finally, the composite reward is used to update the model weights using the PPO algorithm.</p><p>Findings Our exploration of decoding strategies and data distributions yielded several insights. Firstly, adopting a non-deterministic sampling approach was crucial for creating a less toxic model while maintaining diversity in responses, in contrast to deterministic sampling. Another significant finding was the definition of the KL coefficient, a key factor in measuring model divergence. When assessing generated text, we utilized a range of metrics including coherence, grammatical correctness, informativeness, and engagement, surpassing traditional toxicity-related measures. Our methodology achieved a substantial reduction in toxicity from 24% to 5%, while preserving initial coherence and grammatical correctness, as assessed by DEAM and GRUEN metrics, leading to outputs that are more aligned and user-friendly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Limitations</head><p>Our training relies on toxicity detection models, which are susceptible to false positives and bias. The Perspective API, as highlighted in Table <ref type="table" target="#tab_3">2</ref>, particularly struggles with implicit toxicity, giving more emphasis to toxic words than considering the surrounding context. Despite generally encountering low false negatives from ToxiGen HateBERT and our toxicity detector, occasional mispredictions emphasize the need for further refinement. An increase in false positives might not significantly impact RL training unless it substantially exceeds correct predictions; however, caution is essential. The model could adjust its behavior to outperform classifiers or our reward function, potentially resulting in unclear or nonsensical text.</p><p>Another limitation of our research is the evolving definition of toxicity, which poses a challenge, especially with the application of LLMs in diverse cultural contexts. Lastly, the ongoing evaluation of dialogue agents remains challenging, with annotators struggling to keep pace. Automatic metrics, such as toxic detectors and evaluations based on artificial models, may introduce errors, as these models are limited and can introduce biases and erroneous assessments that may impact the quality of the results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Future Work</head><p>We have plans to expand our approach herein presented by enhancing several key components: toxic detectors, evaluation metrics, and adversarial examples. Regarding toxic detection models, our aim is to conduct a more comprehensive evaluation to understand their strengths and biases for further improvement. For evaluation metrics, we are actively working on developing a comprehensive evaluation framework for dialogue agents, addressing critical aspects to enhance reliability. As adversarial examples were found to be crucial in the development of the research, we plan to expand and enhance the quality of our dataset, supporting follow-up studies. Finally, we will broaden our experimentation with different decoding strategies and their outcomes to bolster training robustness and, ultimately, to improve and better align dialogue agents.</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 1 :</head><label>1</label><figDesc>Figure1: Fine-Tuning Language Models using RL: The fine-tuning process begins with two identical models exposed to an adversarial prompt dataset. The trained model 𝜋 PPO , or policy, generates an output evaluated using a reward function and contributes to the calculation of KL divergence in comparison to a reference model copy, 𝜋 base . The KL divergence is a measure of how one probability distribution diverges from a second, which is used in this context with 𝜆 KL to control the maximun divergence.The final reward is determined as the aggregate of these components. This resulting reward then informs the adjustment of the RL algorithm (in our case, Proximal Policy Optimization, PPO<ref type="bibr" target="#b50">[51]</ref>) to refine the policy's parameters<ref type="bibr" target="#b51">[52]</ref>.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_1"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Methodology of the training process. In Step 1, the model undergoes a fine-tuning process on the Prosocialdialog dataset [54] and the Bot-adversarial dialogue dataset [55]. In Step 2, the formulation and training process of the reward model are conducted. Finally, in Step 3, adversarial examples and the reward model are utilized in the RL training.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_2"><head></head><label></label><figDesc>[HST] Hi, how are you? [END] I am doing fine [ANS] I hope you die". The token [HST] marks the beginning of the conversation history, with each pair of turns separated by [END]. The token [ANS] indicates the start of a response to the last utterance.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_3"><head>Figure 4 :</head><label>4</label><figDesc>Figure 4: Adversarial examples gathering process. The diagram shows the adversarial examples gathering process; the mode used was BlenderBot model fine-tuned in Step 1.</figDesc><graphic coords="9,249.54,179.72,68.15,56.79" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_4"><head>Figure 5 :</head><label>5</label><figDesc>Figure 5: Percentage of toxic comments analysed by each component of the reward function in the 1.5 million dataset of possible adversarial examples.</figDesc><graphic coords="14,161.01,224.14,270.78,171.66" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_5"><head></head><label></label><figDesc>, we present both positive and negative examples of our model's output at different training stages. We term them 'positive examples' to showcase how the model successfully addressed toxicity issues after training. In the case of 'negative examples', we mean that the model still exhibited some level of toxicity even after two training processes.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_6"><head>Figure 6 :</head><label>6</label><figDesc>Figure 6: Response examples of each model after Step 1 and Step 3 training. (Left) A positive example in which the base model's response has been improved after the RL training process; (Right): a negative example.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0"><head></head><label></label><figDesc></figDesc><graphic coords="1,0.00,190.95,595.28,459.99" type="bitmap" /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head></head><label></label><figDesc>The toxicity score is obtained by first analyzing the comment in isolation using the Perspective API and ToxiGen HateBERT, ensuring that obvious cases of toxicity are caught early. The output of each model is then ensembled and analyzed, potentially reducing false positives that might arise from relying on a single model. If the comment is not deemed toxic, our contextual toxicity detector, ToxDialogDefender, predicts the final score. ToxDialogDefender can analyze the broader conversational context, understanding nuances and detecting toxicity that might not be apparent in isolated comments.</figDesc><table><row><cell cols="2">INPUT</cell><cell>Response Analysis</cell><cell></cell><cell cols="3">Context Analysis Output</cell></row><row><cell></cell><cell></cell><cell>Perspective API</cell><cell>Ye s</cell><cell></cell><cell></cell><cell></cell></row><row><cell>CHAT HISTORY</cell><cell>BlenderBot Output</cell><cell>Ensemble</cell><cell>Toxic?</cell><cell>N o</cell><cell>ToxDialogDefender</cell><cell>Reward</cell></row><row><cell></cell><cell></cell><cell>Toxigen</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell></cell><cell></cell><cell>HateBERT</cell><cell></cell><cell></cell><cell></cell><cell></cell></row><row><cell cols="3">Figure 3: Reward model prediction formulation.</cell><cell></cell><cell></cell><cell></cell><cell></cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_1"><head>Table 1</head><label>1</label><figDesc>Training results of different base models in the toxicity detection task given a conversation context.</figDesc><table><row><cell>Base Model</cell><cell>Dataset</cell><cell cols="2">F1-Score Accuracy</cell></row><row><cell>DistillBERT</cell><cell>Test Validation</cell><cell>0.759 0.764</cell><cell>0.853 0.856</cell></row><row><cell>RoBERTa</cell><cell>Test Validation</cell><cell>0.753 0.750</cell><cell>0.856 0.856</cell></row><row><cell>DeBERTa</cell><cell>Test Validation</cell><cell>0.794 0.795</cell><cell>0.869 0.871</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head></head><label></label><figDesc>99 + 5.46𝜉 PERS + 1.09𝑂 𝑇 − 2.08𝑂 𝑁 𝑇 where 𝑂 𝑁 𝑇 and 𝑂 𝑇 denote the outputs of the ToxiGen HateBERT model for Non-Toxic and Toxic texts, respectively; 𝜉 PERS denotes the output of the Perspective API model; and 𝑃 denotes the probability of a text being toxic, given the values of 𝑂 𝑁 𝑇 , 𝑂 𝑇 , and 𝜉 PERS .</figDesc><table /></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 2</head><label>2</label><figDesc>Performance metrics for Perspective API and ToxiGen HateBERT in different toxic datasets.</figDesc><table><row><cell>Base Model</cell><cell cols="3">Dataset F1-Score Accuracy</cell></row><row><cell></cell><cell>Jigsaw</cell><cell>0.92</cell><cell>0.91</cell></row><row><cell>Perspective API</cell><cell>ToxiGen</cell><cell>0.58</cell><cell>0.62</cell></row><row><cell></cell><cell>Surgei</cell><cell>0.83</cell><cell>0.82</cell></row><row><cell></cell><cell>Jigsaw</cell><cell>0.82</cell><cell>0.82</cell></row><row><cell>ToxiGen HateBERT</cell><cell>ToxiGen</cell><cell>0.88</cell><cell>0.87</cell></row><row><cell></cell><cell>Surgei</cell><cell>0.8</cell><cell>0.8</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_4"><head>Table 3</head><label>3</label><figDesc>Performance metrics of the machine learning algorithms as ensemble in comparison with Perspective API and ToxiGen HateBERT models.</figDesc><table><row><cell>Base Model</cell><cell>Dataset</cell><cell cols="2">F1-Score Accuracy</cell></row><row><cell>Perspective API</cell><cell>Test Validation</cell><cell>0.75 0.83</cell><cell>0.765 0.82</cell></row><row><cell>ToxiGen HateBERT</cell><cell>Test Validation</cell><cell>0.85 0.8</cell><cell>0.845 0.8</cell></row><row><cell>Decision Tree</cell><cell>Test Validation</cell><cell>0.88 0.859</cell><cell>0.875 0.86</cell></row><row><cell>Logistic Regression</cell><cell>Test Validation</cell><cell>0.875 0.865</cell><cell>0.88 0.86</cell></row><row><cell>Ripper</cell><cell>Test Validation</cell><cell>0.523 0.506</cell><cell>0.38 0.37</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_5"><head>Table 4</head><label>4</label><figDesc>Comparison of BlenderBot fine-tuned model with the model after each experiment conducted with RL. The toxicity scores displayed are from the test set, along with DEAM and GRUEN metrics, assessing the models' capacities in coherence and grammatical correctness.</figDesc><table><row><cell>Experiments</cell><cell cols="3">Toxicity (%) DEAM GRUEN</cell></row><row><cell>BlenderBot Finetuned</cell><cell>23.91</cell><cell>0.9876</cell><cell>0.8161</cell></row><row><cell>80/20 Greedy search</cell><cell>10.89</cell><cell>0.9895</cell><cell>0.8247</cell></row><row><cell>80/20 Multinomial search</cell><cell>4.9</cell><cell>0.9908</cell><cell>0.8392</cell></row><row><cell>50/50 Greedy search</cell><cell>5.87</cell><cell>0.9911</cell><cell>0.8465</cell></row></table></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_0">Hugging Face: https://huggingface.co/TheMrguiller/ToxDialogDefender</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1">SuperGLUE benchmark, https://super.gluebenchmark.com/leaderboard, accessed on May 3rd, 2024.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_2">Surge AI, https://www.surgehq.ai/, accessed on May 3rd, 2024.</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="8" xml:id="foot_3">RL4MS:https://rl4lms.apps.allenai.org/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="9" xml:id="foot_4">TRL:https://huggingface.co/docs/trl/index</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="10" xml:id="foot_5">Negative KL divergence issue in Hugging Face, https://github.com/huggingface/trl/issues/256, accessed on April 30th, 2024.</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="7.">Acknowledgments</head><p>This work has been partially supported by the Basque Government (ICL4LANG project, grant no. KK-2023/00094). J. Del Ser also receives support from this institution through the research group MATHMODE (IT1456-22).</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"> <ref type="bibr" target="#b3">4</ref> <p>Perspective API, About the API -Attributes and Languages, https://developers.perspectiveapi.com/s/ about-the-api-attributes-and-languages?language=en_US, accessed on April 30th, 2024.</p></div>
			</div>

			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<analytic>
		<title level="a" type="main">Recipes for Building an Open-Domain Chatbot</title>
		<author>
			<persName><forename type="first">S</forename><surname>Roller</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Dinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ju</surname></persName>
		</author>
		<author>
			<persName><surname>Williamson</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.eacl-main.24</idno>
		<ptr target="https://aclanthology.org/2021.eacl-main.24.doi:10.18653/v1/2021.eacl-main.24" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">P</forename><surname>Merlo</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tiedemann</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Tsarfaty</surname></persName>
		</editor>
		<meeting>the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="300" to="325" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">DIALOGPT: Large-Scale Generative Pre-training for Conversational Response Generation</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Sun</surname></persName>
		</author>
		<author>
			<persName><surname>Galley</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-demos.30</idno>
		<ptr target="https://aclanthology.org/2020.acl-demos.30.doi:10.18653/v1/2020.acl-demos.30" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Celikyilmaz</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T.-H</forename><surname>Wen</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="270" to="278" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Bordes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-L</forename><surname>Boureau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1605.07683</idno>
		<title level="m">Learning End-to-End Goal-Oriented Dialog</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b3">
	<analytic>
		<title level="a" type="main">Chatbots meet eHealth: Automatizing Healthcare</title>
		<author>
			<persName><forename type="first">F</forename><surname>Amato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Marrone</surname></persName>
		</author>
		<author>
			<persName><surname>Moscato</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">WAIAH@ AI* IA</title>
				<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="40" to="49" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Building Task-Oriented Dialogue Systems for Online Shopping</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Yan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Duan</surname></persName>
		</author>
		<author>
			<persName><surname>Chen</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">31</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<monogr>
		<author>
			<persName><surname>Openai</surname></persName>
		</author>
		<ptr target="https://openai.com/blog/chatgpt,2022" />
		<title level="m">Introducing ChatGPT</title>
				<imprint>
			<date type="published" when="2023">on 09/18/2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><surname>Albert</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<title level="m">Llama 2: Open Foundation and Fine-Tuned Chat Models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Trolls turned Tay, Microsoft&apos;s fun millennial AI bot, into a genocidal maniac</title>
		<author>
			<persName><forename type="first">A</forename><surname>Ohlheiser</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">The Washington Post</title>
		<imprint>
			<biblScope unit="volume">25</biblScope>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Use of personal information for artificial intelligence learning data under the Personal Information Protection Act: the case of Lee-Luda, an artificial-intelligence chatbot in South Korea</title>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">J</forename><surname>Jeon</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Go</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">H</forename><surname>Namgung</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Asia Pacific Law Review</title>
		<imprint>
			<biblScope unit="volume">31</biblScope>
			<biblScope unit="page" from="55" to="72" />
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Detecting Cyberbullying and Cyberaggression in Social Media</title>
		<author>
			<persName><forename type="first">D</forename><surname>Chatzakou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Leontiadis</surname></persName>
		</author>
		<author>
			<persName><surname>Blackburn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Transactions on the Web (TWEB)</title>
		<imprint>
			<biblScope unit="volume">13</biblScope>
			<biblScope unit="page" from="1" to="51" />
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<monogr>
		<author>
			<persName><forename type="first">E</forename><surname>Dinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Abercrombie</surname></persName>
		</author>
		<author>
			<persName><surname>Bergman</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2107.03451</idno>
		<title level="m">Anticipating Safety Issues in E2E Conversational AI: Framework and Tooling</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">On the Emergence of Sinophobic Behavior on Web Communities in the Face of COVID-19</title>
		<author>
			<persName><forename type="first">F</forename><surname>Tahmasbi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Schild</surname></persName>
		</author>
		<author>
			<persName><surname>Ling</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the web conference 2021</title>
				<meeting>the web conference 2021</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="1122" to="1133" />
		</imprint>
	</monogr>
	<note>a bat, Chang!</note>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Ji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Qiu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><surname>Zhang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2310.19852</idno>
		<title level="m">AI Alignment: A Comprehensive Survey</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Khatri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Hedayatnia</surname></persName>
		</author>
		<author>
			<persName><surname>Venkatesh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1812.10757</idno>
		<title level="m">Advancing the State of the Art in Open Domain Dialog Systems through the Alexa Prize</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b14">
	<analytic>
		<title level="a" type="main">Build it Break it Fix it for Dialogue Safety: Robustness from Adversarial Human Attack</title>
		<author>
			<persName><forename type="first">E</forename><surname>Dinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Humeau</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Chintagunta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D19-1461</idno>
		<ptr target="https://aclanthology.org/D19-1461.doi:10.18653/v1/D19-1461" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics</title>
				<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="4537" to="4546" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Ngo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raterink</surname></persName>
		</author>
		<author>
			<persName><surname>Araújo</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2108.07790</idno>
		<title level="m">Mitigating harm in language models with conditional-likelihood filtration</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models</title>
		<author>
			<persName><forename type="first">S</forename><surname>Gehman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gururangan</surname></persName>
		</author>
		<author>
			<persName><surname>Sap</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.findings-emnlp.301</idno>
		<ptr target="https://aclanthology.org/2020.findings-emnlp.301.doi:10.18653/v1/2020.findings-emnlp.301" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="3356" to="3369" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">The Woman Worked as a Babysitter: On Biases in Language Generation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Sheng</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Natarajan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Peng</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D19-1339</idno>
		<ptr target="https://aclanthology.org/D19-1339.doi:10.18653/v1/D19-1339" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Inui</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Jiang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Ng</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Wan</surname></persName>
		</editor>
		<meeting>the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Association for Computational Linguistics<address><addrLine>Hong Kong, China</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="3407" to="3412" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">GeDi: Generative Discriminator Guided Sequence Generation</title>
		<author>
			<persName><forename type="first">B</forename><surname>Krause</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">D</forename><surname>Gotmare</surname></persName>
		</author>
		<author>
			<persName><surname>Mccann</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.findings-emnlp.424</idno>
		<ptr target="https://aclanthology.org/2021.findings-emnlp.424.doi:10.18653/v1/2021.findings-emnlp.424" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2021, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting><address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4929" to="4952" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Detoxifying Language Models Risks Marginalizing Minority Voices</title>
		<author>
			<persName><forename type="first">A</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Pathak</surname></persName>
		</author>
		<author>
			<persName><surname>Wallace</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.naacl-main.190" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">K</forename><surname>Toutanova</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Rumshisky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Zettlemoyer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Hakkani-Tur</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">I</forename><surname>Beltagy</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Bethard</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Cotterell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Chakraborty</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Zhou</surname></persName>
		</editor>
		<meeting>the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="2390" to="2397" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Leashing the Inner Demons: Self-Detoxification for Language Models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mcauley</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI Conference on Artificial Intelligence</title>
				<meeting>the AAAI Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">36</biblScope>
			<biblScope unit="page" from="11530" to="11537" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Why So Toxic?: Measuring and Triggering Toxic Behavior in Open-Domain Chatbots</title>
		<author>
			<persName><forename type="first">W</forename><forename type="middle">M</forename><surname>Si</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Backes</surname></persName>
		</author>
		<author>
			<persName><surname>Blackburn</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security</title>
				<meeting>the 2022 ACM SIGSAC Conference on Computer and Communications Security</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="2659" to="2673" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<analytic>
		<title level="a" type="main">Defining and Detecting Toxicity on Social Media: Context and Knowledge are Key</title>
		<author>
			<persName><forename type="first">A</forename><surname>Sheth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">L</forename><surname>Shalin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">U</forename><surname>Kursuncu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Neurocomputing</title>
		<imprint>
			<biblScope unit="volume">490</biblScope>
			<biblScope unit="page" from="312" to="318" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b23">
	<analytic>
		<title level="a" type="main">A New Generation of Perspective API: Efficient Multilingual Character-level Transformers</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lees</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><forename type="middle">Q</forename><surname>Tran</surname></persName>
		</author>
		<author>
			<persName><surname>Tay</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</title>
				<meeting>the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3197" to="3207" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Automated Hate Speech Detection and the Problem of Offensive Language</title>
		<author>
			<persName><forename type="first">T</forename><surname>Davidson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Warmsley</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Macy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Weber</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the international AAAI conference on web and social media</title>
				<meeting>the international AAAI conference on web and social media</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">11</biblScope>
			<biblScope unit="page" from="512" to="515" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b25">
	<analytic>
		<title level="a" type="main">TOXIGEN: A Large-Scale ´Machine-Generated Dataset for Adversarial and Implicit Hate Speech Detection</title>
		<author>
			<persName><forename type="first">T</forename><surname>Hartvigsen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gabriel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Palangi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Ray</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Kamar</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.234</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.234.doi:10.18653/v1/2022.acl-long.234" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="3309" to="3326" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b26">
	<analytic>
		<title level="a" type="main">Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI</title>
		<author>
			<persName><forename type="first">L</forename><surname>Rosenblatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Piedras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wilkins</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI)</title>
				<meeting>the Second Workshop on NLP for Positive Impact (NLP4PI)</meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="15" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Detecting Unintended Social Bias in Toxic Language Datasets</title>
		<author>
			<persName><forename type="first">N</forename><surname>Sahoo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Gupta</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhattacharyya</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.conll-1.10</idno>
		<ptr target="https://aclanthology.org/2022.conll-1.10.doi:10.18653/v1/2022.conll-1.10" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 26th Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Fokkens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">V</forename><surname>Srikumar</surname></persName>
		</editor>
		<meeting>the 26th Conference on Computational Natural Language Learning (CoNLL), Association for Computational Linguistics<address><addrLine>Abu Dhabi, United Arab Emirates; Hybrid</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="132" to="143" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">A Critical Audit of Accuracy and Demographic Biases within Toxicity Detection Tools</title>
		<author>
			<persName><forename type="first">J</forename><surname>Jiang</surname></persName>
		</author>
		<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
		<respStmt>
			<orgName>Dartmouth College Undergraduate Theses</orgName>
		</respStmt>
	</monogr>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Social Biases in NLP Models as Barriers for Persons with Disabilities</title>
		<author>
			<persName><forename type="first">B</forename><surname>Hutchinson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Prabhakaran</surname></persName>
		</author>
		<author>
			<persName><surname>Denton</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.acl-main.487</idno>
		<ptr target="https://aclanthology.org/2020.acl-main.487.doi:10.18653/v1/2020.acl-main.487" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Chai</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Schluter</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<meeting>the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="5491" to="5501" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Towards Equal Gender Representation in the Annotations of Toxic Language Detection</title>
		<author>
			<persName><forename type="first">E</forename><surname>Excell</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">Al</forename><surname>Moubayed</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.gebnlp-1.7</idno>
		<ptr target="https://aclanthology.org/2021.gebnlp-1.7.doi:10.18653/v1/2021.gebnlp-1.7" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 3rd Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Costa-Jussa</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Gonen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Hardmeier</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Webster</surname></persName>
		</editor>
		<meeting>the 3rd Workshop on Gender Bias in Natural Language Processing, Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="55" to="65" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">Critical Perspectives: A Benchmark Revealing Pitfalls in PerspectiveAPI</title>
		<author>
			<persName><forename type="first">L</forename><surname>Piedras</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Rosenblatt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wilkins</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.nlp4pi-1.2</idno>
		<ptr target="https://aclanthology.org/2022.nlp4pi-1.2.doi:10.18653/v1/2022.nlp4pi-1.2" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Workshop on NLP for Positive Impact (NLP4PI), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">L</forename><surname>Biester</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Demszky</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Jin</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Sachan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Tetreault</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Wilson</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Xiao</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Zhao</surname></persName>
		</editor>
		<meeting>the Second Workshop on NLP for Positive Impact (NLP4PI), Association for Computational Linguistics<address><addrLine>Abu Dhabi, United Arab Emirates (Hybrid</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="15" to="24" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Queens are Powerful too: Mitigating Gender Bias in Dialogue Generation</title>
		<author>
			<persName><forename type="first">E</forename><surname>Dinan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Fan</surname></persName>
		</author>
		<author>
			<persName><surname>Williams</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.emnlp-main.656</idno>
		<ptr target="https://aclanthology.org/2020.emnlp-main.656.doi:10.18653/v1/2020.emnlp-main.656" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">B</forename><surname>Webber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<meeting>the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="8173" to="8188" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">Just Say No: Analyzing the Stance of Neural Dialogue Generation in Offensive Contexts</title>
		<author>
			<persName><forename type="first">A</forename><surname>Baheti</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Ritter</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Riedl</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.emnlp-main.397" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="4846" to="4862" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<analytic>
		<title level="a" type="main">Context Sensitivity Estimation in Toxicity Detection</title>
		<author>
			<persName><forename type="first">A</forename><surname>Xenos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pavlopoulos</surname></persName>
		</author>
		<author>
			<persName><surname>Androutsopoulos</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.woah-1.15" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics</title>
				<meeting>the 5th Workshop on Online Abuse and Harms (WOAH 2021), Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="140" to="145" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b35">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Pavlopoulos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sorensen</surname></persName>
		</author>
		<author>
			<persName><surname>Dixon</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.00998</idno>
		<title level="m">Toxicity Detection: Does Context Really Matter?</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">Revisiting Contextual Toxicity Detection in Conversations</title>
		<author>
			<persName><forename type="first">A</forename><surname>Anuchitanukul</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Ive</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Journal of Data and Information Quality</title>
		<imprint>
			<biblScope unit="volume">15</biblScope>
			<biblScope unit="page" from="1" to="22" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<analytic>
		<title level="a" type="main">Information Leakage in Embedding Models</title>
		<author>
			<persName><forename type="first">C</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Raghunathan</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2020 ACM SIGSAC conference on computer and communications security</title>
				<meeting>the 2020 ACM SIGSAC conference on computer and communications security</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="377" to="390" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<monogr>
		<author>
			<persName><forename type="first">L</forename><surname>Weidinger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mellor</surname></persName>
		</author>
		<author>
			<persName><surname>Rauh</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2112.04359</idno>
		<title level="m">Ethical and social risks of harm from Language Models</title>
				<imprint>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">SaFeRDialogues: Taking Feedback Gracefully after Conversational Safety Failures</title>
		<author>
			<persName><forename type="first">M</forename><surname>Ung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y.-L</forename><surname>Boureau</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.447</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.447.doi:10.18653/v1/2022.acl-long.447" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers)</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics (Volume Long Papers)<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="6462" to="6481" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b40">
	<monogr>
		<author>
			<persName><forename type="first">S</forename><surname>Dathathri</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Madotto</surname></persName>
		</author>
		<author>
			<persName><surname>Lan</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1912.02164</idno>
		<title level="m">Plug and Play Language Models: A Simple Approach to Controlled Text Generation</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">DExperts: Decoding-Time Controlled Text Generation with Experts and Anti-Experts</title>
		<author>
			<persName><forename type="first">A</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Sap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><surname>Swayamdipta</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2021.acl-long.522" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</title>
		<title level="s">Association for Computational Linguistics</title>
		<editor>
			<persName><forename type="first">C</forename><surname>Zong</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">W</forename><surname>Li</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Navigli</surname></persName>
		</editor>
		<meeting>the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="6691" to="6706" />
		</imprint>
	</monogr>
	<note>Long Papers</note>
</biblStruct>

<biblStruct xml:id="b42">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Zhou</surname></persName>
		</author>
		<author>
			<persName><surname>Wang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.08295</idno>
		<title level="m">Detoxify Language Model Step-by-Step</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b43">
	<monogr>
		<author>
			<persName><forename type="first">Z</forename><surname>Gou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Shao</surname></persName>
		</author>
		<author>
			<persName><surname>Gong</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.11738</idno>
		<title level="m">Critic: Large Language Models Can Self-Correct with Tool-Interactive Critiquing</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b44">
	<analytic>
		<title level="a" type="main">Reward Modeling for Mitigating Toxicity in Transformer-Based Language Models</title>
		<author>
			<persName><forename type="first">F</forename><surname>Faal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Schmitt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">Y</forename><surname>Yu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Applied Intelligence</title>
		<imprint>
			<biblScope unit="volume">53</biblScope>
			<biblScope unit="page" from="8421" to="8435" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b45">
	<analytic>
		<title level="a" type="main">Quark: Controllable Text Generation with Reinforced</title>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Welleck</surname></persName>
		</author>
		<author>
			<persName><surname>Hessel</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Un]learning, 36th Conference on Neural Information Processing Systems</title>
				<meeting><address><addrLine>NeurIPS</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b46">
	<monogr>
		<title level="m" type="main">Generating Sequences by Learning to Self-Correct</title>
		<author>
			<persName><forename type="first">S</forename><surname>Welleck</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Lu</surname></persName>
		</author>
		<author>
			<persName><surname>West</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2211.00053</idno>
		<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b47">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Glass</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1809.04113</idno>
		<title level="m">Detecting egregious responses in neural sequence-to-sequence models</title>
				<imprint>
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b48">
	<analytic>
		<title level="a" type="main">Red Teaming Language Models with Language Models</title>
		<author>
			<persName><forename type="first">E</forename><surname>Perez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><surname>Cai</surname></persName>
		</author>
		<ptr target="https://aclanthology.org/2022.emnlp-main.225" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">Y</forename><surname>Goldberg</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Z</forename><surname>Kozareva</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Zhang</surname></persName>
		</editor>
		<meeting>the 2022 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Abu Dhabi, United Arab Emirates</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="page" from="3419" to="3448" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b49">
	<analytic>
		<title level="a" type="main">Probing Toxic Content in Large Pre-Trained Language Models</title>
		<author>
			<persName><forename type="first">N</forename><surname>Ousidhoum</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><surname>Fang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</title>
		<title level="s">Long Papers</title>
		<meeting>the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4262" to="4274" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b50">
	<monogr>
		<title level="m" type="main">Training language models to follow instructions with human feedback</title>
		<author>
			<persName><forename type="first">L</forename><surname>Ouyang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><surname>Jiang</surname></persName>
		</author>
		<ptr target="https://arxiv.org/abs/2203.0215513" />
		<imprint>
			<date type="published" when="2022">2022. 2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b51">
	<monogr>
		<author>
			<persName><forename type="first">N</forename><surname>Lambert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Werra</surname></persName>
		</author>
		<title level="m">Illustrating Reinforcement Learning from Human Feedback (RLHF), Hugging Face Blog</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b52">
	<monogr>
		<author>
			<persName><forename type="first">J</forename><surname>Shlens</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1404.2000</idno>
		<title level="m">Notes on Kullback-Leibler Divergence and Likelihood</title>
				<imprint>
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b53">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Yu</surname></persName>
		</author>
		<author>
			<persName><surname>Jiang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2205.12688</idno>
		<title level="m">PROSOCIALDIALOG: A Prosocial Backbone for Conversational Agents</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b54">
	<analytic>
		<title level="a" type="main">Beyond Goldfish Memory: Long-Term Open-Domain Conversation</title>
		<author>
			<persName><forename type="first">J</forename><surname>Xu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Szlam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Weston</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.356</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.356.doi:10.18653/v1/2022.acl-long.356" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="5180" to="5197" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b55">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Bai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Kadavath</surname></persName>
		</author>
		<author>
			<persName><surname>Kundu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2212.08073</idno>
		<title level="m">Constitutional AI: Harmlessness from AI Feedback</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b56">
	<analytic>
		<title level="a" type="main">Is Reinforcement Learning (Not) for Natural Language Processing</title>
		<author>
			<persName><forename type="first">R</forename><surname>Ramamurthy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Ammanabrolu</surname></persName>
		</author>
		<author>
			<persName><surname>Brantley</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.01241</idno>
	</analytic>
	<monogr>
		<title level="m">Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b57">
	<analytic>
		<title level="a" type="main">Hopfgartner, A Comparative Study of Using Pre-trained Language Models for Toxic Comment Classification</title>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Companion Proceedings of the Web Conference 2021</title>
				<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="500" to="507" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b58">
	<analytic>
		<title level="a" type="main">A Study of Multilingual Toxic Text Detection Approaches under Imbalanced Sample Distribution</title>
		<author>
			<persName><forename type="first">G</forename><surname>Song</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Xiao</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Information</title>
		<imprint>
			<biblScope unit="volume">12</biblScope>
			<biblScope unit="page">205</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b59">
	<analytic>
		<title level="a" type="main">Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks</title>
		<author>
			<persName><forename type="first">R</forename><surname>Dey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Salem</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">IEEE 60th international midwest symposium on circuits and systems (MWSCAS), IEEE</title>
				<imprint>
			<date type="published" when="2017">2017. 2017</date>
			<biblScope unit="page" from="1597" to="1600" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b60">
	<analytic>
		<title level="a" type="main">Bidirectional recurrent neural networks</title>
		<author>
			<persName><forename type="first">M</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">K</forename><surname>Paliwal</surname></persName>
		</author>
		<ptr target="https://api.semanticscholar.org/CorpusID:18375389" />
	</analytic>
	<monogr>
		<title level="j">IEEE Trans. Signal Process</title>
		<imprint>
			<biblScope unit="volume">45</biblScope>
			<biblScope unit="page" from="2673" to="2681" />
			<date type="published" when="1997">1997</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b61">
	<analytic>
		<title level="a" type="main">BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding</title>
		<author>
			<persName><forename type="first">J</forename><surname>Devlin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-W</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><surname>Lee</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/N19-1423</idno>
		<ptr target="https://aclanthology.org/N19-1423.doi:10.18653/v1/N19-1423" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies</title>
		<title level="s">Long and Short Papers</title>
		<editor>
			<persName><forename type="first">J</forename><surname>Burstein</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">C</forename><surname>Doran</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">T</forename><surname>Solorio</surname></persName>
		</editor>
		<meeting>the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies<address><addrLine>Minneapolis, Minnesota</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="4171" to="4186" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b62">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Ott</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><surname>Du</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1907.11692</idno>
		<title level="m">RoBERTa: A Robustly Optimized BERT Pretraining Approach</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b63">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>He</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Chen</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.03654</idno>
		<title level="m">DeBERTa: Decoding-enhanced BERT with Disentangled Attention</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b64">
	<monogr>
		<title level="m" type="main">Toxic Comment Classification Challenge</title>
		<author>
			<persName><forename type="first">J</forename><surname>Cjadams</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Sorensen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Elliott</surname></persName>
		</author>
		<author>
			<persName><surname>Dixon</surname></persName>
		</author>
		<ptr target="https://kaggle.com/competitions/jigsaw-toxic-comment-classification-challenge" />
		<imprint>
			<date type="published" when="2017">2017</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b65">
	<analytic>
		<title level="a" type="main">Raiders of the Lost Kek: 3.5 Years of Augmented 4chan Posts from the Politically Incorrect Board</title>
		<author>
			<persName><forename type="first">A</forename><surname>Papasavva</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Zannettou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><forename type="middle">De</forename><surname>Cristofaro</surname></persName>
		</author>
		<author>
			<persName><surname>Stringhini</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the international AAAI conference on web and social media</title>
				<meeting>the international AAAI conference on web and social media</meeting>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">14</biblScope>
			<biblScope unit="page" from="885" to="894" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b66">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Chang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><surname>Wu</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.03109</idno>
		<title level="m">A Survey on Evaluation of Large Language Models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b67">
	<analytic>
		<title level="a" type="main">DEAM: Dialogue Coherence Evaluation using AMRbased Semantic Manipulations</title>
		<author>
			<persName><forename type="first">S</forename><surname>Ghazarian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Wen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Galstyan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Peng</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2022.acl-long.57</idno>
		<ptr target="https://aclanthology.org/2022.acl-long.57.doi:10.18653/v1/2022.acl-long.57" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics</title>
		<title level="s">Long Papers</title>
		<editor>
			<persName><forename type="first">S</forename><surname>Muresan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">P</forename><surname>Nakov</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">A</forename><surname>Villavicencio</surname></persName>
		</editor>
		<meeting>the 60th Annual Meeting of the Association for Computational Linguistics<address><addrLine>Dublin, Ireland</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2022">2022</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="771" to="785" />
		</imprint>
	</monogr>
	<note>Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b68">
	<analytic>
		<title level="a" type="main">GRUEN for Evaluating Linguistic Quality of Generated Text</title>
		<author>
			<persName><forename type="first">W</forename><surname>Zhu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhat</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2020.findings-emnlp.9</idno>
		<ptr target="https://aclanthology.org/2020.findings-emnlp.9.doi:10.18653/v1/2020.findings-emnlp.9" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">T</forename><surname>Cohn</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>He</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">Y</forename><surname>Liu</surname></persName>
		</editor>
		<imprint>
			<date type="published" when="2020">2020</date>
			<biblScope unit="page" from="94" to="108" />
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
