<?xml version="1.0" encoding="UTF-8"?>
<TEI xml:space="preserve" xmlns="http://www.tei-c.org/ns/1.0" 
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xsi:schemaLocation="http://www.tei-c.org/ns/1.0 https://raw.githubusercontent.com/kermitt2/grobid/master/grobid-home/schemas/xsd/Grobid.xsd"
 xmlns:xlink="http://www.w3.org/1999/xlink">
	<teiHeader xml:lang="en">
		<fileDesc>
			<titleStmt>
				<title level="a" type="main">Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection</title>
			</titleStmt>
			<publicationStmt>
				<publisher/>
				<availability status="unknown"><licence/></availability>
			</publicationStmt>
			<sourceDesc>
				<biblStruct>
					<analytic>
						<author>
							<persName><forename type="first">Chiara</forename><forename type="middle">Di</forename><surname>Bonaventura</surname></persName>
							<email>chiara.di_bonaventura@kcl.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="institution">King&apos;s College London</orgName>
								<address>
									<settlement>London</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
							<affiliation key="aff1">
								<orgName type="institution">Imperial College London</orgName>
								<address>
									<settlement>London</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Lucia</forename><surname>Siciliani</surname></persName>
							<email>lucia.siciliani@uniba.it</email>
							<affiliation key="aff2">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Pierpaolo</forename><surname>Basile</surname></persName>
							<email>pierpaolo.basile@uniba.it</email>
							<affiliation key="aff2">
								<orgName type="department">Department of Computer Science</orgName>
								<orgName type="institution">University of Bari Aldo Moro</orgName>
								<address>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Albert</forename><surname>Meroño-Peñuela</surname></persName>
							<affiliation key="aff0">
								<orgName type="institution">King&apos;s College London</orgName>
								<address>
									<settlement>London</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<persName><forename type="first">Barbara</forename><surname>Mcgillivray</surname></persName>
							<email>barbara.mcgillivray@kcl.ac.uk</email>
							<affiliation key="aff0">
								<orgName type="institution">King&apos;s College London</orgName>
								<address>
									<settlement>London</settlement>
									<country key="GB">United Kingdom</country>
								</address>
							</affiliation>
						</author>
						<author>
							<affiliation key="aff3">
								<orgName type="department">Tenth Italian Conference on Computational Linguistics</orgName>
								<address>
									<addrLine>Dec 04 -06</addrLine>
									<postCode>2024</postCode>
									<settlement>Pisa</settlement>
									<country key="IT">Italy</country>
								</address>
							</affiliation>
						</author>
						<title level="a" type="main">Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection</title>
					</analytic>
					<monogr>
						<idno type="ISSN">1613-0073</idno>
					</monogr>
					<idno type="MD5">79F0AA8F3C9FC726F5E7380167E53110</idno>
				</biblStruct>
			</sourceDesc>
		</fileDesc>
		<encodingDesc>
			<appInfo>
				<application version="0.7.2" ident="GROBID" when="2025-04-23T17:34+0000">
					<desc>GROBID - A machine learning software for extracting information from scholarly documents</desc>
					<ref target="https://github.com/kermitt2/grobid"/>
				</application>
			</appInfo>
		</encodingDesc>
		<profileDesc>
			<textClass>
				<keywords>
					<term>Large Language Models</term>
					<term>Hate Speech Detection</term>
					<term>Explanation Generation</term>
					<term>Human Evaluation</term>
				</keywords>
			</textClass>
			<abstract>
<div xmlns="http://www.tei-c.org/ns/1.0"><p>Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.</p></div>
			</abstract>
		</profileDesc>
	</teiHeader>
	<text xml:lang="en">
		<body>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>Explainability is a crucial open challenge in Natural Language Processing (NLP) research on abusive language <ref type="bibr" target="#b0">[1]</ref> as increasing models' complexity <ref type="bibr" target="#b1">[2]</ref>, models' intrinsic bias <ref type="bibr" target="#b2">[3]</ref>, and international regulations <ref type="bibr" target="#b3">[4]</ref> call for a shift in perspective from performance-based models to more transparent models. Moreover, recent studies have shown the benefits of explanations for users <ref type="bibr" target="#b4">[5,</ref><ref type="bibr" target="#b5">6]</ref> and content moderators <ref type="bibr" target="#b6">[7]</ref> on social media platforms. The former can benefit from receiving an explanation for why a certain post has been flagged or removed whereas the latter are shown to annotate toxic posts faster and solve doubtful annotations thanks to explanations.</p><p>Several efforts have moved towards explainable abusive language detection in the past years, like the development of datasets containing rationales (i.e., the tokens in the text that suggest why the text is hateful) <ref type="bibr" target="#b7">[8]</ref> or implied statements (i.e., description of the implied meaning of the text) <ref type="bibr" target="#b8">[9,</ref><ref type="bibr" target="#b9">10]</ref>, and shared tasks on explainable hate speech detection <ref type="bibr" target="#b10">[11,</ref><ref type="bibr" target="#b11">12]</ref>, inter alia. With Large Language Models (LLMs) like FLAN-T5 <ref type="bibr" target="#b12">[13]</ref> showing remarkable performance across tasks and human-like text generation <ref type="bibr" target="#b13">[14,</ref><ref type="bibr" target="#b14">15,</ref><ref type="bibr" target="#b15">16]</ref>, recent studies have explored LLMs for explainable hate speech detection, wherein classification predictions are described through natural language explanations <ref type="bibr" target="#b16">[17,</ref><ref type="bibr" target="#b17">18]</ref>. For instance, <ref type="bibr" target="#b18">[19]</ref> used chain-of-thought prompting <ref type="bibr" target="#b19">[20]</ref> of LLMs to generate explanations for implicit hate speech detection.</p><p>However, most of these studies rely on empirical metrics like BLEU <ref type="bibr" target="#b20">[21]</ref> to evaluate the generated explanations automatically. Consequently, the human perception and implications of these explanations remain understudied, as well as the extent to which empirical metrics approximate human judgements. <ref type="bibr" target="#b21">[22]</ref> recruited crowdworkers to evaluate the level of hatefulness in tweets and the quality of explanations generated by GPT-3. Instead, we conduct an expert survey investigating four LLMs and five learning strategies across multi-class abusive language detection tasks to answer the following questions: RQ1: How well do LLM-generated explanations for abusive language detection match human expectations? RQ2: How well do empirical metrics align with human judgements? RQ3: What makes LLM-generated explanations good, according to experts?</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Experimental Setup</head><p>To answer these research questions, we design a beforeand-after study, surveying participants about their prior expectations about LLM-generated explanations and then showing them examples generated by several LLMs with diverse learning strategies<ref type="foot" target="#foot_0">1</ref> , followed by further interviews. To ensure robustness of our results, we recruited experts in the field, i.e., AI researchers, as described below.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.1.">Data</head><p>For our experiments, we use the HateXplain <ref type="bibr" target="#b7">[8]</ref> and the Implicit Hate Corpus <ref type="bibr" target="#b8">[9]</ref> as they encompass different levels of offensiveness (i.e., hate speech, offensive, neutral), expressiveness (i.e., explicit hate, implicit hate, neutral), multiple targeted groups, and explanations for the hateful label (Table <ref type="table" target="#tab_0">1</ref>). These datasets contain unstructured explanations of the words that constitute abuse (in Hat-eXplain) and the user's intent (in Implicit Hate). In view of previous research arguing the need for structured explanations in hateful content moderation <ref type="bibr" target="#b0">[1]</ref>, we use the following template to create structured explanations, that we will use as ground-truth: "Explanation: it contains the following hateful words (implied statement):" for abusive content in HateXplain (Implicit Hate Corpus), and "The text does not contain abusive content." for neutral content. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Dataset</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2.">Methodology</head><p>We extensively investigate four popular LLMs across five learning strategies on their ability to detect multi-class offensiveness and expressiveness of abusive language and to generate explanations for the classification.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Models.</head><p>We use different open-source LLMs (Table <ref type="table">2</ref>): the base versions of FLAN-Alpaca <ref type="bibr" target="#b22">[23,</ref><ref type="bibr" target="#b23">24]</ref>, FLAN-T5 <ref type="bibr" target="#b12">[13]</ref>, mT0 <ref type="bibr" target="#b24">[25]</ref>, and the 7B foundational model Llama 2 <ref type="bibr" target="#b25">[26]</ref>, which is an updated version of LlaMA <ref type="bibr" target="#b26">[27]</ref>.</p><formula xml:id="formula_0">Model Instruction Fine-tuned Toxicity Fine-tuned FLAN-Alpaca � � FLAN-T5 � � mT0 � - Llama-2 - -</formula></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 2</head><p>Summary of models used.</p><p>Learning strategies. As different prompting strategies might yield different results, we test five distinct learning strategies using the established Stanford Alpaca template<ref type="foot" target="#foot_1">2</ref> (cf. Appendix A for prompt details):</p><p>(1) zero-shot learning (zsl): we pass "Classify the input text as list_of_labels, and provide an explanation" in the instruction field of the template. The list_of_labels changes according to the dataset used;</p><p>(2) few-shot learning (fsl): we pass three additional examples to the aforementioned template, which are randomly sampled with equal probability among the labels to account for class imbalance in the datasets. We experimented with different numbers of examples (i.e., passing one, three or five examples), and chose three as it was the best strategy;</p><p>(3) knowledge-guided zero-shot learning (kg): instead of passing additional examples in the prompts, we add external knowledge retrieved by means of an entity linker <ref type="foot" target="#foot_2">3</ref> , which first detects entities mentioned in the input text, and then retrieves the relevant information from the external knowledge base. We use Wikidata <ref type="bibr" target="#b27">[28]</ref> for encyclopedic knowledge, KnowledJe <ref type="bibr" target="#b28">[29]</ref> for hate speech temporal linguistic knowledge and ConceptNet <ref type="bibr" target="#b29">[30]</ref> for commonsense knowledge. We modify the prompt template with an additional field called 'context' to account for this external knowledge;</p><p>(4) instruction fine-tuning (ft): we use the same prompts used in (1) to instruction fine-tune Llama-2;</p><p>(5) knowledge-guided instruction fine-tuning (kg_ft): we use the knowledge-guided prompts developed in (3) to instruction fine-tune Llama-2.</p><p>Empirical eval metrics. We evaluate how closely the LLM-generated explanations match the ground-truth across eight empirical similarity metrics due to the challenge of simultaneously assessing a wide set of criteria <ref type="bibr" target="#b30">[31,</ref><ref type="bibr" target="#b31">32,</ref><ref type="bibr" target="#b32">33]</ref>. Following established NLG research <ref type="bibr" target="#b33">[34,</ref><ref type="bibr" target="#b34">35]</ref>, we choose BERTScore <ref type="bibr" target="#b35">[36]</ref> and METEOR <ref type="bibr" target="#b36">[37]</ref> for semantic similarity. For syntactic similarity, we select BLEU <ref type="bibr" target="#b20">[21]</ref>, GBLEU <ref type="bibr" target="#b37">[38]</ref>, ROUGE <ref type="bibr" target="#b38">[39]</ref>, ChrF <ref type="bibr" target="#b39">[40]</ref> with its derivates ChrF+ and ChrF++ <ref type="bibr" target="#b40">[41,</ref><ref type="bibr" target="#b41">42]</ref>. Additionally, we present an expert evaluation following our survey.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.3.">Survey Design</head><p>To evaluate how well LLMs align with human expectations and judgements in explanation generation, we design a before-and-after study as follows.</p><p>Before treatment. We ask for participant's background information, e.g., gender identity, native language and how they would rate the usefulness and trustworthiness of a language model for explanation generation. Specifically, we ask "How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" and "How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" on a 1-5 Likert scale.</p><p>Treatment. As for the treatment, we show participants a sample of 70 texts from the datasets, paired with up to four different explanations. Specifically, given a text and ground-truth explanation, participants are asked if the text is correctly explained. If yes, they are asked to rate three different LLM-generated explanations with respect to the ground-truth on a 1-3 scale. These explanations are randomly sampled among the four LLMs and five learning strategies discussed in Section 2.2.</p><p>After treatment. Finally, we ask participants' opinion on the usefulness and trustworthiness of explanation generation, having seen the LLM-generated explanations. In addition, we ask general opinions related to what type of errors they observed most frequently, and what a good explanation would look like.</p><p>The full list of questions is in the Appendix B. The institutional ethical board of the first author's university approved our study design. We distributed the survey through channels that allow us to target individuals working in AI who are familiar with the field of language models and/or AI Ethics, including NLP reading groups and AI Ethics interest groups. To ensure the reliability of our before-and-after study, participants were given 1 hour to complete as many answers as they could. We collected answers from 15 participants, of which 33% (67%) identify as female (male), and 33% (67%) are (non) English native-speakers. The average level of participants' expertise in abusive language research is 2.47 out of 5 (self-described) <ref type="foot" target="#foot_3">4</ref> , and their continents</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Results and Discussion</head><p>Our 15 participants reach a fair agreement, with Krippendorff's alpha <ref type="bibr" target="#b42">[43]</ref> equal to 38.43%.</p><p>Fig. <ref type="figure">1</ref> shows changes in the relative frequencies of participant scores in the usefulness and trustworthiness of explanations before and after treatment. Participants' responses before treatment have expectations of textual explanations for classifications of being "highly useful" (above 50%; highest possible score) in terms of usefulness, and "moderately trustworthy" or "neutral" (above 40%; second and third best possible score) in terms of trustworthiness. However, scores for after treatment show participants changing their usefulness scores towards "moderately unuseful" (40-50%; second worst possible score) and their trustworthiness scores to "highly untrustworthy" (above 30%; worst possible score). Agreement differs in each category: usefulness is much more consensual, whereas trustworthiness is judged with higher variance. In general, LLM-generated explanations do not meet human expectations in terms of usefulness and trustworthiness. Specifically, exposing participants to these explanations leads to an average percentage decrease of 47.78% and 64.32% in the perception of the usefulness and trustworthiness of explanations, respectively. Fig. <ref type="figure" target="#fig_0">2</ref> shows the scores of all empirical metrics and expert evaluation for all models on explanation generation. Overall, similarity metrics tend to be highly volatile with respect to each other. For instance, FLAN-Alpaca prompted with zero-shot learning (i.e., 'alpaca_zsl' in the figure) generates explanations that are more than 70% semantically similar to the ground-truth explanations according to BERTScore while being less than 20% semantically similar according to METEOR. Similarly for syntax: BLEU and GBLEU similarity scores are less than 3% whereas ROUGE and chrF/+/++ are in the range 9%-21%. Moreover, we observe that BERTScore has a tendency to over-score explanations compared to human evaluation scores. Contrarily, METEOR, BLEU, GBLEU, ROUGE and chrF/+/++ have a tendency to under-score explanations. Instruction fine-tuning helped all metrics to approximate expert evaluations better, especially when tuned on knowledge-guided prompts. We use the Spearman's rank correlation coefficient to compare the correlation between human scores and those provided by all the other metrics. In detail, we rank the models for each type of metric, and then we compute the Spearman correlation between the rank obtained by human scores and those obtained by other metrics. Table <ref type="table" target="#tab_2">3</ref> reports all the correlation scores. We observe that BERTScore is the most correlated with humans in both tasks. Also, Figure <ref type="figure">1</ref>: Relative frequencies of Likert scores before and after treatment on usefulness and trustworthiness of LLMs for explanation generation in abusive language detection. chrF/+/++ metrics are highly correlated with humans while all the other metrics based on syntactic matches are slightly correlated with humans. Results show that semantic metrics are more similar to how humans evaluate the quality of the explanation generated by LLMs. Only one metric (ROUGE) shows a different behaviour between the two tasks.</p><p>Since 38.55% of the ground-truth explanations were not rated as good explanations by participants, we further investigated what are the most common errors and what makes an explanation good. Table <ref type="table" target="#tab_3">4</ref> returns the most common error categories reported by participants. Most of them are related to logical fallacies (e.g., contradictory statements, hallucination), especially in the context of sarcasm and self-deprecating humour, rather than linguistic errors (e.g., grammar, misspellings). It is worth noticing that 13.33% of the participants reported that LLM-generated explanations contain cultural bias (e.g., stereotypes), with the implication of potentially perpetuating harms against the targeted victims of abusive language. As for desiderata, 73.33% of participants would like to receive textual explanations that are coherent with human reasoning and understanding, i.e., that are relevant and exhaustive to the text they refer to while being logically and linguistically correct. A remaining 20% thinks that a good explanation must be coherent with model reasoning instead. In other words, participants are much more concerned about how the explanation looks like rather than its reflection of the inner mechanism of the model reasoning. To quote a participant's perspective, "I would want the explanation to be helpful to me and guide my own reasoning".   </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Metric</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Conclusion</head><p>In this paper, we conducted a before-and-after study to understand human expectations and judgements of LLMgenerated explanations for multi-class abusive language detection tasks. Contrarily to previous research <ref type="bibr" target="#b21">[22]</ref>, we investigated multiple LLMs and learning techniques, and we surveyed AI experts who are familiar with abusive language research instead of crowdworkers. We found that human expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met: after seeing these explanations, the usefulness and trustworthiness ratings decrease by 47.78% and 64.32%, respectively. Secondly, our results show that empirical metrics commonly used to evaluate textual explanations are highly volatile with respect to each other, even when they measure the same type of similarity (i.e., semantic vs. syntactic), and therefore pointing at the need of more reliable metrics for the empirical evaluation of textual explanations. In general, BERTScore and METEOR metrics exhibit the strongest correlation with human judgements. Lastly, our study provides evidence of the desiderata for LLM-generated explanations, suggesting that explanations should be coherent with human reasoning rather than model reasoning. Participants value the most textual explanations that are relevant and exhaustive to the text they refer to, while being logically and linguistically correct. Justifications for this preference lie on the fact that abusive language detection heavily relies on additional context and knowledge about slang and slurs, for which receiving an explanation is helpful to participants' understanding of the text. Future work should investigate whether this preference holds for other domains as well. In light of our findings, we conclude with three recommendations to use LLMs responsibly for explainable abusive language detection: (1) be aware of the cultural bias these models might exhibit when generating free-text explanations, which can further harm targeted groups;</p><p>(2) if possible, instruction fine-tune LLMs for explanation generation of abusive language detection. This not only could ensure the generation of structured explanations as advised by previous research <ref type="bibr" target="#b0">[1]</ref> but it also returns the highest evaluation scores, both empirically and expert-wise, when using knowledge-guided prompts;</p><p>(3) opt for a combination of empirical metrics to evaluate textual explanations when no human evaluation is possible, since no particular empirical metric seems to generalise across different learning techniques, models and datasets, making the ground-truth lie somewhere in between BERTScore (upper bound) and BLEU (lower bound).</p></div><figure xmlns="http://www.tei-c.org/ns/1.0" xml:id="fig_0"><head>Figure 2 :</head><label>2</label><figDesc>Figure 2: Evaluation of explanation generation by LLMs across empirical metrics and human eval.</figDesc></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_0"><head>Table 1</head><label>1</label><figDesc>Summary of datasets used.</figDesc><table><row><cell></cell><cell>Labels</cell><cell>Target</cell><cell>Explanation</cell></row><row><cell>HateXplain</cell><cell>hate speech, offensive, neutral</cell><cell>women, black, ...</cell><cell>Token-level</cell></row><row><cell>Implicit Hate</cell><cell>implicit hate, explicit hate, neutral</cell><cell>Jews, whites, ...</cell><cell>Implied statement</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_2"><head>Table 3</head><label>3</label><figDesc>The Spearman coefficient between each metric and experts' scores.</figDesc><table><row><cell>Error Category</cell><cell>Relative Frequency</cell></row><row><cell>Logical Errors</cell><cell>26.67%</cell></row><row><cell>Vagueness</cell><cell>20.00%</cell></row><row><cell>Cultural Bias</cell><cell>13.33%</cell></row><row><cell>Hallucination</cell><cell>13.33%</cell></row><row><cell>Irrelevant Info</cell><cell>13.33%</cell></row><row><cell>Other</cell><cell>6.67%</cell></row></table></figure>
<figure xmlns="http://www.tei-c.org/ns/1.0" type="table" xml:id="tab_3"><head>Table 4</head><label>4</label><figDesc>Percentage of error categories reported by participants.</figDesc><table /></figure>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0">The data containing the LLM-generated explanations are publicly available at https://github.com/ChiaraDiBonaventura/ is-explanation-all-you-need</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_1">https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file# data-release</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_2">If available, we use the API provided by the knowledge source, spaCy otherwise. https://spacy.io/</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_3">The list of levels to choose from was: 1=Novice, 2=Advanced beginner, 3=Competent, 4=Proficient</note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_4">, 5=Expert. of origin include Europe (60%), Asia (26.67%), Africa (6.67%), and Latin America (6.67%).</note>
		</body>
		<back>

			<div type="acknowledgement">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Acknowledgments</head><p>This work was supported by the UK Research and Innovation [grant number EP/S023356/1] in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org); by the Trustworthy AI Research award by The Alan Turing Institute, supported by the British Embassy Rome and the UK Science &amp; Innovation Network; and by the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.</p></div>
			</div>


			<div type="funding">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>† Work partially funded by the Trustworthy AI Research award received by The Alan Turing Institute and the the Italian Future AI Research Foundation (FAIR).</p></div>
			</div>

			<div type="annex">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Prompt Details</head><p>Table <ref type="table">5</ref> shows the two types of prompts we used in our experiments, following the template of the Stanford Alpaca project. The two categories differ for the 'context' that is passed in the knowledge-guided version, which contains the information extracted from the knowledge sources linked to the text. As described in the Section 2.2 of the paper, we used the vanilla prompts for zero-shot learning, few-shot learning, and instruction fine-tuning whereas we used the knowledge-guided prompts for knowledgeguided zero-shot learning and knowledge-guided instruction fine-tuning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Survey Questions</head><p>Participants were presented with the questions shown in Table <ref type="table">6</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Category Prompt Template</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Vanilla</head><p>Below is an instruction that describes a task, paired with input text. Write a response that appropriately completes the instruction.</p><p>Instruction: Classify the input text as list_of_labels, and provide an explanation. Input text: text_to_classify. Response:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Knowledge-guided</head><p>Below is an instruction that describes a task, paired with context and input text. Write a response that appropriately completes the instruction based on the context. Instruction: Classify the input text as list_of_labels, and provide an explanation. Context: knowledge_source_linked. Input text: text_to_classify. Response:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 5</head><p>Details of vanilla prompts and knowledge-guided prompts passed to the LLMs in our experiments.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Part Questions</head><p>Before Treatment "Which gender do you identify as?" "Are you an English native-speaker?" "What is your country of origin?" "What is your level of expertise on language models or abusive language?" "How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" "How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?"</p><p>Treatment "Do you think explanation 1 provides a good explanation given the text?" "If your answer was yes, does explanation 2 mean the same thing as explanation 1?" "If your answer was yes, does explanation 3 mean the same thing as explanation 1?" "If your answer was yes, does explanation 4 mean the same thing as explanation 1?"</p><p>After Treatment "Having seen these explanations, how useful would you rate a system that provides you a textual explanation for its classification?" "Having seen these explanations, how trustworthy would you rate a system that provides you a textual explanation for its classification?" "What was the main error you noticed in these explanations?" "What do you think makes a textual explanation good?" "Do you have any comment you would like to share?"</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Table 6</head><p>List of questions asked to participants in our expert survey.</p></div>			</div>
			<div type="references">

				<listBibl>

<biblStruct xml:id="b0">
	<monogr>
		<author>
			<persName><forename type="first">P</forename><surname>Mishra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Yannakoudakis</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Shutova</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1908.06024</idno>
		<title level="m">Tackling online abuse: A survey of automated abuse detection methods</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b1">
	<analytic>
		<title level="a" type="main">Model interpretability through the lens of computational complexity</title>
		<author>
			<persName><forename type="first">P</forename><surname>Barceló</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Monet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Pérez</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Subercaseaux</surname></persName>
		</author>
		<ptr target="https://proceedings.neurips.cc/paper_files/paper/2020/file/b1adda14824f50ef24ff1c05bb66faf3-Paper.pdf" />
	</analytic>
	<monogr>
		<title level="m">Advances in Neural Information Processing Systems</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Larochelle</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Ranzato</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Hadsell</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">M</forename><surname>Balcan</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">H</forename><surname>Lin</surname></persName>
		</editor>
		<imprint>
			<publisher>Curran Associates, Inc</publisher>
			<date type="published" when="2020">2020</date>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="15487" to="15498" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b2">
	<analytic>
		<title level="a" type="main">The risk of racial bias in hate speech detection</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Card</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gabriel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/P19-1163</idno>
		<ptr target="https://aclanthology.org/P19-1163.doi:10.18653/v1/P19-1163" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Korhonen</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">D</forename><surname>Traum</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Màrquez</surname></persName>
		</editor>
		<meeting>the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics<address><addrLine>Florence, Italy</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2019">2019</date>
			<biblScope unit="page" from="1668" to="1678" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b3">
	<monogr>
		<title level="m">The European Parliament and The Council of the European Union, Eu regulation</title>
				<imprint>
			<publisher>Official Journal of the European Union</publisher>
			<date type="published" when="2016">2016. 2016</date>
		</imprint>
	</monogr>
	<note>/679 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation)</note>
</biblStruct>

<biblStruct xml:id="b4">
	<analytic>
		<title level="a" type="main">Disproportionate removals and differing content moderation experiences for conservative, transgender, and black social media users: Marginalization and moderation gray areas</title>
		<author>
			<persName><forename type="first">O</forename><forename type="middle">L</forename><surname>Haimson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Delmonaco</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Nie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Wegner</surname></persName>
		</author>
		<idno type="DOI">10.1145/3479610</idno>
		<idno>doi:</idno>
		<ptr target="10.1145/3479610" />
	</analytic>
	<monogr>
		<title level="j">Proc. ACM Hum.-Comput. Interact</title>
		<imprint>
			<biblScope unit="volume">5</biblScope>
			<date type="published" when="2021">2021</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b5">
	<analytic>
		<title level="a" type="main">Effect of transparency and trust on acceptance of automatic online comment moderation systems</title>
		<author>
			<persName><forename type="first">J</forename><surname>Brunk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Mattern</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">M</forename><surname>Riehle</surname></persName>
		</author>
		<idno type="DOI">10.1109/CBI.2019.00056</idno>
	</analytic>
	<monogr>
		<title level="m">IEEE 21st Conference on Business Informatics (CBI)</title>
				<imprint>
			<date type="published" when="2019">2019. 2019</date>
			<biblScope unit="volume">01</biblScope>
			<biblScope unit="page" from="429" to="435" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b6">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Calabrese</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Neves</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">W</forename><surname>Bos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Ross</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Lapata</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Barbieri</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2406.04106</idno>
		<title level="m">Explainability and hate speech: Structured explanations make social media moderators faster</title>
				<imprint>
			<date type="published" when="2024">2024</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b7">
	<analytic>
		<title level="a" type="main">Hatexplain: A benchmark dataset for explainable hate speech detection</title>
		<author>
			<persName><forename type="first">B</forename><surname>Mathew</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Saha</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><forename type="middle">M</forename><surname>Yimam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Biemann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mukherjee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI conference on artificial intelligence</title>
				<meeting>the AAAI conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="14867" to="14875" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b8">
	<analytic>
		<title level="a" type="main">Latent hatred: A benchmark for understanding implicit hate speech</title>
		<author>
			<persName><forename type="first">M</forename><surname>Elsherief</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Ziems</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Muchlinski</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Anupindi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Seybolt</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">De</forename><surname>Choudhury</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yang</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing</title>
				<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing</meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="345" to="363" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b9">
	<analytic>
		<title level="a" type="main">Social bias frames: Reasoning about social and power implications of language</title>
		<author>
			<persName><forename type="first">M</forename><surname>Sap</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gabriel</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Qin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Jurafsky</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><forename type="middle">A</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Choi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">ACL</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b10">
	<analytic>
		<title level="a" type="main">Overview of the first shared task on homotransphobia detection in italian, in: 8th Evaluation</title>
		<author>
			<persName><forename type="first">D</forename><surname>Nozza</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">T</forename><surname>Cignarella</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Damo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Caselli</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Patti</surname></persName>
		</author>
		<author>
			<persName><surname>Hodi</surname></persName>
		</author>
		<author>
			<persName><surname>Evalita</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2023</title>
		<title level="s">CEUR Workshop Proceedings</title>
		<imprint>
			<date type="published" when="2023">2023. 2023</date>
		</imprint>
	</monogr>
	<note>CEUR-WS. org</note>
</biblStruct>

<biblStruct xml:id="b11">
	<analytic>
		<title level="a" type="main">Semeval-2023 task 10: Explainable detection of online sexism</title>
		<author>
			<persName><forename type="first">H</forename><surname>Kirk</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Yin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Vidgen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Röttger</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 17th International Workshop on Semantic Evaluation</title>
				<meeting>the 17th International Workshop on Semantic Evaluation<address><addrLine>SemEval-</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023. 2023</date>
			<biblScope unit="page" from="2193" to="2210" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b12">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><forename type="middle">W</forename><surname>Chung</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Hou</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Longpre</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Zoph</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Tay</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Fedus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Dehghani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Brahma</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2210.11416</idno>
		<title level="m">Scaling instruction-finetuned language models</title>
				<imprint>
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b13">
	<analytic>
		<title level="a" type="main">Language models are few-shot learners</title>
		<author>
			<persName><forename type="first">T</forename><surname>Brown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Mann</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ryder</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Subbiah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><forename type="middle">D</forename><surname>Kaplan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Dhariwal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Neelakantan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Shyam</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Sastry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Askell</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">33</biblScope>
			<biblScope unit="page" from="1877" to="1901" />
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b14">
	<monogr>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Ladhak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Durmus</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Mckeown</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Hashimoto</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2301.13848</idno>
		<title level="m">Benchmarking large language models for news summarization</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b15">
	<monogr>
		<author>
			<persName><forename type="first">C</forename><surname>Ziems</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Held</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Shaikh</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Yang</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2305.03514</idno>
		<title level="m">Can large language models transform computational social science?</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b16">
	<analytic>
		<title level="a" type="main">Probing LLMs for hate speech detection: strengths and vulnerabilities</title>
		<author>
			<persName><forename type="first">S</forename><surname>Roy</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Harshvardhan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Mukherjee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Saha</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.findings-emnlp.407</idno>
		<ptr target="https://aclanthology.org/2023.findings-emnlp.407.doi:10.18653/v1/2023.findings-emnlp.407" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting><address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="6116" to="6128" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b17">
	<analytic>
		<title level="a" type="main">HARE: Explainable hate speech detection with step-by-step reasoning</title>
		<author>
			<persName><forename type="first">Y</forename><surname>Yang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kim</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Ho</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Thorne</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S.-Y</forename><surname>Yun</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.findings-emnlp.365</idno>
		<ptr target="https://aclanthology.org/2023.findings-emnlp.365.doi:10.18653/v1/2023.findings-emnlp.365" />
	</analytic>
	<monogr>
		<title level="m">Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">H</forename><surname>Bouamor</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Pino</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">K</forename><surname>Bali</surname></persName>
		</editor>
		<meeting><address><addrLine>Singapore</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="5490" to="5505" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b18">
	<analytic>
		<title level="a" type="main">Chain of explanation: New prompting method to generate quality natural language explanation for implicit hate speech</title>
		<author>
			<persName><forename type="first">F</forename><surname>Huang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kwak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>An</surname></persName>
		</author>
		<idno type="DOI">10.1145/3543873.3587320</idno>
		<idno>doi:10.1145/3543873.3587320</idno>
		<ptr target="https://doi.org/10.1145/3543873.3587320" />
	</analytic>
	<monogr>
		<title level="m">Companion Proceedings of the ACM Web Conference 2023, WWW &apos;23 Companion</title>
				<meeting><address><addrLine>New York, NY, USA</addrLine></address></meeting>
		<imprint>
			<publisher>Association for Computing Machinery</publisher>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="90" to="93" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b19">
	<analytic>
		<title level="a" type="main">Chain-of-thought prompting elicits reasoning in large language models</title>
		<author>
			<persName><forename type="first">J</forename><surname>Wei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Schuurmans</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Bosma</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Xia</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Chi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Zhou</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Advances in neural information processing systems</title>
		<imprint>
			<biblScope unit="volume">35</biblScope>
			<biblScope unit="page" from="24824" to="24837" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b20">
	<analytic>
		<title level="a" type="main">Bleu: a method for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">K</forename><surname>Papineni</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Roukos</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Ward</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W.-J</forename><surname>Zhu</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 40th annual meeting of the Association for Computational Linguistics</title>
				<meeting>the 40th annual meeting of the Association for Computational Linguistics</meeting>
		<imprint>
			<date type="published" when="2002">2002</date>
			<biblScope unit="page" from="311" to="318" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b21">
	<analytic>
		<title level="a" type="main">Evaluating gpt-3 generated explanations for hateful content moderation</title>
		<author>
			<persName><forename type="first">H</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Hee</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">R</forename><surname>Awal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">T W</forename><surname>Choo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">R</forename><forename type="middle">K</forename></persName>
		</author>
		<author>
			<persName><forename type="first">-</forename><forename type="middle">W</forename><surname>Lee</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence</title>
				<meeting>the Thirty-Second International Joint Conference on Artificial Intelligence</meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="page" from="6255" to="6263" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b22">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Bhardwaj</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Poria</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2308.09662</idno>
		<title level="m">Red-teaming large language models using chain of utterances for safetyalignment</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b23">
	<monogr>
		<author>
			<persName><forename type="first">R</forename><surname>Taori</surname></persName>
		</author>
		<author>
			<persName><forename type="first">I</forename><surname>Gulrajani</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Dubois</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Li</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Guestrin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Liang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">B</forename><surname>Hashimoto</surname></persName>
		</author>
		<ptr target="https://github.com/tatsu-lab/stanford_alpaca,2023" />
		<title level="m">Stanford alpaca: An instruction-following llama model</title>
				<imprint/>
	</monogr>
</biblStruct>

<biblStruct xml:id="b24">
	<analytic>
		<title level="a" type="main">Crosslingual generalization through multitask finetuning</title>
		<author>
			<persName><forename type="first">N</forename><surname>Muennighoff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Sutawika</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Roberts</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Biderman</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><forename type="middle">Le</forename><surname>Scao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">S</forename><surname>Bari</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Shen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><forename type="middle">X</forename><surname>Yong</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Schoelkopf</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Tang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><surname>Radev</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">F</forename><surname>Aji</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Almubarak</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Albanie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Alyafeai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Webson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Raff</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Raffel</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2023.acl-long.891</idno>
		<ptr target="https://aclanthology.org/2023.acl-long.891.doi:10.18653/v1/2023.acl-long.891" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">A</forename><surname>Rogers</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">J</forename><surname>Boyd-Graber</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">N</forename><surname>Okazaki</surname></persName>
		</editor>
		<meeting>the 61st Annual Meeting of the Association for Computational Linguistics<address><addrLine>Toronto, Canada</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2023">2023</date>
			<biblScope unit="volume">1</biblScope>
			<biblScope unit="page" from="15991" to="16111" />
		</imprint>
	</monogr>
	<note>: Long Papers), Association for Computational Linguistics</note>
</biblStruct>

<biblStruct xml:id="b25">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">L</forename><surname>Martin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stone</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Albert</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Almahairi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Babaei</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Bashlykov</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Batra</surname></persName>
		</author>
		<author>
			<persName><forename type="first">P</forename><surname>Bhargava</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Bhosale</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2307.09288</idno>
		<title level="m">Llama 2: Open foundation and fine-tuned chat models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b26">
	<monogr>
		<author>
			<persName><forename type="first">H</forename><surname>Touvron</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lavril</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Izacard</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Martinet</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M.-A</forename><surname>Lachaux</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Lacroix</surname></persName>
		</author>
		<author>
			<persName><forename type="first">B</forename><surname>Rozière</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Goyal</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Hambro</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Azhar</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2302.13971</idno>
		<title level="m">Llama: Open and efficient foundation language models</title>
				<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b27">
	<analytic>
		<title level="a" type="main">Wikidata: a free collaborative knowledgebase</title>
		<author>
			<persName><forename type="first">D</forename><surname>Vrandečić</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krötzsch</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Communications of the ACM</title>
		<imprint>
			<biblScope unit="volume">57</biblScope>
			<biblScope unit="page" from="78" to="85" />
			<date type="published" when="2014">2014</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b28">
	<monogr>
		<title level="m" type="main">A group-specific approach to nlp for hate speech detection</title>
		<author>
			<persName><forename type="first">K</forename><surname>Halevy</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2304.11223</idno>
		<imprint>
			<date type="published" when="2023">2023</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b29">
	<analytic>
		<title level="a" type="main">Conceptnet 5.5: An open multilingual graph of general knowledge</title>
		<author>
			<persName><forename type="first">R</forename><surname>Speer</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Chin</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Havasi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">Proceedings of the AAAI conference on artificial intelligence</title>
				<meeting>the AAAI conference on artificial intelligence</meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="volume">31</biblScope>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b30">
	<analytic>
		<title level="a" type="main">Perturbation CheckLists for evaluating NLG evaluation metrics</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Sai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Dixit</surname></persName>
		</author>
		<author>
			<persName><forename type="first">D</forename><forename type="middle">Y</forename><surname>Sheth</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Mohan</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Khapra</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/2021.emnlp-main.575</idno>
		<ptr target="https://aclanthology.org/2021.emnlp-main.575.doi:10.18653/v1/2021.emnlp-main.575" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and</title>
				<editor>
			<persName><forename type="first">M.-F</forename><surname>Moens</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">X</forename><surname>Huang</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">L</forename><surname>Specia</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><forename type="middle">W</forename><surname>.-T</surname></persName>
		</editor>
		<editor>
			<persName><surname>Yih</surname></persName>
		</editor>
		<meeting>the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and<address><addrLine>Punta Cana, Dominican Republic</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2021">2021</date>
			<biblScope unit="page" from="7219" to="7234" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b31">
	<analytic>
		<title level="a" type="main">A structured review of the validity of BLEU</title>
		<author>
			<persName><forename type="first">E</forename><surname>Reiter</surname></persName>
		</author>
		<idno type="DOI">10.1162/coli_a_00322</idno>
		<ptr target="https://aclanthology.org/J18-3002.doi:10.1162/coli_a_00322" />
	</analytic>
	<monogr>
		<title level="j">Computational Linguistics</title>
		<imprint>
			<biblScope unit="volume">44</biblScope>
			<biblScope unit="page" from="393" to="401" />
			<date type="published" when="2018">2018</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b32">
	<analytic>
		<title level="a" type="main">Why we need new evaluation metrics for NLG</title>
		<author>
			<persName><forename type="first">J</forename><surname>Novikova</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Dušek</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Curry</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Rieser</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/D17-1238</idno>
		<ptr target="https://aclanthology.org/D17-1238.doi:10.18653/v1/D17-1238" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics</title>
				<editor>
			<persName><forename type="first">M</forename><surname>Palmer</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">R</forename><surname>Hwa</surname></persName>
		</editor>
		<editor>
			<persName><forename type="first">S</forename><surname>Riedel</surname></persName>
		</editor>
		<meeting>the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics<address><addrLine>Copenhagen, Denmark</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="2241" to="2252" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b33">
	<analytic>
		<title level="a" type="main">A survey of evaluation metrics used for nlg systems</title>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">B</forename><surname>Sai</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><forename type="middle">K</forename><surname>Mohankumar</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">M</forename><surname>Khapra</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">ACM Computing Surveys (CSUR)</title>
		<imprint>
			<biblScope unit="volume">55</biblScope>
			<biblScope unit="page" from="1" to="39" />
			<date type="published" when="2022">2022</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b34">
	<monogr>
		<author>
			<persName><forename type="first">A</forename><surname>Celikyilmaz</surname></persName>
		</author>
		<author>
			<persName><forename type="first">E</forename><surname>Clark</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Gao</surname></persName>
		</author>
		<idno type="arXiv">arXiv:2006.14799</idno>
		<title level="m">Evaluation of text generation: A survey</title>
				<imprint>
			<date type="published" when="2020">2020</date>
		</imprint>
	</monogr>
	<note type="report_type">arXiv preprint</note>
</biblStruct>

<biblStruct xml:id="b35">
	<analytic>
		<title level="a" type="main">Bertscore: Evaluating text generation with bert</title>
		<author>
			<persName><forename type="first">T</forename><surname>Zhang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">V</forename><surname>Kishore</surname></persName>
		</author>
		<author>
			<persName><forename type="first">F</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><forename type="middle">Q</forename><surname>Weinberger</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Artzi</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="m">International Conference on Learning Representations</title>
				<imprint>
			<date type="published" when="2019">2019</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b36">
	<analytic>
		<title level="a" type="main">The meteor metric for automatic evaluation of machine translation</title>
		<author>
			<persName><forename type="first">A</forename><surname>Lavie</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><forename type="middle">J</forename><surname>Denkowski</surname></persName>
		</author>
	</analytic>
	<monogr>
		<title level="j">Machine translation</title>
		<imprint>
			<biblScope unit="volume">23</biblScope>
			<biblScope unit="page" from="105" to="115" />
			<date type="published" when="2009">2009</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b37">
	<monogr>
		<author>
			<persName><forename type="first">Y</forename><surname>Wu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Schuster</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Z</forename><surname>Chen</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><forename type="middle">V</forename><surname>Le</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Norouzi</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Macherey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Krikun</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Cao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Q</forename><surname>Gao</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Macherey</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Klingner</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Shah</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Johnson</surname></persName>
		</author>
		<author>
			<persName><forename type="first">X</forename><surname>Liu</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Łukasz</forename><surname>Kaiser</surname></persName>
		</author>
		<author>
			<persName><forename type="first">S</forename><surname>Gouws</surname></persName>
		</author>
		<author>
			<persName><forename type="first">Y</forename><surname>Kato</surname></persName>
		</author>
		<author>
			<persName><forename type="first">T</forename><surname>Kudo</surname></persName>
		</author>
		<author>
			<persName><forename type="first">H</forename><surname>Kazawa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">K</forename><surname>Stevens</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Kurian</surname></persName>
		</author>
		<author>
			<persName><forename type="first">N</forename><surname>Patil</surname></persName>
		</author>
		<author>
			<persName><forename type="first">W</forename><surname>Wang</surname></persName>
		</author>
		<author>
			<persName><forename type="first">C</forename><surname>Young</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Smith</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Riesa</surname></persName>
		</author>
		<author>
			<persName><forename type="first">A</forename><surname>Rudnick</surname></persName>
		</author>
		<author>
			<persName><forename type="first">O</forename><surname>Vinyals</surname></persName>
		</author>
		<author>
			<persName><forename type="first">G</forename><surname>Corrado</surname></persName>
		</author>
		<author>
			<persName><forename type="first">M</forename><surname>Hughes</surname></persName>
		</author>
		<author>
			<persName><forename type="first">J</forename><surname>Dean</surname></persName>
		</author>
		<idno type="arXiv">arXiv:1609.08144</idno>
		<title level="m">Google&apos;s neural machine translation system: Bridging the gap between human and machine translation</title>
				<imprint>
			<date type="published" when="2016">2016</date>
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b38">
	<analytic>
		<title level="a" type="main">ROUGE: A package for automatic evaluation of summaries</title>
		<author>
			<persName><forename type="first">C.-Y</forename><surname>Lin</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W04-1013" />
	</analytic>
	<monogr>
		<title level="m">Text Summarization Branches Out, Association for Computational Linguistics</title>
				<meeting><address><addrLine>Barcelona, Spain</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2004">2004</date>
			<biblScope unit="page" from="74" to="81" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b39">
	<analytic>
		<title level="a" type="main">chrF: character n-gram F-score for automatic MT evaluation</title>
		<author>
			<persName><forename type="first">M</forename><surname>Popović</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W15-3049</idno>
		<ptr target="https://aclanthology.org/W15-3049.doi:10.18653/v1/W15-3049" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics</title>
				<meeting>the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics<address><addrLine>Lisbon, Portugal</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2015">2015</date>
			<biblScope unit="page" from="392" to="395" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b40">
	<analytic>
		<title level="a" type="main">chrF++: words helping character ngrams</title>
		<author>
			<persName><forename type="first">M</forename><surname>Popović</surname></persName>
		</author>
		<idno type="DOI">10.18653/v1/W17-4770</idno>
		<ptr target="https://aclanthology.org/W17-4770.doi:10.18653/v1/W17-4770" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Second Conference on Machine Translation, Association for Computational Linguistics</title>
				<meeting>the Second Conference on Machine Translation, Association for Computational Linguistics<address><addrLine>Copenhagen, Denmark</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2017">2017</date>
			<biblScope unit="page" from="612" to="618" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b41">
	<analytic>
		<title level="a" type="main">A call for clarity in reporting BLEU scores</title>
		<author>
			<persName><forename type="first">M</forename><surname>Post</surname></persName>
		</author>
		<ptr target="https://www.aclweb.org/anthology/W18-6319" />
	</analytic>
	<monogr>
		<title level="m">Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics</title>
				<meeting>the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics<address><addrLine>Belgium, Brussels</addrLine></address></meeting>
		<imprint>
			<date type="published" when="2018">2018</date>
			<biblScope unit="page" from="186" to="191" />
		</imprint>
	</monogr>
</biblStruct>

<biblStruct xml:id="b42">
	<monogr>
		<author>
			<persName><forename type="first">K</forename><surname>Krippendorff</surname></persName>
		</author>
		<title level="m">Computing krippendorff&apos;s alphareliability</title>
				<imprint>
			<date type="published" when="2011">2011</date>
		</imprint>
	</monogr>
</biblStruct>

				</listBibl>
			</div>
		</back>
	</text>
</TEI>
