<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <issn pub-type="ppub">1613-0073</issn>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Expert Survey on LLM-generated Explanations for Abusive Language Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Barbara McGillivray</string-name>
          <email>barbara.mcgillivray@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chiara Di Bonaventura</string-name>
          <email>chiara.di_bonaventura@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Lucia Siciliani</string-name>
          <email>lucia.siciliani@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Pierpaolo Basile</string-name>
          <email>pierpaolo.basile@uniba.it</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Albert Meroño-Peñuela</string-name>
          <email>albert.merono@kcl.ac.uk</email>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="editor">
          <string-name>Large Language Models, Hate Speech Detection, Explanation Generation, Human Evaluation</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science, University of Bari Aldo Moro</institution>
          ,
          <country country="IT">Italy</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Imperial College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>King's College London</institution>
          ,
          <addr-line>London</addr-line>
          ,
          <country country="UK">United Kingdom</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) dificulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>CEUR</p>
      <p>ceur-ws.org
1. Introduction
guage Processing (NLP) research on abusive language
[1] as increasing models’ complexity [2], models’
intrinsic bias [3], and international regulations [4] call for a
shift in perspective from performance-based models to
more transparent models. Moreover, recent studies have
shown the benefits of explanations for users [ 5, 6] and
content moderators [7] on social media platforms. The
former can benefit from receiving an explanation for why
a certain post has been flagged or removed whereas the
latter are shown to annotate toxic posts faster and solve
doubtful annotations thanks to explanations.</p>
    </sec>
    <sec id="sec-2">
      <title>Several eforts have moved towards explainable abu</title>
      <p>opment of datasets containing rationales (i.e., the tokens
sive language detection in the past years, like the devel- implications of these explanations remain understudied,
†Work partially funded by the Trustworthy AI Research award
received by The Alan Turing Institute and the the Italian Future AI
Research Foundation (FAIR).
nEvelop-O</p>
    </sec>
    <sec id="sec-3">
      <title>Language Models (LLMs) like FLAN-T5 [13] showing</title>
      <p>remarkable performance across tasks and human-like
text generation [14, 15, 16], recent studies have explored</p>
    </sec>
    <sec id="sec-4">
      <title>LLMs for explainable hate speech detection, wherein</title>
      <p>classification predictions are described through natural
language explanations [17, 18]. For instance, [19] used
chain-of-thought prompting [20] of LLMs to generate
explanations for implicit hate speech detection.</p>
      <p>However, most of these studies rely on empirical
metrics like BLEU [21] to evaluate the generated explanations
automatically. Consequently, the human perception and
as well as the extent to which empirical metrics
approximate human judgements. [22] recruited crowdworkers to
evaluate the level of hatefulness in tweets and the quality
of explanations generated by GPT-3. Instead, we
conduct an expert survey investigating four LLMs and five
learning strategies across multi-class abusive language
detection tasks to answer the following questions: RQ1:</p>
    </sec>
    <sec id="sec-5">
      <title>How well do LLM-generated explanations for abusive</title>
      <p>language detection match human expectations? RQ2:</p>
    </sec>
    <sec id="sec-6">
      <title>How well do empirical metrics align with human judge</title>
      <p>ments? RQ3: What makes LLM-generated explanations</p>
      <sec id="sec-6-1">
        <title>2. Experimental Setup</title>
        <sec id="sec-6-1-1">
          <title>Model</title>
        </sec>
        <sec id="sec-6-1-2">
          <title>Instruction</title>
        </sec>
        <sec id="sec-6-1-3">
          <title>Fine-tuned</title>
        </sec>
        <sec id="sec-6-1-4">
          <title>Toxicity</title>
        </sec>
        <sec id="sec-6-1-5">
          <title>Fine-tuned</title>
          <p>To answer these research questions, we design a
beforeand-after study, surveying participants about their prior
expectations about LLM-generated explanations and then
showing them examples generated by several LLMs with
diverse learning strategies1, followed by further inter- Table 2
views. To ensure robustness of our results, we recruited Summary of models used.
experts in the field, i.e., AI researchers, as described
below.</p>
          <p>FLAN-Alpaca
FLAN-T5
mT0
Llama-2</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>Learning strategies. As diferent prompting strate</title>
      <p>2.1. Data gies might yield diferent results, we test five distinct
learning strategies using the established Stanford Alpaca
For our experiments, we use the HateXplain [8] and the template2 (cf. Appendix A for prompt details):
Implicit Hate Corpus [9] as they encompass diferent lev- (1) zero-shot learning (zsl): we pass “Classify the
els of ofensiveness (i.e., hate speech, ofensive, neutral), input text as list_of_labels, and provide an
explaexpressiveness (i.e., explicit hate, implicit hate, neutral), nation” in the instruction field of the template. The
multiple targeted groups, and explanations for the hate- list_of_labels changes according to the dataset used;
ful label (Table 1). These datasets contain unstructured (2) few-shot learning (fsl): we pass three additional
explanations of the words that constitute abuse (in Hat- examples to the aforementioned template, which are
raneXplain) and the user’s intent (in Implicit Hate). In view domly sampled with equal probability among the labels
of previous research arguing the need for structured ex- to account for class imbalance in the datasets. We
experiplanations in hateful content moderation [1], we use the mented with diferent numbers of examples (i.e., passing
following template to create structured explanations, that one, three or five examples), and chose three as it was
we will use as ground-truth: “Explanation: it contains the the best strategy;
following hateful words (implied statement):” for abusive (3) knowledge-guided zero-shot learning (kg):
incontent in HateXplain (Implicit Hate Corpus), and “The stead of passing additional examples in the prompts, we
text does not contain abusive content.” for neutral content. add external knowledge retrieved by means of an entity
linker3, which first detects entities mentioned in the
inDataset Labels Target Explanation put text, and then retrieves the relevant information from
HateXplain onhfeeanutestrisvapele,ech, .bw..loamcke,n, lTeovkeeln- tehnecyecxltoeprendailckknnoowwlleeddggee,bKanseo.wWleedJues[e2W9] ifkoirdhaatate[s2p8e]efcohr
IHmaptelicit ienmxepupltliirccaiitlthhaattee,, .Jw.e.hwiste,s, Ismtaptelimedent
tcpeolmamtepmoworaintlhsleiannngseuaidksdntiioctiwoknlneaodlwgfieelled.dWcgaeellamendod‘dcCiofonynttechxeetp’tptNrooeamtc[cp3ot0u]tenfmtorfor this external knowledge;
TSuabmlmea1ry of datasets used. pro(4m)pitnssutsreudctinio(n1)fintoe-itnustnriuncgtio(nft) fine:-tuwneeuLsleamthae-2s;ame
(5) knowledge-guided instruction fine-tuning
2.2. Methodology (kg_ft) : we use the knowledge-guided prompts
developed in (3) to instruction fine-tune Llama-2.</p>
    </sec>
    <sec id="sec-8">
      <title>We extensively investigate four popular LLMs across five learning strategies on their ability to detect multi-class ofensiveness and expressiveness of abusive language and to generate explanations for the classification.</title>
      <p>Models. We use diferent open-source LLMs (Table 2):
the base versions of FLAN-Alpaca [23, 24], FLAN-T5
[13], mT0 [25], and the 7B foundational model Llama 2
[26], which is an updated version of LlaMA [27].</p>
    </sec>
    <sec id="sec-9">
      <title>1The data containing the LLM-generated explanations are</title>
      <p>publicly available at https://github.com/ChiaraDiBonaventura/
is-explanation-all-you-need</p>
    </sec>
    <sec id="sec-10">
      <title>Empirical eval metrics. We evaluate how closely the</title>
      <p>LLM-generated explanations match the ground-truth
across eight empirical similarity metrics due to the
challenge of simultaneously assessing a wide set of criteria
[31, 32, 33]. Following established NLG research [34, 35],
we choose BERTScore [36] and METEOR [37] for
semantic similarity. For syntactic similarity, we select
BLEU [21], GBLEU [38], ROUGE [39], ChrF [40] with
2https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#
data-release
3If available, we use the API provided by the knowledge source,
spaCy otherwise. https://spacy.io/
its derivates ChrF+ and ChrF++ [41, 42]. Additionally, of origin include Europe (60%), Asia (26.67%), Africa
we present an expert evaluation following our survey. (6.67%), and Latin America (6.67%).
2.3. Survey Design</p>
      <sec id="sec-10-1">
        <title>3. Results and Discussion</title>
      </sec>
    </sec>
    <sec id="sec-11">
      <title>To evaluate how well LLMs align with human expec</title>
      <p>tations and judgements in explanation generation, we
design a before-and-after study as follows.</p>
    </sec>
    <sec id="sec-12">
      <title>Our 15 participants reach a fair agreement, with Krip</title>
      <p>pendorf’s alpha [ 43] equal to 38.43%.</p>
      <p>Fig. 1 shows changes in the relative frequencies of
participant scores in the usefulness and trustworthiness
Before treatment. We ask for participant’s back- of explanations before and after treatment. Participants’
ground information, e.g., gender identity, native language responses before treatment have expectations of textual
and how they would rate the usefulness and trustwor- explanations for classifications of being “highly useful”
thiness of a language model for explanation generation. (above 50%; highest possible score) in terms of usefulness,
Specifically, we ask “How useful would you rate a system and “moderately trustworthy” or “neutral” (above 40%;
that provides you a textual explanation for its classifica- second and third best possible score) in terms of
trusttion with respect to receiving only its classification?” and worthiness. However, scores for after treatment show
“How trustworthy would you rate a system that provides participants changing their usefulness scores towards
you a textual explanation for its classification with re- “moderately unuseful” (40-50%; second worst possible
spect to receiving only its classification?” on a 1-5 Likert score) and their trustworthiness scores to “highly
untrustscale. worthy” (above 30%; worst possible score). Agreement
difers in each category: usefulness is much more
conTreatment. As for the treatment, we show participants sensual, whereas trustworthiness is judged with higher
a sample of 70 texts from the datasets, paired with up to variance. In general, LLM-generated explanations do not
four diferent explanations. Specifically, given a text and meet human expectations in terms of usefulness and
trustground-truth explanation, participants are asked if the worthiness. Specifically, exposing participants to these
text is correctly explained. If yes, they are asked to rate explanations leads to an average percentage decrease of
three diferent LLM-generated explanations with respect 47.78% and 64.32% in the perception of the usefulness and
to the ground-truth on a 1-3 scale. These explanations trustworthiness of explanations, respectively.
are randomly sampled among the four LLMs and five Fig. 2 shows the scores of all empirical metrics and
learning strategies discussed in Section 2.2. expert evaluation for all models on explanation
generation. Overall, similarity metrics tend to be highly volatile
After treatment. Finally, we ask participants’ opinion with respect to each other. For instance, FLAN-Alpaca
on the usefulness and trustworthiness of explanation prompted with zero-shot learning (i.e., ‘alpaca_zsl’ in
generation, having seen the LLM-generated explanations. the figure) generates explanations that are more than
In addition, we ask general opinions related to what 70% semantically similar to the ground-truth
explanatype of errors they observed most frequently, and what a tions according to BERTScore while being less than 20%
good explanation would look like. semantically similar according to METEOR. Similarly
for syntax: BLEU and GBLEU similarity scores are less
The full list of questions is in the Appendix B. than 3% whereas ROUGE and chrF/+/++ are in the range
The institutional ethical board of the first author’s 9%-21%. Moreover, we observe that BERTScore has a
university approved our study design. We distributed tendency to over-score explanations compared to human
the survey through channels that allow us to target evaluation scores. Contrarily, METEOR, BLEU, GBLEU,
individuals working in AI who are familiar with the field ROUGE and chrF/+/++ have a tendency to under-score
of language models and/or AI Ethics, including NLP explanations. Instruction fine-tuning helped all metrics
reading groups and AI Ethics interest groups. To ensure to approximate expert evaluations better, especially when
the reliability of our before-and-after study, participants tuned on knowledge-guided prompts. We use the
Spearwere given 1 hour to complete as many answers as they man’s rank correlation coeficient to compare the
corcould. We collected answers from 15 participants, of relation between human scores and those provided by
which 33% (67%) identify as female (male), and 33% (67%) all the other metrics. In detail, we rank the models for
are (non) English native-speakers. The average level each type of metric, and then we compute the Spearman
of participants’ expertise in abusive language research correlation between the rank obtained by human scores
is 2.47 out of 5 (self-described)4, and their continents and those obtained by other metrics. Table 3 reports all
the correlation scores. We observe that BERTScore is
4The list of levels to choose from was: 1=Novice, 2=Advanced be- the most correlated with humans in both tasks. Also,
ginner, 3=Competent, 4=Proficient, 5=Expert.
chrF/+/++ metrics are highly correlated with humans
while all the other metrics based on syntactic matches
are slightly correlated with humans. Results show that
semantic metrics are more similar to how humans
evaluate the quality of the explanation generated by LLMs.</p>
      <p>Only one metric (ROUGE) shows a diferent behaviour
between the two tasks.</p>
      <p>Since 38.55% of the ground-truth explanations were
not rated as good explanations by participants, we
further investigated what are the most common errors and
what makes an explanation good. Table 4 returns the
most common error categories reported by participants.</p>
      <p>Most of them are related to logical fallacies (e.g.,
contradictory statements, hallucination), especially in the
context of sarcasm and self-deprecating humour, rather
than linguistic errors (e.g., grammar, misspellings). It is
worth noticing that 13.33% of the participants reported
that LLM-generated explanations contain cultural bias
(e.g., stereotypes), with the implication of potentially
perpetuating harms against the targeted victims of abusive
language. As for desiderata, 73.33% of participants would
like to receive textual explanations that are coherent
with human reasoning and understanding, i.e., that are
relevant and exhaustive to the text they refer to while
being logically and linguistically correct. A remaining 20%
thinks that a good explanation must be coherent with
model reasoning instead. In other words, participants are
much more concerned about how the explanation looks
like rather than its reflection of the inner mechanism of
the model reasoning. To quote a participant’s
perspective, “I would want the explanation to be helpful to me and
guide my own reasoning”.</p>
      <p>Metric</p>
      <sec id="sec-12-1">
        <title>4. Conclusion</title>
        <p>vs. syntactic), and therefore pointing at the need of more
reliable metrics for the empirical evaluation of textual
exIn this paper, we conducted a before-and-after study to planations. In general, BERTScore and METEOR metrics
understand human expectations and judgements of LLM- exhibit the strongest correlation with human judgements.
generated explanations for multi-class abusive language Lastly, our study provides evidence of the desiderata for
detection tasks. Contrarily to previous research [22], we LLM-generated explanations, suggesting that
explanainvestigated multiple LLMs and learning techniques, and tions should be coherent with human reasoning rather
we surveyed AI experts who are familiar with abusive than model reasoning. Participants value the most
texlanguage research instead of crowdworkers. We found tual explanations that are relevant and exhaustive to the
that human expectations in terms of usefulness and trust- text they refer to, while being logically and
linguistiworthiness of LLM-generated explanations are not met: cally correct. Justifications for this preference lie on the
after seeing these explanations, the usefulness and trust- fact that abusive language detection heavily relies on
worthiness ratings decrease by 47.78% and 64.32%, re- additional context and knowledge about slang and slurs,
spectively. Secondly, our results show that empirical for which receiving an explanation is helpful to
particmetrics commonly used to evaluate textual explanations ipants’ understanding of the text. Future work should
are highly volatile with respect to each other, even when investigate whether this preference holds for other
dothey measure the same type of similarity (i.e., semantic mains as well. In light of our findings, we conclude with</p>
      </sec>
      <sec id="sec-12-2">
        <title>Acknowledgments</title>
      </sec>
    </sec>
    <sec id="sec-13">
      <title>This work was supported by the UK Research and Inno</title>
      <p>vation [grant number EP/S023356/1] in the UKRI Centre
for Doctoral Training in Safe and Trusted Artificial
Intelligence (www.safeandtrustedai.org); by the Trustworthy
AI Research award by The Alan Turing Institute,
supported by the British Embassy Rome and the UK Science
&amp; Innovation Network; and by the PNRR project FAIR
Future AI Research (PE00000013), Spoke 6 - Symbiotic AI
(CUP H97G22000210007) under the NRRP MUR program
funded by the NextGenerationEU.
three recommendations to use LLMs responsibly for
explainable abusive language detection: (1) be aware of the
cultural bias these models might exhibit when generating
free-text explanations, which can further harm targeted
groups; (2) if possible, instruction fine-tune LLMs for
explanation generation of abusive language detection.</p>
      <p>This not only could ensure the generation of structured
explanations as advised by previous research [1] but it
also returns the highest evaluation scores, both
empirically and expert-wise, when using knowledge-guided
prompts; (3) opt for a combination of empirical metrics to
evaluate textual explanations when no human evaluation
is possible, since no particular empirical metric seems to
generalise across diferent learning techniques, models
and datasets, making the ground-truth lie somewhere
in between BERTScore (upper bound) and BLEU (lower
bound).
[14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- guage models using chain of utterances for
safetyplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- alignment, arXiv preprint arXiv:2308.09662 (2023).
try, A. Askell, et al., Language models are few-shot [24] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
learners, Advances in neural information process- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford
aling systems 33 (2020) 1877–1901. paca: An instruction-following llama model, https:
[15] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe- //github.com/tatsu-lab/stanford_alpaca, 2023.
own, T. B. Hashimoto, Benchmarking large lan- [25] N. Muennighof, T. Wang, L. Sutawika, A. Roberts,
guage models for news summarization, arXiv S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X.
preprint arXiv:2301.13848 (2023). Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji,
[16] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, K. Almubarak, S. Albanie, Z. Alyafeai, A.
WebD. Yang, Can large language models transform son, E. Raf, C. Rafel, Crosslingual
generalizacomputational social science?, arXiv preprint tion through multitask finetuning, in: A. Rogers,
arXiv:2305.03514 (2023). J. Boyd-Graber, N. Okazaki (Eds.), Proceedings
[17] S. Roy, A. Harshvardhan, A. Mukherjee, P. Saha, of the 61st Annual Meeting of the Association
Probing LLMs for hate speech detection: strengths for Computational Linguistics (Volume 1: Long
and vulnerabilities, in: H. Bouamor, J. Pino, Papers), Association for Computational
LinguisK. Bali (Eds.), Findings of the Association for tics, Toronto, Canada, 2023, pp. 15991–16111. URL:
Computational Linguistics: EMNLP 2023, Asso- https://aclanthology.org/2023.acl-long.891. doi:10.
ciation for Computational Linguistics, Singapore, 18653/v1/2023.acl- long.891.
2023, pp. 6116–6128. URL: https://aclanthology.org/ [26] H. Touvron, L. Martin, K. Stone, P. Albert, A.
Alma2023.findings-emnlp.407. doi:10.18653/v1/2023. hairi, Y. Babaei, N. Bashlykov, S. Batra, P.
Bharfindings- emnlp.407. gava, S. Bhosale, et al., Llama 2: Open
founda[18] Y. Yang, J. Kim, Y. Kim, N. Ho, J. Thorne, S.-Y. tion and fine-tuned chat models, arXiv preprint
Yun, HARE: Explainable hate speech detection arXiv:2307.09288 (2023).
with step-by-step reasoning, in: H. Bouamor, [27] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
J. Pino, K. Bali (Eds.), Findings of the Association M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
for Computational Linguistics: EMNLP 2023, Asso- E. Hambro, F. Azhar, et al., Llama: Open and
eficiation for Computational Linguistics, Singapore, cient foundation language models, arXiv preprint
2023, pp. 5490–5505. URL: https://aclanthology.org/ arXiv:2302.13971 (2023).
2023.findings-emnlp.365. doi:10.18653/v1/2023. [28] D. Vrandečić, M. Krötzsch, Wikidata: a free
colfindings- emnlp.365. laborative knowledgebase, Communications of the
[19] F. Huang, H. Kwak, J. An, Chain of explana- ACM 57 (2014) 78–85.</p>
      <p>tion: New prompting method to generate qual- [29] K. Halevy, A group-specific approach to nlp for hate
ity natural language explanation for implicit hate speech detection, arXiv preprint arXiv:2304.11223
speech, in: Companion Proceedings of the ACM (2023).</p>
      <p>Web Conference 2023, WWW ’23 Companion, As- [30] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An
sociation for Computing Machinery, New York, NY, open multilingual graph of general knowledge, in:
USA, 2023, p. 90–93. URL: https://doi.org/10.1145/ Proceedings of the AAAI conference on artificial
3543873.3587320. doi:10.1145/3543873.3587320. intelligence, volume 31, 2017.
[20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, [31] A. B. Sai, T. Dixit, D. Y. Sheth, S. Mohan, M. M.</p>
      <p>E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought Khapra, Perturbation CheckLists for evaluating
prompting elicits reasoning in large language mod- NLG evaluation metrics, in: M.-F. Moens, X. Huang,
els, Advances in neural information processing L. Specia, S. W.-t. Yih (Eds.), Proceedings of the
systems 35 (2022) 24824–24837. 2021 Conference on Empirical Methods in
Natu[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a ral Language Processing, Association for
Compumethod for automatic evaluation of machine trans- tational Linguistics, Online and Punta Cana,
Dolation, in: Proceedings of the 40th annual meeting minican Republic, 2021, pp. 7219–7234. URL: https:
of the Association for Computational Linguistics, //aclanthology.org/2021.emnlp-main.575. doi:10.
2002, pp. 311–318. 18653/v1/2021.emnlp- main.575.
[22] H. Wang, M. S. Hee, M. R. Awal, K. T. W. Choo, R. K.- [32] E. Reiter, A structured review of the validity
W. Lee, Evaluating gpt-3 generated explanations of BLEU, Computational Linguistics 44 (2018)
for hateful content moderation, in: Proceedings of 393–401. URL: https://aclanthology.org/J18-3002.
the Thirty-Second International Joint Conference doi:10.1162/coli_a_00322.</p>
      <p>on Artificial Intelligence, 2023, pp. 6255–6263. [33] J. Novikova, O. Dušek, A. Cercas Curry, V. Rieser,
[23] R. Bhardwaj, S. Poria, Red-teaming large lan- Why we need new evaluation metrics for NLG,</p>
      <sec id="sec-13-1">
        <title>A. Prompt Details</title>
        <p>Vanilla
Knowledge-guided</p>
        <p>Below is an instruction that describes a task, paired with input text.
Write a response that appropriately completes the instruction.
Below is an instruction that describes a task, paired with context and input text.
Write a response that appropriately completes the instruction based on the context.</p>
        <sec id="sec-13-1-1">
          <title>Questions</title>
          <p>Before Treatment
Treatment
After Treatment
“Which gender do you identify as?”
“Are you an English native-speaker?”
“What is your country of origin?”
“What is your level of expertise on language models or abusive language?”
“How useful would you rate a system that provides you a textual explanation for its classification
with respect to receiving only its classification?”
“How trustworthy would you rate a system that provides you a textual explanation for its classification
with respect to receiving only its classification?”
“Do you think explanation 1 provides a good explanation given the text?”
“If your answer was yes, does explanation 2 mean the same thing as explanation 1?”
“If your answer was yes, does explanation 3 mean the same thing as explanation 1?”
“If your answer was yes, does explanation 4 mean the same thing as explanation 1?”
“Having seen these explanations, how useful would you rate a system that provides you a textual
explanation for its classification?”
“Having seen these explanations, how trustworthy would you rate a system that provides you a textual
explanation for its classification?”
“What was the main error you noticed in these explanations?”
“What do you think makes a textual explanation good?”
“Do you have any comment you would like to share?”</p>
        </sec>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list />
  </back>
</article>