Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection Chiara Di Bonaventura1,2,∗,† , Lucia Siciliani3 , Pierpaolo Basile3 , Albert Meroño-Peñuela1 and Barbara McGillivray1 1 King’s College London, London, United Kingdom 2 Imperial College London, London, United Kingdom 3 Department of Computer Science, University of Bari Aldo Moro, Italy Abstract Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection. Keywords Large Language Models, Hate Speech Detection, Explanation Generation, Human Evaluation 1. Introduction in the text that suggest why the text is hateful) [8] or implied statements (i.e., description of the implied mean- Explainability is a crucial open challenge in Natural Lan- ing of the text) [9, 10], and shared tasks on explainable guage Processing (NLP) research on abusive language hate speech detection [11, 12], inter alia. With Large [1] as increasing models’ complexity [2], models’ intrin- Language Models (LLMs) like FLAN-T5 [13] showing sic bias [3], and international regulations [4] call for a remarkable performance across tasks and human-like shift in perspective from performance-based models to text generation [14, 15, 16], recent studies have explored more transparent models. Moreover, recent studies have LLMs for explainable hate speech detection, wherein shown the benefits of explanations for users [5, 6] and classification predictions are described through natural content moderators [7] on social media platforms. The language explanations [17, 18]. For instance, [19] used former can benefit from receiving an explanation for why chain-of-thought prompting [20] of LLMs to generate a certain post has been flagged or removed whereas the explanations for implicit hate speech detection. latter are shown to annotate toxic posts faster and solve However, most of these studies rely on empirical met- doubtful annotations thanks to explanations. rics like BLEU [21] to evaluate the generated explanations Several efforts have moved towards explainable abu- automatically. Consequently, the human perception and sive language detection in the past years, like the devel- implications of these explanations remain understudied, opment of datasets containing rationales (i.e., the tokens as well as the extent to which empirical metrics approxi- CLiC-it 2024: Tenth Italian Conference on Computational Linguistics, mate human judgements. [22] recruited crowdworkers to Dec 04 — 06, 2024, Pisa, Italy evaluate the level of hatefulness in tweets and the quality ∗ Corresponding author. of explanations generated by GPT-3. Instead, we con- † Work partially funded by the Trustworthy AI Research award re- duct an expert survey investigating four LLMs and five ceived by The Alan Turing Institute and the the Italian Future AI learning strategies across multi-class abusive language Research Foundation (FAIR). detection tasks to answer the following questions: RQ1: Envelope-Open chiara.di_bonaventura@kcl.ac.uk (C. D. Bonaventura); lucia.siciliani@uniba.it (L. Siciliani); pierpaolo.basile@uniba.it How well do LLM-generated explanations for abusive (P. Basile); albert.merono@kcl.ac.uk (A. Meroño-Peñuela); language detection match human expectations? RQ2: barbara.mcgillivray@kcl.ac.uk (B. McGillivray) How well do empirical metrics align with human judge- Orcid 0000-0002-1438-280X (L. Siciliani); 0000-0002-0545-1105 ments? RQ3: What makes LLM-generated explanations (P. Basile); 0000-0003-4646-5842 (A. Meroño-Peñuela) © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License good, according to experts? Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings 2. Experimental Setup Model Instruction Fine-tuned Toxicity Fine-tuned To answer these research questions, we design a before- FLAN-Alpaca � � FLAN-T5 � � and-after study, surveying participants about their prior mT0 � - expectations about LLM-generated explanations and then Llama-2 - - showing them examples generated by several LLMs with 1 diverse learning strategies , followed by further inter- Table 2 views. To ensure robustness of our results, we recruited Summary of models used. experts in the field, i.e., AI researchers, as described be- low. Learning strategies. As different prompting strate- 2.1. Data gies might yield different results, we test five distinct learning strategies using the established Stanford Alpaca For our experiments, we use the HateXplain [8] and the template2 (cf. Appendix A for prompt details): Implicit Hate Corpus [9] as they encompass different lev- (1) zero-shot learning (zsl): we pass “Classify the els of offensiveness (i.e., hate speech, offensive, neutral), input text as list_of_labels , and provide an expla- expressiveness (i.e., explicit hate, implicit hate, neutral), nation” in the instruction field of the template. The multiple targeted groups, and explanations for the hate- list_of_labels changes according to the dataset used; ful label (Table 1). These datasets contain unstructured (2) few-shot learning (fsl): we pass three additional explanations of the words that constitute abuse (in Hat- examples to the aforementioned template, which are ran- eXplain) and the user’s intent (in Implicit Hate). In view domly sampled with equal probability among the labels of previous research arguing the need for structured ex- to account for class imbalance in the datasets. We experi- planations in hateful content moderation [1], we use the mented with different numbers of examples (i.e., passing following template to create structured explanations, that one, three or five examples), and chose three as it was we will use as ground-truth: “Explanation: it contains the the best strategy; following hateful words (implied statement):” for abusive (3) knowledge-guided zero-shot learning (kg): in- content in HateXplain (Implicit Hate Corpus), and “The stead of passing additional examples in the prompts, we text does not contain abusive content.” for neutral content. add external knowledge retrieved by means of an entity linker3 , which first detects entities mentioned in the in- Dataset Labels Target Explanation put text, and then retrieves the relevant information from hate speech, women, Token- the external knowledge base. We use Wikidata [28] for HateXplain offensive, black, level encyclopedic knowledge, KnowledJe [29] for hate speech neutral ... implicit hate, Jews, temporal linguistic knowledge and ConceptNet [30] for Implicit Implied commonsense knowledge. We modify the prompt tem- explicit hate, whites, Hate statement neutral ... plate with an additional field called ‘context’ to account for this external knowledge; Table 1 (4) instruction fine-tuning (ft): we use the same Summary of datasets used. prompts used in (1) to instruction fine-tune Llama-2; (5) knowledge-guided instruction fine-tuning 2.2. Methodology (kg_ft): we use the knowledge-guided prompts devel- oped in (3) to instruction fine-tune Llama-2. We extensively investigate four popular LLMs across five learning strategies on their ability to detect multi-class Empirical eval metrics. We evaluate how closely the offensiveness and expressiveness of abusive language LLM-generated explanations match the ground-truth and to generate explanations for the classification. across eight empirical similarity metrics due to the chal- lenge of simultaneously assessing a wide set of criteria Models. We use different open-source LLMs (Table 2): [31, 32, 33]. Following established NLG research [34, 35], the base versions of FLAN-Alpaca [23, 24], FLAN-T5 we choose BERTScore [36] and METEOR [37] for se- [13], mT0 [25], and the 7B foundational model Llama 2 mantic similarity. For syntactic similarity, we select [26], which is an updated version of LlaMA [27]. BLEU [21], GBLEU [38], ROUGE [39], ChrF [40] with 2 https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file# 1 The data containing the LLM-generated explanations are data-release 3 publicly available at https://github.com/ChiaraDiBonaventura/ If available, we use the API provided by the knowledge source, is-explanation-all-you-need spaCy otherwise. https://spacy.io/ its derivates ChrF+ and ChrF++ [41, 42]. Additionally, of origin include Europe (60%), Asia (26.67%), Africa we present an expert evaluation following our survey. (6.67%), and Latin America (6.67%). 2.3. Survey Design 3. Results and Discussion To evaluate how well LLMs align with human expec- tations and judgements in explanation generation, we Our 15 participants reach a fair agreement, with Krip- design a before-and-after study as follows. pendorff’s alpha [43] equal to 38.43%. Fig. 1 shows changes in the relative frequencies of participant scores in the usefulness and trustworthiness Before treatment. We ask for participant’s back- of explanations before and after treatment. Participants’ ground information, e.g., gender identity, native language responses before treatment have expectations of textual and how they would rate the usefulness and trustwor- explanations for classifications of being “highly useful” thiness of a language model for explanation generation. (above 50%; highest possible score) in terms of usefulness, Specifically, we ask “How useful would you rate a system and “moderately trustworthy” or “neutral” (above 40%; that provides you a textual explanation for its classifica- second and third best possible score) in terms of trust- tion with respect to receiving only its classification?” and worthiness. However, scores for after treatment show “How trustworthy would you rate a system that provides participants changing their usefulness scores towards you a textual explanation for its classification with re- “moderately unuseful” (40-50%; second worst possible spect to receiving only its classification?” on a 1-5 Likert score) and their trustworthiness scores to “highly untrust- scale. worthy” (above 30%; worst possible score). Agreement differs in each category: usefulness is much more con- Treatment. As for the treatment, we show participants sensual, whereas trustworthiness is judged with higher a sample of 70 texts from the datasets, paired with up to variance. In general, LLM-generated explanations do not four different explanations. Specifically, given a text and meet human expectations in terms of usefulness and trust- ground-truth explanation, participants are asked if the worthiness. Specifically, exposing participants to these text is correctly explained. If yes, they are asked to rate explanations leads to an average percentage decrease of three different LLM-generated explanations with respect 47.78% and 64.32% in the perception of the usefulness and to the ground-truth on a 1-3 scale. These explanations trustworthiness of explanations, respectively. are randomly sampled among the four LLMs and five Fig. 2 shows the scores of all empirical metrics and learning strategies discussed in Section 2.2. expert evaluation for all models on explanation genera- tion. Overall, similarity metrics tend to be highly volatile After treatment. Finally, we ask participants’ opinion with respect to each other. For instance, FLAN-Alpaca on the usefulness and trustworthiness of explanation prompted with zero-shot learning (i.e., ‘alpaca_zsl’ in generation, having seen the LLM-generated explanations. the figure) generates explanations that are more than In addition, we ask general opinions related to what 70% semantically similar to the ground-truth explana- type of errors they observed most frequently, and what a tions according to BERTScore while being less than 20% good explanation would look like. semantically similar according to METEOR. Similarly for syntax: BLEU and GBLEU similarity scores are less The full list of questions is in the Appendix B. than 3% whereas ROUGE and chrF/+/++ are in the range The institutional ethical board of the first author’s 9%-21%. Moreover, we observe that BERTScore has a university approved our study design. We distributed tendency to over-score explanations compared to human the survey through channels that allow us to target evaluation scores. Contrarily, METEOR, BLEU, GBLEU, individuals working in AI who are familiar with the field ROUGE and chrF/+/++ have a tendency to under-score of language models and/or AI Ethics, including NLP explanations. Instruction fine-tuning helped all metrics reading groups and AI Ethics interest groups. To ensure to approximate expert evaluations better, especially when the reliability of our before-and-after study, participants tuned on knowledge-guided prompts. We use the Spear- were given 1 hour to complete as many answers as they man’s rank correlation coefficient to compare the cor- could. We collected answers from 15 participants, of relation between human scores and those provided by which 33% (67%) identify as female (male), and 33% (67%) all the other metrics. In detail, we rank the models for are (non) English native-speakers. The average level each type of metric, and then we compute the Spearman of participants’ expertise in abusive language research correlation between the rank obtained by human scores is 2.47 out of 5 (self-described)4 , and their continents and those obtained by other metrics. Table 3 reports all the correlation scores. We observe that BERTScore is 4 The list of levels to choose from was: 1=Novice, 2=Advanced be- the most correlated with humans in both tasks. Also, ginner, 3=Competent, 4=Proficient, 5=Expert. Figure 1: Relative frequencies of Likert scores before and after treatment on usefulness and trustworthiness of LLMs for explanation generation in abusive language detection. chrF/+/++ metrics are highly correlated with humans the model reasoning. To quote a participant’s perspec- while all the other metrics based on syntactic matches tive, “I would want the explanation to be helpful to me and are slightly correlated with humans. Results show that guide my own reasoning”. semantic metrics are more similar to how humans eval- uate the quality of the explanation generated by LLMs. Spearman Coeff. Metric Only one metric (ROUGE) shows a different behaviour Implicit Hate HateXplain between the two tasks. bertscore 0,80 0,91 meteor 0,64 0,89 Since 38.55% of the ground-truth explanations were chrf1 0,60 0,83 not rated as good explanations by participants, we fur- chrf2 0,60 0,81 ther investigated what are the most common errors and chrf 0,57 0,83 what makes an explanation good. Table 4 returns the gbleu 0,53 0,25 most common error categories reported by participants. rouge 0,50 0,86 Most of them are related to logical fallacies (e.g., con- bleu 0,27 0,11 tradictory statements, hallucination), especially in the Table 3 context of sarcasm and self-deprecating humour, rather The Spearman coefficient between each metric and experts’ than linguistic errors (e.g., grammar, misspellings). It is scores. worth noticing that 13.33% of the participants reported that LLM-generated explanations contain cultural bias (e.g., stereotypes), with the implication of potentially per- Error Category Relative Frequency petuating harms against the targeted victims of abusive Logical Errors 26.67% language. As for desiderata, 73.33% of participants would Vagueness 20.00% like to receive textual explanations that are coherent Cultural Bias 13.33% with human reasoning and understanding, i.e., that are Hallucination 13.33% relevant and exhaustive to the text they refer to while be- Irrelevant Info 13.33% ing logically and linguistically correct. A remaining 20% Other 6.67% thinks that a good explanation must be coherent with Table 4 model reasoning instead. In other words, participants are Percentage of error categories reported by participants. much more concerned about how the explanation looks like rather than its reflection of the inner mechanism of Figure 2: Evaluation of explanation generation by LLMs across empirical metrics and human eval. 4. Conclusion vs. syntactic), and therefore pointing at the need of more reliable metrics for the empirical evaluation of textual ex- In this paper, we conducted a before-and-after study to planations. In general, BERTScore and METEOR metrics understand human expectations and judgements of LLM- exhibit the strongest correlation with human judgements. generated explanations for multi-class abusive language Lastly, our study provides evidence of the desiderata for detection tasks. Contrarily to previous research [22], we LLM-generated explanations, suggesting that explana- investigated multiple LLMs and learning techniques, and tions should be coherent with human reasoning rather we surveyed AI experts who are familiar with abusive than model reasoning. Participants value the most tex- language research instead of crowdworkers. We found tual explanations that are relevant and exhaustive to the that human expectations in terms of usefulness and trust- text they refer to, while being logically and linguisti- worthiness of LLM-generated explanations are not met: cally correct. Justifications for this preference lie on the after seeing these explanations, the usefulness and trust- fact that abusive language detection heavily relies on worthiness ratings decrease by 47.78% and 64.32%, re- additional context and knowledge about slang and slurs, spectively. Secondly, our results show that empirical for which receiving an explanation is helpful to partic- metrics commonly used to evaluate textual explanations ipants’ understanding of the text. Future work should are highly volatile with respect to each other, even when investigate whether this preference holds for other do- they measure the same type of similarity (i.e., semantic mains as well. In light of our findings, we conclude with three recommendations to use LLMs responsibly for ex- [4] The European Parliament and The Council of the plainable abusive language detection: (1) be aware of the European Union, Eu regulation 2016/679 on the cultural bias these models might exhibit when generating protection of natural persons with regard to the free-text explanations, which can further harm targeted processing of personal data and on the free move- groups; (2) if possible, instruction fine-tune LLMs for ment of such data, and repealing directive 95/46/ec explanation generation of abusive language detection. (general data protection regulation), Official Journal This not only could ensure the generation of structured of the European Union (2016). explanations as advised by previous research [1] but it [5] O. L. Haimson, D. Delmonaco, P. Nie, A. Wegner, also returns the highest evaluation scores, both empir- Disproportionate removals and differing content ically and expert-wise, when using knowledge-guided moderation experiences for conservative, transgen- prompts; (3) opt for a combination of empirical metrics to der, and black social media users: Marginalization evaluate textual explanations when no human evaluation and moderation gray areas, Proc. ACM Hum.- is possible, since no particular empirical metric seems to Comput. Interact. 5 (2021). URL: https://doi.org/10. generalise across different learning techniques, models 1145/3479610. doi:10.1145/3479610 . and datasets, making the ground-truth lie somewhere [6] J. Brunk, J. Mattern, D. M. Riehle, Effect of trans- in between BERTScore (upper bound) and BLEU (lower parency and trust on acceptance of automatic on- bound). line comment moderation systems, in: 2019 IEEE 21st Conference on Business Informatics (CBI), vol- ume 01, 2019, pp. 429–435. doi:10.1109/CBI.2019. Acknowledgments 00056 . [7] A. Calabrese, L. Neves, N. Shah, M. W. Bos, B. Ross, This work was supported by the UK Research and Inno- M. Lapata, F. Barbieri, Explainability and hate vation [grant number EP/S023356/1] in the UKRI Centre speech: Structured explanations make social media for Doctoral Training in Safe and Trusted Artificial Intel- moderators faster, arXiv preprint arXiv:2406.04106 ligence (www.safeandtrustedai.org); by the Trustworthy (2024). AI Research award by The Alan Turing Institute, sup- [8] B. Mathew, P. Saha, S. M. Yimam, C. Biemann, ported by the British Embassy Rome and the UK Science P. Goyal, A. Mukherjee, Hatexplain: A benchmark & Innovation Network; and by the PNRR project FAIR - dataset for explainable hate speech detection, in: Future AI Research (PE00000013), Spoke 6 - Symbiotic AI Proceedings of the AAAI conference on artificial (CUP H97G22000210007) under the NRRP MUR program intelligence, volume 35, 2021, pp. 14867–14875. funded by the NextGenerationEU. [9] M. ElSherief, C. Ziems, D. Muchlinski, V. Anupindi, J. Seybolt, M. De Choudhury, D. Yang, Latent ha- References tred: A benchmark for understanding implicit hate speech, in: Proceedings of the 2021 Conference on [1] P. Mishra, H. Yannakoudakis, E. Shutova, Tack- Empirical Methods in Natural Language Processing, ling online abuse: A survey of automated abuse de- 2021, pp. 345–363. tection methods, arXiv preprint arXiv:1908.06024 [10] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith, (2019). Y. Choi, Social bias frames: Reasoning about social [2] P. Barceló, M. Monet, J. Pérez, B. Subercaseaux, and power implications of language, in: ACL, 2020. Model interpretability through the lens of [11] D. Nozza, A. T. Cignarella, G. Damo, T. Caselli, computational complexity, in: H. Larochelle, V. Patti, Hodi at evalita 2023: Overview of the M. Ranzato, R. Hadsell, M. Balcan, H. Lin first shared task on homotransphobia detection in (Eds.), Advances in Neural Information Pro- italian, in: 8th Evaluation Campaign of Natural cessing Systems, volume 33, Curran Asso- Language Processing and Speech Tools for Italian. ciates, Inc., 2020, pp. 15487–15498. URL: https: Final Workshop, EVALITA 2023, CEUR Workshop //proceedings.neurips.cc/paper_files/paper/2020/ Proceedings (CEUR-WS. org), 2023. file/b1adda14824f50ef24ff1c05bb66faf3-Paper.pdf. [12] H. Kirk, W. Yin, B. Vidgen, P. Röttger, Semeval-2023 [3] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith, task 10: Explainable detection of online sexism, in: The risk of racial bias in hate speech detection, in: Proceedings of the 17th International Workshop A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceed- on Semantic Evaluation (SemEval-2023), 2023, pp. ings of the 57th Annual Meeting of the Associa- 2193–2210. tion for Computational Linguistics, Association for [13] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, Computational Linguistics, Florence, Italy, 2019, pp. W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma, 1668–1678. URL: https://aclanthology.org/P19-1163. et al., Scaling instruction-finetuned language mod- doi:10.18653/v1/P19- 1163 . els, arXiv preprint arXiv:2210.11416 (2022). [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- guage models using chain of utterances for safety- plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- alignment, arXiv preprint arXiv:2308.09662 (2023). try, A. Askell, et al., Language models are few-shot [24] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, learners, Advances in neural information process- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford al- ing systems 33 (2020) 1877–1901. paca: An instruction-following llama model, https: [15] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe- //github.com/tatsu-lab/stanford_alpaca, 2023. own, T. B. Hashimoto, Benchmarking large lan- [25] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts, guage models for news summarization, arXiv S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X. preprint arXiv:2301.13848 (2023). Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, [16] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, K. Almubarak, S. Albanie, Z. Alyafeai, A. Web- D. Yang, Can large language models transform son, E. Raff, C. Raffel, Crosslingual generaliza- computational social science?, arXiv preprint tion through multitask finetuning, in: A. Rogers, arXiv:2305.03514 (2023). J. Boyd-Graber, N. Okazaki (Eds.), Proceedings [17] S. Roy, A. Harshvardhan, A. Mukherjee, P. Saha, of the 61st Annual Meeting of the Association Probing LLMs for hate speech detection: strengths for Computational Linguistics (Volume 1: Long and vulnerabilities, in: H. Bouamor, J. Pino, Papers), Association for Computational Linguis- K. Bali (Eds.), Findings of the Association for tics, Toronto, Canada, 2023, pp. 15991–16111. URL: Computational Linguistics: EMNLP 2023, Asso- https://aclanthology.org/2023.acl-long.891. doi:10. ciation for Computational Linguistics, Singapore, 18653/v1/2023.acl- long.891 . 2023, pp. 6116–6128. URL: https://aclanthology.org/ [26] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma- 2023.findings-emnlp.407. doi:10.18653/v1/2023. hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar- findings- emnlp.407 . gava, S. Bhosale, et al., Llama 2: Open founda- [18] Y. Yang, J. Kim, Y. Kim, N. Ho, J. Thorne, S.-Y. tion and fine-tuned chat models, arXiv preprint Yun, HARE: Explainable hate speech detection arXiv:2307.09288 (2023). with step-by-step reasoning, in: H. Bouamor, [27] H. Touvron, T. Lavril, G. Izacard, X. Martinet, J. Pino, K. Bali (Eds.), Findings of the Association M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, for Computational Linguistics: EMNLP 2023, Asso- E. Hambro, F. Azhar, et al., Llama: Open and effi- ciation for Computational Linguistics, Singapore, cient foundation language models, arXiv preprint 2023, pp. 5490–5505. URL: https://aclanthology.org/ arXiv:2302.13971 (2023). 2023.findings-emnlp.365. doi:10.18653/v1/2023. [28] D. Vrandečić, M. Krötzsch, Wikidata: a free col- findings- emnlp.365 . laborative knowledgebase, Communications of the [19] F. Huang, H. Kwak, J. An, Chain of explana- ACM 57 (2014) 78–85. tion: New prompting method to generate qual- [29] K. Halevy, A group-specific approach to nlp for hate ity natural language explanation for implicit hate speech detection, arXiv preprint arXiv:2304.11223 speech, in: Companion Proceedings of the ACM (2023). Web Conference 2023, WWW ’23 Companion, As- [30] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An sociation for Computing Machinery, New York, NY, open multilingual graph of general knowledge, in: USA, 2023, p. 90–93. URL: https://doi.org/10.1145/ Proceedings of the AAAI conference on artificial 3543873.3587320. doi:10.1145/3543873.3587320 . intelligence, volume 31, 2017. [20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, [31] A. B. Sai, T. Dixit, D. Y. Sheth, S. Mohan, M. M. E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought Khapra, Perturbation CheckLists for evaluating prompting elicits reasoning in large language mod- NLG evaluation metrics, in: M.-F. Moens, X. Huang, els, Advances in neural information processing L. Specia, S. W.-t. Yih (Eds.), Proceedings of the systems 35 (2022) 24824–24837. 2021 Conference on Empirical Methods in Natu- [21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a ral Language Processing, Association for Compu- method for automatic evaluation of machine trans- tational Linguistics, Online and Punta Cana, Do- lation, in: Proceedings of the 40th annual meeting minican Republic, 2021, pp. 7219–7234. URL: https: of the Association for Computational Linguistics, //aclanthology.org/2021.emnlp-main.575. doi:10. 2002, pp. 311–318. 18653/v1/2021.emnlp- main.575 . [22] H. Wang, M. S. Hee, M. R. Awal, K. T. W. Choo, R. K.- [32] E. Reiter, A structured review of the validity W. Lee, Evaluating gpt-3 generated explanations of BLEU, Computational Linguistics 44 (2018) for hateful content moderation, in: Proceedings of 393–401. URL: https://aclanthology.org/J18-3002. the Thirty-Second International Joint Conference doi:10.1162/coli_a_00322 . on Artificial Intelligence, 2023, pp. 6255–6263. [33] J. Novikova, O. Dušek, A. Cercas Curry, V. Rieser, [23] R. Bhardwaj, S. Poria, Red-teaming large lan- Why we need new evaluation metrics for NLG, in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceed- A. Prompt Details ings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Table 5 shows the two types of prompts we used in our ex- Computational Linguistics, Copenhagen, Denmark, periments, following the template of the Stanford Alpaca 2017, pp. 2241–2252. URL: https://aclanthology.org/ project. The two categories differ for the ‘context’ that is D17-1238. doi:10.18653/v1/D17- 1238 . passed in the knowledge-guided version, which contains [34] A. B. Sai, A. K. Mohankumar, M. M. Khapra, A the information extracted from the knowledge sources survey of evaluation metrics used for nlg systems, linked to the text. As described in the Section 2.2 of the ACM Computing Surveys (CSUR) 55 (2022) 1–39. paper, we used the vanilla prompts for zero-shot learning, [35] A. Celikyilmaz, E. Clark, J. Gao, Evaluation few-shot learning, and instruction fine-tuning whereas of text generation: A survey, arXiv preprint we used the knowledge-guided prompts for knowledge- arXiv:2006.14799 (2020). guided zero-shot learning and knowledge-guided instruc- [36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, tion fine-tuning. Y. Artzi, Bertscore: Evaluating text generation with bert, in: International Conference on Learning Representations, 2019. B. Survey Questions [37] A. Lavie, M. J. Denkowski, The meteor metric for Participants were presented with the questions shown in automatic evaluation of machine translation, Ma- Table 6. chine translation 23 (2009) 105–115. [38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. Johnson, X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals, G. Corrado, M. Hughes, J. Dean, Google’s neu- ral machine translation system: Bridging the gap between human and machine translation, 2016. arXiv:1609.08144 . [39] C.-Y. Lin, ROUGE: A package for automatic eval- uation of summaries, in: Text Summarization Branches Out, Association for Computational Lin- guistics, Barcelona, Spain, 2004, pp. 74–81. URL: https://www.aclweb.org/anthology/W04-1013. [40] M. Popović, chrF: character n-gram F-score for auto- matic MT evaluation, in: Proceedings of the Tenth Workshop on Statistical Machine Translation, Asso- ciation for Computational Linguistics, Lisbon, Por- tugal, 2015, pp. 392–395. URL: https://aclanthology. org/W15-3049. doi:10.18653/v1/W15- 3049 . [41] M. Popović, chrF++: words helping character n- grams, in: Proceedings of the Second Conference on Machine Translation, Association for Computa- tional Linguistics, Copenhagen, Denmark, 2017, pp. 612–618. URL: https://aclanthology.org/W17-4770. doi:10.18653/v1/W17- 4770 . [42] M. Post, A call for clarity in reporting BLEU scores, in: Proceedings of the Third Conference on Ma- chine Translation: Research Papers, Association for Computational Linguistics, Belgium, Brussels, 2018, pp. 186–191. URL: https://www.aclweb.org/ anthology/W18-6319. [43] K. Krippendorff, Computing krippendorff’s alpha- reliability, 2011. Category Prompt Template Below is an instruction that describes a task, paired with input text. Write a response that appropriately completes the instruction. Vanilla Instruction: Classify the input text as list_of_labels , and provide an explanation. Input text: text_to_classify . Response: Below is an instruction that describes a task, paired with context and input text. Write a response that appropriately completes the instruction based on the context. Knowledge-guided Instruction: Classify the input text as list_of_labels , and provide an explanation. Context: knowledge_source_linked . Input text: text_to_classify . Response: Table 5 Details of vanilla prompts and knowledge-guided prompts passed to the LLMs in our experiments. Part Questions “Which gender do you identify as?” “Are you an English native-speaker?” “What is your country of origin?” “What is your level of expertise on language models or abusive language?” Before Treatment “How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?” “How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?” “Do you think explanation 1 provides a good explanation given the text?” “If your answer was yes, does explanation 2 mean the same thing as explanation 1?” Treatment “If your answer was yes, does explanation 3 mean the same thing as explanation 1?” “If your answer was yes, does explanation 4 mean the same thing as explanation 1?” “Having seen these explanations, how useful would you rate a system that provides you a textual explanation for its classification?” “Having seen these explanations, how trustworthy would you rate a system that provides you a textual After Treatment explanation for its classification?” “What was the main error you noticed in these explanations?” “What do you think makes a textual explanation good?” “Do you have any comment you would like to share?” Table 6 List of questions asked to participants in our expert survey.