-

1613-0073

Expert Survey on LLM-generated Explanations for Abusive Language Detection

Barbara McGillivray

barbara.mcgillivray@kcl.ac.uk 2

Chiara Di Bonaventura

chiara.di_bonaventura@kcl.ac.uk 1 2

Lucia Siciliani

lucia.siciliani@uniba.it 0

Pierpaolo Basile

pierpaolo.basile@uniba.it 0

Albert Meroño-Peñuela

albert.merono@kcl.ac.uk 2

Large Language Models, Hate Speech Detection, Explanation Generation, Human Evaluation

0 Department of Computer Science, University of Bari Aldo Moro , Italy 1 Imperial College London , London , United Kingdom 2 King's College London , London , United Kingdom

2024

Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) dificulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.

CEUR

ceur-ws.org 1. Introduction guage Processing (NLP) research on abusive language [1] as increasing models’ complexity [2], models’ intrinsic bias [3], and international regulations [4] call for a shift in perspective from performance-based models to more transparent models. Moreover, recent studies have shown the benefits of explanations for users [ 5, 6] and content moderators [7] on social media platforms. The former can benefit from receiving an explanation for why a certain post has been flagged or removed whereas the latter are shown to annotate toxic posts faster and solve doubtful annotations thanks to explanations.

Several eforts have moved towards explainable abu

opment of datasets containing rationales (i.e., the tokens sive language detection in the past years, like the devel- implications of these explanations remain understudied, †Work partially funded by the Trustworthy AI Research award received by The Alan Turing Institute and the the Italian Future AI Research Foundation (FAIR). nEvelop-O

Language Models (LLMs) like FLAN-T5 [13] showing

remarkable performance across tasks and human-like text generation [14, 15, 16], recent studies have explored

LLMs for explainable hate speech detection, wherein

classification predictions are described through natural language explanations [17, 18]. For instance, [19] used chain-of-thought prompting [20] of LLMs to generate explanations for implicit hate speech detection.

However, most of these studies rely on empirical metrics like BLEU [21] to evaluate the generated explanations automatically. Consequently, the human perception and as well as the extent to which empirical metrics approximate human judgements. [22] recruited crowdworkers to evaluate the level of hatefulness in tweets and the quality of explanations generated by GPT-3. Instead, we conduct an expert survey investigating four LLMs and five learning strategies across multi-class abusive language detection tasks to answer the following questions: RQ1:

How well do LLM-generated explanations for abusive

language detection match human expectations? RQ2:

How well do empirical metrics align with human judge

ments? RQ3: What makes LLM-generated explanations

2. Experimental Setup Model Instruction Fine-tuned Toxicity Fine-tuned

To answer these research questions, we design a beforeand-after study, surveying participants about their prior expectations about LLM-generated explanations and then showing them examples generated by several LLMs with diverse learning strategies1, followed by further inter- Table 2 views. To ensure robustness of our results, we recruited Summary of models used. experts in the field, i.e., AI researchers, as described below.

FLAN-Alpaca FLAN-T5 mT0 Llama-2

Learning strategies. As diferent prompting strate

2.1. Data gies might yield diferent results, we test five distinct learning strategies using the established Stanford Alpaca For our experiments, we use the HateXplain [8] and the template2 (cf. Appendix A for prompt details): Implicit Hate Corpus [9] as they encompass diferent lev- (1) zero-shot learning (zsl): we pass “Classify the els of ofensiveness (i.e., hate speech, ofensive, neutral), input text as list_of_labels, and provide an explaexpressiveness (i.e., explicit hate, implicit hate, neutral), nation” in the instruction field of the template. The multiple targeted groups, and explanations for the hate- list_of_labels changes according to the dataset used; ful label (Table 1). These datasets contain unstructured (2) few-shot learning (fsl): we pass three additional explanations of the words that constitute abuse (in Hat- examples to the aforementioned template, which are raneXplain) and the user’s intent (in Implicit Hate). In view domly sampled with equal probability among the labels of previous research arguing the need for structured ex- to account for class imbalance in the datasets. We experiplanations in hateful content moderation [1], we use the mented with diferent numbers of examples (i.e., passing following template to create structured explanations, that one, three or five examples), and chose three as it was we will use as ground-truth: “Explanation: it contains the the best strategy; following hateful words (implied statement):” for abusive (3) knowledge-guided zero-shot learning (kg): incontent in HateXplain (Implicit Hate Corpus), and “The stead of passing additional examples in the prompts, we text does not contain abusive content.” for neutral content. add external knowledge retrieved by means of an entity linker3, which first detects entities mentioned in the inDataset Labels Target Explanation put text, and then retrieves the relevant information from HateXplain onhfeeanutestrisvapele,ech, .bw..loamcke,n, lTeovkeeln- tehnecyecxltoeprendailckknnoowwlleeddggee,bKanseo.wWleedJues[e2W9] ifkoirdhaatate[s2p8e]efcohr IHmaptelicit ienmxepupltliirccaiitlthhaattee,, .Jw.e.hwiste,s, Ismtaptelimedent tcpeolmamtepmoworaintlhsleiannngseuaidksdntiioctiwoknlneaodlwgfieelled.dWcgaeellamendod‘dcCiofonynttechxeetp’tptNrooeamtc[cp3ot0u]tenfmtorfor this external knowledge; TSuabmlmea1ry of datasets used. pro(4m)pitnssutsreudctinio(n1)fintoe-itnustnriuncgtio(nft) fine:-tuwneeuLsleamthae-2s;ame (5) knowledge-guided instruction fine-tuning 2.2. Methodology (kg_ft) : we use the knowledge-guided prompts developed in (3) to instruction fine-tune Llama-2.

We extensively investigate four popular LLMs across five learning strategies on their ability to detect multi-class ofensiveness and expressiveness of abusive language and to generate explanations for the classification.

Models. We use diferent open-source LLMs (Table 2): the base versions of FLAN-Alpaca [23, 24], FLAN-T5 [13], mT0 [25], and the 7B foundational model Llama 2 [26], which is an updated version of LlaMA [27].

1The data containing the LLM-generated explanations are

publicly available at https://github.com/ChiaraDiBonaventura/ is-explanation-all-you-need

Empirical eval metrics. We evaluate how closely the

LLM-generated explanations match the ground-truth across eight empirical similarity metrics due to the challenge of simultaneously assessing a wide set of criteria [31, 32, 33]. Following established NLG research [34, 35], we choose BERTScore [36] and METEOR [37] for semantic similarity. For syntactic similarity, we select BLEU [21], GBLEU [38], ROUGE [39], ChrF [40] with 2https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file# data-release 3If available, we use the API provided by the knowledge source, spaCy otherwise. https://spacy.io/ its derivates ChrF+ and ChrF++ [41, 42]. Additionally, of origin include Europe (60%), Asia (26.67%), Africa we present an expert evaluation following our survey. (6.67%), and Latin America (6.67%). 2.3. Survey Design

3. Results and Discussion To evaluate how well LLMs align with human expec

tations and judgements in explanation generation, we design a before-and-after study as follows.

Our 15 participants reach a fair agreement, with Krip

pendorf’s alpha [ 43] equal to 38.43%.

Fig. 1 shows changes in the relative frequencies of participant scores in the usefulness and trustworthiness Before treatment. We ask for participant’s back- of explanations before and after treatment. Participants’ ground information, e.g., gender identity, native language responses before treatment have expectations of textual and how they would rate the usefulness and trustwor- explanations for classifications of being “highly useful” thiness of a language model for explanation generation. (above 50%; highest possible score) in terms of usefulness, Specifically, we ask “How useful would you rate a system and “moderately trustworthy” or “neutral” (above 40%; that provides you a textual explanation for its classifica- second and third best possible score) in terms of trusttion with respect to receiving only its classification?” and worthiness. However, scores for after treatment show “How trustworthy would you rate a system that provides participants changing their usefulness scores towards you a textual explanation for its classification with re- “moderately unuseful” (40-50%; second worst possible spect to receiving only its classification?” on a 1-5 Likert score) and their trustworthiness scores to “highly untrustscale. worthy” (above 30%; worst possible score). Agreement difers in each category: usefulness is much more conTreatment. As for the treatment, we show participants sensual, whereas trustworthiness is judged with higher a sample of 70 texts from the datasets, paired with up to variance. In general, LLM-generated explanations do not four diferent explanations. Specifically, given a text and meet human expectations in terms of usefulness and trustground-truth explanation, participants are asked if the worthiness. Specifically, exposing participants to these text is correctly explained. If yes, they are asked to rate explanations leads to an average percentage decrease of three diferent LLM-generated explanations with respect 47.78% and 64.32% in the perception of the usefulness and to the ground-truth on a 1-3 scale. These explanations trustworthiness of explanations, respectively. are randomly sampled among the four LLMs and five Fig. 2 shows the scores of all empirical metrics and learning strategies discussed in Section 2.2. expert evaluation for all models on explanation generation. Overall, similarity metrics tend to be highly volatile After treatment. Finally, we ask participants’ opinion with respect to each other. For instance, FLAN-Alpaca on the usefulness and trustworthiness of explanation prompted with zero-shot learning (i.e., ‘alpaca_zsl’ in generation, having seen the LLM-generated explanations. the figure) generates explanations that are more than In addition, we ask general opinions related to what 70% semantically similar to the ground-truth explanatype of errors they observed most frequently, and what a tions according to BERTScore while being less than 20% good explanation would look like. semantically similar according to METEOR. Similarly for syntax: BLEU and GBLEU similarity scores are less The full list of questions is in the Appendix B. than 3% whereas ROUGE and chrF/+/++ are in the range The institutional ethical board of the first author’s 9%-21%. Moreover, we observe that BERTScore has a university approved our study design. We distributed tendency to over-score explanations compared to human the survey through channels that allow us to target evaluation scores. Contrarily, METEOR, BLEU, GBLEU, individuals working in AI who are familiar with the field ROUGE and chrF/+/++ have a tendency to under-score of language models and/or AI Ethics, including NLP explanations. Instruction fine-tuning helped all metrics reading groups and AI Ethics interest groups. To ensure to approximate expert evaluations better, especially when the reliability of our before-and-after study, participants tuned on knowledge-guided prompts. We use the Spearwere given 1 hour to complete as many answers as they man’s rank correlation coeficient to compare the corcould. We collected answers from 15 participants, of relation between human scores and those provided by which 33% (67%) identify as female (male), and 33% (67%) all the other metrics. In detail, we rank the models for are (non) English native-speakers. The average level each type of metric, and then we compute the Spearman of participants’ expertise in abusive language research correlation between the rank obtained by human scores is 2.47 out of 5 (self-described)4, and their continents and those obtained by other metrics. Table 3 reports all the correlation scores. We observe that BERTScore is 4The list of levels to choose from was: 1=Novice, 2=Advanced be- the most correlated with humans in both tasks. Also, ginner, 3=Competent, 4=Proficient, 5=Expert. chrF/+/++ metrics are highly correlated with humans while all the other metrics based on syntactic matches are slightly correlated with humans. Results show that semantic metrics are more similar to how humans evaluate the quality of the explanation generated by LLMs.

Only one metric (ROUGE) shows a diferent behaviour between the two tasks.

Since 38.55% of the ground-truth explanations were not rated as good explanations by participants, we further investigated what are the most common errors and what makes an explanation good. Table 4 returns the most common error categories reported by participants.

Most of them are related to logical fallacies (e.g., contradictory statements, hallucination), especially in the context of sarcasm and self-deprecating humour, rather than linguistic errors (e.g., grammar, misspellings). It is worth noticing that 13.33% of the participants reported that LLM-generated explanations contain cultural bias (e.g., stereotypes), with the implication of potentially perpetuating harms against the targeted victims of abusive language. As for desiderata, 73.33% of participants would like to receive textual explanations that are coherent with human reasoning and understanding, i.e., that are relevant and exhaustive to the text they refer to while being logically and linguistically correct. A remaining 20% thinks that a good explanation must be coherent with model reasoning instead. In other words, participants are much more concerned about how the explanation looks like rather than its reflection of the inner mechanism of the model reasoning. To quote a participant’s perspective, “I would want the explanation to be helpful to me and guide my own reasoning”.

Metric

4. Conclusion

vs. syntactic), and therefore pointing at the need of more reliable metrics for the empirical evaluation of textual exIn this paper, we conducted a before-and-after study to planations. In general, BERTScore and METEOR metrics understand human expectations and judgements of LLM- exhibit the strongest correlation with human judgements. generated explanations for multi-class abusive language Lastly, our study provides evidence of the desiderata for detection tasks. Contrarily to previous research [22], we LLM-generated explanations, suggesting that explanainvestigated multiple LLMs and learning techniques, and tions should be coherent with human reasoning rather we surveyed AI experts who are familiar with abusive than model reasoning. Participants value the most texlanguage research instead of crowdworkers. We found tual explanations that are relevant and exhaustive to the that human expectations in terms of usefulness and trust- text they refer to, while being logically and linguistiworthiness of LLM-generated explanations are not met: cally correct. Justifications for this preference lie on the after seeing these explanations, the usefulness and trust- fact that abusive language detection heavily relies on worthiness ratings decrease by 47.78% and 64.32%, re- additional context and knowledge about slang and slurs, spectively. Secondly, our results show that empirical for which receiving an explanation is helpful to particmetrics commonly used to evaluate textual explanations ipants’ understanding of the text. Future work should are highly volatile with respect to each other, even when investigate whether this preference holds for other dothey measure the same type of similarity (i.e., semantic mains as well. In light of our findings, we conclude with

Acknowledgments This work was supported by the UK Research and Inno

vation [grant number EP/S023356/1] in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org); by the Trustworthy AI Research award by The Alan Turing Institute, supported by the British Embassy Rome and the UK Science & Innovation Network; and by the PNRR project FAIR Future AI Research (PE00000013), Spoke 6 - Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU. three recommendations to use LLMs responsibly for explainable abusive language detection: (1) be aware of the cultural bias these models might exhibit when generating free-text explanations, which can further harm targeted groups; (2) if possible, instruction fine-tune LLMs for explanation generation of abusive language detection.

This not only could ensure the generation of structured explanations as advised by previous research [1] but it also returns the highest evaluation scores, both empirically and expert-wise, when using knowledge-guided prompts; (3) opt for a combination of empirical metrics to evaluate textual explanations when no human evaluation is possible, since no particular empirical metric seems to generalise across diferent learning techniques, models and datasets, making the ground-truth lie somewhere in between BERTScore (upper bound) and BLEU (lower bound). [14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka- guage models using chain of utterances for safetyplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas- alignment, arXiv preprint arXiv:2308.09662 (2023). try, A. Askell, et al., Language models are few-shot [24] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li, learners, Advances in neural information process- C. Guestrin, P. Liang, T. B. Hashimoto, Stanford aling systems 33 (2020) 1877–1901. paca: An instruction-following llama model, https: [15] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe- //github.com/tatsu-lab/stanford_alpaca, 2023. own, T. B. Hashimoto, Benchmarking large lan- [25] N. Muennighof, T. Wang, L. Sutawika, A. Roberts, guage models for news summarization, arXiv S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X. preprint arXiv:2301.13848 (2023). Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji, [16] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang, K. Almubarak, S. Albanie, Z. Alyafeai, A. WebD. Yang, Can large language models transform son, E. Raf, C. Rafel, Crosslingual generalizacomputational social science?, arXiv preprint tion through multitask finetuning, in: A. Rogers, arXiv:2305.03514 (2023). J. Boyd-Graber, N. Okazaki (Eds.), Proceedings [17] S. Roy, A. Harshvardhan, A. Mukherjee, P. Saha, of the 61st Annual Meeting of the Association Probing LLMs for hate speech detection: strengths for Computational Linguistics (Volume 1: Long and vulnerabilities, in: H. Bouamor, J. Pino, Papers), Association for Computational LinguisK. Bali (Eds.), Findings of the Association for tics, Toronto, Canada, 2023, pp. 15991–16111. URL: Computational Linguistics: EMNLP 2023, Asso- https://aclanthology.org/2023.acl-long.891. doi:10. ciation for Computational Linguistics, Singapore, 18653/v1/2023.acl- long.891. 2023, pp. 6116–6128. URL: https://aclanthology.org/ [26] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma2023.findings-emnlp.407. doi:10.18653/v1/2023. hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bharfindings- emnlp.407. gava, S. Bhosale, et al., Llama 2: Open founda[18] Y. Yang, J. Kim, Y. Kim, N. Ho, J. Thorne, S.-Y. tion and fine-tuned chat models, arXiv preprint Yun, HARE: Explainable hate speech detection arXiv:2307.09288 (2023). with step-by-step reasoning, in: H. Bouamor, [27] H. Touvron, T. Lavril, G. Izacard, X. Martinet, J. Pino, K. Bali (Eds.), Findings of the Association M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, for Computational Linguistics: EMNLP 2023, Asso- E. Hambro, F. Azhar, et al., Llama: Open and eficiation for Computational Linguistics, Singapore, cient foundation language models, arXiv preprint 2023, pp. 5490–5505. URL: https://aclanthology.org/ arXiv:2302.13971 (2023). 2023.findings-emnlp.365. doi:10.18653/v1/2023. [28] D. Vrandečić, M. Krötzsch, Wikidata: a free colfindings- emnlp.365. laborative knowledgebase, Communications of the [19] F. Huang, H. Kwak, J. An, Chain of explana- ACM 57 (2014) 78–85.

tion: New prompting method to generate qual- [29] K. Halevy, A group-specific approach to nlp for hate ity natural language explanation for implicit hate speech detection, arXiv preprint arXiv:2304.11223 speech, in: Companion Proceedings of the ACM (2023).

Web Conference 2023, WWW ’23 Companion, As- [30] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An sociation for Computing Machinery, New York, NY, open multilingual graph of general knowledge, in: USA, 2023, p. 90–93. URL: https://doi.org/10.1145/ Proceedings of the AAAI conference on artificial 3543873.3587320. doi:10.1145/3543873.3587320. intelligence, volume 31, 2017. [20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, [31] A. B. Sai, T. Dixit, D. Y. Sheth, S. Mohan, M. M.

E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought Khapra, Perturbation CheckLists for evaluating prompting elicits reasoning in large language mod- NLG evaluation metrics, in: M.-F. Moens, X. Huang, els, Advances in neural information processing L. Specia, S. W.-t. Yih (Eds.), Proceedings of the systems 35 (2022) 24824–24837. 2021 Conference on Empirical Methods in Natu[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a ral Language Processing, Association for Compumethod for automatic evaluation of machine trans- tational Linguistics, Online and Punta Cana, Dolation, in: Proceedings of the 40th annual meeting minican Republic, 2021, pp. 7219–7234. URL: https: of the Association for Computational Linguistics, //aclanthology.org/2021.emnlp-main.575. doi:10. 2002, pp. 311–318. 18653/v1/2021.emnlp- main.575. [22] H. Wang, M. S. Hee, M. R. Awal, K. T. W. Choo, R. K.- [32] E. Reiter, A structured review of the validity W. Lee, Evaluating gpt-3 generated explanations of BLEU, Computational Linguistics 44 (2018) for hateful content moderation, in: Proceedings of 393–401. URL: https://aclanthology.org/J18-3002. the Thirty-Second International Joint Conference doi:10.1162/coli_a_00322.

on Artificial Intelligence, 2023, pp. 6255–6263. [33] J. Novikova, O. Dušek, A. Cercas Curry, V. Rieser, [23] R. Bhardwaj, S. Poria, Red-teaming large lan- Why we need new evaluation metrics for NLG,

A. Prompt Details

Vanilla Knowledge-guided

Below is an instruction that describes a task, paired with input text. Write a response that appropriately completes the instruction. Below is an instruction that describes a task, paired with context and input text. Write a response that appropriately completes the instruction based on the context.

Questions

Before Treatment Treatment After Treatment “Which gender do you identify as?” “Are you an English native-speaker?” “What is your country of origin?” “What is your level of expertise on language models or abusive language?” “How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?” “How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?” “Do you think explanation 1 provides a good explanation given the text?” “If your answer was yes, does explanation 2 mean the same thing as explanation 1?” “If your answer was yes, does explanation 3 mean the same thing as explanation 1?” “If your answer was yes, does explanation 4 mean the same thing as explanation 1?” “Having seen these explanations, how useful would you rate a system that provides you a textual explanation for its classification?” “Having seen these explanations, how trustworthy would you rate a system that provides you a textual explanation for its classification?” “What was the main error you noticed in these explanations?” “What do you think makes a textual explanation good?” “Do you have any comment you would like to share?”