Is Explanation All You Need? An Expert Survey on
                                LLM-generated Explanations for Abusive Language
                                Detection
                                Chiara Di Bonaventura1,2,∗,† , Lucia Siciliani3 , Pierpaolo Basile3 , Albert Meroño-Peñuela1 and
                                Barbara McGillivray1
                                1
                                  King’s College London, London, United Kingdom
                                2
                                  Imperial College London, London, United Kingdom
                                3
                                  Department of Computer Science, University of Bari Aldo Moro, Italy


                                                  Abstract
                                                  Explainable abusive language detection has proven to help both users and content moderators, and recent research has
                                                  focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of
                                                  these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a
                                                  before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations
                                                  for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that
                                                  expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease
                                                  by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation
                                                  generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with
                                                  empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive
                                                  language detection.

                                                  Keywords
                                                  Large Language Models, Hate Speech Detection, Explanation Generation, Human Evaluation


                                1. Introduction                                                                                                   in the text that suggest why the text is hateful) [8] or
                                                                                                                                                  implied statements (i.e., description of the implied mean-
                                Explainability is a crucial open challenge in Natural Lan- ing of the text) [9, 10], and shared tasks on explainable
                                guage Processing (NLP) research on abusive language hate speech detection [11, 12], inter alia. With Large
                                [1] as increasing models’ complexity [2], models’ intrin- Language Models (LLMs) like FLAN-T5 [13] showing
                                sic bias [3], and international regulations [4] call for a remarkable performance across tasks and human-like
                                shift in perspective from performance-based models to text generation [14, 15, 16], recent studies have explored
                                more transparent models. Moreover, recent studies have LLMs for explainable hate speech detection, wherein
                                shown the benefits of explanations for users [5, 6] and classification predictions are described through natural
                                content moderators [7] on social media platforms. The language explanations [17, 18]. For instance, [19] used
                                former can benefit from receiving an explanation for why chain-of-thought prompting [20] of LLMs to generate
                                a certain post has been flagged or removed whereas the explanations for implicit hate speech detection.
                                latter are shown to annotate toxic posts faster and solve                                                            However, most of these studies rely on empirical met-
                                doubtful annotations thanks to explanations.                                                                      rics like BLEU [21] to evaluate the generated explanations
                                              Several efforts have moved towards explainable abu- automatically. Consequently, the human perception and
                                sive language detection in the past years, like the devel- implications of these explanations remain understudied,
                                opment of datasets containing rationales (i.e., the tokens as well as the extent to which empirical metrics approxi-
                                CLiC-it 2024: Tenth Italian Conference on Computational Linguistics,
                                                                                                                                                  mate human judgements. [22] recruited crowdworkers to
                                Dec 04 — 06, 2024, Pisa, Italy                                                                                    evaluate the level of hatefulness in tweets and the quality
                                ∗
                                     Corresponding author.                                                                                        of explanations generated by GPT-3. Instead, we con-
                                †
                                    Work partially funded by the Trustworthy AI Research award re- duct an expert survey investigating four LLMs and five
                                    ceived by The Alan Turing Institute and the the Italian Future AI learning strategies across multi-class abusive language
                                    Research Foundation (FAIR).
                                                                                                                                                  detection tasks to answer the following questions: RQ1:
                                Envelope-Open chiara.di_bonaventura@kcl.ac.uk (C. D. Bonaventura);
                                lucia.siciliani@uniba.it (L. Siciliani); pierpaolo.basile@uniba.it                                                How well do LLM-generated explanations for abusive
                                (P. Basile); albert.merono@kcl.ac.uk (A. Meroño-Peñuela);                                                         language detection match human expectations? RQ2:
                                barbara.mcgillivray@kcl.ac.uk (B. McGillivray)                                                                    How well do empirical metrics align with human judge-
                                Orcid 0000-0002-1438-280X (L. Siciliani); 0000-0002-0545-1105                                                     ments? RQ3: What makes LLM-generated explanations
                                (P. Basile); 0000-0003-4646-5842 (A. Meroño-Peñuela)
                                                     © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License good, according to experts?
                                            Attribution 4.0 International (CC BY 4.0).


CEUR
                  ceur-ws.org
Workshop      ISSN 1613-0073
Proceedings
2. Experimental Setup                                                       Model
                                                                                              Instruction
                                                                                              Fine-tuned
                                                                                                                Toxicity
                                                                                                                Fine-tuned
To answer these research questions, we design a before-               FLAN-Alpaca                  �                 �
                                                                      FLAN-T5                      �                 �
and-after study, surveying participants about their prior
                                                                      mT0                          �                 -
expectations about LLM-generated explanations and then
                                                                      Llama-2                      -                 -
showing them examples generated by several LLMs with
                               1
diverse learning strategies , followed by further inter- Table 2
views. To ensure robustness of our results, we recruited Summary of models used.
experts in the field, i.e., AI researchers, as described be-
low.
                                                              Learning strategies. As different prompting strate-
2.1. Data                                                     gies might yield different results, we test five distinct
                                                              learning strategies using the established Stanford Alpaca
For our experiments, we use the HateXplain [8] and the template2 (cf. Appendix A for prompt details):
Implicit Hate Corpus [9] as they encompass different lev-        (1) zero-shot learning (zsl): we pass “Classify the
els of offensiveness (i.e., hate speech, offensive, neutral), input text as list_of_labels , and provide an expla-
expressiveness (i.e., explicit hate, implicit hate, neutral), nation” in the instruction field of the template. The
multiple targeted groups, and explanations for the hate- list_of_labels changes according to the dataset used;
ful label (Table 1). These datasets contain unstructured         (2) few-shot learning (fsl): we pass three additional
explanations of the words that constitute abuse (in Hat- examples to the aforementioned template, which are ran-
eXplain) and the user’s intent (in Implicit Hate). In view domly sampled with equal probability among the labels
of previous research arguing the need for structured ex- to account for class imbalance in the datasets. We experi-
planations in hateful content moderation [1], we use the mented with different numbers of examples (i.e., passing
following template to create structured explanations, that one, three or five examples), and chose three as it was
we will use as ground-truth: “Explanation: it contains the the best strategy;
following hateful words (implied statement):” for abusive        (3) knowledge-guided zero-shot learning (kg): in-
content in HateXplain (Implicit Hate Corpus), and “The stead of passing additional examples in the prompts, we
text does not contain abusive content.” for neutral content. add external knowledge retrieved by means of an entity
                                                              linker3 , which first detects entities mentioned in the in-
  Dataset        Labels            Target     Explanation
                                                              put text, and then retrieves the relevant information from
                 hate speech,      women,
                                              Token-          the external knowledge base. We use Wikidata [28] for
  HateXplain     offensive,        black,
                                              level           encyclopedic knowledge, KnowledJe [29] for hate speech
                 neutral           ...
                 implicit hate,    Jews,                      temporal linguistic knowledge and ConceptNet [30] for
  Implicit                                    Implied         commonsense knowledge. We modify the prompt tem-
                 explicit hate,    whites,
  Hate                                        statement
                 neutral           ...                        plate with an additional field called ‘context’ to account
                                                              for this external knowledge;
Table 1
                                                                 (4) instruction fine-tuning (ft): we use the same
Summary of datasets used.
                                                              prompts used in (1) to instruction fine-tune Llama-2;
                                                                 (5) knowledge-guided instruction fine-tuning
2.2. Methodology                                              (kg_ft): we use the knowledge-guided prompts devel-
                                                              oped in (3) to instruction fine-tune Llama-2.
We extensively investigate four popular LLMs across five
learning strategies on their ability to detect multi-class
                                                              Empirical eval metrics. We evaluate how closely the
offensiveness and expressiveness of abusive language
                                                              LLM-generated explanations match the ground-truth
and to generate explanations for the classification.
                                                              across eight empirical similarity metrics due to the chal-
                                                              lenge of simultaneously assessing a wide set of criteria
Models. We use different open-source LLMs (Table 2): [31, 32, 33]. Following established NLG research [34, 35],
the base versions of FLAN-Alpaca [23, 24], FLAN-T5 we choose BERTScore [36] and METEOR [37] for se-
[13], mT0 [25], and the 7B foundational model Llama 2 mantic similarity. For syntactic similarity, we select
[26], which is an updated version of LlaMA [27].              BLEU [21], GBLEU [38], ROUGE [39], ChrF [40] with

                                                                    2
                                                                      https://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file#
1
    The data containing the LLM-generated explanations are            data-release
                                                                    3
    publicly available at https://github.com/ChiaraDiBonaventura/     If available, we use the API provided by the knowledge source,
    is-explanation-all-you-need                                       spaCy otherwise. https://spacy.io/
its derivates ChrF+ and ChrF++ [41, 42]. Additionally, of origin include Europe (60%), Asia (26.67%), Africa
we present an expert evaluation following our survey.  (6.67%), and Latin America (6.67%).

2.3. Survey Design                                              3. Results and Discussion
To evaluate how well LLMs align with human expec-
tations and judgements in explanation generation, we Our 15 participants reach a fair agreement, with Krip-
design a before-and-after study as follows.                       pendorff’s alpha [43] equal to 38.43%.
                                                                     Fig. 1 shows changes in the relative frequencies of
                                                                  participant scores in the usefulness and trustworthiness
Before treatment. We ask for participant’s back-
                                                                  of explanations before and after treatment. Participants’
ground information, e.g., gender identity, native language
                                                                  responses before treatment have expectations of textual
and how they would rate the usefulness and trustwor-
                                                                  explanations for classifications of being “highly useful”
thiness of a language model for explanation generation.
                                                                  (above 50%; highest possible score) in terms of usefulness,
Specifically, we ask “How useful would you rate a system
                                                                  and “moderately trustworthy” or “neutral” (above 40%;
that provides you a textual explanation for its classifica-
                                                                  second and third best possible score) in terms of trust-
tion with respect to receiving only its classification?” and
                                                                  worthiness. However, scores for after treatment show
“How trustworthy would you rate a system that provides
                                                                  participants changing their usefulness scores towards
you a textual explanation for its classification with re-
                                                                  “moderately unuseful” (40-50%; second worst possible
spect to receiving only its classification?” on a 1-5 Likert
                                                                  score) and their trustworthiness scores to “highly untrust-
scale.
                                                                  worthy” (above 30%; worst possible score). Agreement
                                                                  differs in each category: usefulness is much more con-
Treatment. As for the treatment, we show participants sensual, whereas trustworthiness is judged with higher
a sample of 70 texts from the datasets, paired with up to variance. In general, LLM-generated explanations do not
four different explanations. Specifically, given a text and meet human expectations in terms of usefulness and trust-
ground-truth explanation, participants are asked if the worthiness. Specifically, exposing participants to these
text is correctly explained. If yes, they are asked to rate explanations leads to an average percentage decrease of
three different LLM-generated explanations with respect 47.78% and 64.32% in the perception of the usefulness and
to the ground-truth on a 1-3 scale. These explanations trustworthiness of explanations, respectively.
are randomly sampled among the four LLMs and five                    Fig. 2 shows the scores of all empirical metrics and
learning strategies discussed in Section 2.2.                     expert evaluation for all models on explanation genera-
                                                                  tion. Overall, similarity metrics tend to be highly volatile
After treatment. Finally, we ask participants’ opinion with respect to each other. For instance, FLAN-Alpaca
on the usefulness and trustworthiness of explanation prompted with zero-shot learning (i.e., ‘alpaca_zsl’ in
generation, having seen the LLM-generated explanations. the figure) generates explanations that are more than
In addition, we ask general opinions related to what 70% semantically similar to the ground-truth explana-
type of errors they observed most frequently, and what a tions according to BERTScore while being less than 20%
good explanation would look like.                                 semantically similar according to METEOR. Similarly
                                                                  for syntax: BLEU and GBLEU similarity scores are less
The full list of questions is in the Appendix B. than 3% whereas ROUGE and chrF/+/++ are in the range
The institutional ethical board of the first author’s 9%-21%. Moreover, we observe that BERTScore has a
university approved our study design. We distributed tendency to over-score explanations compared to human
the survey through channels that allow us to target evaluation scores. Contrarily, METEOR, BLEU, GBLEU,
individuals working in AI who are familiar with the field ROUGE and chrF/+/++ have a tendency to under-score
of language models and/or AI Ethics, including NLP explanations. Instruction fine-tuning helped all metrics
reading groups and AI Ethics interest groups. To ensure to approximate expert evaluations better, especially when
the reliability of our before-and-after study, participants tuned on knowledge-guided prompts. We use the Spear-
were given 1 hour to complete as many answers as they man’s rank correlation coefficient to compare the cor-
could. We collected answers from 15 participants, of relation between human scores and those provided by
which 33% (67%) identify as female (male), and 33% (67%) all the other metrics. In detail, we rank the models for
are (non) English native-speakers. The average level each type of metric, and then we compute the Spearman
of participants’ expertise in abusive language research correlation between the rank obtained by human scores
is 2.47 out of 5 (self-described)4 , and their continents and those obtained by other metrics. Table 3 reports all
                                                                  the correlation scores. We observe that BERTScore is
4
  The list of levels to choose from was: 1=Novice, 2=Advanced be- the most correlated with humans in both tasks. Also,
 ginner, 3=Competent, 4=Proficient, 5=Expert.
Figure 1: Relative frequencies of Likert scores before and after treatment on usefulness and trustworthiness of LLMs for
explanation generation in abusive language detection.


chrF/+/++ metrics are highly correlated with humans             the model reasoning. To quote a participant’s perspec-
while all the other metrics based on syntactic matches          tive, “I would want the explanation to be helpful to me and
are slightly correlated with humans. Results show that          guide my own reasoning”.
semantic metrics are more similar to how humans eval-
uate the quality of the explanation generated by LLMs.                                    Spearman Coeff.
                                                                         Metric
Only one metric (ROUGE) shows a different behaviour                                  Implicit Hate HateXplain
between the two tasks.                                                  bertscore        0,80         0,91
                                                                        meteor           0,64         0,89
   Since 38.55% of the ground-truth explanations were
                                                                        chrf1            0,60         0,83
not rated as good explanations by participants, we fur-
                                                                        chrf2            0,60         0,81
ther investigated what are the most common errors and                   chrf             0,57         0,83
what makes an explanation good. Table 4 returns the                     gbleu            0,53         0,25
most common error categories reported by participants.                  rouge            0,50         0,86
Most of them are related to logical fallacies (e.g., con-               bleu             0,27         0,11
tradictory statements, hallucination), especially in the
                                                                Table 3
context of sarcasm and self-deprecating humour, rather
                                                                The Spearman coefficient between each metric and experts’
than linguistic errors (e.g., grammar, misspellings). It is     scores.
worth noticing that 13.33% of the participants reported
that LLM-generated explanations contain cultural bias
(e.g., stereotypes), with the implication of potentially per-            Error Category       Relative Frequency
petuating harms against the targeted victims of abusive                  Logical Errors              26.67%
language. As for desiderata, 73.33% of participants would                Vagueness                   20.00%
like to receive textual explanations that are coherent                   Cultural Bias               13.33%
with human reasoning and understanding, i.e., that are                   Hallucination               13.33%
relevant and exhaustive to the text they refer to while be-              Irrelevant Info             13.33%
ing logically and linguistically correct. A remaining 20%                Other                       6.67%
thinks that a good explanation must be coherent with            Table 4
model reasoning instead. In other words, participants are       Percentage of error categories reported by participants.
much more concerned about how the explanation looks
like rather than its reflection of the inner mechanism of
Figure 2: Evaluation of explanation generation by LLMs across empirical metrics and human eval.


4. Conclusion                                                vs. syntactic), and therefore pointing at the need of more
                                                             reliable metrics for the empirical evaluation of textual ex-
In this paper, we conducted a before-and-after study to      planations. In general, BERTScore and METEOR metrics
understand human expectations and judgements of LLM-         exhibit the strongest correlation with human judgements.
generated explanations for multi-class abusive language      Lastly, our study provides evidence of the desiderata for
detection tasks. Contrarily to previous research [22], we    LLM-generated explanations, suggesting that explana-
investigated multiple LLMs and learning techniques, and      tions should be coherent with human reasoning rather
we surveyed AI experts who are familiar with abusive         than model reasoning. Participants value the most tex-
language research instead of crowdworkers. We found          tual explanations that are relevant and exhaustive to the
that human expectations in terms of usefulness and trust-    text they refer to, while being logically and linguisti-
worthiness of LLM-generated explanations are not met:        cally correct. Justifications for this preference lie on the
after seeing these explanations, the usefulness and trust-   fact that abusive language detection heavily relies on
worthiness ratings decrease by 47.78% and 64.32%, re-        additional context and knowledge about slang and slurs,
spectively. Secondly, our results show that empirical        for which receiving an explanation is helpful to partic-
metrics commonly used to evaluate textual explanations       ipants’ understanding of the text. Future work should
are highly volatile with respect to each other, even when    investigate whether this preference holds for other do-
they measure the same type of similarity (i.e., semantic     mains as well. In light of our findings, we conclude with
three recommendations to use LLMs responsibly for ex-          [4] The European Parliament and The Council of the
plainable abusive language detection: (1) be aware of the          European Union, Eu regulation 2016/679 on the
cultural bias these models might exhibit when generating           protection of natural persons with regard to the
free-text explanations, which can further harm targeted            processing of personal data and on the free move-
groups; (2) if possible, instruction fine-tune LLMs for            ment of such data, and repealing directive 95/46/ec
explanation generation of abusive language detection.              (general data protection regulation), Official Journal
This not only could ensure the generation of structured            of the European Union (2016).
explanations as advised by previous research [1] but it        [5] O. L. Haimson, D. Delmonaco, P. Nie, A. Wegner,
also returns the highest evaluation scores, both empir-            Disproportionate removals and differing content
ically and expert-wise, when using knowledge-guided                moderation experiences for conservative, transgen-
prompts; (3) opt for a combination of empirical metrics to         der, and black social media users: Marginalization
evaluate textual explanations when no human evaluation             and moderation gray areas, Proc. ACM Hum.-
is possible, since no particular empirical metric seems to         Comput. Interact. 5 (2021). URL: https://doi.org/10.
generalise across different learning techniques, models            1145/3479610. doi:10.1145/3479610 .
and datasets, making the ground-truth lie somewhere            [6] J. Brunk, J. Mattern, D. M. Riehle, Effect of trans-
in between BERTScore (upper bound) and BLEU (lower                 parency and trust on acceptance of automatic on-
bound).                                                            line comment moderation systems, in: 2019 IEEE
                                                                   21st Conference on Business Informatics (CBI), vol-
                                                                   ume 01, 2019, pp. 429–435. doi:10.1109/CBI.2019.
Acknowledgments                                                    00056 .
                                                               [7] A. Calabrese, L. Neves, N. Shah, M. W. Bos, B. Ross,
This work was supported by the UK Research and Inno-
                                                                   M. Lapata, F. Barbieri, Explainability and hate
vation [grant number EP/S023356/1] in the UKRI Centre
                                                                   speech: Structured explanations make social media
for Doctoral Training in Safe and Trusted Artificial Intel-
                                                                   moderators faster, arXiv preprint arXiv:2406.04106
ligence (www.safeandtrustedai.org); by the Trustworthy
                                                                   (2024).
AI Research award by The Alan Turing Institute, sup-
                                                               [8] B. Mathew, P. Saha, S. M. Yimam, C. Biemann,
ported by the British Embassy Rome and the UK Science
                                                                   P. Goyal, A. Mukherjee, Hatexplain: A benchmark
& Innovation Network; and by the PNRR project FAIR -
                                                                   dataset for explainable hate speech detection, in:
Future AI Research (PE00000013), Spoke 6 - Symbiotic AI
                                                                   Proceedings of the AAAI conference on artificial
(CUP H97G22000210007) under the NRRP MUR program
                                                                   intelligence, volume 35, 2021, pp. 14867–14875.
funded by the NextGenerationEU.
                                                               [9] M. ElSherief, C. Ziems, D. Muchlinski, V. Anupindi,
                                                                   J. Seybolt, M. De Choudhury, D. Yang, Latent ha-
References                                                         tred: A benchmark for understanding implicit hate
                                                                   speech, in: Proceedings of the 2021 Conference on
 [1] P. Mishra, H. Yannakoudakis, E. Shutova, Tack-                Empirical Methods in Natural Language Processing,
     ling online abuse: A survey of automated abuse de-            2021, pp. 345–363.
     tection methods, arXiv preprint arXiv:1908.06024         [10] M. Sap, S. Gabriel, L. Qin, D. Jurafsky, N. A. Smith,
     (2019).                                                       Y. Choi, Social bias frames: Reasoning about social
 [2] P. Barceló, M. Monet, J. Pérez, B. Subercaseaux,              and power implications of language, in: ACL, 2020.
     Model interpretability through the lens of               [11] D. Nozza, A. T. Cignarella, G. Damo, T. Caselli,
     computational complexity, in: H. Larochelle,                  V. Patti, Hodi at evalita 2023: Overview of the
     M. Ranzato, R. Hadsell, M. Balcan, H. Lin                     first shared task on homotransphobia detection in
     (Eds.), Advances in Neural Information Pro-                   italian, in: 8th Evaluation Campaign of Natural
     cessing Systems, volume 33, Curran Asso-                      Language Processing and Speech Tools for Italian.
     ciates, Inc., 2020, pp. 15487–15498. URL: https:              Final Workshop, EVALITA 2023, CEUR Workshop
     //proceedings.neurips.cc/paper_files/paper/2020/              Proceedings (CEUR-WS. org), 2023.
     file/b1adda14824f50ef24ff1c05bb66faf3-Paper.pdf.         [12] H. Kirk, W. Yin, B. Vidgen, P. Röttger, Semeval-2023
 [3] M. Sap, D. Card, S. Gabriel, Y. Choi, N. A. Smith,            task 10: Explainable detection of online sexism, in:
     The risk of racial bias in hate speech detection, in:         Proceedings of the 17th International Workshop
     A. Korhonen, D. Traum, L. Màrquez (Eds.), Proceed-            on Semantic Evaluation (SemEval-2023), 2023, pp.
     ings of the 57th Annual Meeting of the Associa-               2193–2210.
     tion for Computational Linguistics, Association for      [13] H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay,
     Computational Linguistics, Florence, Italy, 2019, pp.         W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma,
     1668–1678. URL: https://aclanthology.org/P19-1163.            et al., Scaling instruction-finetuned language mod-
     doi:10.18653/v1/P19- 1163 .                                   els, arXiv preprint arXiv:2210.11416 (2022).
[14] T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Ka-           guage models using chain of utterances for safety-
     plan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sas-         alignment, arXiv preprint arXiv:2308.09662 (2023).
     try, A. Askell, et al., Language models are few-shot    [24] R. Taori, I. Gulrajani, T. Zhang, Y. Dubois, X. Li,
     learners, Advances in neural information process-            C. Guestrin, P. Liang, T. B. Hashimoto, Stanford al-
     ing systems 33 (2020) 1877–1901.                             paca: An instruction-following llama model, https:
[15] T. Zhang, F. Ladhak, E. Durmus, P. Liang, K. McKe-           //github.com/tatsu-lab/stanford_alpaca, 2023.
     own, T. B. Hashimoto, Benchmarking large lan-           [25] N. Muennighoff, T. Wang, L. Sutawika, A. Roberts,
     guage models for news summarization, arXiv                   S. Biderman, T. Le Scao, M. S. Bari, S. Shen, Z. X.
     preprint arXiv:2301.13848 (2023).                            Yong, H. Schoelkopf, X. Tang, D. Radev, A. F. Aji,
[16] C. Ziems, W. Held, O. Shaikh, J. Chen, Z. Zhang,             K. Almubarak, S. Albanie, Z. Alyafeai, A. Web-
     D. Yang, Can large language models transform                 son, E. Raff, C. Raffel, Crosslingual generaliza-
     computational social science?, arXiv preprint                tion through multitask finetuning, in: A. Rogers,
     arXiv:2305.03514 (2023).                                     J. Boyd-Graber, N. Okazaki (Eds.), Proceedings
[17] S. Roy, A. Harshvardhan, A. Mukherjee, P. Saha,              of the 61st Annual Meeting of the Association
     Probing LLMs for hate speech detection: strengths            for Computational Linguistics (Volume 1: Long
     and vulnerabilities, in: H. Bouamor, J. Pino,                Papers), Association for Computational Linguis-
     K. Bali (Eds.), Findings of the Association for              tics, Toronto, Canada, 2023, pp. 15991–16111. URL:
     Computational Linguistics: EMNLP 2023, Asso-                 https://aclanthology.org/2023.acl-long.891. doi:10.
     ciation for Computational Linguistics, Singapore,            18653/v1/2023.acl- long.891 .
     2023, pp. 6116–6128. URL: https://aclanthology.org/     [26] H. Touvron, L. Martin, K. Stone, P. Albert, A. Alma-
     2023.findings-emnlp.407. doi:10.18653/v1/2023.               hairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhar-
     findings- emnlp.407 .                                        gava, S. Bhosale, et al., Llama 2: Open founda-
[18] Y. Yang, J. Kim, Y. Kim, N. Ho, J. Thorne, S.-Y.             tion and fine-tuned chat models, arXiv preprint
     Yun, HARE: Explainable hate speech detection                 arXiv:2307.09288 (2023).
     with step-by-step reasoning, in: H. Bouamor,            [27] H. Touvron, T. Lavril, G. Izacard, X. Martinet,
     J. Pino, K. Bali (Eds.), Findings of the Association         M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal,
     for Computational Linguistics: EMNLP 2023, Asso-             E. Hambro, F. Azhar, et al., Llama: Open and effi-
     ciation for Computational Linguistics, Singapore,            cient foundation language models, arXiv preprint
     2023, pp. 5490–5505. URL: https://aclanthology.org/          arXiv:2302.13971 (2023).
     2023.findings-emnlp.365. doi:10.18653/v1/2023.          [28] D. Vrandečić, M. Krötzsch, Wikidata: a free col-
     findings- emnlp.365 .                                        laborative knowledgebase, Communications of the
[19] F. Huang, H. Kwak, J. An, Chain of explana-                  ACM 57 (2014) 78–85.
     tion: New prompting method to generate qual-            [29] K. Halevy, A group-specific approach to nlp for hate
     ity natural language explanation for implicit hate           speech detection, arXiv preprint arXiv:2304.11223
     speech, in: Companion Proceedings of the ACM                 (2023).
     Web Conference 2023, WWW ’23 Companion, As-             [30] R. Speer, J. Chin, C. Havasi, Conceptnet 5.5: An
     sociation for Computing Machinery, New York, NY,             open multilingual graph of general knowledge, in:
     USA, 2023, p. 90–93. URL: https://doi.org/10.1145/           Proceedings of the AAAI conference on artificial
     3543873.3587320. doi:10.1145/3543873.3587320 .               intelligence, volume 31, 2017.
[20] J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia,       [31] A. B. Sai, T. Dixit, D. Y. Sheth, S. Mohan, M. M.
     E. Chi, Q. V. Le, D. Zhou, et al., Chain-of-thought          Khapra, Perturbation CheckLists for evaluating
     prompting elicits reasoning in large language mod-           NLG evaluation metrics, in: M.-F. Moens, X. Huang,
     els, Advances in neural information processing               L. Specia, S. W.-t. Yih (Eds.), Proceedings of the
     systems 35 (2022) 24824–24837.                               2021 Conference on Empirical Methods in Natu-
[21] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a          ral Language Processing, Association for Compu-
     method for automatic evaluation of machine trans-            tational Linguistics, Online and Punta Cana, Do-
     lation, in: Proceedings of the 40th annual meeting           minican Republic, 2021, pp. 7219–7234. URL: https:
     of the Association for Computational Linguistics,            //aclanthology.org/2021.emnlp-main.575. doi:10.
     2002, pp. 311–318.                                           18653/v1/2021.emnlp- main.575 .
[22] H. Wang, M. S. Hee, M. R. Awal, K. T. W. Choo, R. K.-   [32] E. Reiter, A structured review of the validity
     W. Lee, Evaluating gpt-3 generated explanations              of BLEU, Computational Linguistics 44 (2018)
     for hateful content moderation, in: Proceedings of           393–401. URL: https://aclanthology.org/J18-3002.
     the Thirty-Second International Joint Conference             doi:10.1162/coli_a_00322 .
     on Artificial Intelligence, 2023, pp. 6255–6263.        [33] J. Novikova, O. Dušek, A. Cercas Curry, V. Rieser,
[23] R. Bhardwaj, S. Poria, Red-teaming large lan-                Why we need new evaluation metrics for NLG,
     in: M. Palmer, R. Hwa, S. Riedel (Eds.), Proceed-       A. Prompt Details
     ings of the 2017 Conference on Empirical Methods
     in Natural Language Processing, Association for         Table 5 shows the two types of prompts we used in our ex-
     Computational Linguistics, Copenhagen, Denmark,         periments, following the template of the Stanford Alpaca
     2017, pp. 2241–2252. URL: https://aclanthology.org/     project. The two categories differ for the ‘context’ that is
     D17-1238. doi:10.18653/v1/D17- 1238 .                   passed in the knowledge-guided version, which contains
[34] A. B. Sai, A. K. Mohankumar, M. M. Khapra, A            the information extracted from the knowledge sources
     survey of evaluation metrics used for nlg systems,      linked to the text. As described in the Section 2.2 of the
     ACM Computing Surveys (CSUR) 55 (2022) 1–39.            paper, we used the vanilla prompts for zero-shot learning,
[35] A. Celikyilmaz, E. Clark, J. Gao, Evaluation            few-shot learning, and instruction fine-tuning whereas
     of text generation: A survey, arXiv preprint            we used the knowledge-guided prompts for knowledge-
     arXiv:2006.14799 (2020).                                guided zero-shot learning and knowledge-guided instruc-
[36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger,          tion fine-tuning.
     Y. Artzi, Bertscore: Evaluating text generation with
     bert, in: International Conference on Learning
     Representations, 2019.
                                                             B. Survey Questions
[37] A. Lavie, M. J. Denkowski, The meteor metric for        Participants were presented with the questions shown in
     automatic evaluation of machine translation, Ma-        Table 6.
     chine translation 23 (2009) 105–115.
[38] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi,
     W. Macherey, M. Krikun, Y. Cao, Q. Gao,
     K. Macherey, J. Klingner, A. Shah, M. Johnson,
     X. Liu, Łukasz Kaiser, S. Gouws, Y. Kato, T. Kudo,
     H. Kazawa, K. Stevens, G. Kurian, N. Patil, W. Wang,
     C. Young, J. Smith, J. Riesa, A. Rudnick, O. Vinyals,
     G. Corrado, M. Hughes, J. Dean, Google’s neu-
     ral machine translation system: Bridging the gap
     between human and machine translation, 2016.
     arXiv:1609.08144 .
[39] C.-Y. Lin, ROUGE: A package for automatic eval-
     uation of summaries, in: Text Summarization
     Branches Out, Association for Computational Lin-
     guistics, Barcelona, Spain, 2004, pp. 74–81. URL:
     https://www.aclweb.org/anthology/W04-1013.
[40] M. Popović, chrF: character n-gram F-score for auto-
     matic MT evaluation, in: Proceedings of the Tenth
     Workshop on Statistical Machine Translation, Asso-
     ciation for Computational Linguistics, Lisbon, Por-
     tugal, 2015, pp. 392–395. URL: https://aclanthology.
     org/W15-3049. doi:10.18653/v1/W15- 3049 .
[41] M. Popović, chrF++: words helping character n-
     grams, in: Proceedings of the Second Conference
     on Machine Translation, Association for Computa-
     tional Linguistics, Copenhagen, Denmark, 2017, pp.
     612–618. URL: https://aclanthology.org/W17-4770.
     doi:10.18653/v1/W17- 4770 .
[42] M. Post, A call for clarity in reporting BLEU scores,
     in: Proceedings of the Third Conference on Ma-
     chine Translation: Research Papers, Association
     for Computational Linguistics, Belgium, Brussels,
     2018, pp. 186–191. URL: https://www.aclweb.org/
     anthology/W18-6319.
[43] K. Krippendorff, Computing krippendorff’s alpha-
     reliability, 2011.
         Category               Prompt Template
                                Below is an instruction that describes a task, paired with input text.
                                Write a response that appropriately completes the instruction.
         Vanilla
                                Instruction: Classify the input text as list_of_labels , and provide an explanation.
                                Input text: text_to_classify .
                                Response:
                                Below is an instruction that describes a task, paired with context and input text.
                                Write a response that appropriately completes the instruction based on the context.

         Knowledge-guided       Instruction: Classify the input text as list_of_labels , and provide an explanation.
                                Context: knowledge_source_linked .
                                Input text: text_to_classify .
                                Response:

Table 5
Details of vanilla prompts and knowledge-guided prompts passed to the LLMs in our experiments.


 Part                  Questions
                       “Which gender do you identify as?”
                       “Are you an English native-speaker?”
                       “What is your country of origin?”
                       “What is your level of expertise on language models or abusive language?”
 Before Treatment
                       “How useful would you rate a system that provides you a textual explanation for its classification
                       with respect to receiving only its classification?”
                       “How trustworthy would you rate a system that provides you a textual explanation for its classification
                       with respect to receiving only its classification?”
                       “Do you think explanation 1 provides a good explanation given the text?”
                       “If your answer was yes, does explanation 2 mean the same thing as explanation 1?”
 Treatment
                       “If your answer was yes, does explanation 3 mean the same thing as explanation 1?”
                       “If your answer was yes, does explanation 4 mean the same thing as explanation 1?”
                       “Having seen these explanations, how useful would you rate a system that provides you a textual
                       explanation for its classification?”
                       “Having seen these explanations, how trustworthy would you rate a system that provides you a textual
 After Treatment       explanation for its classification?”
                       “What was the main error you noticed in these explanations?”
                       “What do you think makes a textual explanation good?”
                       “Do you have any comment you would like to share?”

Table 6
List of questions asked to participants in our expert survey.