Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection

Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection ChiaraDiBonaventura chiara.di_bonaventura@kcl.ac.uk King's College London

London United Kingdom

Imperial College London

London United Kingdom

LuciaSiciliani lucia.siciliani@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

PierpaoloBasile pierpaolo.basile@uniba.it Department of Computer Science University of Bari Aldo Moro

Italy

AlbertMeroño-Peñuela King's College London

London United Kingdom

BarbaraMcgillivray barbara.mcgillivray@kcl.ac.uk King's College London

London United Kingdom

Tenth Italian Conference on Computational Linguistics

Dec 04 -06 2024 Pisa Italy

Is Explanation All You Need? An Expert Survey on LLM-generated Explanations for Abusive Language Detection 1613-0073 79F0AA8F3C9FC726F5E7380167E53110 GROBID - A machine learning software for extracting information from scholarly documents Large Language Models Hate Speech Detection Explanation Generation Human Evaluation

Explainable abusive language detection has proven to help both users and content moderators, and recent research has focused on prompting LLMs to generate explanations for why a specific text is hateful. Yet, understanding the alignment of these generated explanations with human expectations and judgements is far from being solved. In this paper, we design a before-and-after study recruiting AI experts to evaluate the usefulness and trustworthiness of LLM-generated explanations for abusive language detection tasks, investigating multiple LLMs and learning strategies. Our experiments show that expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met, as their ratings decrease by 47.78% and 64.32%, respectively, after treatment. Further, our results suggest caution in using LLMs for explanation generation of abusive language detection due to (i) their cultural bias, and (ii) difficulty in reliably evaluating them with empirical metrics. In light of our results, we provide three recommendations to use LLMs responsibly for explainable abusive language detection.

Introduction

Explainability is a crucial open challenge in Natural Language Processing (NLP) research on abusive language [1] as increasing models' complexity [2], models' intrinsic bias [3], and international regulations [4] call for a shift in perspective from performance-based models to more transparent models. Moreover, recent studies have shown the benefits of explanations for users [5,6] and content moderators [7] on social media platforms. The former can benefit from receiving an explanation for why a certain post has been flagged or removed whereas the latter are shown to annotate toxic posts faster and solve doubtful annotations thanks to explanations.

Several efforts have moved towards explainable abusive language detection in the past years, like the development of datasets containing rationales (i.e., the tokens in the text that suggest why the text is hateful) [8] or implied statements (i.e., description of the implied meaning of the text) [9,10], and shared tasks on explainable hate speech detection [11,12], inter alia. With Large Language Models (LLMs) like FLAN-T5 [13] showing remarkable performance across tasks and human-like text generation [14,15,16], recent studies have explored LLMs for explainable hate speech detection, wherein classification predictions are described through natural language explanations [17,18]. For instance, [19] used chain-of-thought prompting [20] of LLMs to generate explanations for implicit hate speech detection.

However, most of these studies rely on empirical metrics like BLEU [21] to evaluate the generated explanations automatically. Consequently, the human perception and implications of these explanations remain understudied, as well as the extent to which empirical metrics approximate human judgements. [22] recruited crowdworkers to evaluate the level of hatefulness in tweets and the quality of explanations generated by GPT-3. Instead, we conduct an expert survey investigating four LLMs and five learning strategies across multi-class abusive language detection tasks to answer the following questions: RQ1: How well do LLM-generated explanations for abusive language detection match human expectations? RQ2: How well do empirical metrics align with human judgements? RQ3: What makes LLM-generated explanations good, according to experts?

Experimental Setup

To answer these research questions, we design a beforeand-after study, surveying participants about their prior expectations about LLM-generated explanations and then showing them examples generated by several LLMs with diverse learning strategies1 , followed by further interviews. To ensure robustness of our results, we recruited experts in the field, i.e., AI researchers, as described below.

Data

For our experiments, we use the HateXplain [8] and the Implicit Hate Corpus [9] as they encompass different levels of offensiveness (i.e., hate speech, offensive, neutral), expressiveness (i.e., explicit hate, implicit hate, neutral), multiple targeted groups, and explanations for the hateful label (Table 1). These datasets contain unstructured explanations of the words that constitute abuse (in Hat-eXplain) and the user's intent (in Implicit Hate). In view of previous research arguing the need for structured explanations in hateful content moderation [1], we use the following template to create structured explanations, that we will use as ground-truth: "Explanation: it contains the following hateful words (implied statement):" for abusive content in HateXplain (Implicit Hate Corpus), and "The text does not contain abusive content." for neutral content.

Dataset

Methodology

We extensively investigate four popular LLMs across five learning strategies on their ability to detect multi-class offensiveness and expressiveness of abusive language and to generate explanations for the classification.

Models.

We use different open-source LLMs (Table 2): the base versions of FLAN-Alpaca [23,24], FLAN-T5 [13], mT0 [25], and the 7B foundational model Llama 2 [26], which is an updated version of LlaMA [27].

Model Instruction Fine-tuned Toxicity Fine-tuned FLAN-Alpaca � � FLAN-T5 � � mT0 � - Llama-2 - -

Table 2

Summary of models used.

Learning strategies. As different prompting strategies might yield different results, we test five distinct learning strategies using the established Stanford Alpaca template2 (cf. Appendix A for prompt details):

(1) zero-shot learning (zsl): we pass "Classify the input text as list_of_labels, and provide an explanation" in the instruction field of the template. The list_of_labels changes according to the dataset used;

(2) few-shot learning (fsl): we pass three additional examples to the aforementioned template, which are randomly sampled with equal probability among the labels to account for class imbalance in the datasets. We experimented with different numbers of examples (i.e., passing one, three or five examples), and chose three as it was the best strategy;

(3) knowledge-guided zero-shot learning (kg): instead of passing additional examples in the prompts, we add external knowledge retrieved by means of an entity linker 3 , which first detects entities mentioned in the input text, and then retrieves the relevant information from the external knowledge base. We use Wikidata [28] for encyclopedic knowledge, KnowledJe [29] for hate speech temporal linguistic knowledge and ConceptNet [30] for commonsense knowledge. We modify the prompt template with an additional field called 'context' to account for this external knowledge;

(4) instruction fine-tuning (ft): we use the same prompts used in (1) to instruction fine-tune Llama-2;

(5) knowledge-guided instruction fine-tuning (kg_ft): we use the knowledge-guided prompts developed in (3) to instruction fine-tune Llama-2.

Empirical eval metrics. We evaluate how closely the LLM-generated explanations match the ground-truth across eight empirical similarity metrics due to the challenge of simultaneously assessing a wide set of criteria [31,32,33]. Following established NLG research [34,35], we choose BERTScore [36] and METEOR [37] for semantic similarity. For syntactic similarity, we select BLEU [21], GBLEU [38], ROUGE [39], ChrF [40] with its derivates ChrF+ and ChrF++ [41,42]. Additionally, we present an expert evaluation following our survey.

Survey Design

To evaluate how well LLMs align with human expectations and judgements in explanation generation, we design a before-and-after study as follows.

Before treatment. We ask for participant's background information, e.g., gender identity, native language and how they would rate the usefulness and trustworthiness of a language model for explanation generation. Specifically, we ask "How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" and "How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" on a 1-5 Likert scale.

Treatment. As for the treatment, we show participants a sample of 70 texts from the datasets, paired with up to four different explanations. Specifically, given a text and ground-truth explanation, participants are asked if the text is correctly explained. If yes, they are asked to rate three different LLM-generated explanations with respect to the ground-truth on a 1-3 scale. These explanations are randomly sampled among the four LLMs and five learning strategies discussed in Section 2.2.

After treatment. Finally, we ask participants' opinion on the usefulness and trustworthiness of explanation generation, having seen the LLM-generated explanations. In addition, we ask general opinions related to what type of errors they observed most frequently, and what a good explanation would look like.

The full list of questions is in the Appendix B. The institutional ethical board of the first author's university approved our study design. We distributed the survey through channels that allow us to target individuals working in AI who are familiar with the field of language models and/or AI Ethics, including NLP reading groups and AI Ethics interest groups. To ensure the reliability of our before-and-after study, participants were given 1 hour to complete as many answers as they could. We collected answers from 15 participants, of which 33% (67%) identify as female (male), and 33% (67%) are (non) English native-speakers. The average level of participants' expertise in abusive language research is 2.47 out of 5 (self-described) 4 , and their continents

Results and Discussion

Our 15 participants reach a fair agreement, with Krippendorff's alpha [43] equal to 38.43%.

Fig. 1 shows changes in the relative frequencies of participant scores in the usefulness and trustworthiness of explanations before and after treatment. Participants' responses before treatment have expectations of textual explanations for classifications of being "highly useful" (above 50%; highest possible score) in terms of usefulness, and "moderately trustworthy" or "neutral" (above 40%; second and third best possible score) in terms of trustworthiness. However, scores for after treatment show participants changing their usefulness scores towards "moderately unuseful" (40-50%; second worst possible score) and their trustworthiness scores to "highly untrustworthy" (above 30%; worst possible score). Agreement differs in each category: usefulness is much more consensual, whereas trustworthiness is judged with higher variance. In general, LLM-generated explanations do not meet human expectations in terms of usefulness and trustworthiness. Specifically, exposing participants to these explanations leads to an average percentage decrease of 47.78% and 64.32% in the perception of the usefulness and trustworthiness of explanations, respectively. Fig. 2 shows the scores of all empirical metrics and expert evaluation for all models on explanation generation. Overall, similarity metrics tend to be highly volatile with respect to each other. For instance, FLAN-Alpaca prompted with zero-shot learning (i.e., 'alpaca_zsl' in the figure) generates explanations that are more than 70% semantically similar to the ground-truth explanations according to BERTScore while being less than 20% semantically similar according to METEOR. Similarly for syntax: BLEU and GBLEU similarity scores are less than 3% whereas ROUGE and chrF/+/++ are in the range 9%-21%. Moreover, we observe that BERTScore has a tendency to over-score explanations compared to human evaluation scores. Contrarily, METEOR, BLEU, GBLEU, ROUGE and chrF/+/++ have a tendency to under-score explanations. Instruction fine-tuning helped all metrics to approximate expert evaluations better, especially when tuned on knowledge-guided prompts. We use the Spearman's rank correlation coefficient to compare the correlation between human scores and those provided by all the other metrics. In detail, we rank the models for each type of metric, and then we compute the Spearman correlation between the rank obtained by human scores and those obtained by other metrics. Table 3 reports all the correlation scores. We observe that BERTScore is the most correlated with humans in both tasks. Also, Figure 1: Relative frequencies of Likert scores before and after treatment on usefulness and trustworthiness of LLMs for explanation generation in abusive language detection. chrF/+/++ metrics are highly correlated with humans while all the other metrics based on syntactic matches are slightly correlated with humans. Results show that semantic metrics are more similar to how humans evaluate the quality of the explanation generated by LLMs. Only one metric (ROUGE) shows a different behaviour between the two tasks.

Since 38.55% of the ground-truth explanations were not rated as good explanations by participants, we further investigated what are the most common errors and what makes an explanation good. Table 4 returns the most common error categories reported by participants. Most of them are related to logical fallacies (e.g., contradictory statements, hallucination), especially in the context of sarcasm and self-deprecating humour, rather than linguistic errors (e.g., grammar, misspellings). It is worth noticing that 13.33% of the participants reported that LLM-generated explanations contain cultural bias (e.g., stereotypes), with the implication of potentially perpetuating harms against the targeted victims of abusive language. As for desiderata, 73.33% of participants would like to receive textual explanations that are coherent with human reasoning and understanding, i.e., that are relevant and exhaustive to the text they refer to while being logically and linguistically correct. A remaining 20% thinks that a good explanation must be coherent with model reasoning instead. In other words, participants are much more concerned about how the explanation looks like rather than its reflection of the inner mechanism of the model reasoning. To quote a participant's perspective, "I would want the explanation to be helpful to me and guide my own reasoning".

Metric

Conclusion

In this paper, we conducted a before-and-after study to understand human expectations and judgements of LLMgenerated explanations for multi-class abusive language detection tasks. Contrarily to previous research [22], we investigated multiple LLMs and learning techniques, and we surveyed AI experts who are familiar with abusive language research instead of crowdworkers. We found that human expectations in terms of usefulness and trustworthiness of LLM-generated explanations are not met: after seeing these explanations, the usefulness and trustworthiness ratings decrease by 47.78% and 64.32%, respectively. Secondly, our results show that empirical metrics commonly used to evaluate textual explanations are highly volatile with respect to each other, even when they measure the same type of similarity (i.e., semantic vs. syntactic), and therefore pointing at the need of more reliable metrics for the empirical evaluation of textual explanations. In general, BERTScore and METEOR metrics exhibit the strongest correlation with human judgements. Lastly, our study provides evidence of the desiderata for LLM-generated explanations, suggesting that explanations should be coherent with human reasoning rather than model reasoning. Participants value the most textual explanations that are relevant and exhaustive to the text they refer to, while being logically and linguistically correct. Justifications for this preference lie on the fact that abusive language detection heavily relies on additional context and knowledge about slang and slurs, for which receiving an explanation is helpful to participants' understanding of the text. Future work should investigate whether this preference holds for other domains as well. In light of our findings, we conclude with three recommendations to use LLMs responsibly for explainable abusive language detection: (1) be aware of the cultural bias these models might exhibit when generating free-text explanations, which can further harm targeted groups;

(2) if possible, instruction fine-tune LLMs for explanation generation of abusive language detection. This not only could ensure the generation of structured explanations as advised by previous research [1] but it also returns the highest evaluation scores, both empirically and expert-wise, when using knowledge-guided prompts;

(3) opt for a combination of empirical metrics to evaluate textual explanations when no human evaluation is possible, since no particular empirical metric seems to generalise across different learning techniques, models and datasets, making the ground-truth lie somewhere in between BERTScore (upper bound) and BLEU (lower bound).

Figure 2 :2Figure 2: Evaluation of explanation generation by LLMs across empirical metrics and human eval.

Table 11Summary of datasets used.LabelsTargetExplanationHateXplainhate speech, offensive, neutralwomen, black, ...Token-levelImplicit Hateimplicit hate, explicit hate, neutralJews, whites, ...Implied statement

Table 33The Spearman coefficient between each metric and experts' scores.Error CategoryRelative FrequencyLogical Errors26.67%Vagueness20.00%Cultural Bias13.33%Hallucination13.33%Irrelevant Info13.33%Other6.67%

Table 44Percentage of error categories reported by participants.The data containing the LLM-generated explanations are publicly available at https://github.com/ChiaraDiBonaventura/ is-explanation-all-you-needhttps://github.com/tatsu-lab/stanford_alpaca?tab=readme-ov-file# data-releaseIf available, we use the API provided by the knowledge source, spaCy otherwise. https://spacy.io/The list of levels to choose from was: 1=Novice, 2=Advanced beginner, 3=Competent, 4=Proficient, 5=Expert. of origin include Europe (60%), Asia (26.67%), Africa (6.67%), and Latin America (6.67%).

Acknowledgments

This work was supported by the UK Research and Innovation [grant number EP/S023356/1] in the UKRI Centre for Doctoral Training in Safe and Trusted Artificial Intelligence (www.safeandtrustedai.org); by the Trustworthy AI Research award by The Alan Turing Institute, supported by the British Embassy Rome and the UK Science & Innovation Network; and by the PNRR project FAIR -Future AI Research (PE00000013), Spoke 6 -Symbiotic AI (CUP H97G22000210007) under the NRRP MUR program funded by the NextGenerationEU.

† Work partially funded by the Trustworthy AI Research award received by The Alan Turing Institute and the the Italian Future AI Research Foundation (FAIR).

A. Prompt Details

Table 5 shows the two types of prompts we used in our experiments, following the template of the Stanford Alpaca project. The two categories differ for the 'context' that is passed in the knowledge-guided version, which contains the information extracted from the knowledge sources linked to the text. As described in the Section 2.2 of the paper, we used the vanilla prompts for zero-shot learning, few-shot learning, and instruction fine-tuning whereas we used the knowledge-guided prompts for knowledgeguided zero-shot learning and knowledge-guided instruction fine-tuning.

B. Survey Questions

Participants were presented with the questions shown in Table 6.

Category Prompt Template

Vanilla

Below is an instruction that describes a task, paired with input text. Write a response that appropriately completes the instruction.

Instruction: Classify the input text as list_of_labels, and provide an explanation. Input text: text_to_classify. Response:

Knowledge-guided

Below is an instruction that describes a task, paired with context and input text. Write a response that appropriately completes the instruction based on the context. Instruction: Classify the input text as list_of_labels, and provide an explanation. Context: knowledge_source_linked. Input text: text_to_classify. Response:

Table 5

Details of vanilla prompts and knowledge-guided prompts passed to the LLMs in our experiments.

Part Questions

Before Treatment "Which gender do you identify as?" "Are you an English native-speaker?" "What is your country of origin?" "What is your level of expertise on language models or abusive language?" "How useful would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?" "How trustworthy would you rate a system that provides you a textual explanation for its classification with respect to receiving only its classification?"

Treatment "Do you think explanation 1 provides a good explanation given the text?" "If your answer was yes, does explanation 2 mean the same thing as explanation 1?" "If your answer was yes, does explanation 3 mean the same thing as explanation 1?" "If your answer was yes, does explanation 4 mean the same thing as explanation 1?"

After Treatment "Having seen these explanations, how useful would you rate a system that provides you a textual explanation for its classification?" "Having seen these explanations, how trustworthy would you rate a system that provides you a textual explanation for its classification?" "What was the main error you noticed in these explanations?" "What do you think makes a textual explanation good?" "Do you have any comment you would like to share?"

Table 6

List of questions asked to participants in our expert survey.

PMishra HYannakoudakis EShutova arXiv:1908.06024 Tackling online abuse: A survey of automated abuse detection methods 2019 arXiv preprint Model interpretability through the lens of computational complexity PBarceló MMonet JPérez BSubercaseaux Advances in Neural Information Processing Systems HLarochelle MRanzato RHadsell MBalcan HLin Curran Associates, Inc 2020 33 The risk of racial bias in hate speech detection MSap DCard SGabriel YChoi NASmith 10.18653/v1/P19-1163 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics AKorhonen DTraum LMàrquez the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics

Florence, Italy

2019 The European Parliament and The Council of the European Union, Eu regulation Official Journal of the European Union 2016. 2016 /679 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (general data protection regulation) Disproportionate removals and differing content moderation experiences for conservative, transgender, and black social media users: Marginalization and moderation gray areas OLHaimson DDelmonaco PNie AWegner 10.1145/3479610 doi: Proc. ACM Hum.-Comput. Interact 5 2021 Effect of transparency and trust on acceptance of automatic online comment moderation systems JBrunk JMattern DMRiehle 10.1109/CBI.2019.00056 IEEE 21st Conference on Business Informatics (CBI) 2019. 2019 01 ACalabrese LNeves NShah MWBos BRoss MLapata FBarbieri arXiv:2406.04106 Explainability and hate speech: Structured explanations make social media moderators faster 2024 arXiv preprint Hatexplain: A benchmark dataset for explainable hate speech detection BMathew PSaha SMYimam CBiemann PGoyal AMukherjee Proceedings of the AAAI conference on artificial intelligence the AAAI conference on artificial intelligence 2021 35 Latent hatred: A benchmark for understanding implicit hate speech MElsherief CZiems DMuchlinski VAnupindi JSeybolt MDeChoudhury DYang Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing the 2021 Conference on Empirical Methods in Natural Language Processing 2021 Social bias frames: Reasoning about social and power implications of language MSap SGabriel LQin DJurafsky NASmith YChoi ACL 2020 Overview of the first shared task on homotransphobia detection in italian, in: 8th Evaluation DNozza ATCignarella GDamo TCaselli VPatti Hodi Evalita Campaign of Natural Language Processing and Speech Tools for Italian. Final Workshop, EVALITA 2023 CEUR Workshop Proceedings 2023. 2023 CEUR-WS. org Semeval-2023 task 10: Explainable detection of online sexism HKirk WYin BVidgen PRöttger Proceedings of the 17th International Workshop on Semantic Evaluation the 17th International Workshop on Semantic Evaluation

SemEval-

2023. 2023 HWChung LHou SLongpre BZoph YTay WFedus YLi XWang MDehghani SBrahma arXiv:2210.11416 Scaling instruction-finetuned language models 2022 arXiv preprint Language models are few-shot learners TBrown BMann NRyder MSubbiah JDKaplan PDhariwal ANeelakantan PShyam GSastry AAskell Advances in neural information processing systems 33 2020 TZhang FLadhak EDurmus PLiang KMckeown TBHashimoto arXiv:2301.13848 Benchmarking large language models for news summarization 2023 arXiv preprint CZiems WHeld OShaikh JChen ZZhang DYang arXiv:2305.03514 Can large language models transform computational social science? 2023 arXiv preprint Probing LLMs for hate speech detection: strengths and vulnerabilities SRoy AHarshvardhan AMukherjee PSaha 10.18653/v1/2023.findings-emnlp.407 Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics HBouamor JPino KBali

Singapore

2023 HARE: Explainable hate speech detection with step-by-step reasoning YYang JKim YKim NHo JThorne S.-YYun 10.18653/v1/2023.findings-emnlp.365 Findings of the Association for Computational Linguistics: EMNLP 2023, Association for Computational Linguistics HBouamor JPino KBali

Singapore

2023 Chain of explanation: New prompting method to generate quality natural language explanation for implicit hate speech FHuang HKwak JAn 10.1145/3543873.3587320 doi:10.1145/3543873.3587320 Companion Proceedings of the ACM Web Conference 2023, WWW '23 Companion

New York, NY, USA

Association for Computing Machinery 2023 Chain-of-thought prompting elicits reasoning in large language models JWei XWang DSchuurmans MBosma FXia EChi QVLe DZhou Advances in neural information processing systems 35 2022 Bleu: a method for automatic evaluation of machine translation KPapineni SRoukos TWard W.-JZhu Proceedings of the 40th annual meeting of the Association for Computational Linguistics the 40th annual meeting of the Association for Computational Linguistics 2002 Evaluating gpt-3 generated explanations for hateful content moderation HWang MSHee MRAwal KT WChoo RK -WLee Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence the Thirty-Second International Joint Conference on Artificial Intelligence 2023 RBhardwaj SPoria arXiv:2308.09662 Red-teaming large language models using chain of utterances for safetyalignment 2023 arXiv preprint RTaori IGulrajani TZhang YDubois XLi CGuestrin PLiang TBHashimoto Stanford alpaca: An instruction-following llama model Crosslingual generalization through multitask finetuning NMuennighoff TWang LSutawika ARoberts SBiderman TLeScao MSBari SShen ZXYong HSchoelkopf XTang DRadev AFAji KAlmubarak SAlbanie ZAlyafeai AWebson ERaff CRaffel 10.18653/v1/2023.acl-long.891 Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics ARogers JBoyd-Graber NOkazaki the 61st Annual Meeting of the Association for Computational Linguistics

Toronto, Canada

2023 1 : Long Papers), Association for Computational Linguistics HTouvron LMartin KStone PAlbert AAlmahairi YBabaei NBashlykov SBatra PBhargava SBhosale arXiv:2307.09288 Llama 2: Open foundation and fine-tuned chat models 2023 arXiv preprint HTouvron TLavril GIzacard XMartinet M.-ALachaux TLacroix BRozière NGoyal EHambro FAzhar arXiv:2302.13971 Llama: Open and efficient foundation language models 2023 arXiv preprint Wikidata: a free collaborative knowledgebase DVrandečić MKrötzsch Communications of the ACM 57 2014 A group-specific approach to nlp for hate speech detection KHalevy arXiv:2304.11223 2023 arXiv preprint Conceptnet 5.5: An open multilingual graph of general knowledge RSpeer JChin CHavasi Proceedings of the AAAI conference on artificial intelligence the AAAI conference on artificial intelligence 2017 31 Perturbation CheckLists for evaluating NLG evaluation metrics ABSai TDixit DYSheth SMohan MMKhapra 10.18653/v1/2021.emnlp-main.575 Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and M.-FMoens XHuang LSpecia SW.-T Yih the 2021 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Online and

Punta Cana, Dominican Republic

2021 A structured review of the validity of BLEU EReiter 10.1162/coli_a_00322 Computational Linguistics 44 2018 Why we need new evaluation metrics for NLG JNovikova ODušek ACurry VRieser 10.18653/v1/D17-1238 Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics MPalmer RHwa SRiedel the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics

Copenhagen, Denmark

2017 A survey of evaluation metrics used for nlg systems ABSai AKMohankumar MMKhapra ACM Computing Surveys (CSUR) 55 2022 ACelikyilmaz EClark JGao arXiv:2006.14799 Evaluation of text generation: A survey 2020 arXiv preprint Bertscore: Evaluating text generation with bert TZhang VKishore FWu KQWeinberger YArtzi International Conference on Learning Representations 2019 The meteor metric for automatic evaluation of machine translation ALavie MJDenkowski Machine translation 23 2009 YWu MSchuster ZChen QVLe MNorouzi WMacherey MKrikun YCao QGao KMacherey JKlingner AShah MJohnson XLiu ŁukaszKaiser SGouws YKato TKudo HKazawa KStevens GKurian NPatil WWang CYoung JSmith JRiesa ARudnick OVinyals GCorrado MHughes JDean arXiv:1609.08144 Google's neural machine translation system: Bridging the gap between human and machine translation 2016 ROUGE: A package for automatic evaluation of summaries C.-YLin Text Summarization Branches Out, Association for Computational Linguistics

Barcelona, Spain

2004 chrF: character n-gram F-score for automatic MT evaluation MPopović 10.18653/v1/W15-3049 Proceedings of the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics the Tenth Workshop on Statistical Machine Translation, Association for Computational Linguistics

Lisbon, Portugal

2015 chrF++: words helping character ngrams MPopović 10.18653/v1/W17-4770 Proceedings of the Second Conference on Machine Translation, Association for Computational Linguistics the Second Conference on Machine Translation, Association for Computational Linguistics

Copenhagen, Denmark

2017 A call for clarity in reporting BLEU scores MPost Proceedings of the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics the Third Conference on Machine Translation: Research Papers, Association for Computational Linguistics

Belgium, Brussels

2018 KKrippendorff Computing krippendorff's alphareliability 2011