How Biased Is Your NLG Evaluation? Pavlos Vougiouklis1† , Eddy Maddalena2† , Jonathon Hare1 , and Elena Simperl2 School of Electronics and Computer Science University of Southampton Southampton, United Kingdom 1 {pv1e13, jsh2}@ecs.soton.ac.uk 2 {e.maddalena, e.simperl}@soton.ac.uk Abstract. Human assessments by either experts or crowdworkers are used extensively for the evaluation of systems employed on a variety of text generative tasks. In this paper, we focus on the human evaluation of textual summaries from knowledge base triple-facts. More specifically, we investigate possible similarities between the evaluation that is performed by experts and crowdworkers. We generate a set of summaries from DB- pedia triples using a state-of-the-art neural network architecture. These summaries are evaluated against a set of criteria by both experts and crowdworkers. Our results highlight significant differences between the scores that are provided by the two groups. Keywords: Natural Language Generation · Human Evaluation · Crowd- sourcing 1 Introduction In the last decade, crowdsourcing has gained increased interest since it offer the methods to reach large amounts of online contributors capable of performing in a small time large amounts of short human intelligence tasks. In particular, it has served the evaluation purposes in different areas of computer science, such as information retrieval [1], machine learning [6], and Natural Language Processing [8]. Human judgements are used for the evaluation of many systems employed on a variety of text generative tasks ranging from Machine Translation [2] and conversational agents [12,13] to generation of summaries [5,3,14] and questions [9,4] in natural language over knowledge graphs. Depending on the task and the evaluation criteria, these judgements are collected by either a small group of “experts” or at a larger scale by crowdworkers that are recruited through a crowdsourcing platform. Especially in the case of Natural Language Generation (NLG) over knowledge graphs, human evaluation is crucial. This is attributed to the inadequacy of the automatic text similarity metrics, such as BLEU [10] or ROUGE [7], to objectively evaluate the generated text [11]. † The authors contributed equally to this work. Vougiouklis et al. In this paper, we focus on the human evaluation of textual summaries from knowledge base triple-facts [3,14]. More specifically, we wish to investigate whether there is any similarity between the way that experts and crowdworkers perform on the same evaluation tasks. We compile a list of three criteria that are usually employed for the human evaluation of automatically generated texts [5,14]: (i) flu- ency, (ii) coverage, and (iii) contradictions. We use the neural network approach that has been recently proposed by Vougiouklis et al. in order to generate tex- tual summaries from DBpedia triples. The summaries are evaluated against the selected criteria by both experts and crowdworkers using the same task interface. Our experiments have showed that there are significant differences between the scores that are provided by experts and the crowdworkers. Our future work will focus on the methods with which the crowdworkers should be trained in order to perform more accurately on similar tasks. 2 Experimental Design We run a crowdsourcing task according to which we evaluate 20 summaries that have been generated with the Triples2GRU system that has been proposed by Vougiouklis et al.. We regard each summary as a a concise representation in natural language of an input set of triple-facts. Each summary is generated by Triples2GRU given a set of 8 to 18 triples1 , and is evaluated by 10 workers. Before starting the task, the workers are presented with general instructions. They are also informed with respect to the ethics approval that we have received for the carrying out of this experiment. The task consists of three phases through which workers were required to evaluate a given summary: (i) text fluency (with an integer number between 1 and 6), (ii) information coverage, by classifying as “Present” or “Absent” each triple-fact from a given list, and (iii) contradictions, by classifying each one of the aforementioned facts as “Direct Contraction” or “Not a Contradiction”. At the beginning of each phase, the workers are presented with definitions, suggestions, examples and counter-examples. Each worker was rewarded with 0.20$. After the carrying out of the experiment, the same 20 summaries are also evaluated under the same setup by two experts. 3 Results Fluency. For each summary, (i) we computed the average of the fluency scores that have been assigned by the 10 workers. Then, (ii) we computed the average of all the values obtained in (i) resulting in an average of 4.8 out of 6. The average fluency with which the experts evaluated the 20 summaries was 5.28. The ANOVA test computed on the two fluency score series produced p < 0.05. Consequently, we can claim that compared to the experts, crowdworkers tend to systematically underestimate the summaries’ fluency by 0.5 out of 6. 1 The pre-trained version of Triples2GRU that we used (i.e. https://github.com/pvougiou/Neural-Wikipedian) accepts up to 22 triples as in- put. How Biased Is Your NLG Evaluation? Fig. 1. Task interface showing the page that both the experts and crowdworkers used to identify facts whose information is contradicted in the summary. Coverage. Workers evaluated the coverage of each summary with respect to a set of triple-facts that generated it. Each summary is aligned with 8 − 18 facts. The assessments were made by choosing between two labels: (i) “Present” for facts that are either implicitly or explicitly mentioned in the summary, and (ii) “Absent” for the rest. We compute the percentage of the “Present” facts for each summary. Then, similarly to fluency, for each summary, we first compute the average of coverage across the workers, and then the average across all the summaries. The average coverage for all the 20 summaries was 26.85%. In our second experiment, two experts repeated together the same evaluation resulting in an average of 39.71% of facts covered by the summaries. As a result, workers tend to undercount the presence of facts in the generated summaries (confirmed by ANOVA test p < 0.05). Finally, a positive significant correlation (Pearson = 0.64) pointed out that workers evaluate coverage in a consistent manner with the experts. Contradictions. Workers were required to evaluate possible contradictions be- tween the information in a given summary and the respective facts that generated it. Workers were required to mark as “Direct Contradiction” facts that contra- Vougiouklis et al. dict the summary, and as “Not a Contradiction” the rest. For each summary, we compute the percentage of facts that are labelled as contradictions by each single workers. Similarly to coverage, (i) for each summary, we computed the av- erage of the percentages of contradictions of all the workers, and (ii) we averaged the contradiction scores across all the summaries. In a preliminary version of our experiments, each fact was to be marked as either “Contradiction” or “Not Con- tradiction”. However, this proved inadequate since workers were marking facts that were not covered in the summary as contradicting, resulting in an average of ∼ 50% of facts whose information is contradicted in the summaries. In order to minimize the effect of contradictions, besides changing the available labels for each triple-fact, in the contradiction instructions (shown before the third phase of the task), we explicitly noted that contradictions should be rare and that we expected many summaries without any of them. As shown in Fig. 1, we advise workers to identify as contradictions only “Direct contradictions” whose infor- mation is explicitly negated in the corresponding summary. Our final result of 30% represents the average of contradicting facts per summary. The same eval- uation was made by the two expert, and the average percentage of triple-facts that are contradicted in the summaries was 0.7%. Consequently, workers tend (ANOVA test, p < 0.05) to significantly overestimate the presence of facts that are contradicted in the generated summaries. 4 Conclusion In this paper, we presented preliminary results of a work aimed to explore the use of crowdsourcing for the evaluation of NLG systems. In particular, we focused on the evaluation of textual summaries that are generated from triple-facts. We compared the results of two studies, one that has been performed by experts and one by crowdworkers. The evaluations were conducted in three phases: (i) the flu- ency of the summary, (ii) the coverage, and (iii) the contradictions of a summary; the latter two are assessed with respect to the given triple-facts. Our preliminary analysis shows that crowdworkers tend to underestimate the fluency of the sum- maries by 0.5 out of 6. While coverage is judged consistently across both experts and crowdworkers, it is significantly underestimated by the latter. Lastly, despite the fact that we emphasised on the low number of expected contradicting facts, workers strongly overestimated their presence. A natural extension of this work is to identify the type of facts (i.e. predicates) that influence negatively the workers’ judgement. Further studies will focus on minimising this bias by both training workers on how to identify only direct contradictions, and increasing the quality control of the experiment. Acknowledgements This research is partially supported by the Answering Questions using Web Data (WDAqua) and QROWD projects, both of which are part of the Horizon 2020 programme under respective grant agreement Nos 642795 and 723088. How Biased Is Your NLG Evaluation? References 1. Alonso, O., Mizzaro, S.: Using crowdsourcing for trec relevance assessment. Infor- mation processing & management 48(6), 1053–1066 (2012) 2. Bojar, O., Chatterjee, R., Federmann, C., Graham, Y., Haddow, B., Huang, S., Huck, M., Koehn, P., Liu, Q., Logacheva, V., Monz, C., Negri, M., Post, M., Rubino, R., Specia, L., Turchi, M.: Findings of the 2017 conference on machine translation (wmt17). In: Proceedings of the Second Conference on Machine Trans- lation, Volume 2: Shared Task Papers. pp. 169–214. Association for Computational Linguistics, Copenhagen, Denmark (September 2017), http://www.aclweb.org/ anthology/W17-4717 3. Chisholm, A., Radford, W., Hachey, B.: Learning to generate one-sentence bi- ographies from Wikidata. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers. pp. 633–642. Association for Computational Linguistics, Valencia, Spain (April 2017), http://www.aclweb.org/anthology/E17-1060 4. Du, X., Shao, J., Cardie, C.: Learning to ask: Neural question generation for reading comprehension. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 1342–1352. Association for Computational Linguistics, Vancouver, Canada (July 2017), http://aclweb. org/anthology/P17-1123 5. Ell, B., Harth, A.: A language-independent method for the extraction of RDF verbalization templates. In: Proceedings of the 8th International Natural Lan- guage Generation Conference (INLG). pp. 26–34. Association for Computational Linguistics, Philadelphia, Pennsylvania, U.S.A. (June 2014), http://www.aclweb. org/anthology/W14-4405 6. Lease, M.: On quality control and machine learning in crowdsourcing. Human Computation 11(11) (2011) 7. Lin, C.Y.: ROUGE: A package for automatic evaluation of summaries. In: Marie- Francine Moens, S.S. (ed.) Text Summarization Branches Out: Proceedings of the ACL-04 Workshop. pp. 74–81. Association for Computational Linguistics, Barcelona, Spain (July 2004) 8. Marujo, L., Gershman, A., Carbonell, J., Frederking, R., Neto, J.P.: Supervised topical key phrase extraction of news stories using crowdsourcing, light filtering and co-reference normalization. arXiv preprint arXiv:1306.4886 (2013) 9. Ngonga Ngomo, A.C., Bühmann, L., Unger, C., Lehmann, J., Gerber, D.: Sorry, i don’t speak SPARQL: Translating SPARQL queries into nat- ural language. In: Proceedings of the 22Nd International Conference on World Wide Web. pp. 977–988. WWW ’13, ACM, New York, NY, USA (2013). https://doi.org/10.1145/2488388.2488473, http://doi.acm.org/10.1145/ 2488388.2488473 10. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: A method for au- tomatic evaluation of machine translation. In: Proceedings of the 40th An- nual Meeting on Association for Computational Linguistics. pp. 311–318. ACL ’02, Association for Computational Linguistics, Stroudsburg, PA, USA (2002). https://doi.org/10.3115/1073083.1073135 11. Reiter, E.: Natural Language Generation, chap. 20, pp. 574–598. Wiley-Blackwell (2010). https://doi.org/10.1002/9781444324044.ch20, https://onlinelibrary. wiley.com/doi/abs/10.1002/9781444324044.ch20 Vougiouklis et al. 12. Ritter, A., Cherry, C., Dolan, W.B.: Data-driven response generation in social me- dia. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing. pp. 583–593. EMNLP ’11, Association for Computational Linguistics, Stroudsburg, PA, USA (2011) 13. Sordoni, A., Galley, M., Auli, M., Brockett, C., Ji, Y., Mitchell, M., Nie, J.Y., Gao, J., Dolan, B.: A neural network approach to context-sensitive generation of conver- sational responses. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Tech- nologies. pp. 196–205. Association for Computational Linguistics, Denver, Colorado (May–June 2015) 14. Vougiouklis, P., Elsahar, H., Kaffee, L.A., Gravier, C., Laforest, F., Hare, J., Simperl, E.: Neural wikipedian: Generating textual summaries from knowledge base triples. Journal of Web Semantics 52-53, 1 – 15 (2018). https://doi.org/https://doi.org/10.1016/j.websem.2018.07.002, http:// www.sciencedirect.com/science/article/pii/S1570826818300313