Evaluation of Interest and Coherence in Machine Generated Stories Dominic Callan and Jennifer Foster School of Computing, Dublin City University, Ireland dominic.callan24@mail.dcu.ie, jennifer.foster@dcu.ie Abstract. Evaluation of the narrative text generated by machines has traditionally been a challenge, particularly when attempting to evalu- ate subjective elements such as interest or believability. Recent improve- ments in narrative machine text generation have been largely driven by the emergence of transformer-based language models, trained on mas- sive quantities of data. In this study, a corpus of stories is generated using the pre-trained GPT-Neo transformer model, with human-written prompts. The stories generated through this process are subsequently evaluated through both human evaluation and two automated metrics: BERTScore and BERT Next-Sentence-Prediction. The results show vari- ation in human evaluation results in comparison to automated metrics, suggesting further work is required to train automated metrics to identify text that is defined as interesting by humans. Keywords: NLP · NLG · Machine-Generated Text · Transformer · Eval- uation 1 Introduction Many challenges exist in the evaluation of machine generated text. With the improvement in text generation quality by modern transformer models [22], flu- ency has greatly increased, however evaluating other elements of the text through automated metrics continues to prove difficult. Story generation differs signifi- cantly from other machine text generation challenges; rather than focusing on word overlap with an input or reference text as the metric for success, devel- oping a believable narrative text requires composing coherent natural language texts that describe a sensible sequence of events [23]. A ‘good’ or successfully generated story is a subjective idea; there are many criteria that should be con- sidered, with the result that the evaluation of stories is a difficult problem that is relatively understudied [15]. In other text generation tasks, such as Machine Translation, ‘gold-standard’ reference texts exist as a benchmark for comparison. No equivalent baseline reference texts exist against which to compare machine-generated stories when evaluating subjective concepts such as creativity or interestingness; creative lan- guage cannot easily be defined in this way, as evaluating in this manner does not allow for the possibility of correct but novel generation [19]. Clark et al. [6] Copyright 2021 for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0) 2 Callan et al. attempted a comparison of human authored and machine generated text across several domains, including story and news generation. They observe that human evaluators focused on form and structure rather than content in deciding whether a text was written by a machine or a human. This allowed for the conclusion that machines write fluently but did not address other narrative strengths of the text produced. As noted by Akoury et al. [1] and Roemmele et al. [20], a potentially infinite number of human-generated stories can be generated that have attributes that may be considered interesting by human evaluators. To use a small sample of these as a reference for evaluating interest is therefore unreliable; Attempting to base a model against one human-generated reference could bias the model towards a certain style of writing or type of vocabulary, when the goal is to evaluate how interesting the story is. This paper takes the dual approach of obtaining human judgements for a set of machine-generated stories, focusing on criteria around story interest, coupled with automated evaluations of the same texts. The automated metrics imple- mented focus on semantic similarity estimation, rather than n-gram overlap. A goal of the study is to evaluate the success of such automated metrics on narra- tive text generated by a large-scale transformer model and review these results in comparison to human evaluation of the same stories. 2 Background 2.1 Defining Story Interest To evaluate the extent to which subjective attributes like interest, creativity or believability are applicable in machine generated text, certain criteria must be defined as metrics. In their in-depth study of human evaluations of auto- matically generated text, van der Lee et al. [14] reported that the most used metrics in these types of studies were fluency, naturalness, quality, and meaning- preservation, but ultimately, they note that the criteria chosen should depend on the specific task. Gatt & Krahmer [10] produce a similar list, also including ‘Human-likeness’ and ‘Genre compatibility’. Celikyilmaz et al. [4] discuss certain criteria and attributes that should be present, including overall style, formality, or tone of the generated text. They add that there should be a ‘typicality’ to the generated text, meaning that it should be the type of text that we often see. Accuracy is of less concern for story-ending generation, as their output cannot usually be judged by fidelity to an identifiable, external input [13]. Grammati- cality and fluency are not significant problems with modern transformer-based systems in comparison with older systems – the errors are instead often semantic or narrative [23]; humans can easily recognise non-sequitur sequences of events or conclusions, even when they are grammatical [19]. The difference between well written, coherent text, and interesting text is difficult to define. Generating text that simply describes a sequence of events alone is not enough for it to be considered interesting and coherent [16]. Evaluation of Interest and Coherence in Machine Generated Stories 3 2.2 Human Evaluation of Machine Generated Text NLG evaluation has long been identified as a difficult and complex area to mea- sure accurately [12]. Human evaluation is still considered as the benchmark for evaluating machine generated outputs [11, 13]. Chaganty et al. [5] note that the many problems with automated evaluation metrics motivate the need for human evaluation. A goal of natural language generation is to produce fluent outputs that can be read by laypeople [23]; it is suitable therefore that this same group of ‘laypeople’ review the output where possible. We lack a good way of encod- ing aspects of what constitutes human quality output so we must rely on human evaluation of our models [6]. However, undertaking human evaluation of machine generated text systems also involves many challenges. Human interaction can be slow, is expensive and often hard to scale up [15, 5]; Purdy et al. [19] observe that the cost of human evaluation presents a bottleneck to AI research on story generation. Training of evaluators on what to expect and setting context and expectations can help them to focus on specific features of the text, which can be necessary given a tendency for humans to focus on form and fluency ahead of content [6]. 2.3 Crowdsourcing When crowdsourcing human evaluations, Celikyilmaz et al. [4] highlight issues with using sources like Amazon Mechanical Turk, especially when the task is to evaluate longer text sequences. These workers are typically more used to eval- uating microtasks and may be less experienced with evaluating stories. Strong clear guidelines and instructions need to be issued to maximise the effectiveness of these evaluations. Lowe et al. [15] however warn that there must be a balance, as too much instruction can introduce bias. Van der Lee et al. [13] caution that there is a risk of inadvertently recruiting bots or participants who want to get paid for as little work as possible. 2.4 Automatic Evaluation of Machine Generated Text The many challenges around reliable and scalable human evaluation have driven the development of automated evaluation systems. However, this challenge has traditionally proven difficult in NLG; text generation can go wrong in different ways while still receiving the same scores on automated metrics [13]. Many au- tomated metrics exist currently. BLEU [17] has traditionally been used in NLG systems to evaluate word overlap, however it is not a suitable metric for mea- suring the success of developing narrative text. Chaganty et al. [5] note that while BLEU is cheap to run, it correlates poorly with human judgement. By rewarding word overlap, BLEU assigns a positive value to repetition, an ele- ment of machine text-generation that is to be avoided in story generation. As a metric, BLEU breaks down when the space of allowable outputs is large, as in open-ended generation like with prompts and stories [23]. Other metrics have emerged; BLEURT [21] is a BERT-based evaluation metric that is fine-tuned 4 Callan et al. on synthetically generated sentence pairs using automatic evaluation scores such as BLEU. It is then further fine-tuned on machine generated texts and human written references using human evaluation scores and automatic metrics as la- bels. The Automatic Dialogue Evaluation Model (ADEM) proposed by Lowe et al. [15] is a model-based evaluation that is learned from human judgements. It is mainly used for evaluating dialogue generation and is shown to correlate well with human judgement. Hashimoto et al. [11] propose Human Unified with Statistical Evaluation (HUSE), focussing on open ended text generation tasks, such as story generation. This model combines statistical evaluation and human evaluation metrics in a single model and differs to ADEM in this way. 3 Methodology 3.1 Input Data The dataset used is a set of prompts taken from the reddit.com ‘writing prompts’ data set, introduced by Fan et al. [8]. The themes of the prompts vary, although they are often centred around fantasy or sci-fi. The average prompt length is 147 characters or 27 words. The shortest is 8 words and the longest is 56. 3.2 Transformer Language Models Transformer language models make use of an ‘attention’ function which helps to identify for each word how relevant other words in the sequence are. The transformer architecture is used in the BERT system developed by Google [7], and in GPT-2 and GPT-3, developed by OpenAI [3]. For this study, the GPT- style architecture is implemented for the text generation process and BERT is used to underpin the automated evaluation of the machine generated text. Although both are Transformers, there are fundamental differences in how the two systems operate; BERT is trained to predict a masked token given the tokens on its left and right, and to predict whether two sequences follow on from each other. GPT is trained to predict the next token in a sequence, where every token can only attend to context to its left. Licencing costs prevented the use of the GPT-3 model for this study. The GPT-Neo 2.7B parameter transformer model is used instead. Developed by EleutherAI, it is designed to be an open-source replication of Open-AI’s GPT-3 architecture [2]. GPT-Neo model is trained on ‘The Pile’ dataset, an 825GB diverse open-source English text corpus targeted at training large scale data models [9]. The Pile is made up of 22 smaller datasets, including BookCorpus2, YouTube closed-captions, Project Gutenberg, and English Wikipedia. 800 stories were generated for this study. Given that the focus of this analysis is on narra- tive style text, when either the prompts or the stories were of a non-narrative nature, they were excluded from the final corpus. From the remaining corpus of narrative-style stories, 100 prompt-story pairs were chosen at random for eval- uation. The average story length is 77 words, the longest has 96 words and the Evaluation of Interest and Coherence in Machine Generated Stories 5 shortest has 54. A cap of 400 characters was used and the average character count is 381 characters. This cap was implemented both as a method of maintaining coherence but also as a consideration to the survey participants who would be reviewing each story. Automated Evaluation Metrics Whilst it is clearly identified in literature that good automatic evaluation metrics are still hard to come by ([13], [18],[19]), we chose the following two metrics for this study: BERTScore and BERT Next Sentence Prediction. Upon review of the available automated metrics, these were chosen since both focus on semantic similarity and thus have the potential to capture some notion of story coherence. BERTScore is a language generation evaluation metric based on BERT [24]. It calculates a similarity score for two sentences, as a sum of cosine similarities between the contextualised word embeddings produced by a pretrained BERT model for each word in each sentence. By assigning different embeddings to words depending on their surrounding context, BERTScore attempts to reward seman- tic relationships between an input and an output, a core element of successful story generation. Example 1 shows a prompt/story sentence pair from the data that achieves a high re-scaled BERTScore of 0.287 and low Cumulative 1-gram BLEU score: 0.1226: Example 1. Prompt: You are born with the ability to stop time, but one day you see something else is moving when you have already stopped time. Story sentence: Your brain takes over and tells you to move, but you can’t. Whilst it was developed for image captioning and machine translation tasks, BERTScore is designed to be task-agnostic. It is unclear how it can perform on open-ended tasks. BERT Next Sentence Prediction BERT is trained on two tasks, BERT Masked LM and BERT for Next Sentence Prediction [7]. NSP is the task of predicting the probability that a sentence logically succeeds the previous sentence and is designed to learn the relationships between sentences. For this study, this BERT- NSP model is implemented as the second automated metric to evaluate stories. Each sentence pair is tokenised, and the BERT model processes the sentences and outputs 0 to indicate that Sentence-Two does follow Sentence-One, and 1 when it believes it does not. 3.3 Implementation of Evaluation Metrics BERTScore The BERTScore metric tokenises two selections of text that are to be compared, and using contextual embeddings, derives a semantic similar- ity metric by calculating cosine similarities between the embeddings.1 Two ap- proaches were undertaken to obtain two BERTScore metrics for each story. In 1 Zhang et al. [25] announced an optional improvement to BERTScore after the release of their original paper, to address the relatively small range observed between high 6 Callan et al. the first approach, the BERTScore is calculated between each sentence in the story and the prompt, and the resulting scores are averaged. This score is identi- fied as BERTScore-1 in the results. The second approach compares each sentence to the previous sentence, rather than comparing each sentence to the prompt. These scores were again aggregated and are captured as BERTScore-2. By taking this approach, it can be observed firstly if individual segments of the story are semantically linked back to the prompt, but also if each segment is semantically linked to the previous segment. BERT Next Sentence Prediction Similarly to BERTScore, BERT-NSP eval- uates the prompt/story pairs in two different ways. BERT-NSP-1 looks at pre- dicting whether each sentence in the story logically follows on from the prompt, whereas BERT-NSP-2 compares the prompt to the first sentence of the story, and then each subsequent sentence to the previous sentence. Human Evaluation by Survey Human evaluation was undertaken using anonymous surveying where participants were firstly advised that the stories were written by machines. For each pair, the evaluators were shown the prompt and the subsequent story generated by the GPT-Neo model and asked to as- sess it on a Likert-scale with a score of between 1 and 7 for the following four questions: 1. How related do you think the story is to the prompt? 2. How much sense does the story make to you? 3. How interesting is the PROMPT to you? 4. How interesting is the STORY to you? (Would you read more?) There was also a further optional free-text question at the end of each survey, for evaluators to leave general comments or impressions. The 100 prompt/story pairs were split into five sets of 20 pairs to reduce the chances of evaluators tiring or growing bored and abandoning the survey. The wording of these questions is designed to ask in plain-English terms about the coherence and interestingness of the stories generated by the machines. It was important to record the perceived semantic connection between the prompt and story; an interesting story could be produced by the system, however if it did not relate to the prompt then the objective of the task has not been achieved. A question on the story making sense to the evaluator was a proxy for story-coherence. This was introduced to observe whether a story needs to be coherent to be interesting to a reader, or conversely if an incoherent story was likely to be deemed uninteresting. Separate to the interest-level of the story, evaluators were asked if they found the prompt interesting, as their level of interest in the prompt may impact their interest in and low scores. They suggest that the cosine similarity score is rescaled through a lin- ear transformation, noting that this rescaling doesn’t negatively impact correlation with human judgement. This rescaling is implemented in BERTScore calculations in this paper. Evaluation of Interest and Coherence in Machine Generated Stories 7 the resulting story generated. Two sample prompt/story pairs were included in the instructions of the survey, to provide context on the type of text that the evaluator would be reading in the survey and to set their expectations. Each prompt/story was reviewed by a minimum of 6 unique reviewers, although the majority were reviewed by 7 or more. 4 Results and Discussion Table 1. Correlation between human judgements and automated metrics. Metric Q1 Q2 Q3 Q4 BS1 BS2 NSP1 NSP2 Q1 1.00 0.72 0.24 0.62 0.41 0.29 0.25 0.25 Q2 0.72 1.00 0.19 0.80 0.25 0.21 0.12 0.15 Q3 0.24 0.19 1.00 0.34 0.09 0.12 0.02 0.06 Q4 0.62 0.80 0.34 1.00 0.28 0.23 0.10 0.25 BS1 0.41 0.25 0.09 0.28 1.00 0.66 0.32 0.26 BS2 0.29 0.21 0.12 0.23 0.66 1.00 0.24 0.27 NSP1 0.25 0.12 0.02 0.10 0.32 0.24 1.00 0.61 NSP2 0.25 0.15 0.06 0.25 0.26 0.27 0.61 1.00 Table 1 shows a matrix illustrating correlation between the different survey questions and the four automated metrics implemented. 4.1 Human Evaluation Within the human survey results, the strongest correlation of 0.80 is between story coherence (Q2) and story interest (Q4), suggesting that evaluators were most interested in the stories that they found to be the most coherent. A strong relationship (0.72) is also observed between the story coherence (Q2), and the story-prompt relationship (Q1), indicating potentially that evaluators factored in the connection between prompt and story when considering overall coherence; a story that was coherent, would be deemed less so if it did not follow on logically from the prompt. From observing average scores received for each question illus- trated in Table 2, the prompt-interest received the highest average rating of 4.45 out of 7; the prompts were seen as more interesting than the stories that were generated. This suggests that, in general the model was unable to produce sto- ries of greater interest than the prompts; it should be noted that there was a low positive correlation of 0.34 reported between prompt-interest and story-interest, showing little connection between these two. There was no preference observed for either longer prompts or longer stories. The lowest average score for any question was for story interest (Q4), at 3.26/7. Focusing on average ratings across the four questions, the highest rated prompt/ 8 Callan et al. story pair can be seen in Example 2. The story remains coherent and relevant to the prompt throughout, not deviating, even at its latter stages: both coherence and prompt-relationship are shown to correlate with story interest. Introducing the term ‘corporate overlords’ puts a somewhat dystopian slant to the narrative that wasn’t necessarily part of the prompt, showing progression of the story in a defined and novel direction, which the evaluators seemed to have enjoyed. The lowest rated prompt/story pair, observed in Example 3, displayed few of the above-mentioned positive characteristics. The story does not relate back to the prompt and the context of the story contradicts the prompt semantically. Table 2. Human Evaluation mean scores from survey results. Question Average Score Q1. Story/prompt relationship 3.74/7 Q2. Story Coherence 3.42/7 Q3. Prompt Interest 4.45/7 Q4. Story Interest 3.26/7 Example 2. Prompt: In the future, Disney purchases an entire planet and makes the whole thing into one giant amusement park. Story: The problem is that a large portion of the population hates the park, hates its corporate overlords, hates its theme park, and is starting to revolt. At the park itself, the people are divided. Some people are happy, some people are not. One thing that makes this story dif- ferent from movies like the Matrix is that the people who live at the park have no control over the things that are happening there. Example 3. Prompt: Everyone has disappeared but you, you think that everyone is just holding surprise party for you. Story: Then, the story starts. The film follows the adventures of jinx, a former pro wrestling manager who is now a private chef. His wife. jenna leigh, had just found out that he was dating a woman named dana bjorn Stresemann. A few months later, jinx and dana get back together after a break-up. Now, when jinx is going away for work, dana contacts the police with a missing person notice. 4.2 BERT Next Sentence Prediction The BERT-NSP scores are relatively high; in most cases both BERT-NSP-1, where each sentence relates back to the prompt, and BERT-NSP-2, where each sentence is compared to the previous sentence, predict that the next sentence does logically follow the preceding sentence. Within the 100 prompts/stories assessed, there were a total of 531 sentence pairs reviewed for next sentence prediction combinations. For BERT-NSP-1, 433 of the comparisons were deemed to be logical next sentences and only 98 were not. For BERT-NSP-2, an even Evaluation of Interest and Coherence in Machine Generated Stories 9 higher number of sentences were predicted to follow on from the previous one; 497 of the 531 sentences. This is somewhat expected, given the similarities in training objectives (predicting the next token versus next sentence prediction) and training data between GPT-Neo and BERT. 4.3 BERTScore For both BERTScore metrics, cosine similarity was calculated for each sen- tence pair and then averaged for an overall score for a given prompt/story. The BERTScore-1 results ranged from -0.187 to 0.365 with a mean of 0.074, and the BERTScore-2 results ranged from -0.168 to 0.439 with a mean of 0.138. Some of the highest BERTScore results were for stories that demonstrated a notable amount of repetition, e.g Example 4 which received the highest BERTScore-1: Example 4. Prompt: A dozen small alien ships enter the solar system, they ignore us. A few years later other ships show up, destroy the first visitors and leave. Ten years later two fleets arrive. Story: A decade later the aliens come again, this time with a fleet of ships, and destroy the visitors and leave. One thousand years later, a new alien ship arrives, a vessel similar to the first. One hundred years later the alien ships finally come again, this time with over 500 ships, destroy the 100 ships that came the previous year, then use the surviving alien vessels to create their base The re-use of the term ‘years later’ assisted in increasing the cosine similarity F1 score. Despite the intention of maintaining a focus on rewarding semantic similarity, this shows that repetition is still also rewarded when implementing this metric. This same prompt / story pair was the 15th highest rated by humans out of 100 available. The results in Table 1 show weak correlation between human judgement scores and automated metric scores. The highest correlation between an auto- mated and a human metric is 0.41 between story-prompt relatedness (Q1) and BERTScore-1. Regarding story-interest as defined by humans, there was a very low correlation of 0.28 with BERTScore-1 and 0.23 with BERTScore-2. There was almost no correlation with the BERT-NSP scores; this metric found for most cases that sentences logically followed each other, however it did not provide the more granular level of analysis that human surveying and BERTScore provided. 4.4 Discussion The machine-generated stories tend to match the theme/genre of the prompts. If this theme is not of any interest to the evaluator, they may mark this story- interest score low on the scale. Strong correlation between prompt interest and story interest however was not observed. In general, the prompts were quite spe- cific. They set a certain tone or introduced a theme that defined a direction that a story ‘should’ take. Whilst this was still an open-ended style task, vaguer, less specific prompts may provide more leeway for the model to produce stories that 10 Callan et al. humans would deem relevant. While some stories generated could be considered related to the prompt, the story may have taken a secondary semantic element of the prompt to build upon, rather than use the predominant or primary theme or idea. The last question in the survey invited the survey participants to comment on the stories. The following is a summary: – They found the themes somewhat unsettling. – Some stories did not make sense. – Some stories came across as poetic, but also noted that this may have been a coincidence or fluke. – The stories came across like ‘blurbs that would be seen on the back of a book cover’. – The stories sometimes went in a direction that human stories would not, which generated interest. – The stories fail to continue expanding on the most interesting part of the prompt. – There are some funny generations, there is also a surreal aspect to some of them, and even some that are profound (e.g., “I gave you all you gave me”). – The style seemed different to human writing, although this wasn’t necessarily bad – The stories written by the computer were sometimes more abstract (than the prompts) The BERT-NSP results suggest that the sentences generated follow a logical or- der. BERTScore and BERT-NSP scoring is undertaken at a sentence level and aggregated for each story, whereas the human evaluators were asked to judge the story in its entirety. This is relevant, as BERTScore scores may be impacted by one or two low results in a sentence-pair within a story, thereby lowering the overall average score for an otherwise strong story. Regarding human evalua- tion and automated metric comparisons, the survey question that BERTScore-1 correlates with most closely – albeit with a low positive correlation of 0.41 – is story-prompt relatedness, which aligns with what BERTScore-1 is trying to achieve: semantic relatedness between the prompt and each story-sentence. 5 Conclusion Whilst it is established that modern transformer models generate significantly more fluent text than their predecessors, evaluation of narrative elements of their output continues to be a challenge. Many standard automated evaluation metrics exist for text generation that reward repetition of the input; this is not a success metric in narrative text generation. Our survey results show a strong correlation between story coherence and story interest. Given that the average interest scores were low, this suggests perhaps that the GPT-Neo model does not always output coherent stories. There is a fine balance to suspending disbelief in storytelling, and machine-generated text is shown in this study to Evaluation of Interest and Coherence in Machine Generated Stories 11 often lack this level of nuance. When the text has close to human fluency, this raises the human evaluator’s expectations; the machine-generated stories must be as interesting and coherent as every-day human-generated stories are to satisfy human evaluators. Future work in the area could involve identifying story prompts for a certain specific genre (crime for example) and generate stories consistent with this genre. If evaluators with an interest this genre were recruited, this may reduce the occurrence of low scores due simply to evaluators’ lack of interest in the topic, regardless of the quality of the output. There is also scope, given a sufficiently high volume of human judgements, to train a new evaluation system, allowing for the development of an automated metric fit for evaluating narrative text. From a narrative perspective, the stories generated by GPT-Neo leave us somewhat short in terms of consistently providing interest; their success is some- what hit-and-miss. More immediate consistent success for these systems may be achieved through generating non-narrative-style text, or by employing a hybrid machine-human approach. References 1. Akoury, N., Wang, S., Whiting, J., Hood, S., Peng, N., Iyyer, M.: STORIUM: A Dataset and Evaluation Platform for Machine-in-the-Loop Story Generation. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 6470–6484. Association for Computational Linguistics, Online (Nov 2020). https://doi.org/10.18653/v1/2020.emnlp-main.525, https:// aclanthology.org/2020.emnlp-main.525 2. Black, S., Gao, L., Wang, P., Leahy, C., Biderman, S.: GPT-Neo: Large scale autoregressive language modeling with mesh-tensorflow (2021), http://github. com/eleutherai/gpt-neo 3. Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T.J., Child, R., Ramesh, A., Ziegler, D.M., Wu, J., Win- ter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., Amodei, D.: Language models are few-shot learners. ArXiv abs/2005.14165 (2020) 4. Celikyilmaz, A., Clark, E., Gao, J.: Evaluation of text generation: A survey. arXiv preprint arXiv:2006.14799 (2020) 5. Chaganty, A.T., Mussman, S., Liang, P.: The price of debiasing automatic metrics in natural language evaluation. arXiv preprint arXiv:1807.02202 (2018) 6. Clark, E., August, T., Serrano, S., Haduong, N., Gururangan, S., Smith, N.A.: All that’s’ human’is not gold: Evaluating human evaluation of generated text. arXiv preprint arXiv:2107.00061 (2021) 7. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirec- tional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018) 8. Fan, A., Lewis, M., Dauphin, Y.: Hierarchical neural story generation. arXiv preprint arXiv:1805.04833 (2018) 9. Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., et al.: The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027 (2020) 12 Callan et al. 10. Gatt, A., Krahmer, E.: Survey of the state of the art in natural language generation: Core tasks, applications and evaluation. Journal of Artificial Intelligence Research 61, 65–170 (2018) 11. Hashimoto, T.B., Zhang, H., Liang, P.: Unifying human and statistical evaluation for natural language generation. arXiv preprint arXiv:1904.02792 (2019) 12. Howcroft, D.M., Belz, A., Clinciu, M.A., Gkatzia, D., Hasan, S.A., Mahamood, S., Mille, S., van Miltenburg, E., Santhanam, S., Rieser, V.: Twenty years of confusion in human evaluation: Nlg needs evaluation sheets and standardised definitions. In: Proceedings of the 13th International Conference on Natural Language Generation. pp. 169–182 (2020) 13. van der Lee, C., Gatt, A., van Miltenburg, E., Krahmer, E.J.: Human evaluation of automatically generated text: Current trends and best practice guidelines. Comput. Speech Lang. 67, 101151 (2021) 14. van der Lee, C., Gatt, A., van Miltenburg, E., Wubben, S., Krahmer, E.J.: Best practices for the human evaluation of automatically generated text. In: INLG (2019) 15. Lowe, R., Noseworthy, M., Serban, I.V., Angelard-Gontier, N., Bengio, Y., Pineau, J.: Towards an automatic turing test: Learning to evaluate dialogue responses. arXiv preprint arXiv:1708.07149 (2017) 16. McIntyre, N., Lapata, M.: Learning to tell tales: A data-driven approach to story generation. In: Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Pro- cessing of the AFNLP. pp. 217–225 (2009) 17. Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002) 18. Pillutla, K., Swayamdipta, S., Zellers, R., Thickstun, J., Welleck, S., Choi, Y., Harchaoui, Z.: Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems 34 (2021) 19. Purdy, C., Wang, X., He, L., Riedl, M.: Predicting generated story quality with quantitative measures. In: Fourteenth Artificial Intelligence and Interactive Digital Entertainment Conference (2018) 20. Roemmele, M., Gordon, A.S., Swanson, R.: Evaluating story generation systems us- ing automated linguistic analyses. In: SIGKDD 2017 Workshop on Machine Learn- ing for Creativity. pp. 13–17 (2017) 21. Sellam, T., Das, D., Parikh, A.P.: Bleurt: Learning robust metrics for text gener- ation. In: ACL (2020) 22. Vaswani, A., Shazeer, N.M., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. ArXiv abs/1706.03762 (2017) 23. Yao, L., Peng, N., Weischedel, R.M., Knight, K., Zhao, D., Yan, R.: Plan-and-write: Towards better automatic storytelling. ArXiv abs/1811.05701 (2019) 24. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Bertscore: Evaluating text generation with bert. ArXiv abs/1904.09675 (2020) 25. Zhang, T., Kishore, V., Wu, F., Weinberger, K.Q., Artzi, Y.: Rescal- ing bertscore with baselines. https://github.com/Tiiiger/bert_score/blob/ master/journal/rescale_baseline.md (2020)