The Tomato Festival: Towards using ChatGPT for Long-Form Discourse Generation of Plan-Based Narratives? Maryam Dueifi1 , Markus Eger2,* 1 Cal Poly Pomona, Department of Computer Science 2 UC Santa Cruz, Department of Computational Media Abstract With Generative AI at the forefront of public conversation, the capability of AI systems to tell stories has seen a surge of interest. OpenAI’s ChatGPT provides a user-friendly interface, as well as well-documented API access, and is used widely for generative purposes. In this paper we investigate how well it can actually produce narrative text. We present an approach to take a story plan produced by the Glaive narrative planner and turn it into a novella-length text. We then present a preliminary evaluation of the text output and discuss the challenges and limitations of having it actually be read by humans. Crucially, we show that the text is not comparable to human-written text in terms of grammatical complexity, which we posit to be one possible reason for it not being very enjoyable to read. As part of our work we also encountered several particular challenges that led to misspun tales, which we also discuss in detail. 1. Introduction erated by a narrative planner. In particular, we focus on generating long-form narratives, that, at present, reach the Narrative generation has been a topic of interest for AI length of a novella (about 25 000 words), and follow the research for many decades. Meehan’s TaleSpin [1] is of- story as produced by the planner. Our contribution is three- ten cited as the first “story generator”, although other ap- fold: First, we present a novel approach to prompt an LLM proaches have preceded it [2]. Nevertheless, TaleSpin, with using the planner output to expand the narrative to the de- its use of character goals and plans has served as the inspi- sired length. Second, we have generated several narratives ration of a wide variety of plan-based narrative generators. using our approach and show an evaluation of some of their Often, these generators distinguish between the story/fab- qualities. Third, and arguably most importantly, during the ula part of a narrative, consisting of the events as they hap- course of our work we have discovered several limitations pened in the story world, and the discourse, i.e. the way of this application of LLMs, and we will discuss them in that story is actually told, like the text that is produced [3]. detail. More recently, Large Language Models (LLMs) have seen a surge in popularity, including through OpenAI’s ChatGPT [4]. These are neural network model architectures termed 2. Related Work transformers [5], which can learn correlations between the occurrences of words in a text corpus. This can then be Our work builds on previous work in narrative generation, used to answer user queries, by letting the model predict combining a logic-based story structure generated by a plan- the most likely continuation that follows the question text, ner, with text generated by an LLM. We will therefore dis- producing a response text. However, LLMs can produce all cuss prior work in both of these areas. kinds of text, including narrative, when prompted to do so. While LLMs can produce narrative text, the transformer 2.1. Plan-Based Narrative Generation mechanism that they are based on has one significant lim- itation: When predicting the continuation of a text, the When a story is viewed as a (partially ordered) sequence of attention head mechanism employed by the model can only actions taken by the characters, it has striking similarities look at a limited context preceding the continuation. This to a plan, in the AI planning sense: A sequence of actions context window thus limits how much information the LLM applied to an initial state to achieve a goal condition [6]. can even see. Essentially, if one tried to generate an entire Indeed, there is a large body of work that utilized planners novel using an LLM with a limited context window, the to generate such stories [7]. The main challenge is how to model would not be able to take the contents of the first define the “goal” of the story: Often, this is described as few chapters into account when generating the end of the a state the story world ought to be in at the conclusion of narrative, which may result in events that contradict ear- the story as desired by an author. However, a centralized lier changes to the world state. Plan-based narratives, on planner assigning actions to characters to reach this global the other hand, use an explicit (logic-based) world model goal might lead to the characters acting contrary to what to represent the state of the story world at each time step, common sense would dictate. In a bank heist story, an “ef- and can ensure that only actions that are actually possible ficient” plan might be that the bank teller just delivers the to occur are taken by the characters. However, generating money to the robber’s house, but it would likely not make the discourse for such a plan-based story often involves for a compelling story. In TaleSpin, the individual characters templates or other short story fragments, often resulting in have their own goals, preventing them from acting against repetitive or terse discourse output. their own interests [1], but this may not always lead to a In this paper we present an approach that utilizes Ope- story with an actual plot. Another approach, originating nAI’s ChatGPT to generate the discourse for a story gen- with a branch of research by Riedl and Young [8]), instead uses a centralized planner that allows the author to define AIIDE Workshop on Intelligent Narrative Technologies, November 18, 2024, an overall goal for the story, while also ascribing intentions University of Kentucky Lexington, KY, USA to each character, preventing them from taking actions that * Corresponding author. do not further their own character-goals. Ware and Young $ mdueifi@cpp.edu (M. Dueifi); meger@ucsc.edu (M. Eger) [9] build on this work and also give characters the capability © 2024 Copyright for this paper by its authors. Use permitted under Creative Commons License Attribution 4.0 International (CC BY 4.0). CEUR ceur-ws.org Workshop ISSN 1613-0073 Proceedings to form plans that they do not end up executing due to some creativity has been used to generate NPC responses in a conflict that arises. Other research has also investigated murder mystery [26], control game play in interactive RPGs the use of landmarks to guide the generation process [10], [27], or even to generate entire games using VGDL [28]. incorporating character beliefs [11, 12], and failed actions [13]. Many of these planning-based approaches use the standardized Planning Domain Definition Language, PDDL, 3. Our Approach [14], and although some use extensions to handle intentions Our system is able to generate a long-form narrative con- and beliefs, these can be compiled away if needed [15, 16]. sisting of multiple chapters. As input we utilize a plan, as For our present work, we use the Glaive narrative planner, obtained by the Glaive narrative planner, as well as an op- which implements intentions and beliefs using an extended tional genre descriptor. Our discourse generator works in PDDL-variant [17], and which comes with a library of stan- three steps: dard narrative planning problems, which we will discuss in more detail below. 1. Convert Glaive plan into chapter descriptions. With a story in hand, the next problem is how to convey 2. Generate chapters from descriptions. this story to an audience, i.e. how to generate the discourse 3. Summarize and regenerate each chapter. [3]. The discourse generation problem actually consists In each step, our approach utilizes the different roles one of several parts, including selecting which actions of the can provide when prompting ChatGPT as shown below. To story to tell, which to omit, the ordering of the telling in summarize: The system role provides the model with high- order to convey the necessary information to the audience, level guidance, the assistant role gives the model context and then determining the actual realization of the discourse (typically previous model output, but can also be used to in the form of text or other media. Research has focused demonstrate desirable output to the model), while the user significantly on the first parts, by planning which discourse role contains the actual prompt the model should respond actions convey the “right” information to the audience [18], to. how to model suspense [19], flashbacks and flashforwards [20, 21] or focalization [22]. The actual text generation is then typically handled through templates, as was e.g. 3.1. Chapter Descriptions the case for the CPOCL experiments, or with a simple text The first challenge when generating discourse from story planner [23]. Another option is to manually translate the plans is that the planner output is in the form of a plan, in internal representation to text, as was done for TaleSpin the case of Glaive this comes in a PDDL-like syntax. The [24]. first step is therefore to convert these formal representa- In a more refined proposed model by Barot et al. [25], tions into descriptions of a story chapter, where each step the authors draw a distinction between the discourse, which in the plan corresponds to one chapter (we will discuss the incorporates the decisions for what to tell, and the narration, implications of this below). Rather than having the domain which is more concerned with how to tell the narrative. Our author come up with a mapping of plan actions to a descrip- work is best characterized as focusing on the narration, i.e. tion, we utilize ChatGPT itself to make this mapping. Given the surface text realization, based on full story plans. As we a plan step 𝑠, e.g. “(hatch-plan robbie six-shooter will discuss below, this is not a strong limitation for the data brown-horse bank mother-lode)”, we construct the set we were working in, but a more sophisticated discourse following prompt: model could be incorporated in the future. To generate the • System: You are rephrasing a string of words. output text, we make use of Large Language Models, which we will briefly discuss next. • User: Take the following phrase and make it into a coherent sentence: 𝑠. Provide only the resulting sentence. 2.2. Large Language Models • Assistant: In a statement like (accept talia rory vil- Large Language Models (LLMs) are based on a neural net- lage), the meaning is: “Talia accepts Rory’s proposal work architecture called transformers [5], which add an in the village” attention head mechanism to recurrent neural networks, The example mapping provides the model with guidance essentially allowing them to learn a probability distribution how to interpret. For the example step above, the model of words conditioned on the context these words occur in, produces (variations of) the description “Robbie hatched a with varying weights for the context. In recent years, LLMs plan to rob the bank for the mother lode with his six-shooter have seen a surge in popularity, in part due to the availabil- and brown horse.”. In the next step, we use these steps to ity of OpenAI’s ChatGPT, which presents an LLM using a generate individual chapters of the story. chat-like interface. The inference-capabilities of the LLM are used to predict the most likely continuation to a user query. In practice, if the user enters a question, the most 3.2. Chapter Generation likely continuation, as learned by the model, is an answer Once we have a textual description of each step of the plan, to that question, whereas if the user enters instructions, the we have the model generate a chapter of the story based on most likely continuation follows these instructions. While the description. However, when using only the description the basic premise is enticing, at present LLMs suffer from a of the current step, chapters become disconnected, with variety of issues, including hallucinations, where they make characters changing frequently between them, needless rep- up people, events, citations, or court cases, and just plain etitions, or plain inconsistencies. On the other hand, the factual inaccuracies. As text-models, they are ill-equipped to context the model can use is also limited, so we cannot pro- reason, or perform calculations. For our purposes, though, vide it with the entire story so far. Instead, we construct these issues are not necessarily problems, as we want the the following prompt, incorporating only the text of the model to generate novel content. Indeed, this potential for immediately preceding chapter: • System: You are a story teller continuing a story. 4.1. Input • User: Take the following chapter and make the next We use the Glaive narrative planner to produce story plans chapter, and include dialogue and natural progres- for use with our system [17]. Glaive comes with 7 standard sion: . narrative planning problems, of which we use four for our • Assistant: The current chapter is: experiments: • Assistant: The current chapter is about: • Fantasy: A story world set in a magical kingdom, • Assistant: The genre is: with two lovers, Talia and Rory, and a monster, Gar- gax, guarding a treasure. For the first chapter, the system prompt is changed to “be- • Heist: A story world set in the American old west, ginning a story”, and no previous chapter text is included. set in a town with a bank, a saloon, and options for For the last chapter, the model is also explicitly told that it the characters to rob the bank, cheat at poker, steal is “ending the story” in the system prompt. valuables and exchange them for money, and for the With these first two steps, the system will already produce sheriff to arrest criminals. a discourse following the events of the plan produced by • Raiders of the Lost Ark: A story world based on Glaive. However, there will still be noticeable disconnects, the movie “Indiane Jones: Raiders of the Lost Ark”, as the generation process does not take into account what set in 1936, with a powerful artifact, the Ark of the may follow. Chapters often end with “To be continued...”, or Covenant, being chased after by Indiana Jones (at even “The end.”, even though the narrative has not reached the behest of the US Army), and the Nazis. its conclusion. We therefore added another processing step • Western: Another story world set in the American to refine the flow of the narrative. old west, featuring snakes that can bite characters, and anti-venom that must be obtained to heal the 3.3. Chapter Summarization and snake bite. Regeneration • (Not Used) Aladdin: A story world based on the tale “Aladdin” from 1001 Nights, with a king, a woman One key limitation with LLMs for our work is their limited called Jasmine, a genie that can fulfill wishes, and context. However, even with larger context size, provid- a hero character. We did not use this domain be- ing the model with more input does not guarantee more cause “Jasmine” and “Genie” were specific enough desirable output. Nevertheless, in order to improve the con- for ChatGPT to recall the story from its training set. sistency of the narrative, we take each story chapter and We will discuss why this did not consistently happen have the model rewrite it in context. For each chapter, we for Raiders of the Lost Ark below. first ask the model to provide a summary of the events that • (Not Used) Best Laid Plans: A story world consist- happen in it, with the simple prompt “Create a short sum- ing of a goblin minion that must obtain hair tonic for mary of the following chapter: ”. We then their warlock overlord. We did not use this domain, ask the model to rewrite each existing chapter with the as the vast majority of actions performed in valid following prompt: plans are move-actions, which did not lead to very • System: You are a story teller remaking chapters. interesting narratives. • User: Rewrite the following chapter: • (Not Used) Space: A domain ostensible set in space with volatile planets and aliens. However, this do- • Assistant: Keep in mind that the main is underdeveloped, with the sample problem chapter is: leading to a solution of only two steps, far shy of an Where the assistant-phrase is present up to four times: We interesting narrative, and we therefore did not use take the context of the current chapter to be the two imme- this domain either. diately preceding and two immediately following chapters, While we did not use the three last domains for our main and include their summaries in the prompt. This causes experiments, we still attempted to use them, and we will the model to take the events of these chapters into account elaborate more on some of the problems we encountered when rewriting the current one. below. 4. Results and Discussion 4.2. Results Depending on the input plan, our approach is able to pro- Based on the narrative domains discussed above, we gener- duce a narrative telling that may reach the length of a short ated 10 text outputs for every one of the four domains, using novella. This makes evaluating the output challenging, as the solution plan included with Glaive (it is, of course, pos- it requires reading through hundreds of pages of text and sible, to generate more/different solution plans, to produce determine its quality. Before we go into more detail about more varied stories). For our purposes, we were interested these challenges, and the evaluation we performed, we will in how well the translation of plan steps to chapters would first discuss the input data we used and then show some work, what the chapters produced by ChatGPT would look sample output to demonstrate what our approach is capa- like, and how consistent they are across the narrative. Due ble of. However, we would be remiss to not also discuss to the recent release of ChatGPT 4o we performed our ex- limitations and challenges that remain, which we will do to periments twice, once using ChatGPT 3.5, which has much conclude this section. faster processing times, and once with ChatGPT 4o, to make use of the newest version of the model. First, despite only being given a single example step, the model is able to translate plan steps into chapter descriptions Table 1 Translation of the first 10 plan steps of the heist story into chapter descriptions as generated by ChatGPT 3.5 and 4o. In most cases, the descriptions were identical; rows marked with * indicate where the 4o version of ChatGPT produced significantly different output, which is listed there. Plan step Chapter description (hatch-plan robbie six-shooter brown-horse bank mother-lode) Robbie hatched a plan to rob the bank with his trusty six-shooter and brown horse to strike the mother lode. (open jill general-store) Jill opened the general store. (move-once sally main-street bank) Sally moved once on Main Street near the bank. (withdraw-money sally bank dress-money) Sally withdrew money from the bank to pay for her dress. *(withdraw-money sally bank dress-money) Sally withdrew money from the bank in her dress. (move-once sally bank main-street) Sally moves once on Main Street. (sell sally anne tomatoes tomato-money main-street) Sally Anne sells tomatoes for money on Main Street. (move-once robbie main-street dark-alley) Robbie moved once from Main Street to a dark alley. (hide-in-dark-alley robbie dark-alley) Robbie hides in a dark alley. (move-once barney barneys-room saloon) Barney moves once in Barney’s room at the saloon. *(move-once barney barneys-room saloon) Barney moved once from Barney’s room to the saloon. (pickpocket robbie sally dress-money main-street dark-alley) Robbie and Sally pickpocket dress money on Main Street in a dark alley. *(pickpocket robbie sally dress-money main-street dark-alley) Pickpocket Robbie stole Sally’s money from her dress in a dark alley on Main Street. very well, as shown in Table 1. In most instances, the de- The same story, as told by ChatGPT 4o starts like this: scriptions produced by the two model versions were almost identical, with only three steps being described differently. The sun dipped low over the dusty town of Notably, the models sometime struggle with distinguishing Red Rock, casting long shadows across the subject and object of an action, but the resulting descriptions wooden storefronts and dirt-packed streets. are still serviceable to turn them into chapters. The air was thick with the scent of sagebrush The second step is to take the chapter descriptions and and the distant promise of rain that never having the model produce narrative text that expands the seemed to come. Robbie McGraw sat on the description into an entire chapter. At first glance, this pro- creaky porch of the Rusty Spur Saloon, his duces output that reads like an actual story. For example, six-shooter resting comfortably in its holster one of our produced texts for the heist story starts as follows and his brown horse, Whiskey, tethered to (as told by ChatGPT 3.5): the hitching post nearby. Of particular note is that while the input only mentioned In the dusty town of Redwood Creek, the a character “Robbie” by name, both versions of the model sun beat down mercilessly on the wooden fill in additional names for the town, other characters, or buildings that lined the main street. Robbie even the horse, as needed. Overall, for the heist story in stood outside the saloon, wiping the sweat particular, the models produce text that is very compelling from his brow as he watched the townsfolk in parts, but falls short of actually being interesting, as we go about their business. He had been plan- will discuss in the next section. ning this heist for months, and today was the day he would finally put his plan into action. 4.3. Evaluation “Hey Robbie, you lookin’ mighty serious One key point of our approach is that while we generate there,” a voice called out from behind him. chapters one by one, we then have the model rewrite them Robbie turned to see his old friend, Jake, ap- in the context of the surrounding chapters. The purpose of proaching with a grin on his face. this is to ensure a higher level of consistency. Even though “Hey Jake, just thinking about our little job the model is given the previous chapter when writing the today,” Robbie replied, trying to keep his next one, characters are not used continuously, and instead voice casual. new characters are introduced, or the role of a character is changed between chapters. The rewrite attempts to ad- Jake’s eyes lit up with excitement. “You mean dress this issue by putting the chapter in context of what the bank job? I thought you were just jokin’ happens before and after it. One way to show this effect about that.” is to determine how often individual characters show up, Robbie shook his head. “No joke, Jake. and how the rewrite affects the output. Table 2 shows some We’re gonna hit that bank and make off with basic statistics of the generated narratives. We used spaCy1 enough gold to set us up for life.” to perform named entity recognition for each chapter, and Jake’s grin widened. “Count me in, partner. tracked the different character’s occurrences across chap- When do we make our move?” ters. We call characters that appear in more than 30% of chapters “main characters”, as plans often include actions Robbie glanced around to make sure no one performed by other characters, and characters that only was listening in. “Tonight, after the sun goes show up in one or two chapters “incidental” characters. The down. We’ll ride in, guns blazing, and take what’s rightfully ours.” 1 https://spacy.io/ Table 2 Some basic statistics of the generated narratives, averaged over 10 outputs for each narrative: Number of words, the number of main (appear in more than 30% of chapters), incidental (appear in only one or two chapters) and other characters, for the initial version of the narrative as well as the regenerated one (indicated with RW). Narrative ChatGPT Words Main Other Inc. Words (RW) Main (RW) Other (RW) Inc. (RW) Ark 3.5 4051.9 3.6 1 4.9 3303.3 3.2 1 4.3 Ark 4o 7646.7 4.8 1 6.4 7006.4 4.9 1 6.2 Fantasy 3.5 3193.4 3.3 0.9 1.3 2612.2 4.1 0.9 0.8 Fantasy 4o 5619.6 4.7 1 5.4 5085.4 6.3 1 3.6 Heist 3.5 15601.1 3.3 5.7 10.9 13060.8 3.2 5.5 12.7 Heist 4o 27526.5 6.9 5.2 15.5 27456.5 6.8 5.1 15.2 Western 3.5 4219.2 4.5 1 2.7 3628.4 4.4 1 2.8 Western 4o 6840.5 6.3 1 5.1 6507.4 6.1 1 5.8 table shows several effects of the rewrite, as well as some individual narrative, together with the mean and 95% confi- differences between the two model versions: In every in- dence intervals. It can be seen that human authors tend to stance, the rewritten narrative is shorter than the original, use a rather even mix of more and less complex sentences, as the model is able to remove some redundancy. Charac- while the models tend to eschew simpler constructs in favor ters are also utilized slightly differently, as shown by the of sentences consisting of more nested clauses. Note that changing number of main and incidental characters, but a neither higher nor lower Frazier scores are inherently “bet- more detailed analysis of this effect is still an open problem. ter”, and this evaluation only serves to provide a comparison Also noteworthy is that the newer model is significantly with some samples of “typical” (good) human writing. In more loquacious than the older version, producing almost our main experiments we avoided instructing the model to double the output given the same input story steps. imitate human authors due to some ethical concerns, but for As our approach is able to generate narratives consisting comparison reasons we added the instruction to write in the of thousands of words, a more detailed evaluation is very style of each of the authors of the four books, and performed challenging. Perhaps the most desirable form of evaluating the same analysis. Based purely on grammatical complexity, the produced narratives would be by gathering feedback the model does not seem to capture the same writing style, from human readers, but we encountered two obstacles and instead further increases sentence nesting. to this: First, the sheer quantity of text the readers would have to go through is beyond any reasonable compensation 4.4. Misspun Tales we could offer, particularly, because second, upon closer inspection, the writing is not very good. Rather than sub- While we believe that the results we show above already ject volunteer participants to what we do not believe to constitute a novel contribution, various problem cases we be good literature, we set out to quantify why the writing uncovered may also be of interest to future researchers. First, does not seem to be enjoyable. Below we will detail some while our approach often results in narratives that follow perhaps more anecdotal evidence, but we also attempted the given structure, this is not always the case. Since the to quantify some properties of the writing itself. Subjec- model is given the previous chapter input, as well as the tively, the rhythm of the writing seems artificial (which desired next step, it has to perform a trade-off in how much it is), and we believe this is due to the repetitive sentence attention to give to each, and at times the “most likely” con- structure. The Frazier score is a measure for syntactical tinuation ignores the plan steps entirely. In one particularly complexity [29], and measures, broadly speaking, the depth noteworthy instance, ChatGPT 4o took the Heist narrative, of the parse tree of a sentence. Higher scores therefore indi- had the bank robbery take place in chapter 3, followed by cate higher grammatical complexity, while simpler sentence an escape by sea onto a pirate ship, which then turned into structures are scored lower. We used nltk[30] and Stanford a fantasy story to chase after a powerful artifact, concluding CoreNLP [31] to compute the parse tree, and computed the with (as summarized by the model): Frazier score for each sentence of the generated narratives. For comparison purposes, we also computed the Frazier In Chapter Thirty: The Convergence, the scores for books written by human authors: Mary Shelley’s town of Port Meridian buzzes with anticipa- Frankenstein [32]), Jane Austen’s Pride and Prejudice [33]), tion as the day of the Great Unveiling ap- Victor Hugo’s Les Miserables (English translation by Isabel proaches. Robbie and Talia, along with the Florence Hapgood) [34] and Sir Arthur Conan Doyle’s The committee, work tirelessly to decipher the Adventures of Sherlock Holmes [35]. We believe that these Heart of the Ancients’ inscriptions. They books are a good representation of non-trivial literature, uncover a crucial passage about the "Conver- as they are considered classics and literary achievements, gence of Realms," a moment when the bound- yet still readable by a dedicated reader. While the style and aries between their world and the Ancients’ length of the books may vary, the average Frazier score of will blur, unlocking immense knowledge but all four books fell between 7 and 8. For comparison, the av- also posing significant dangers. As the com- erage Frazier score of the narratives generated by ChatGPT mittee intensifies their efforts, Robbie and 3.5 fell between 14 and 18 for the different narratives, and Talia resolve to ensure that this newfound while ChatGPT 4o did produce less convoluted sentences, wisdom is used ethically and for the greater its average Frazier score still ranged between 10 and 12. good, heralding a new era of unity and po- Figure 1 shows the distribution of Frazier scores across each tential. Figure 1: Distribution of grammatical complexity across sentences with mean values and 95% confidence interval, as measured by the Frazier score. In the first row, we provide distributions of classical, human-written books as comparison. The second row shows the distribution in narratives produced by ChatGPT 3.5, while the third row shows the same for narratives produced by ChatGPT 4o. For the fourth row, we additionally instructed the model (ChatGPT 3.5) to write the heist narrative in the style of the author of each of the books from row one. Neither higher nor lower complexity scores are inherently “better”, but the distribution of more and less complex sentences is a property of an author’s writing. A minor point that this conclusion also demonstrates is model does not seem to have enough context to work with the tendency of the model to lead to happy endings, that reas- any character list that is given to it. When instructed to sure the reader that whatever power or treasure is obtained use specific names, the text might initially use the provided will be used ethically. While not “wrong”, the persistent names, but they often change back to the names favored by mention of this is not well-placed in all story contexts. We the model for each particular narrative. suspect that this is due to some of the “safeguards” OpenAI Even when the model follows the provided trajectory has integrated into their system, to prevent (some) unethi- and uses characters consistently, though, it struggles with cal output. We do not disagree with this choice, but it also keeping a cohesive tone. Step 7 in the heist plan is “(sell shows that controlling LLM output to make it suitable for sally anne tomatoes tomato-money main-street)”, which all applications is challenging. consists of a single transaction. In one instance, the model Generally, control is a major issue when using an LLM. took this idea and just “ran with it”, turning one character Our approach to rewrite chapters to make characters behave selling tomatoes into the three protagonists hosting a tomato more consistently is not the only thing one could do. We festival where they sell produce together with the local attempted to tell the model outright which characters exist farmers in order to finance their travels: and what their roles are. However, this led to two problems: First, the characters have to come from somewhere. If the With renewed purpose, the trio embarked domain author is tasked with providing a character list, they on a mission to gather more tomatoes and will need to foresee which larger cast of the characters the secure a venue for the tomato festival. They model might need, which may also lead to higher repeti- approached local farmers, explaining their tiveness of the generated stories. Our solution to that was plan and offering a fair share of the profits. to let the model generate an (ideally) varied list of charac- To their surprise, the farmers were intrigued ters to use. However, the model does favor certain names by the idea and agreed to contribute their for different genres (Robbie’s partner in the heist story is tomato harvest. usually called “Hank” or “Jake”, despite neither showing up in the input). The second problem, though, is that the By itself, the tomato festival is a reasonable interpretation of the given story step, but it stood out in context, as the preceding chapter is titled “The Enigmatic Stranger”, and This happened exactly twice in all of our experiments, all the following chapter “The Mysterious Stranger”. Overall, other attempts produced actual chapter text. the model favored a darker, grittier narrative, which made the tomato festival seem even more out of place. On the other hand, as these other two chapter titles already in- 5. Conclusion and Future Work dicate, encounters with strangers, risk, or danger are all We have presented an approach to using ChatGPT, to pro- narrative devices the model employs frequently. The narra- duce long-form text for discourse generation – or, more tive in question contains three consecutive chapters titled “A precisely, surface text realization. Our approach takes story Risky Proposition”, which is then followed by “The Perilous steps produced by a narrative planner and tasks the model Journey”. with first translating the abstract step output to descriptions It is already known that ChatGPT often produces inaccu- of individual chapters, and then turn these descriptions into rate information [36], but this also raises issues even when actual chapter text. We perform another pass over the chap- it produces fiction. It may, for example, suggest ordering ters where we ask the model to summarize each chapter, and whiskey to sober up: then rewrite it using the summaries of the preceding and Robbie led Barney to a table in the corner, following chapters as additional context. We show several away from prying eyes. He signaled the bar- example outputs of our model, and discuss how challenges tender for two whiskeys, hoping the strong with an evaluation of rather long texts that are not partic- drink would sober Barney up enough to have ularly well written. Crucially, we also investigate why the a coherent conversation. texts do not appear to read well, and show an analysis of the grammatical complexity of the generated narratives, which On the flip side, as some of the examples are based on tend to be more complex than comparable human-written existing media, which OpenAI may have included in the literature. Finally, we discuss a myriad of other problems data they used to train the model, the resulting narrative we encountered that led to narratives that were ill-formed, may just reproduce this existing data, rather than produc- illogical or incongruous. ing a new one. This was particularly challenging with the While our work presents a somewhat bleak outlook on Aladdin story, but for the Indiana Jones-based story another the current state of narrative generation using LLMs, we interesting phenomenon occurred: Rather than interpreting believe these insights are crucial to understanding what “Indiana” as a name (even in context with the Ark and the makes text seem artificial. The alien-ness of AI text may location of Tanis), ChatGPT would take it as the US state of have been intuitively understood, but our work attempts Indiana, and then either name a town there “Tanis” or turn to quantify it, which may in turn lead to improvements in “Tanis” into a character, as in this example: output, or - perhaps more importantly - highlight the im- The sun was just beginning to rise over the portance of human authorship. Our approach also focused horizon, casting a golden hue across the on surface text realization without taking larger questions small town of Maplewood, Indiana. The of discourse generation into account. Some repetitiveness streets were still quiet, with only the occa- may also be alleviated by not having the model generate an sional chirping of birds breaking the silence. entire chapter for rather trivial move actions. Tanis stood at the edge of her driveway, her backpack slung over one shoulder and a map of Indiana clutched in her hand. She took a References deep breath, feeling the crisp morning air fill [1] J. R. Meehan, TALE-SPIN, an interactive program that her lungs. Today was the day she had been writes stories., in: IJCAI, volume 77, 1977, pp. 91–98. waiting for. [2] J. Ryan, Grimes’ fairy tales: a 1960s story generator, “Are you sure about this, Tanis?” Her best in: Interactive Storytelling: 10th International Confer- friend, Mia, asked as she approached. Mia’s ence on Interactive Digital Storytelling, ICIDS 2017 eyes were filled with concern, but there was Funchal, Madeira, Portugal, November 14–17, 2017, also a hint of excitement. “It’s a long way Proceedings 10, Springer, 2017, pp. 89–103. to Indianapolis, and you know how unpre- [3] R. M. Young, Story and discourse: A bipartite model dictable things can get.” of narrative generation in virtual worlds, Interaction Studies 8 (2007) 177–208. Overall, using ChatGPT for narrative generation seems [4] OpenAI, Gpt-4 technical report, 2024. to produce reasonable output on the surface, but once one arXiv:2303.08774. looks at the text more closely, problems keep arising al- [5] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, most fractally, as one digs deeper on a problem another L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Atten- one shows up. The model has a general notion of what a tion is all you need, Advances in neural information “good” narrative would look like, but no understanding of processing systems 30 (2017). flow, composition, coherence, common sense, or purpose. [6] M. Lebowitz, Story-telling as planning and learning, Attempts to rectify this by better prompts are only partially Poetics 14 (1985) 483–502. successful, as providing too much instruction to the model [7] R. M. Young, S. G. Ware, B. A. Cassell, J. Robertson, makes it more likely to ignore parts of it, while providing Plans and planning in narrative generation: a review too little guidance results in rambling. We will conclude of plan-based approaches to the generation of story, with the entirety of chapter 1 for one iteration of the Ark discourse and interactivity in narratives, Sprache und story (using ChatGPT 4o): Datenverarbeitung, Special Issue on Formal and Com- The genre is action, adventure, and mystery. putational Models of Narrative 37 (2013) 41–64. (sic!) [8] M. O. Riedl, R. M. Young, Narrative planning: Balanc- data, interaction, and expressive processing, A com- ing plot and character, Journal of Artificial Intelligence panion to digital literary studies (2013) 161–182. Research 39 (2010) 217–268. [25] C. Barot, C. Potts, R. M. Young, A tripartite plan-based [9] S. Ware, R. Young, CPOCL: A narrative planner sup- model of narrative for narrative discourse generation, porting conflict, in: Proceedings of the AAAI Confer- in: Proceedings of the AAAI Conference on Artifi- ence on Artificial Intelligence and Interactive Digital cial Intelligence and Interactive Digital Entertainment, Entertainment, volume 7, 2011, pp. 97–102. volume 11, 2015, pp. 2–8. [10] J. Porteous, M. Cavazza, Controlling narrative genera- [26] S. R. Cox, W. T. Ooi, Conversational interactions with tion with planning trajectories: the role of constraints, npcs in llm-driven gaming: Guidelines from a content in: Interactive Storytelling: Second Joint International analysis of player feedback, in: International Work- Conference on Interactive Digital Storytelling, ICIDS shop on Chatbot Research and Design, Springer, 2023, 2009, Guimarães, Portugal, December 9-11, 2009. Pro- pp. 167–184. ceedings 2, Springer, 2009, pp. 234–245. [27] X. Peng, J. Quaye, W. Xu, C. Brockett, B. Dolan, N. Jojic, [11] S. G. Ware, C. Siler, The sabre narrative planner: multi- G. DesGarennes, K. Lobb, M. Xu, J. Leandro, et al., agent coordination with intentions and beliefs, in: Player-driven emergence in llm-driven game narrative, AAMAS Conference proceedings, 2021. arXiv preprint arXiv:2404.17027 (2024). [12] H. Mohr, M. Eger, C. Martens, Eliminating the impos- [28] C. Hu, Y. Zhao, J. Liu, Generating games via llms: An sible: A procedurally generated murder mystery., in: investigation with video game description language, AIIDE Workshops, 2018. arXiv preprint arXiv:2404.08706 (2024). [13] R. Sanghrajka, R. M. Young, B. Thorne, Headspace: [29] L. Frazier, Syntactic complexity, Natural language incorporating action failure and character beliefs into parsing: Psychological, computational, and theoretical narrative planning, in: Proceedings of the AAAI Con- perspectives (1985) 129–189. ference on Artificial Intelligence and Interactive Digi- [30] S. Bird, E. Klein, E. Loper, Natural language processing tal Entertainment, volume 18, 2022, pp. 171–178. with Python: analyzing text with the natural language [14] C. Aeronautiques, A. Howe, C. Knoblock, I. D. McDer- toolkit, " O’Reilly Media, Inc.", 2009. mott, A. Ram, M. Veloso, D. Weld, D. W. Sri, A. Barrett, [31] C. D. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. J. D. Christianson, et al., Pddl| the planning domain defi- Bethard, D. McClosky, The Stanford CoreNLP natu- nition language, Technical Report, Tech. Rep. (1998). ral language processing toolkit, in: Association for [15] P. Haslum, Narrative planning: Compilations to classi- Computational Linguistics (ACL) System Demonstra- cal planning, Journal of Artificial Intelligence Research tions, 2014, pp. 55–60. URL: http://www.aclweb.org/ 44 (2012) 383–395. anthology/P/P14/P14-5010. [16] M. Christensen, J. Nelson, R. Cardona-Rivera, Using [32] M. Shelley, Frankenstein; Or, The Modern Prometheus, domain compilation to add belief to narrative planners, Penguin, 1818. in: Proceedings of the AAAI Conference on Artificial [33] J. Austen, Pride and Prejudice, T. Egerton, Whitehall, Intelligence and Interactive Digital Entertainment, vol- 1813. ume 16, 2020, pp. 38–44. [34] V. Hugo, Les misérables, Thomas Y. Crowell & Co., [17] S. Ware, R. M. Young, Glaive: a state-space narra- 1887. Translation by Isabel Florence Hapgood. tive planner supporting intentionality and conflict, [35] A. C. Doyle, The Adventures of Sherlock Holmes, in: Proceedings of the AAAI Conference on Artifi- George Newnes, 1892. cial Intelligence and Interactive Digital Entertainment, [36] M. T. Hicks, J. Humphries, J. Slater, Chatgpt is bullshit, volume 10, 2014, pp. 80–86. Ethics and Information Technology 26 (2024) 38. [18] J. Niehaus, R. M. Young, A method for generating nar- rative discourse to prompt inferences, in: Proceedings of the Intelligent Narrative Technologies III Workshop, 2010, pp. 1–8. [19] Y.-G. Cheong, R. M. Young, A computational model of narrative generation for suspense., in: AAAI, 2006, pp. 1906–1907. [20] B.-C. Bae, R. M. Young, A computational model of nar- rative generation for surprise arousal, IEEE Transac- tions on Computational Intelligence and AI in Games 6 (2013) 131–143. [21] H.-Y. Wu, M. Young, M. Christie, A cognitive-based model of flashbacks for computational narratives, in: Proceedings of the AAAI Conference on Artificial In- telligence and Interactive Digital Entertainment, vol- ume 12, 2016, pp. 239–245. [22] M. Eger, C. Barot, R. Young, Merits of a temporal modal logic for narrative discourse generation, in: Proceed- ings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 11, 2015, pp. 23–29. [23] D. K. Elson, Modeling narrative discourse, Columbia University, 2012. [24] N. Wardrip-Fruin, Reading digital literature: Surface,