1. Introduction

Leveraging LLM-Constructed Graphs for Efective Goal-Driven Storytelling

Taewoo Yoo

Yun-Gyung Cheong

0 0 SungKyunKwan University (SKKU) / Suwon , South Korea

While advanced language models, such as Large Language Models (LLMs), have demonstrated potential in generating various types of text, including narratives, they often struggle to maintain semantic consistency. In narrative theory, skeleton selection refers to deriving a story's backbone by choosing only the pivotal events, or nucleus, from the comprehensive story world (fabula), ensuring a focused and coherent narrative structure. To address the challenges faced by LLMs, we utilize Story Plan Graphs (SPGs)-a form of Knowledge Graphs-to ensure logical soundness for skeleton construction. When evaluated against GPT-3.5 using the ROCStories dataset, our approach demonstrates enhanced skeleton selection capabilities, ofering an eficient solution for storytelling.

eol>LLMs (Large Language Models) SPGs (Story Plan Graphs) KGs (Knowledge Graphs) Narrative generation Goal-driven storytelling

1. Introduction

Stories are an essential element that permeates human culture and history. They are expressed in various forms, literature, movies and entertainment such as games, providing enjoyment to people. A story refers to a series of events linked by causality, experienced or enacted by actors [ 1, 2, 3 ]. For instance, “Mary woke up late. She missed the bus to work. Her boss was unhappy” is considered a story, whereas “Mary woke up late. She wore a blue dress to work. The cofee machine was broken.” is non-narrative.

A coherent and engaging story demands that each sentence logically follows the previous one. This means that the events, actions, and dialogue in the story must be linked by cause and efect, ensuring that the overall narrative makes sense for the reader. Furthermore, crafting a story that engages and entertains the reader presents a notable challenge. Consequently, narrative generation has captured the interest of researchers for decades and has become a topic of intensive investigation with the advent of LLMs enabled by Transformer-based language models [ 4, 5 ].

While LLMs ofer significant improvements in narrative generation, they still face challenges in maintaining deep semantic coherence, avoiding repetition, and producing highly specific and creative responses [ 6 ]. Moreover, LLMs sometimes exhibit reasoning errors and inconsistent responses due to the lack of an underlying belief system and the reliance on probabilistic patterns from its training data [ 7, 8 ].

To address these limitations, utilizing knowledge graphs (KGs) and reasoning frameworks can be a solution. Traditional solutions to story generation, such as symbolic planning [9, 10, 11, 12, 13], can infer causal relationships between events. Additionally, it ofers a means to model semantic dependencies in the form of a graph. This is conceptually similar to KGs, which represent information in a structured format. Thus, this paper presents Story Plan Graph (SPGs) as a form of Knowledge Graph, specifically tailored for narrative story generation.

A story can be analyzed via a tripartite model, which include the notions of fabula, syuzhet, and discourse [14, 15, 16, 17]. The term fabula provides the raw content of a story, the syuzhet selects and organizes that content, and the discourse presents it to the audience.

Drawn upon this narrative analysis theory, we aim to construct a narrative as a story plan graph (SPG) at the fabula layer and select core events as a skeleton at the syuzhet layer. Specifically, this study investigates which events should be chosen from the modeled SPG to construct the most efective skeleton in terms of coherency, logicality, and interestingness.

The key contributions of this research are enumerated as follows: • We propose a new method leveraging story plan knowledge graphs to construct a coherent story. • We propose an eficient content selection procedure based on the well-established significance metrics, TF-IDF [18] and the PageRank [19] algorithm. • We conducted an automated evaluation utilizing GPT-3.5 and the ROCStories dataset [20]. The results indicate that our approach efectively constructs the skeleton, by accurately identifying and prioritizing key events within the narrative, ensuring both relevance and coherence in the given story.

In this study, we examine how symbolic knowledge, specifically Story Plan Graphs (SPGs), and algorithms can enhance narrative generation using Large Language Models (LLMs).

The structure of the paper is as follows: Section 2 reviews related works; Section 3 describes our skeleton selection approach; Section 4 presents the experiment and discusses the results; and finally, Section 5 concludes with future work.

2. Background and Related Work 2.1. Narrative Analysis Theory

The employment of the bipartite model—story and discourse—in analyzing narrative has a long history in narratology [14]. In this model, story refers to the content plane of narrative whereas discourse represents its expression plane.

Some narrative theorists [15, 16, 17] maintain that diferent stories emerging from the same story material is rooted in the existence of an abstract entity called the narrator, who decides what to tell and when to tell it. To distinguish the narrator’s role from the discourse, they propose a three-tiered model of narrative consisting of the fabula, the sjuzhet, and the narrative discourse. The ‘fabula’ refers to the comprehensive story world, encompassing all events, characters, and circumstances.

In this paper, the event sentence list from the SPG was utilized as the fabula. All events within the fabula are feasible, distinguishing it from the ‘possible world’ [21], wherein not all possessed events can occur concurrently. The ‘skeleton’ is derived by selecting only the pivotal events from the fabula, essentially constituting the backbone or the primary events of the story–named nucleus [14]. The ‘syuzhet’ is responsible for ordering the nucleus of the skeleton to instill elements such as suspense, thereby captivating the audience; it may also incorporate ‘satellites’—events that might not be crucial to the storyline but are pivotal for narration [14]. The ‘discourse’ represents the syuzhet as expressed through mediums like text or film. Our research focuses on skeleton selection, grounded in the aforementioned theories and definitions.

2.2. Computational Approaches to Story Generation

Traditional story generation systems leverage symbolic approaches such as inference and planning algorithms. These systems are divided into author-centric and character-centeric approaches. Talespin [9] generates stories by modeling the goals and actions of characters and constructing narratives through their interactions. Universe [22] is a system focused on the creative aspects of storytelling, designed as an aid for writers. It synthesizes various story elements into a plot through interaction with humans.

Minstrel [23] is a knowledge-based system for storytelling, emphasizing character and plot development. It simulates creative problem-solving in story generation, employing methods to use existing knowledge in novel ways. Mexica [24] aims to model the creative process of story writing, specifically creating narratives about the lives of early Mexican natives. Its approach emphasizes creativity and emotional connections to deepen the narrative generated.

Fabulist [25] is a story generation architecture that models story structure and character intentions, considering the causes and consequences of events to create narratives. Virtual StoryTeller [26] employs a multi-agent approach to generate stories. Each agent, with its independent knowledge and goals, interacts in the story development process, selecting actions that contribute to story creation. Our work references these methodologies to study skeleton selection methods.

Existing story generation models often struggle to maintain consistency. Various approaches have been researched to address this issue. For instance, Xie et al. [27] investigated whether large pre-trained language models could learn storytelling with few examples. Additionally, Peng et al. [28] proposed a method to improve the consistency and thematic coherence of neural network-based story generation using reader models. Furthermore, Wang et al. [29] conducted a comprehensive survey on open-world story generation with structured knowledge enhancement, exploring ways to improve the logical coherence of generated stories. Xu et al. [30] proposed the MEGATRON-CNTRL framework, which integrates an external knowledge base to enable controllable story generation. These studies illustrate additional methods for LLMs to maintain coherent and logical story structures beyond simple text generation capabilities. In this research, we utilized SPGs—a form of KGs—to enhance the consistent storytelling abilities of LLMs.

Neural Story Planning [31] addresses the manual schema-related challenges of traditional story generation methods, such as symbolic planning, by utilizing LLMs. By drawing upon common-sense knowledge extracted from these expansive language models, it is possible to recursively expand the SPG using a backward chaining approach from the goal event sentence, thus generating a consistent SPG. In this paper, we leverage these SPGs as a form of KGs, integrating the structured, logical representation of symbolic planning with the language-based knowledge generated by an LLM.

3. Skeleton Selection 3.1. Overview

First, we construct the SPGs employing the Neural Story Planning method [31], setting the last sentence of select stories from the ROCStories dataset as the goal event sentence and subsequently constructing the SPGs.

Figure 3 depicts our skeleton selection algorithm, which computes the selection score for each event sentence (where 1 ≤ ≤ ) within the fabula = {1, 2, ..., 3} as follows: ( ) = ( ) + ( ) (1) where represents the weight of the event-based score, and denotes the weight of the graph-based score, with the constraint = 1 − and 0 ≤ , ≤ 1. We aim to adjust to blend the two scores.

The overall process of computing the selection score follows the steps as shown in Algorithm 1: 1. Initialize a fabula : This step initializes the entire story plot. 2. Initialize a goal : The last sentence of the story is set as the goal event. 3. Vectorize Events: All events in the plot are vectorized to evaluate their importance. 4. Compute Event-Based Score: The event-based score for each event sentence is calculated using tf-idf. 5. Compute Graph-Based Score: The graph-based score is calculated using the PageRank algorithm and the distance from the goal event. 6. Combine Scores: The event-based and graph-based scores are combined to compute the final selection score.

7. Select Top-k Events: The top-k event sentences are selected based on their final selection scores. This algorithm ensures logical coherence and an interesting story composition by evaluating the causal relationships between event sentences () generated through backward chaining from the goal event sentence () in the SPG.

Finally, the top- event sentences are selected based on their selection scores. Please note that the selected event sentences may not be directly linked within the graph. For instance, in Figure 1, while “Ludo’s work was taking a toll on his health” (denoted as 1) and “Ludo got a prescription for the medicine from his doctor” (denoted as 3) are selected, “Ludo drove himself to hospital” (denoted as 2) may not be. Although 1 and 3 are not directly linked within the graph, readers can infer 2 on their own. Thus, readers can comprehend the story without 2 being selected.

3.2. Event-Based Score

The event-based score is calculated based on the importance derived from the tf-idf of the event sentences within the fabula. Condition sentences appear associated with causal links between event sentences during the computation of the graph-based score. Therefore, only the event sentences were utilized when calculating . The computation of the event-based score is as follows: ( ) = ∑︁ (, ) * (, ) * (, ) ∈ (2) where denotes the events present in the event sentence, as highlighted in bold in Figure 4.

The first term, (, ), references the inverse document frequency from the ROCStories dataset, . The general inverse document frequency aids in filtering out the events that occur frequently throughout the datasets. The second term, (, ), represents the term frequency and is employed to identify pivotal events within each event sentence. The final term, (, ), assists in filtering events that are frequently used locally. For instance, in the story shown in Figure 4, both the words ‘drive’ and ‘get’ appear frequently. The final term helps reduce the probability of selecting these commonly occurring words.

3.3. Graph-Based Score

Leveraging the information derived from the SPG, we assess the significance of each event sentence node. To incorporate the importance of the causal relationships between event sentences and condition sentences, we employ the PageRank method to determine the graph-based score of each event sentence.

We also consider the (, ) from the goal event sentence to each node as a weight, emphasizing events surrounding the goal. This approach was chosen to align with our focus on selecting a skeleton for a goal-driven story. The graph-based score, , is computed using the following equation: ( ) = (_, (, )) (3) where _ represents the adjacency list of the SPG, and is defined as the shortest path between and when at least one path exists between them, as described in Harary [32]. Notably, since every node in the SPG is generated through backward chaining from , there are no instances where and are not connected.

4. Experiment 4.1. Dataset

In this section, we evaluate our skeleton selection method using the SPGs generated based on the stories in ROCStories dataset. We describe the dataset, baseline, and evaluation methodology. Subsequently, we present the results in comparison with the baseline and ablation study, discussing the implications of these findings.

We employed the recently-introduced story planning method, Neural Story Planning, to generate the SPGs. For the goal event sentence, we utilized the final sentence from the stories in the ROCStories dataset. The ROCStories dataset, introduced by Mostafazadeh et al. [20], is a collection of 100,000 short commonsense stories designed for research in commonsense reasoning and story understanding.

Each story in the ROCStories dataset consists of five sentences that describe everyday scenarios, providing a rich source of diverse narrative structures. This makes it particularly suitable for evaluating story generation and skeleton selection methods.

From the generated plan graphs, we conducted experiments using 135 SPGs that adhered to the criteria of a fabula rather than a possible world. Each fabula comprises more than 15 event sentences. The selection criteria ensured that the stories used in our experiments maintained a level of complexity suitable for testing our skeleton selection algorithm.

Additionally, the ROCStories dataset allows for the testing of narrative coherence and logical progression, as the stories inherently contain causal links and event dependencies. This characteristic of the dataset was crucial for evaluating the efectiveness of our Story Plan Graph-based approach to skeleton selection.

4.2. Baseline

We used GPT-3.51 to generate a skeleton for our baseline. We provided the adjacency list of the SPG and the fabula through prompting, instructing it to select event sentences, including the goal event sentence. Given that our study focuses on goal-driven storytelling, the final event sentence represents the goal event. We noted that GPT-3.5 not only performed skeleton selection but also undertook ordering. Since we need to compare only the skeleton selection performance, we rearrange the skeleton produced by GPT-3.5 to match the order of the fabula. Examples of prompts designed to guide GPT-3.5 in selecting skeletons from the fabula are as follows:

Role:

You create a skeleton story by selecting events from the tree-structured story planner. You have to look at the story planner given an adjacency list and choose 9 events in event list. The criteria for selecting events can be freely defined. Please select an appropriate event considering the fun of the event, causal rink, goal sentence, etc.

Content:

goal: Ludo watched a lot of movies on the subscription during the next week. adjacency list: Ludo watched a lot of movies on the subscription during the next week.:set() I; A subscription for watching movies:Ludo watched a lot of movies on the subscription during the next week.

Ludo purchased a subscription online using his credit card:I; A subscription for watching movies ... event list: Ludo drove to his workplace in his car Ludo has completed a new project that needs to be completed urgently Ludo got the laptop from his company for work purposes Ludo was trying to impress his boss by working hard ... 1‘gpt-3.5-turbo-16k-0613’ version was used through OpenAI API. We opted for a specific version rather than the latest version to ensure consistency in our experiments.

Question: Choose 9 events in event list.

Answer:

We additionally conducted skeleton selection using ChatGPT2. By comparing the skeleton selection performance with ChatGPT, known for its high proficiency in a wide range of linguistic tasks, we aimed to assess the efectiveness of our selection algorithm. We utilized ChatGPT 4 with the same prompts used in GPT-3.5 for interactive tasks.

4.3. Evaluation Method

To evaluate whether the selected skeletons are 1) intriguing, 2) logical, and 3) cohesive towards the goal, we compared the skeleton produced by GPT-3.5 (A) and the skeleton selected using our method (B) using the following three questions: • Interestingness: Which story was more interesting? • Logic Coherency: Which story had coherent flow between sentences? • Topic Coherency: Which story had overall consistency in theme?

For each of the three questions, we collected responses 10 times each for A or B to evaluate which skeleton, A or B, was selected more efectively. The responses were gathered using the GPT-3.5 version 3, which served as our baseline. For this evaluation, we set = 0.5 and = 10.

Role: You are the story evaluator. You just have to look at Story A and Story B, and answer the questions only with "A" or "B".

Content: Story A: 1. Horace was transported to a location where he can freely move around by walking 2. Horace hailed a taxi on the street 3. Horace took a taxi to the car dealership 4. Horace bought a car from a dealership 5. Horace drove his car to the hardware store 6. Horace had been using the lightbulb in his bathroom for a long time until it burned out 7. The old lightbulb burned out after being used for a long time 8. Horace asked a store employee for assistance 9. Horace bought a new lightbulb from a hardware store 10. Horace is glad the lightbulb in his bathroom is no longer dead.

Story B: 1. Horace didn’t have a choice in inheriting his functional legs 2. Horace inherited his pair of functional legs from his parents 3. Horace has had the ability to walk since he was born 4. Horace walked to the street where he hailed the taxi 5. Horace hailed a taxi on the street 6. Horace had been using the lightbulb in his bathroom for a long time until it burned out 7. The old lightbulb burned out after being used for a long time 8. Horace had a doubt 9. Horace bought a new lightbulb from a hardware store 10. Horace is glad the lightbulb in his bathroom is no longer dead.

Question: Which story was more interesting?

Answer:

In this example, Story A is skeleton selected with GPT-3.5, and Story B as skeleton selected with our method. The order of Story A and Story B is determined randomly. The rationale for randomizing the order in our evaluations stems from the positional bias found in large language models, as identified in recent research [33]. To mitigate this bias, we randomized the story sequence and accordingly structured our prompts.

4.4. Results and Discussion

As presented in Table 1, the skeleton selected using our method was favored over the skeleton generated by GPT-3.5 across all three question types. Although these findings are based on evaluations by an LLM rather than human judgments, numerous prior studies [31, 33] have utilized LLMs for auto-evaluation. Hence, it can be inferred that our algorithm performed a more efective skeleton selection.

Table 2 report the results from the experiments comparing our skeleton selection method with that of ChatGPT. The preference for our algorithm, though marginally higher, indicates that our algorithm can exhibit comparable performance in the skeleton selection task to the commercial large-scale language models.

To validate the eficacy of our proposed event-based and graph-based approaches, we assessed skeletons generated by adjusting the value of . According to Equation 1, when = 1, the skeleton is selected solely based on the event-based method, and when = 0, it is based entirely on the graph-based method.

The results are presented in Table 3. As we hypothesized, the graph-based only selection method more adeptly chose skeletons that were logical and coherent towards the goal. Additionally, the event-based approach seemed to aid in selecting more engaging skeletons. To further discern the utility of our proposed methods, we conducted an ablation study, as detailed in Section 4.5.

Question Type

GPT-3.5 (%)

Ours (%) Interestingness Logic Coherency Topic Coherency average

4.5. Ablation Study

To determine the impact of our proposed event-based score and graph-based score on the quality of skeleton selection, we conducted evaluations using a simple tf-idf and a PageRank that doesn’t use weights, respectively. The results are displayed in Table 4.

Across all question types, the skeleton selection method we proposed demonstrates superior performance. This suggests that both and which we proposed have been efectively applied in the skeleton selection process.

5. Conclusion

In this paper, we propose an algorithm to generate a narrative story skeleton by selecting important events from the fabula using a Story Plan Graph (SPG), which emphasizes the logical coherence of event sentences within the story’s structure. Our approach also considers an event-based scheme to include pivotal events based on their occurrences in the fabula. Collectively, these methods ensure the inclusion of overarching event sentences throughout the fabula.

We employ GPT-3.5 to automatically evaluate the interest, logical coherence, and unity of the skeleton. The results demonstrate that our skeleton selection algorithm outperforms GPT-3.5 and shows comparable performance to ChatGPT, while ofering greater eficiency in terms of API usage fees and physical resources.

We plan to conduct a comprehensive formal evaluation using state-of-the-art LLMs to validate the eficacy of our proposed approach. By integrating SPGs as a form of Knowledge Graph and LLMs, we believe that this paper contributes to the computational storytelling community by combining the strengths of symbolic and neural methods for reliable knowledge processing.

Acknowledgments

This work was partly supported by National Research Foundation of Korea (NRF) grant funded by the Korea government (MIST) (No.RS-2024-00357849), the Korea Planning & Evaluation Institute of Industrial Technology (KEIT) grant funded by the Korea government (MOTIE) (No.RS-2024-00413839), and Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No.2019-0-00421, Artificial Intelligence Graduate School Program (Sungkyunkwan University)). [9] J. R. Meehan, Tale-spin, an interactive program that writes stories., in: Ijcai, volume 77, 1977, pp.

91–98. [10] M. Lebowitz, Story-telling as planning and learning, Poetics 14 (1985) 483–502. [11] J. Porteous, M. Cavazza, Controlling narrative generation with planning trajectories: the role of constraints, in: Interactive Storytelling: Second Joint International Conference on Interactive Digital Storytelling, ICIDS 2009, Guimarães, Portugal, December 9-11, 2009. Proceedings 2, Springer, 2009, pp. 234–245. [12] M. O. Riedl, R. M. Young, Narrative planning: Balancing plot and character, Journal of Artificial

Intelligence Research 39 (2010) 217–268. [13] S. Ware, R. Young, Cpocl: A narrative planner supporting conflict, in: Proceedings of the AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, volume 7, 2011, pp. 97–102. [14] S. B. Chatman, Story and discourse: Narrative structure in fiction and film, Cornell university press, 1978. [15] G. Genette, Narrative discourse: An essay in method, volume 3, Cornell University Press, 1983. [16] S. Rimmon-Kenan, Towards... afterthoughts, almost twenty years later, SR-K.: Narrative Fiction.

Contemporary Poetics. 2nd ed. London and New York: Routledge (2002) 134–149. [17] R. Walsh, Fabula and fictionality in narrative theory, Style 35 (2001) 592–606. [18] K. Sparck Jones, A statistical interpretation of term specificity and its application in retrieval,

Journal of documentation 28 (1972) 11–21. [19] L. Page, S. Brin, R. Motwani, T. Winograd, The pagerank citation ranking: Bring order to the web,

Technical Report, Technical report, stanford University, 1998. [20] N. Mostafazadeh, N. Chambers, X. He, D. Parikh, D. Batra, L. Vanderwende, P. Kohli, J. Allen, A corpus and evaluation framework for deeper understanding of commonsense stories, arXiv preprint arXiv:1604.01696 (2016). [21] M.-L. Ryan, Possible worlds, artificial intelligence, and narrative theory, Indiana University Press, 1991. [22] M. Lebowitz, Creating a story-telling universe (1983). [23] S. R. Turner, Minstrel: a computer model of creativity and storytelling, University of California,

Los Angeles, 1993. [24] R. P. y Pérez, MEXICA: a computer model of creativity in writing, Norman Spinrad, 1999. [25] M. O. Riedl, R. M. Young, A planning approach to story generation for history education, in: Proceedings of the 3rd International Conference on Narrative and Interactive Learning Environments, 2004, pp. 41–48. [26] M. Theune, S. Faas, A. Nijholt, D. Heylen, The virtual storyteller: Story creation by intelligent agents, in: Proceedings of the Technologies for Interactive Digital Storytelling and Entertainment (TIDSE) Conference, volume 204215, 2003, p. 116. [27] Z. Xie, T. Cohn, J. H. Lau, The next chapter: A study of large language models in storytelling, 2023.

arXiv:2301.09790. [28] X. Peng, K. Xie, A. Alabdulkarim, H. Kayam, S. Dani, M. O. Riedl, Guiding neural story generation with reader models, 2022. arXiv:2112.08596. [29] Y. Wang, J. Lin, Z. Yu, W. Hu, B. F. Karlsson, Open-world story generation with structured knowledge enhancement: A comprehensive survey, 2023. arXiv:2212.04634. [30] P. Xu, M. Patwary, M. Shoeybi, R. Puri, P. Fung, A. Anandkumar, B. Catanzaro, Megatron-cntrl: Controllable story generation with external knowledge using large-scale language models, 2020. arXiv:2010.00840. [31] A. Ye, C. Cui, T. Shi, M. O. Riedl, Neural story planning, arXiv preprint arXiv:2212.08718 (2022). [32] F. Harary, Graph theory addison-wesley reading ma usa, 1969. [33] L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. Xing, et al., Judging llm-as-a-judge with mt-bench and chatbot arena, arXiv preprint arXiv:2306.05685 (2023).

[1]

Bordwell , The musical analogy , Yale French Studies ( 1980 ) 141 - 156 .

[2]

Bal , C. Van Boheemen , Narratology: Introduction to the theory of narrative , University of Toronto Press, 2009 .

[3]

H. P.

Abbott , Narrative and life, The Cambridge Introduction to Narrative ( 2002 ) 1 - 12 .

[4]

Vaswani ,

Shazeer ,

Parmar ,

Uszkoreit ,

Jones ,

A. N.

Gomez , Ł. Kaiser, I. Polosukhin , Attention is all you need , Advances in neural information processing systems 30 ( 2017 ).

[5]

Radford , J. Wu, Rewon child, david luan, dario amodei, and ilya sutskever . 2019 , Language models are unsupervised multitask learners . OpenAI blog 1 ( 2019 ) 9 .

[6]

Holtzman ,

Buys ,

Du ,

Forbes , Y. Choi, The curious case of neural text degeneration , arXiv preprint arXiv: 1904 . 09751 ( 2019 ).

[8]

J. A.

Baktash ,

Dawodi , Gpt-4: A review on advancements and opportunities in natural language processing , 2023 . arXiv: 2305 . 03195 .