<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>FlintstonesSV++ : Improving Story Narration using Visual Scene Graph</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Janak Kapuriya</string-name>
          <email>janakkumar.kapuriya@insight-centre.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paul Buitelaar</string-name>
          <email>paul.buitelaar@universityofgalway.ie</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Insight Research Ireland Centre for Data Analytics, Data Science Institute, Univeristy of Galway</institution>
          ,
          <country country="IE">Ireland</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Recent advancements in text-to-image, text-to-video, and large language models have signi!cantly enhanced the performance of various downstream tasks. In the !eld of Story Visualization, models have been developed to generate coherent image sequences from storylines composed of multiple scenes. These innovations have largely relied on benchmark datasets such as FlintstonesSV and PororoSV, which provide essential resources for tasks like Story Visualization and Story Continuation. However, our analysis identi!es several limitations in the FlintstonesSV dataset that restrict the performance of models trained on it. To address these limitations, we introduce FlintstonesSV++, an enhanced version of the FlintstonesSV dataset. FlintstonesSV++ leverages visual Scene Graphs and Large Language Models to enrich storylines with factual details, further validated by human reviewers. By !ne-tuning text-to-story generation models on FlintstonesSV++, we demonstrate substantial improvements, achieving a 5.2% average increase in alignment scores and a 5.72% boost in image generation quality compared to models trained on the original dataset. Moreover, a qualitative comparative analysis highlights the superior performance of FlintstonesSV++ compared to the original dataset. The FlintstonesSV++ dataset marks a signi!cant advancement in enabling tasks such as Story Visualization and Story Continuation. To support further research in story-based visual content generation, we made the code and dataset publicly available.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Story Narrative Generation</kwd>
        <kwd>Visual Scene Graphs</kwd>
        <kwd>Dataset Improvement</kwd>
        <kwd>Storyline Visualization</kwd>
        <kwd>Narrative Resources</kwd>
        <kwd>Large Language Models</kwd>
        <kwd>Large Multimodal Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advancements in text-to-image generation have been driven by high-quality models such as
DALL-E 3 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], SDXL [2], Imagen-3 [3], and Stable Di"usion-3 [4]. These models excel in
generating photorealistic images, which has advanced the development of text-to-video generation models
like Sora [5], SVD [6], and Veo2 [7]. Real-world applications of these models include personalized
product advertisements, story visualization, educational content creation and Social media video
creation. Among these applications, story visualization transforms textual narratives into coherent visual
representations, bridging the gap between language and visual understanding. A related task, story
continuation, generates coherent image sequences based on a given storyline and an initial image.
Unlike traditional text-to-image or text-to-video generation, which typically focus on isolated descriptions,
story visualization and continuation require generated images to align with textual sentences while
maintaining consistency in characters and scenes. These tasks are further complicated by challenges
such as ensuring frame-to-frame coherence, capturing context, and addressing factual gaps in scene
descriptions.
      </p>
      <p>Recent progress in the story visualization domain highlights promising approaches. Grimm [8]
employs an auto-regressive method to establish temporal dependencies between image-sentence pairs.
Story-LDM [9] uses a visual memory module to maintain consistency across generations, while ARLDM
[10] applies text-to-image di"usion models for coherent image generation. Temporal-Story [11]
integrates a #ow adapter and spatio-temporal attention to capture character movements across scenes.
These models have been evaluated on benchmark datasets like FlintstonesSV [12] and PororoSV [13],
designed for animated story visualization tasks. Despite their utility, the FlintstonesSV dataset has
limitations that hinder model performance in generating coherent and consistent scenes.</p>
      <p>Each sample in FlintstonesSV includes an image and a corresponding scene description, typically
covering basic details like the character’s name, activity, and setting. However, key elements essential
for a comprehensive understanding of the scene are often missing. Speci!cally, it lacks essential details,
such as character attributes, detailed background descriptions, precise character positioning, and
highlevel objects along with their relationships to other objects and characters. Addressing these limitations
is vital for advancing story visualization and continuation tasks.</p>
      <p>To address the limitations of the FlintstonesSV dataset, we propose a Visual Scene Graph
(VSG)based approach to enhance the factual accuracy of its scene descriptions. A VSG [14] extracts key
information from images, such as objects, attributes, and relationships, which serves as a foundation
for generating enriched scene descriptions using large language models. This process results in the
enhanced FlintstonesSV++ dataset, which signi!cantly improves the performance of text-to-image
generation models in terms of alignment and generative quality. We validate our improvements through
expert evaluations of the visual scene graphs and provide qualitative comparisons between the original
FlintstonesSV and FlintstonesSV++ descriptions. Additionally, results from text-to-story generation
models !ne-tuned to generate scene images from scene narratives further highlight the e"ectiveness of
our approach.</p>
      <p>Our key contributions in this paper are:
1. Visual Scene Graph Integration: We introduce a novel approach that leverages Visual Scene
Graphs to address the limitations of the FlintstonesSV dataset, enriching it with detailed factual
information, which is subsequently validated by human reviewers.
2. FlintstonesSV++ Dataset Creation: We develop an improved version of the FlintstonesSV
dataset, named FlintstonesSV++. This enhanced dataset combines Visual Scene Graphs with large
language model outputs to create more comprehensive and accurate scene descriptions.
3. Performance Improvements : Our experiments with FlintstonesSV++ demonstrate signi!cant
enhancements in model performance. We observe an average 5.2% increase in scene description
alignment CLIP score and average 5.72% improvement in text-to-story generation results across
various pretrained di"usion models.
4. Qualitative Comparative Analysis : We provide an in-depth qualitative comparison between
FlintstonesSV++ and the original FlintstonesSV dataset. This analysis highlights the superior
quality of our improved dataset, showcasing enhanced scene descriptions and more accurate
prediction results.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <sec id="sec-2-1">
        <title>2.1. Story Visualization</title>
        <p>The Story Visualization task focuses on generating sequences of visually coherent scenes from
multiscene storylines. Earlier approaches predominantly relied on Generative Adversarial Networks (GANs)
to generate scene sequences [15, 16, 17, 18, 19, 20]. More recently, di"usion models have been
introduced for this task, showing promising results [9, 21, 10, 22, 11]. These methods often utilize common
benchmark datasets such as FlintstonesSV [12] and PororoSV [13]. However, we identi!ed inherent
issues with the FlintstonesSV dataset. It lacks comprehensive background information, precise character
attributes, accurate positioning of characters, and detailed object descriptions, including their
relationships with other objects and characters. These limitations hinder the performance of downstream tasks,
such as story visualization, particularly when !ne-tuning story visualization models.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Visual Scene Graphs</title>
        <p>The Visual Genome framework [14] extracts structured factual information from images by identifying
objects, attributes, and relationships within distinct image regions. These region-speci!c graphs are
combined into a uni!ed scene graph, where objects are interconnected through speci!c relationships and
annotated with relevant attributes, e"ectively integrating factual information from images. Scene graphs
have shown substantial utility in various downstream tasks, including Visual Question Answering
[23, 24, 25], Visual Scene Reasoning [26, 27, 28], and Image Captioning [29, 30], leading to notable
improvements in task accuracy. In our work, we adopt a Visual Scene Graph (VSG) based approach to
address inherent issues in the original FlintstonesSV dataset. By utilizing VSGs, we aim to improve the
dataset’s quality by addressing gaps such as incomplete background information, imprecise character
attributes, and inadequate object relationships.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <sec id="sec-3-1">
        <title>3.1. FlintstonesSV Dataset and its Limitations</title>
        <p>The FlintstonesSV dataset is curated from the animated sitcom The Flintstones, featuring scenes
centered around seven main characters, each contributing to diverse interactions and scenarios. These
characters play a pivotal role in capturing the dynamics and relationships within the series. The
dataset is composed of 24,512 samples, divided into training, validation, and test sets comprising 20,132,
2,071, and 2,309 samples, respectively. Each sample consists of an image paired with a concise scene
description. The FlintstonesSV dataset has been widely used for tasks such as story visualization and
story continuation. Its alignment of images with corresponding scene descriptions makes it a valuable
resource for developing and evaluating models that integrate visual content with narrative generation.</p>
        <p>FlintstoneSV: Red color dino is in the yard looking at a stick.</p>
        <p>FlintstoneSV++: A red cartoon dinosaur with a long neck, tail, and standing on a grey stone path gazes at a brown
pointed stick held by Fred near a tall tropical palm tree, while a grey stone wall stands behind it.</p>
        <p>FlintstoneSV: Betty and Wilma are in the kitchen. Betty is talking to Wilma. Wilma is cooking.</p>
        <p>FlintstoneSV++: In the primitive cave kitchen, Betty stands near Wilma who is cooking a large turkey in a blue
stone pot on the stove. They are engaged in conversation.</p>
        <p>FlintstoneSV: Fred and Barney are standing on a sidewalk. Barney is speaking to Fred, while Fred listens silently
with his hands on his hips.</p>
        <p>FlintstoneSV++: Fred, an orange adult male with his hands on his hips, stands near Barney who is speaking while
wearing a scarf, both men are standing on the gray flat horizontal sidewalk next to a rough vertical stone wall.</p>
        <p>Despite its utility, the FlintstonesSV dataset has notable limitations. As illustrated in Figure 1, Example
1 highlights that FlintstonesSV captions fail to provide information about Dino and background elements,
such as the wall and palm trees. In Example 2, the original captions omit crucial details, including
the type of food being cooked, the utensil used, and its color. Similarly, in Example 3, the captions
neglect to describe character apparel, such as a scarf, and background elements like the wall, along
with their spatial relationships with the character. These gaps limit the dataset’s ability to capture the
complete essence of a story scene. Consequently, models trained on it often struggle with generating
or continuing stories that are contextually rich and detailed. The lack of critical visual information in
scene descriptions further hampers performance in tasks such as story generation and continuation.</p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Narrative Improvement Using Visual Scene Graphs</title>
        <p>Inspired by the Visual Genome framework [14], we adopt a similar methodology to enhance
text-tostory generation for the FlintstonesSV dataset. By utilizing a visual genome-based approach, detailed
information about story scenes can be extracted from the FlintstonesSV dataset, capturing key factual
details enriches the scene representation, addressing the limitations of the FlintstonesSV dataset. This
added detail is pivotal for story visualization and text-to-story generation, as it provides a deeper
understanding of the narrative context. Consequently, models can generate more coherent, contextually
accurate, and visually grounded stories, signi!cantly improving the quality of their outputs.</p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Visual Scene Graph Human Evaluation</title>
        <p>To validate the accuracy of the Visual Scene Graphs (VSGs) extracted from story scenes, we conducted
a human evaluation with 7 annotators, all of whom are researchers in NLP, and of which 3 are PhD
students, 3 are research sta" and 1 is academic sta". Each evaluator reviewed 10 randomly selected VSG
samples. The evaluation focused on three primary components of the generated VSGs: objects, attributes,
and relationships. objects were assessed for accuracy and completeness in detection, attributes were
evaluated for their relevance and precision in describing the detected objects, and relationships were
analyzed for their contextual appropriateness and validity in representing object connections within
the scene. Each component was rated on a scale of 1 to 5, where 1 is the lowest and 5 is the highest.
The scoring guidelines were as follows: a score of 5 indicated perfection with no corrections required, 4
signi!ed minor issues requiring some tweaking, 3 corresponded to major issues necessitating further
improvement, 2 re#ected signi!cant issues requiring major revisions, and 1 represented an invalid
component deemed beyond repair.
providing a more detailed and nuanced representation of the story scenes. These !ndings highlight the
potential of VSGs to improve scene understanding and reasoning in narrative contexts.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>This section outlines our three-stage methodology for improving the story scene narratives. The
subsequent sub-sections provide detailed descriptions of each stage. Section 4.1 explains the process
of VSG generation from scene images and descriptions using Gemini multi-modal model. Section 4.2
discusses narrative scene generation using factual knowledge derived from the generated VSG. Finally,
Section 4.3 presents the !ne-tuning of di"usion models for the text-to-story generation task.</p>
      <p>Scene Image
1) Visual Scene Graph Generation
word
teal
purple
wooden
Chair</p>
      <p>female
speaking
sitting on
knittting
Wilma
sitting
red hair
in front of
using</p>
      <p>holding
Knitting Needle
long
thin</p>
      <p>Cloth
beige
Visual Scene Graph</p>
      <p>mounted
Window
blue</p>
      <p>round
Mistral 7B Instruct</p>
      <p>Zero Shot Prompting
A red-haired woman named Wilma sits on an old teal
wooden chair with a stone-like finish, speaking as she
knits a beige piece of cloth with a purple knitting needle
in front of a round blue wall-mounted window.</p>
      <p>Improved Scene Caption
2) Scene Narrative Generation
Wilma sits in a chair in the living
room. She is in front of a window.</p>
      <p>She speaks to herself while holding
a cloth she is knitting.</p>
      <p>Scene Caption
Predicted Scene Image</p>
      <p>Zero Shot Prompting</p>
      <p>Stable Diffusion
Finetuning with LoRA
Decoder</p>
      <p>Encoder
3) Story Scene Generation</p>
      <sec id="sec-4-1">
        <title>4.1. Visual Scene Graph (VSG) Generation</title>
        <p>To extract factual information in the form of a Visual Scene Graph of story scene images, as described in
the Stage 1 of Figure 2, we utilize the generative capabilities of the pre-trained Large Vision-Language
Model (LVLM) Gemini-Flash [31]. This model generates a Visual Scene Graph by processing the scene
image along with its corresponding caption from the original FlintstonesSV dataset. Let I denote the
set of input scene images and C denote the set of corresponding captions, where each caption C → C
is paired with an image I → I . Let M represent Gemini-Flash as the pre-trained LVLM, and P(I, C)
indicate the handcrafted zero shot prompt 1 formed by combining an image I with its corresponding
caption C. For each pair (I, C) from the dataset, the visual scene graph is generated as follows:</p>
        <p>GI = M(P(I, C)),
where GI = (OI , AI , RI ) represents the visual scene graph for image I. Here, OI denotes the set of
detected objects, AI = { 1,  2, . . . ,  n} represents the attributes associated with each object ( i for
oi), and RI = { ij } captures the relationships between objects oi and oj .</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Scene Narrative Generation</title>
        <p>To generate enriched scene narratives for the images in the FlintstonesSV dataset, as described in the
Stage 2 of Figure 2, For each image I, the corresponding VSG GI extracted in Stage 1 is then converted
into a natural language prompt P = F (GI ). Where F (·) is a formatting function which structures the
VSG into an instruction that can be e"ectively processed by a large language model (LLM).</p>
        <p>The prompt P is then processed by the Mistral-7B Instruct [32] model M to generate the enriched
scene narrative E = M(P). These enriched narratives E enhance the original dataset by incorporating
additional details including backgrounds, character attributes, precise character positioning,
interobject relationships, and the presence of other high-level objects. As a result, the enriched narratives
provide a more comprehensive context, improving the dataset’s utility for downstream tasks like story
visualization.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Story Scene Generation</title>
        <p>For story scene generation from narrative prompts, as we can see in the Stage 3 of Figure 2, we
leverage open-source pre-trained text-to-image di"usion models to generate images corresponding
to the provided story scene narratives. To tailor the model to our speci!c dataset, we !ne-tune the
model using the LoRA [33] parameter-e$cient !ne-tuning technique. Let M represent the pre-trained
text-to-image di"usion model. The model takes as input a prompt P, which represents the story scene
narrative, and outputs a generated image Ipred, where:</p>
        <p>Ipred = M(P).</p>
        <p>During !ne-tuning, the model M is optimized to minimize the Stable Di"usion loss L, which is
computed based on the denoising process within the di"usion framework. Speci!cally, for a latent
representation Z0, a time step t, and noise  , the Stable Di"usion loss is de!ned as:</p>
        <p>L = EZ0,t, ︃) ↑ ↓   (Zt, t)↑2[︃ .</p>
        <p>In this equation, Zt refers to the noisy latent variable at time t, and  denotes the noise added during
the forward process. The term   (Zt, t) represents the predicted noise at time t, estimated by the model
M, which is parameterized by  , and ↑ · ↑2 indicates the squared Euclidean norm.</p>
        <p>Once the model is !ne-tuned, it is loaded to be tested on unseen story narratives to generate the
corresponding visual scenes, demonstrating its capability to produce relevant story scenes.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Experiments and Results</title>
      <sec id="sec-5-1">
        <title>5.1. Experimental Setup</title>
        <p>We employed three open-source pre-trained text-to-image models—SDXL Base-1.0 [2], CompVis Stable
Di"usion V4, and Stable Di"usion-2 [34]—to evaluate the e"ectiveness of the FlintstonesSV++ dataset.
These models were !ne-tuned using parameter-e$cient methods, speci!cally the Low-Rank Adaptation
(LoRA) technique [33], for the scene generation task based on scene narratives. During !ne-tuning, we
used a batch size of 8, trained the models for 10 epochs, and applied a cosine learning rate scheduler.
All other hyper-parameters were kept at their default values to ensure consistency in the evaluation
process.
1Prompts for VSG Extraction and Story Narrative Generation are given in the GitHub Repository - FlintstonesSV++</p>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Evaluation Metrics</title>
        <p>To evaluate the e"ectiveness of visual scene generation on the improved Flintstones++ dataset compared
to the original Flintstones dataset, we used two metrics: FID Score [35] and CLIP Score [36]. The FID
(Fréchet Inception Distance) measures the quality of generated images by comparing their feature
distributions with those of real images, using the Inception-V3 model. A lower FID score indicates
better image quality. The CLIP Score assesses how well the generated scene aligns with the story
narrative. It is computed by passing the generated image and the story narrative through the pre-trained
OpenAI CLIP model’s image and text encoders, respectively. The embeddings are compared using
cosine similarity, with a higher CLIP score indicating better alignment. We report the average FID and
CLIP scores over all samples on the test set.</p>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Quantitative Results</title>
        <p>This section presents the !ne-tuning results on the FlintstonesSV++ and FlintstonesSV datasets using
three di"erent text-to-image generation models. As shown in Table 2, the CLIP score improves by
6.23%, 4.85%, and 4.78% across the three models, indicating a better alignment between the scene
captions and the generated images. On average, a 5.2% improvement in the CLIP score demonstrates the
e"ectiveness of the FlintstonesSV++ captions in enhancing downstream tasks. These captions enable
the generated scene images to align more accurately with their corresponding captions. Moreover, the
FID score decreases by 14%, 3%, and approximately 1% for the text-to-image generation task when using
FlintstonesSV++ compared to FlintstonesSV. While some models show only minor improvements in
the average FID score across the full test set due to their pre-training capabilities, still their alignment
between generated images and scene narratives also improves signi!cantly with FlintstonesSV++.
These enhancements in alignment and image generation make FlintstonesSV++ a superior choice over
FlintstonesSV for downstream tasks such as story visualization and story continuation.</p>
        <p>FlintstonesSV++ o"ers detailed scene narratives that capture not only the spatial positions of
characters and objects but also the relationships between them, which are crucial for accurate scene transitions.
These improvements address the limitations of FlintstonesSV, as evidenced by the quantitative results.</p>
      </sec>
      <sec id="sec-5-4">
        <title>5.4. Qualitative Results Analysis</title>
        <p>To evaluate the impact of the FlintstonesSV++ dataset with enhanced scene captions, we present the
results of scene images generated by !ne-tuned Stable Di"usion models on a text-to-image generation
task. The model was trained to generate story scenes based on input narratives. As shown in Figure 3,
we compare three samples, each with its ground truth image and the predicted image generated using
scene descriptions from both the Flintstone and FlintstonesSV++ datasets. The Flintstone dataset results
reveal several limitations. In example (a), duplicate characters appear, and while multiple windows
are present, curtains are missing. In example (b), the background home and the characters’ emotional
expressions di"er from the ground truth. In example (c), Betty and Wilma are not depicted on the
beach, and the umbrella is missing. The FlintstonesSV++ dataset addresses these shortcomings by using
VSG enhanced captions and factual information added by visual scene graph is highlighted in blue ,
which provide more accurate and detailed cues. As a result, the generated scenes align better with the
input narratives. These improvements demonstrate the e"ectiveness of the FlintstonesSV++ dataset in
enhancing both story visualization and story-continuation tasks.</p>
        <p>Ground Truth
FlintstoneSV
FlinstoneSV++</p>
        <p>Caption
Fred is laying on a couch in a
room while Barney talks to Fred
through a window.</p>
        <p>Fred, a large orange and relaxed
spteornseo-nli,keis colyuicnhg oinn aa swmhiatlel
cave-like room. Barney, a small
blonde man, stands nearby talking
to Fred while looking through an
oorvaanlgestcounretainwsindow with long</p>
        <p>Caption</p>
        <p>Caption
Fred is walking down the sidewalk,
talking to someone off camera right.</p>
        <p>Fred, a brown, medium-sized man with an
orange tunic, walks sadly along a rough
grey stone wall in the Stone Age. He
passes by a small light green house with
a purple decorative plant near it, while a
short grey stone wall is next to the
sidewalk he's walking on.</p>
        <p>Betty and Wilma are sitting on the
beach. Wilma is talking to Betty.</p>
        <p>Betty is listening to what Wilma is
saying.</p>
        <p>Two women, Wilma with red hair
and Betty with black hair, are
sitting on the sandy beach, under
tchoenveursmabtiroenllaw.Thhieley enejnogyaingge thine
sea view, with their beach mats
beneath their folded chairs.
(a)
(b)
(c)</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>In this paper, we introduced FlintstonesSV++, an improved version of the FlintstonesSV dataset,
enhanced using Visual Scene Graphs (VSG) and Large Language Models (LLMs). Our approach enriches
the dataset by incorporating factual information that was previously absent, making it more suitable
for benchmarking downstream tasks such as story visualization and story continuation. Experimental
results demonstrate that FlintstonesSV++ achieves superior performance in text-to-story generation
task, highlighting the e"ectiveness of our enhancements. These !ndings establish FlintstonesSV++ as
successor to FlintstonesSV, o"ering richer and more detailed scene descriptions through VSG and LLMs,
thereby improving its utility in narrative-based AI applications.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Limitations</title>
      <p>Our dataset construction leverages pre-trained models, speci!cally the Gemini-Flash Large
VisionLanguage Model for extracting visual scene graphs from images and the Mistral LLM for generating
scene stories based on these graphs. The accuracy of the extracted scene graphs and generated stories
is inherently constrained by the pre-trained capabilities of these models. Since these models are utilized
in a zero-shot prompting setting, their outputs may exhibit biases present in their training data. Despite
these limitations, our experimental results demonstrate signi!cant improvements over the original
dataset. Additionally, we experiment with the di"usion models for per-frame story scene visualization
in this paper to showcase the e"ectiveness of scene narrative.</p>
    </sec>
    <sec id="sec-8">
      <title>Acknowledgement</title>
      <p>We thank the anonymous reviewers for their insights on this work. This publication has
emanated from research conducted with the !nancial support of Research Ireland under Grant Number
SFI/12/RC/2289_P2 (Insight), co-funded by the European Regional Development Fund.
[2] D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, R. Rombach,
Sdxl: Improving latent di"usion models for high-resolution image synthesis, arXiv preprint
arXiv:2307.01952 (2023).
[3] J. Baldridge, J. Bauer, M. Bhutani, N. Brichtova, A. Bunner, K. Chan, Y. Chen, S. Dieleman, Y. Du,</p>
      <p>Z. Eaton-Rosen, et al., Imagen 3, arXiv preprint arXiv:2408.07009 (2024).
[4] P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. Müller, H. Saini, Y. Levi, D. Lorenz, A. Sauer, F. Boesel,
et al., Scaling recti!ed #ow transformers for high-resolution image synthesis, in: Forty-!rst
International Conference on Machine Learning, 2024.
[5] Y. Qin, Z. Shi, J. Yu, X. Wang, E. Zhou, L. Li, Z. Yin, X. Liu, L. Sheng, J. Shao, et al., Worldsimbench:</p>
      <p>Towards video generation models as world simulators, arXiv preprint arXiv:2410.18072 (2024).
[6] A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English,
V. Voleti, A. Letts, et al., Stable video di"usion: Scaling latent video di"usion models to large
datasets, arXiv preprint arXiv:2311.15127 (2023).
[7] A. Sharma, A. Yu, A. Razavi, A. Toor, A. Pierson, A. Gupta, A. Waters, A. van den Oord, D. Tanis,
D. Erhan, E. Lau, E. Shaw, G. Barth-Maron, G. Shaw, H. Zhang, H. Nandwani, H. Moraldo, H. Kim,
I. Blok, J. Bauer, J. Donahue, J. Chung, K. Mathewson, K. David, L. Espeholt, M. van Zee, M. McGill,
M. Narasimhan, M. Wang, M. Bi%kowski, M. Babaeizadeh, M. T. Sa"ar, N. de Freitas, N. Pezzotti,
P.-J. Kindermans, P. Rane, R. Hornung, R. Riachi, R. Villegas, R. Qian, S. Dieleman, S. Zhang, S. Cabi,
S. Luo, S. Fruchter, S. Nørly, S. Srinivasan, T. Pfa", T. Hume, V. Verma, W. Hua, W. Zhu, X. Yan,
X. Wang, Y. Kim, Y. Du, Y. Chen, Veo (2024). URL: https://deepmind.google/technologies/veo/.
[8] C. Liu, H. Wu, Y. Zhong, X. Zhang, Y. Wang, W. Xie, Intelligent grimm - open-ended visual
storytelling via latent di"usion models, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2024, pp. 6190–6200.
[9] T. Rahman, H.-Y. Lee, J. Ren, S. Tulyakov, S. Mahajan, L. Sigal, Make-a-story: Visual memory
conditioned consistent story generation, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition, 2023, pp. 2493–2502.
[10] X. Pan, P. Qin, Y. Li, H. Xue, W. Chen, Synthesizing coherent story with auto-regressive latent
di"usion models, in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer
Vision (WACV), 2024, pp. 2920–2930.
[11] S. Zheng, Y. Fu, Temporalstory: Enhancing consistency in story visualization using spatial-temporal
attention, arXiv e-prints (2024) arXiv–2407.
[12] T. Gupta, D. Schwenk, A. Farhadi, D. Hoiem, A. Kembhavi, Imagine this! scripts to compositions
to videos, 2018. URL: https://arxiv.org/abs/1804.03608. arXiv:1804.03608.
[13] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, J. Gao, Storygan: A sequential
conditional gan for story visualization, in: Proceedings of the IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), 2019.
[14] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A.</p>
      <p>Shamma, et al., Visual genome: Connecting language and vision using crowdsourced dense image
annotations, International journal of computer vision 123 (2017) 32–73.
[15] B. Li, Word-level !ne-grained story visualization, in: European Conference on Computer Vision,</p>
      <p>Springer, 2022, pp. 347–362.
[16] Y. Li, Z. Gan, Y. Shen, J. Liu, Y. Cheng, Y. Wu, L. Carin, D. Carlson, J. Gao, Storygan: A sequential
conditional gan for story visualization, in: Proceedings of the IEEE/CVF conference on computer
vision and pattern recognition, 2019, pp. 6329–6338.
[17] A. Maharana, M. Bansal, Integrating visuospatial, linguistic, and commonsense structure into
story visualization, in: M.-F. Moens, X. Huang, L. Specia, S. W.-t. Yih (Eds.), Proceedings of
the 2021 Conference on Empirical Methods in Natural Language Processing, Association for
Computational Linguistics, Online and Punta Cana, Dominican Republic, 2021, pp. 6772–6786. URL:
https://aclanthology.org/2021.emnlp-main.543/. doi:10.18653/v1/2021.emnlp-main.543.
[18] A. Maharana, D. Hannan, M. Bansal, Improving generation and evaluation of visual stories via
semantic consistency, in: K. Toutanova, A. Rumshisky, L. Zettlemoyer, D. Hakkani-Tur, I. Beltagy,
S. Bethard, R. Cotterell, T. Chakraborty, Y. Zhou (Eds.), Proceedings of the 2021 Conference of
the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Association for Computational Linguistics, Online, 2021, pp. 2427–2442. URL: https:
//aclanthology.org/2021.naacl-main.194/. doi:10.18653/v1/2021.naacl-main.194.
[19] Y.-Z. Song, Z. Rui Tam, H.-J. Chen, H.-H. Lu, H.-H. Shuai, Character-preserving coherent story
visualization, in: European Conference on Computer Vision, Springer, 2020, pp. 18–33.
[20] A. Maharana, D. Hannan, M. Bansal, Storydall-e: Adapting pretrained text-to-image transformers
for story continuation, in: European Conference on Computer Vision, Springer, 2022, pp. 70–87.
[21] X. Shen, M. Elhoseiny, Large language models as consistent story visualizers, arXiv preprint
arXiv:2312.02252 (2023).
[22] W. Wang, C. Zhao, H. Chen, Z. Chen, K. Zheng, C. Shen, Autostory: Generating diverse storytelling
images with minimal human e"orts, International Journal of Computer Vision (2024) 1–22.
[23] T. Qian, J. Chen, S. Chen, B. Wu, Y.-G. Jiang, Scene graph re!nement network for visual question
answering, IEEE Transactions on Multimedia 25 (2022) 3950–3961.
[24] V. Damodaran, S. Chakravarthy, A. Kumar, A. Umapathy, T. Mitamura, Y. Nakashima, N. Garcia,
C. Chu, Understanding the role of scene graphs in visual question answering, arXiv preprint
arXiv:2101.05479 (2021).
[25] M. Hildebrandt, H. Li, R. Koner, V. Tresp, S. Günnemann, Scene graph reasoning for visual question
answering, arXiv preprint arXiv:2007.01072 (2020).
[26] J. Shi, H. Zhang, J. Li, Explainable and explicit visual reasoning over scene graphs, in: Proceedings
of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8376–8384.
[27] H. Tian, N. Xu, A.-A. Liu, C. Yan, Z. Mao, Q. Zhang, Y. Zhang, Mask and predict: Multi-step
reasoning for scene graph generation, in: Proceedings of the 29th ACM International Conference
on Multimedia, 2021, pp. 4128–4136.
[28] Z. Wang, H. You, L. H. Li, A. Zareian, S. Park, Y. Liang, K.-W. Chang, S.-F. Chang, Sgeitl: Scene
graph enhanced image-text learning for visual commonsense reasoning, in: Proceedings of the
AAAI conference on arti!cial intelligence, volume 36, 2022, pp. 5914–5922.
[29] X. Li, S. Jiang, Know more say less: Image captioning based on scene graphs, IEEE Transactions
on Multimedia 21 (2019) 2117–2130.
[30] Y. Zhong, L. Wang, J. Chen, D. Yu, Y. Li, Comprehensive image captioning via scene graph
decomposition, in: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK,
August 23–28, 2020, Proceedings, Part XIV 16, Springer, 2020, pp. 211–229.
[31] G. Team, P. Georgiev, V. I. Lei, R. Burnell, L. Bai, A. Gulati, G. Tanzer, D. Vincent, Z. Pan, S. Wang,
et al., Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, arXiv
preprint arXiv:2403.05530 (2024).
[32] A. Q. Jiang, A. Sablayrolles, A. Mensch, C. Bamford, D. S. Chaplot, D. d. l. Casas, F. Bressand,</p>
      <p>G. Lengyel, G. Lample, L. Saulnier, et al., Mistral 7b, arXiv preprint arXiv:2310.06825 (2023).
[33] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, W. Chen, LoRA: Low-rank
adaptation of large language models, in: International Conference on Learning Representations,
2022. URL: https://openreview.net/forum?id=nZeVKeeFYf9.
[34] R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-resolution image synthesis with
latent di"usion models, in: Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition (CVPR), 2022, pp. 10684–10695.
[35] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, S. Hochreiter, Gans trained by a two time-scale
update rule converge to a local nash equilibrium, Advances in neural information processing
systems 30 (2017).
[36] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin,
J. Clark, et al., Learning transferable visual models from natural language supervision, in:
International conference on machine learning, PMLR, 2021, pp. 8748–8763.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>J.</given-names>
            <surname>Betker</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Goh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ouyang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Guo</surname>
          </string-name>
          , et al.,
          <article-title>Improving image generation with better captions, Computer Science</article-title>
          . https://cdn. openai. com/papers/dalle-3
          <source>. pdf 2</source>
          (
          <year>2023</year>
          )
          <article-title>8</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>