<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Findings of the Shared Task on Multilingual Story Illustration: Bridging Cultures through AI Artistry (MUSIA)</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Krishna Tewari</string-name>
          <email>krishnatewari.rs.cse24@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anshita Malviya</string-name>
          <email>anshitamalviya.rs.cse23@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Supriya Chanda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Arjun Mukherjee</string-name>
          <email>arjunmukherjee.rs.cse23@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sukomal Pal</string-name>
          <email>spal.cse@iitbhu.ac.in</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Bennett University</institution>
          ,
          <addr-line>Greater Noida</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Indian Institute of Technology (BHU)</institution>
          ,
          <addr-line>Varanasi</addr-line>
          ,
          <country country="IN">INDIA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2026</year>
      </pub-date>
      <abstract>
        <p>The Multilingual Story Illustration Shared Task (MUSIA), held under FIRE-2025, looks into the challenge of creating culturally grounded visual narratives for short stories in English and Hindi. As multimodal generative AI becomes more important in education, digital storytelling, and creative content creation, MUSIA ofers the first benchmark focused on culturally accurate, multilingual story visualization. Eight teams signed up for the task; ifve submitted valid system runs, and four provided camera-ready papers. The systems used diferent strategies like narrative segmentation, translation, summarization, prompt design, retrieval-augmented methods, and difusion-based generation. Human evaluation assessed three areas: visual quality, relevance, and consistency. The results showed that pipelines combining LLM-based story understanding with difusion models achieved the best outcomes, especially when it came to creating visually coherent images. However, most systems had trouble maintaining narrative fidelity and consistency across panels. None used the culturally rich training illustrations ofered in MUSIA, which led to a noticeable Westernization of the generated images. These results reveal ongoing limitations in current text-to-image models, particularly their struggle to accurately reflect Indian cultural elements such as regional clothing, landscapes, and folk designs. This paper discusses the MUSIA dataset, task formulation, team methods, and comparative results, laying the groundwork for creating multilingual, culturally aware story-illustration systems and guiding future research in inclusive multimodal generation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Mulitmodal storytelling</kwd>
        <kwd>Text-to-image generation</kwd>
        <kwd>Cultural representation</kwd>
        <kwd>Narrative illustration</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Recent advances in multimodal Artificial Intelligence (AI), driven by large language models (LLMs)
and difusion-based text-to-image (T2I) generators, have greatly expanded the range of applications
needing joint reasoning over language and vision. Among these applications, story illustration has
become an important but underexplored task. It involves turning narrative passages into a series of
images that accurately reflect the story’s events, characters, and setting. While leading models perform
reasonably well on English story datasets from Western media [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ], they often struggle with narratives
that difer linguistically or culturally from their training data.
      </p>
      <p>This limitation is especially clear in Indian storytelling, which is multilingual and culturally rich.
Stories written in Hindi, English, Bengali, and other Indian languages often contain detailed cultural
cues, including traditional clothing, local architecture, regional landscapes, folk motifs, and idiomatic
expressions. However, difusion models mainly trained on Western visuals frequently misrepresent
these elements. As a result, illustrations for Indian stories often show incorrect clothing, non-Indian
characters, or Western-style backgrounds, breaking cultural and narrative trust.</p>
      <p>
        Existing resources like VIST [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], PororoSV [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], FlintstonesSV [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], and OpenStory++ [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have
advanced story visualization, but they are limited to English narratives and Western imagery.
Consequently, models based on these datasets are not well-suited for the multilingual and culturally diverse
nature of Indian stories.
      </p>
      <p>To fill this gap, we introduce the FIRE-2025 Shared Task on Multilingual Story Illustration (MUSIA) 1,
a new benchmark designed to evaluate systems that generate culturally appropriate illustrations for
short stories in Indian languages. This edition focuses on Hindi and English. Participants receive
training stories with example illustrations and must produce a specific number of images for each test
story indicated in a mapping file.</p>
      <p>The task presents several challenges. Systems must capture key moments in the narrative while
ensuring that characters, backgrounds, and visual styles remain consistent across multiple panels. At
the same time, they must incorporate culturally relevant features that reflect the linguistic and
regional context of the story. A fully fixed character template can make these requirements too simple,
while unconstrained generation risks style drift and inconsistency. The MUSIA dataset ofers diverse,
attribution-compliant illustrations to help guide models toward culturally faithful generation.</p>
      <p>Evaluation is entirely based on human assessment across three areas: relevance, consistency, and
visual quality. Reviewers rate each area using a three-level scale (Good, Moderate, Fair), allowing for
a detailed comparison of diferent modeling strategies in multilingual and multicultural settings.</p>
      <p>MUSIA thus establishes the first standardized framework for multilingual, culturally grounded story
illustration in the Indian context. By emphasizing both linguistic diversity and cultural authenticity,
this shared task aims to encourage the development of generative models that can better support
applications in education, children’s media, digital storytelling, and creative content production.</p>
      <p>The rest of the paper is structured as follows: Section 2 discusses related work; Section 3 describes
the dataset; Section 4 presents the proposed methodology; Section 5 reports results and analysis; and
Section 6 concludes with key findings.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>Recent advances in vision language models have led to highly capable systems that can generate images
and interpret narratives from text. Earlier work mainly targeted single-image generation from short
descriptions, but newer approaches move toward visualizing entire stories from longer passages. This
section highlights key methods and datasets that underpin the MUSIA multilingual story illustration
track.</p>
      <sec id="sec-2-1">
        <title>2.1. Text-to-Image Generation</title>
        <p>
          Early work on text-conditioned image generation largely relied on GAN-based architectures, with an
emphasis on improving visual realism and semantic alignment for single images described by short
captions. DM-GAN, for example, augments the generator with a dynamic memory module that
repeatedly updates visual features using textual cues, leading to sharper images and better coverage of
ifne-grained details in complex descriptions [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ]. Follow-up approaches move toward more structured
forms of conditioning: Make-A-Scene lets users provide a high-level scene layout or semantic map,
which is then fused with the input text so that the generated images adhere more closely to spatial
structure and human-specified priors [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>
          In parallel, another thread of research scales text-to-image models using large datasets and
autoregressive formulations. Zero-shot systems such as DALL·E cast image synthesis as predicting sequences
of discrete visual tokens conditioned on a textual prompt, and show strong compositional
generalization without any task-specific fine-tuning [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. Autoregressive transformers trained on web-scale
image–text corpora further increase diversity and richness of generated content, illustrating that sheer
scale can bridge much of the performance gap relative to supervised models [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. These advances are
1https://cse-iitbhu.github.io/MUSIA/
underpinned by stronger visual and multimodal encoders, including Vision Transformers [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and
contrastive vision–language pretraining frameworks like CLIP [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ], which supply robust priors for
aligning images and text.
        </p>
        <p>
          More recently, difusion-based methods have become the predominant choice for high-quality
image synthesis. Foundational work on unconditional difusion models shows that, with appropriate
parameterization, they can outperform GANs both in sample quality and in coverage of the data
distribution [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Building on this, GLIDE demonstrates that text-guided difusion can produce
photorealistic images and supports flexible, prompt-driven editing via classifier-free guidance [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ]. Latent
difusion further improves computational eficiency by running the difusion process in a learned
latent space, enabling high-resolution generation at manageable cost [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ]. Large-scale systems such as
Imagen and related models extend these ideas with powerful language encoders and massive training
corpora, achieving strong zero-shot performance across standard text-to-image benchmarks [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ].
Complementary work on vector-quantized difusion [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ] explores discrete latent representations, ofering
alternative trade-ofs in training stability and sampling speed.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Story Visualization and Visual Storytelling</title>
        <p>
          While the models discussed above are mainly designed for generating a single image from text, story
visualization has the additional requirement of preserving coherence across a series of images driven
by a longer narrative. StoryGAN casts this task as sequential conditional generation, using a story
encoder together with a recurrent generator so that each panel reflects not only the current sentence but
also prior context across the story [
          <xref ref-type="bibr" rid="ref1 ref18">18, 1</xref>
          ]. PororoGAN extends this idea by strengthening
characterspecific representations and improving temporal consistency on the Pororo-SV benchmark [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
Moving beyond these early GAN-based solutions, StoryDALL-E leverages large pretrained text-to-image
transformers for story continuation: each frame is conditioned on the current sentence as well as a
compact summary of earlier frames, which substantially enhances character and style consistency over
longer sequences [
          <xref ref-type="bibr" rid="ref19 ref20">19, 20</xref>
          ]. More recent character-preserving methods explicitly model visual plans and
token-level alignments to keep track of entities across panels [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ].
        </p>
        <p>
          Advances in story visualization are tightly coupled with the availability of dedicated datasets and
benchmarks. Pororo-SV and FlintstonesSV provide short, cartoon-like narratives with aligned frame
sequences that are well suited for studying story-level generation [
          <xref ref-type="bibr" rid="ref18 ref22 ref4">18, 4, 22</xref>
          ]. FlintstonesSV++
further enriches this setting with additional annotations and visual scene graphs, enabling more detailed
reasoning about objects and their interactions [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]. Complementary resources such as the Visual
Storytelling Dataset (VIST) pair real-world photo streams with human-written stories, supporting research
on aligning visual and textual narratives [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. OpenStory++ extends this direction to large-scale,
opendomain, instance-aware visual storytelling with more diverse story types and visual content [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Taken
together, these datasets underscore the need to evaluate both image quality and narrative coherence,
but they largely focus on English and on relatively narrow visual domains (for example, specific
cartoons or TV shows).
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Multimodal Models for Narrative Understanding and Generation</title>
        <p>
          Beyond pure image synthesis, a line of work studies multimodal models that tell stories from visual
inputs and jointly reason over images and text. Narrative generation frameworks for image sequences
aim to produce flowing, coherent stories that unfold over time, instead of treating each frame as an
isolated captioning problem [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ]. To better enforce temporal consistency, some approaches introduce
visual coherence losses for image-based story generation, discouraging sudden shifts in meaning or style
between adjacent sentences [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ]. At a larger scale, multimodal language models such as Flamingo [26]
bring few-shot generalization to image–text tasks by conditioning a strong language model on
interleaved visual and textual tokens. On the purely textual side, GPT-3 shows that large language
models can handle a wide range of narrative and generative tasks with minimal examples [27], and later
work adapts such models to storytelling by instruction-tuning and extending them to multimodal
settings [28].
        </p>
        <p>
          These advances rely heavily on general-purpose visual and multimodal backbones. Architectures
like Vision Transformers [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] and contrastively trained vision–language models such as CLIP [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] ofer
robust shared representations for images and text, which many narrative generation systems reuse as
frozen encoders or as building blocks within larger pipelines. Still, most existing multimodal narrative
systems are designed for a single language or a limited set of visual domains, and they rarely tackle
the challenge of maintaining consistent characters and style across long story arcs.
        </p>
        <p>
          Taken together, prior work provides strong building blocks for text-to-image generation, story
visualization, and multimodal narrative modeling. However, there is still a lack of benchmarks that
explicitly target multilingual, culturally diverse story illustration, where models must generate a sequence
of panels that are both narratively appropriate and visually consistent. Datasets such as Pororo-SV,
FlintstonesSV, VIST, and OpenStory++ [
          <xref ref-type="bibr" rid="ref18 ref22 ref23 ref3 ref4 ref6">18, 4, 22, 3, 6, 23</xref>
          ] tend to focus on one language, a specific
visual style, or do not place strong emphasis on fine-grained character consistency in illustrated stories.
The MUSIA track addresses this gap by casting story illustration as a multilingual, story-level
text-toimage task rooted in children’s storybooks, with evaluation criteria that deliberately stress relevance,
coherence, and visual quality across the full narrative.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>The MUSIA dataset is constructed from openly licensed storybooks and digital archives that provide
narrative text and illustrations under permissive terms (CC-BY or Public Domain). Our sources span
educational repositories, community-driven children’s literature platforms, and curated folk-story
collections. This diversity enables the dataset to capture a wide spectrum of Indian storytelling traditions,
including classic folktales, moral narratives, fables, and contemporary short fiction. Each candidate
story undergoes manual verification to confirm that its licensing permits redistribution and that both
its textual and visual content are appropriate for child-oriented contexts.</p>
      <p>For every story that satisfies these criteria, we compile the complete narrative alongside all
corresponding illustrations. The training split is ofered in two languages, English and Hindi. It is organized
into language-specific directories with separate Stories and Images folders. Story files follow a
consistent naming pattern:
• eng_story_XXXX for English stories, and
• hin_story_XXXX for Hindi stories,
where XXXX denotes a four-digit, zero-padded identifier (e.g., eng_story_0001). Images retain the
same numeric base and include an additional two-digit index indicating their order within the story
(e.g., eng_story_0001_01). All images are stored in standard formats such as .jpg, .jpeg, or .png.</p>
      <p>The final training set consists of 360 English stories and 185 Hindi stories, each linked with its full
set of illustrations. The test set contains 40 English and 30 Hindi stories. However, only the story text
and a mapping file specifying the required number of illustrations per story are provided, ensuring
that evaluation proceeds in a completely generative manner without reference images.</p>
      <p>To maintain textual quality and consistency, all stories were normalized to UTF-8 encoding, cleaned
to remove stray symbols or formatting artifacts, and filtered to eliminate non-narrative content such as
author notes, advertisements, and extraneous metadata. Paragraph structure was preserved to
maintain the narrative flow and to facilitate downstream tasks such as scene segmentation and panel-level
generation.</p>
      <p>For the visual component, illustrations were kept at their original resolution and aspect ratio to
preserve artistic fidelity. Images with severe noise, compression artifacts, low resolution, or intrusive
watermarks were removed through a mix of automatic checks and manual review. Each remaining
image was then cross-verified for relevance and narrative alignment to ensure that it accurately reflected
the corresponding portion of the story.
Raja Krishna Chandra ruled over a part of Bengal
about two hundred years ago. His court jester was
Gopal Bhand. Though Gopal Bhand had not
studied books, he was a very wise man. Once, a very
learned man, Mahagyani Pandit came to the court.</p>
      <p>He spoke all the Indian languages fluently and
perfectly. He answered all the questions very wisely.</p>
      <p>People were amazed to talk to him but no one could
identify his mother tongue. Whenever they asked
him, he smiled arrogantly. He said, “A truly wise
man will easily know my mother tongue.” Raja
Krishna Chandra was very upset. So he announced
a reward for anyone who could tell the Pandit’s
mother tongue. All the scholars listened to
Mahagyani attentively. But no one could identify his
mother tongue. “Shame on you”, said the king
angrily. All the scholars were silent. Gopal Bhand
stood up hesitantly. He said, “Your Highness, give
me a chance.” “How could you tell?”, asked the king.
“Your Highness! I won’t talk. He will tell you
himself”, answered Gopal Bhand.</p>
      <p>The next morning the king was walking in his
garden. Gopal Bhand ran upto him quickly and said, “I
have told Mahagyani Pandit that you are going to
honour him with a garland of roses.” “What!”, said
the king surprisingly. The next moment the king saw
Mahagyani Pandit walking in expectantly. He was
in silk clothes.</p>
      <p>Gopal Bhand hid himself behind the hedge. As soon
as the Pandit came near the hedge, he put his leg
out and tripped the Pandit. The Mahagyani pandit
fell down on the freshly watered ground. He sat up
and shouted at Gopal Bhand in his mother tongue.</p>
      <p>Gopal Bhand said, “Your Highness, now you know,
what the Pandit’s mother tongue is!” Mahagyani
Pandit got up and said to Gopal Bhand, “You wise
man, you have trapped me intelligently,” and he
went away.</p>
      <p>Quality control involved three bilingual annotators who independently assessed every text–image
pair for linguistic accuracy, cultural appropriateness, and narrative correspondence. Only pairs that
received unanimous approval were included in the final dataset, ensuring high standards of multimodal
्सकू ल में आज मेरा पहला दन है। माँ मेरा हाथ पकड़े हुए हैं और
मेरे साथ चल रही हैं। “मैं अब बड़ी हो गई हूँ." मैं कहती हूँ।
“चलो... चलो !" माँ ने मेरा हाथ कसकर पकड़ा हुआ है।
्सकू ल के पास बहुत सारे बे हैं। कु छ बस से आते हैं। कु छ कार
से आते हैं। कु छ रे से आते हैं। कु छ साइकल से और कु छ
पैदल आते हैं, मेरी तरह। हम फाटक तक पहुँचे। माँ मेरा हाथ
छोड़ती हैं।
वह गेट पर रुक गई। मुझे अंदर अके ले जाना है। मेरे चारों तरफ़
बहुत से अनजाने चेहरे हैं। मैं एक क़दम चलती हूँ। मैं दूसरा क़दम
बढ़ाती हूँ। मैं पीछे मुड़कर देखती हूँ। जैसे मैं आगे बढ़ती जाती
ूहँ, माँ छोट दखती जाती हैं। यक्ा वह गायब हो जाएगँी?
मैं दौड़कर उनके पास जाती हूँ। मुझे नहीं लगता क मैं बड़ी हो
चुक हूँ। मैं उनका हाथ पकड़ती हूँ और कहती हूँ, “मत जाओ!"
सभी अंदर जा चुके हैं। सफर् मैं बाहर हैं।
टचर दद बाहर आती हैं। वे मुझे देख मुस्कराती हैं। मैं भी
मु्सकराती हूँ।माँ कहती हैं, “रानी जब तुम बाहर आओगी मैं यहीं
मलूँगी।" मैं उनका हाथ छोड़ देती हूँ। वह हाथ हलाती हैं।
मैं दौड़कर अंदर जाती हूँ। माँ छु ट् होने पर वहीं मलेंगी!
coherence and cultural integrity.</p>
      <p>Table 1 and Table 2 presents representative examples from MUSIA-2025 english and hindi dataset
respectively, illustrating the alignment between narrative passages and their associated illustrations.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>Eight teams registered to participate in the MUSIA-2025 shared task, indicating strong engagement
with the challenge of multilingual story illustration. As the evaluation stage drew near, five of these
teams were able to submit valid system runs for review, and four continued all the way through to
prepare and submit camera-ready papers that carefully presented their approaches and results.</p>
      <p>The team NandiniDivya [29] builds on the One-Prompt-One-Story (1Prompt1Story) framework,
using it as a training-free backbone to generate consistent illustrations for each story. Their pipeline
begins by putting all the text into a common form: English stories are used as they are, while Hindi
stories are first translated into English with a large language model so that the rest of the process runs
in a single language. The MUSIA mapping file is then consulted to find out how many images are
needed per story, and the LLM is asked to rewrite the narrative into exactly that many short,
scenewise descriptions. Instead of treating these scenes separately, the team stitches all of them together
into one long, comma-separated prompt, so that character details and overall plot context are visible
to the model at once rather than in isolation. This combined prompt is fed into the 1Prompt1Story
framework, which augments a difusion-based generator (Playground AI) with components such as
Singular-Value Reweighting and Identity-Preserving Cross-Attention to stabilise character identity and
important visual cues across the full sequence, while still allowing natural changes in viewpoint and
background. From this shared prompt, the system produces one image for each scene, resulting in a
run of illustrations that stay aligned with the story and look coherent as a set, all without any extra
ifne-tuning on the MUSIA dataset.</p>
      <p>The NLPFusion team [30] design a hybrid pipeline that combines story understanding with
difusionbased image generation to address the MUSIA shared task. Their system handles each English or Hindi
story in three broad steps: preparing the text, summarising the narrative, and finally creating the
illustrations. To begin with, the raw story is cleaned to remove unwanted symbols and formatting, then
broken into sentences and grouped into consecutive chunks so that the number of “mini-stories” lines
up with the number of images required for that story. Each chunk is meant to capture one scene that
deserves its own illustration. These chunks are then fed into a T5-large abstractive summarisation model,
run with fixed length constraints and deterministic decoding, to produce short scene descriptions that
focus on the main actions and characters while avoiding unnecessary repetition. These summaries
serve as the core of the image prompts. In the last stage, the team uses the Stable Difusion XL (SDXL
1.0) model, running in 16-bit precision on a GPU, to generate the images. A fixed cartoon-style prefix
is added to every summary to keep the visual look consistent, and the combined prompt is passed to
the SDXL pipeline to produce one PNG image per scene, stored in language-specific folders with basic
logging and error handling for smooth batch processing. Overall, this step-by-step setup converts raw
multilingual stories into a set of visually consistent, story-faithful illustrations tailored to the MUSIA
evaluation.</p>
      <p>The team Retriever [31] approaches the MUSIA shared task with a purely zero-shot strategy, relying
on large foundation models and prompt engineering instead of training any custom models. They work
directly on the oficial MUSIA test set, which contains 39 English and 30 Hindi stories, each annotated
with the exact number of images to be generated. The same pipeline is applied to both languages
by making use of Gemini-2.5-Pro’s multilingual capabilities. For each story, Gemini is first asked to
produce a single high-level “system prompt” that defines a child-friendly visual style for the entire
story, and then to write one detailed image prompt for every required illustration, each capturing a
specific scene or turning point in the narrative. A carefully crafted meta-prompt guides Gemini using
structured tags for aspects such as style, colour, shading, texture, character persistence, and framing,
and instructs it to return the outputs in a Python dictionary format for easy downstream processing.
In the generation phase, the global system prompt is prepended to each scene-level prompt, so that
all images share a common artistic look and consistent character depiction while still reflecting the
unique content of each scene. These combined prompts are then passed to Google’s Imagen-4.0-Ultra
model to create the final images, yielding story-wise sequences that aim to stay faithful to the text and
visually coherent across all frames.</p>
      <p>The team_meoooo [32] designs a multilingual story-to-image pipeline that treats Hindi and English
stories within a single, unified framework by combining translation, text segmentation,
summarisation, and difusion-based image generation. Hindi stories are first translated into English using the
pretrained facebook/nllb-200-distilled-600M model, while English stories are used directly. The
resulting texts are cleaned to remove extra spaces, line breaks, and other formatting artefacts. The
cleaned story is then split into sentences and organised into smaller “sub-stories”. To do this, each
sentence is represented with a TF–IDF vector, cosine similarity scores are computed between
sentences, and sentences with high similarity are grouped together, with low similarity marking natural
breaks in the narrative. Because summarisation models have input length limits, these initial groups
are further adjusted based on token counts: each chunk is tokenised using NLTK’s word_tokenize,
tokens are redistributed so that segment lengths remain reasonably balanced, and the final segments
are checked against a 512-token threshold using the facebook/bart-large-cnn tokenizer, with only
occasional truncation of very long inputs. Each of these balanced segments is then summarised
using the facebook/bart-large-cnn abstractive model to produce short descriptions that highlight the
central events and characters in that part of the story. These summaries serve as prompts for the
stable-diffusion-X1-base-10 model, which generates one image per segment, aiming to produce a
sequence of illustrations that both follow the narrative flow and maintain a coherent visual style across
the entire story.</p>
    </sec>
    <sec id="sec-5">
      <title>5. Results</title>
      <p>The evaluation process followed a structured human-annotation protocol designed to capture both
visual and narrative fidelity. Each system’s outputs were examined by trained annotators who assessed
every image independently along three dimensions. Visual Quality measured clarity, realism, absence
of distortions, and overall aesthetic coherence. Relevance captured how well the image reflected the
textual description of the specific story segment, focusing on attributes such as depicted actions,
objects, characters, and contextual cues. Consistency evaluated whether images belonging to the same
story maintained stable character appearances, settings, and thematic progression. All judgments were
made using a uniform three-point scale i.e., Good, Moderate, and Fair providing interpretable and
comparable cross-team performance indicators. This evaluation framework ensured that systems were not
rewarded solely for generating attractive images but were also assessed on their ability to respect
narrative structure, linguistic cues, and multi-panel continuity.</p>
      <p>Fair
1
12
28
37
38
39
39
22</p>
      <p>Human evaluation scores for all systems submitted on English and Hindi stories are shown in Table 3
and Table 4. In English, Retriever-run1 clearly excels. All 39 outputs receive a Good rating for Visual
Quality. Its Relevance scores are also strong, with 34 Good and 5 Moderate ratings. Consistency is a
bit more challenging; 24 outputs are rated Good, 14 Moderate, and just 1 Fair. This indicates that this
system generally preserves story flow but occasionally loses track between panels. Team
NandiniDivyarun1 also achieves high Visual Quality with 36 Good and 2 Moderate ratings. However, its Relevance
and Consistency scores are more evenly spread between Moderate and Fair. This suggests that while
the images are visually appealing, they do not always connect tightly with the narrative.
Team_meoooorun1 is in the middle: most images are rated Moderate rather than Good for Visual Quality (22 v/s
15). Relevance and Consistency are mostly rated Fair, indicating challenges in capturing nuances of
the story. The remaining systems, including NLPFusion runs and JU Team MCSE-1_run1, primarily
receive Fair ratings across all three criteria, suggesting they struggle with both basic visual quality and
narrative alignment in English stories.</p>
      <p>The Hindi evaluations show a similar pattern. Retriever-run1 again performs best, with all 30
outputs rated Good for both Visual Quality and Relevance. Consistency scores are slightly lower but
still strong, with 27 Good and 3 Moderate ratings. Team NandiniDivya-run1 matches Retriever-run1
in Visual Quality with 30 Good ratings. However, its other two criteria have a more varied profile:
Relevance rates as Good, Moderate, and Fair (10/15/5), and Consistency shows the same 15/10/5
distribution. This reflects the English evaluations, where images are visually appealing, but sometimes
loosely connect to specific story sections. For Team_meoooo-run1, Visual Quality is mostly Good or
Moderate (19 and 11 ratings, respectively). Yet, Relevance and Consistency are mostly rated Fair. This
means the system often produces decent-looking images that do not closely match the Hindi text. The
NLPFusion systems again rank lower, with most outputs receiving Fair ratings across all metrics.</p>
      <p>Looking across teams and languages, three key trends emerge. First, visual quality appears to be
the easiest dimension to satisfy. Even comparatively weaker systems are often capable of producing
or retrieving images that are sharp, visually coherent, and free from obvious artifacts such as
blurriness, distortions, or malformed objects. In contrast, Relevance and especially Consistency are much
harder to meet since they require the model to focus on specific entities, actions, and story progression.
Second, Retriever-run1 stands out as the top-performing system overall and is also the most consistent
across both English and Hindi. Third, systems like NLPFusion and Team_meoooo-run1 reveal an
important gap: they can create visually appealing outputs but struggle to maintain a strong connection
to the text and ensure characters and scenes stay consistent across multiple panels. This gap is what
MUSIA aims to highlight, distinguishing systems that produce merely ”nice” images from those that
efectively follow and understand the story’s context.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Conclusion</title>
      <p>The MUSIA-2025 shared task highlights the growing need for generative systems that go beyond
creating visually appealing images. These systems should focus on narrative accuracy and cultural
relevance. Eight teams registered at first, but only five submitted valid system runs, and just four completed
their final papers. Overall, the systems showcased impressive visual quality, mainly due to strong
difusion models and LLM-guided prompt engineering. However, they often fell short in narrative relevance
and consistency across panels, revealing ongoing issues in today’s multimodal generation setups. A
significant finding is that none of the teams used the culturally rich training illustrations from the
dataset. As a result, most generated images reflected Western stylistic biases. This points to a broader
issue with current text-to-image models, which are mainly trained on Western-centric image
collections and often struggle to depict Indian cultural elements, including traditional attire, regional settings,
indigenous motifs, and local storytelling practices. These observations reinforce MUSIA’s main goal:
to promote systems that not only understand multilingual story narratives but also create visuals that
truly represent cultural contexts. Future methods should incorporate the provided illustrations using
techniques like retrieval-augmented prompting, cultural style transfer, or prototype-based learning to
reduce bias. Expanding MUSIA to include more Indian languages and diverse narrative traditions could
further enhance story structures and visual diversity. Additionally, developing automated metrics for
cultural accuracy would support human evaluations of relevance, consistency, and quality. Promising
paths include multimodal fine-tuning using Indian illustration datasets and memory-enhanced
structures to maintain character identity across panels. Overall, MUSIA-2025 shows both the potential and
current limitations of multimodal generation systems, stressing the need for models that are more
culturally aware and context-sensitive in creative and educational settings.</p>
    </sec>
    <sec id="sec-7">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the authors used ChatGPT, Grammarly in order to: Grammar and
spelling check, Paraphrase and reword. After using these tools, the authors reviewed and edited the
content as needed and take full responsibility for the publication’s content.
of the 2022 Conference on Neural Information Processing Systems, 2022, pp. 2345–2356.
[26] J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K.
Millican, M. Reynolds, R. Ring, E. Rutherford, S. Cabi, T. Han, Z. Gong, S. Samangooei, M.
Monteiro, J. Menick, S. Borgeaud, A. Brock, A. Nematzadeh, S. Sharifzadeh, M. Binkowski, R. Barreira,
O. Vinyals, A. Zisserman, K. Simonyan, Flamingo: a visual language model for few-shot learning,
Advances in Neural Information Processing Systems 35 (2022).
[27] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam,
G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh,
D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, P. Nakkiran, J. Clark,
C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot
learners, Advances in Neural Information Processing Systems 33 (2020).
[28] Y. Zhang, J. Li, Y. Chen, W. Xu, X. Wang, Instruction-tuned multimodal language models for
storytelling, in: Proceedings of the 2023 International Conference on Learning Representations,
2023, pp. 3456–3467.
[29] N. S. Sharma, Divya, Multilingual Story Illustration for MUSIA 2025 using One-Prompt-One-Story</p>
      <p>Image Generation, in: FIRE 2025 Working Notes, CEUR Workshop Proceedings, 2025.
[30] S. Mannan, A. Hegde, S. Coelho, Bridging Cultures through AI: The Art of Multilingual
Storytelling, in: FIRE 2025 Working Notes, CEUR Workshop Proceedings, 2025.
[31] K. Kachhadiya, P. Patel, Leveraging Large Language Model(LLM) and V-LLM for Zero-Shot
Multilingual Story Illustration, in: FIRE 2025 Working Notes, CEUR Workshop Proceedings, 2025.
[32] M. Sadhukhan, I. Bhattacharya, P. Dutta, NarrArt: Multilingual Story Illustration with AI for
English and Hindi Narratives, in: FIRE 2025 Working Notes, CEUR Workshop Proceedings, 2025.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Y. Cheng, Y. Wu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Carin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carlson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Storygan: A sequential conditional gan for story visualization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>993</fpage>
          -
          <lpage>1002</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maharana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Improving Generation and Evaluation of Visual Stories via Semantic Consistency</article-title>
          , in: NAACL,
          <year>2021</year>
          , pp.
          <fpage>2427</fpage>
          -
          <lpage>2442</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hu</surname>
          </string-name>
          , Y. Cheng,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , G. Neubig,
          <article-title>Visual storytelling dataset (vist</article-title>
          ), https://service. tib.eu/ldmservice/dataset/visual-storytelling-dataset--vist-,
          <year>2024</year>
          . Accessed:
          <fpage>2025</fpage>
          -10-25.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>G.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , Pororogan:
          <article-title>An improved story visualization model on pororo-sv dataset</article-title>
          ,
          <source>Proceedings of the 3rd International Conference on Computer Science and Artificial Intelligence</source>
          (
          <year>2019</year>
          )
          <fpage>1</fpage>
          -
          <lpage>5</lpage>
          . URL: https://dl.acm.org/doi/10.1145/3374587.3374649. doi:
          <volume>10</volume>
          .1145/3374587. 3374649.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Farhadi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hoiem</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kembhavi</surname>
          </string-name>
          , Imagine This! Scripts to Compositions to Videos, in
          <source>: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>610</fpage>
          -
          <lpage>626</lpage>
          . URL: https://link.springer.com/chapter/10.1007/978-3-
          <fpage>030</fpage>
          -01237-3_
          <fpage>37</fpage>
          . doi:
          <volume>10</volume>
          . 1007/978- 3-
          <fpage>030</fpage>
          - 01237- 3_
          <fpage>37</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Ye</surname>
          </string-name>
          , J. Liu,
          <string-name>
            <given-names>R.</given-names>
            <surname>Peng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.- J.</given-names>
            <surname>Qi</surname>
          </string-name>
          , Openstory++
          <article-title>: A large-scale dataset and benchmark for instance-aware open-domain visual storytelling</article-title>
          ,
          <source>arXiv preprint arXiv:2408.03695</source>
          (
          <year>2024</year>
          ). URL: https://arxiv.org/abs/2408.03695.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          , Dm-gan:
          <article-title>Dynamic memory generative adversarial networks for text-to-image synthesis</article-title>
          ,
          <year>2019</year>
          . URL: https://arxiv.org/abs/
          <year>1904</year>
          .01310. arXiv:
          <year>1904</year>
          .01310.
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>O.</given-names>
            <surname>Gafni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Polyak</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Ashual</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sheynin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Taigman</surname>
          </string-name>
          ,
          <article-title>Make-a-scene: Scenebased text-to-image generation with human priors, 2022</article-title>
          . URL: https://arxiv.org/abs/2203.13131. arXiv:
          <volume>2203</volume>
          .
          <fpage>13131</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pavlov</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gray</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Voss</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Zero-shot text-to-image generation</article-title>
          ,
          <year>2021</year>
          . URL: https://arxiv.org/abs/2102.12092. arXiv:
          <volume>2102</volume>
          .
          <fpage>12092</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Y.</given-names>
            <surname>Koh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Luong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Baid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vasudevan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ku</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Hutchinson</surname>
          </string-name>
          , W. Han,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Parekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , J. Baldridge,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>Scaling autoregressive models for content-rich text-to-image generation</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2206.10789. arXiv:
          <volume>2206</volume>
          .
          <fpage>10789</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <source>Difusion models beat gans on image synthesis</source>
          ,
          <year>2021</year>
          . URL: https://arxiv. org/abs/2105.05233. arXiv:
          <volume>2105</volume>
          .
          <fpage>05233</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>McGrew</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          , Glide:
          <article-title>Towards photorealistic image generation and editing with text-guided difusion models</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2112.10741. arXiv:
          <volume>2112</volume>
          .
          <fpage>10741</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2112.10752. arXiv:
          <volume>2112</volume>
          .
          <fpage>10752</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>C.</given-names>
            <surname>Saharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K. S.</given-names>
            <surname>Ghasemipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <article-title>Photorealistic text-to-image difusion models with deep language understanding</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2205.11487. arXiv:
          <volume>2205</volume>
          .
          <fpage>11487</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>S.</given-names>
            <surname>Gu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Vector quantized difusion model for text-to-image synthesis</article-title>
          ,
          <year>2022</year>
          . URL: https://arxiv.org/abs/2111.14822. arXiv:
          <volume>2111</volume>
          .
          <fpage>14822</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Y. Cheng, Y. Wu,
          <string-name>
            <given-names>L.</given-names>
            <surname>Carin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Carlson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <article-title>Storygan: A sequential conditional GAN for story visualization</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision</source>
          and Pattern
          <string-name>
            <surname>Recognition (CVPR) Workshops</surname>
          </string-name>
          ,
          <year>2019</year>
          , p.
          <fpage>9</fpage>
          -
          <lpage>17</lpage>
          .
          <article-title>Uses the Pororo-SV dataset for evaluation.</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maharana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , Storydall-e:
          <article-title>Adapting pretrained text-to-image transformers for story continuation</article-title>
          ,
          <source>in: Proceedings of the European Conference on Computer Vision (ECCV)</source>
          , Springer,
          <year>2022</year>
          , pp.
          <fpage>70</fpage>
          -
          <lpage>87</lpage>
          . URL: https://arxiv.org/abs/2209.06192. doi:
          <volume>10</volume>
          .1007/ 978- 3-
          <fpage>031</fpage>
          - 19826- 4\_5.
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Maharana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Hannan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          , Storydall-e:
          <article-title>Adapting pretrained text-to-image transformers for story continuation</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>13697</fpage>
          -
          <lpage>13706</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Song</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Character preserving coherent story visualization via visual planning and token alignment</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>1234</fpage>
          -
          <lpage>1243</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>T.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Gupta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <article-title>Imagine this! scripts to compositions to videos</article-title>
          ,
          <source>Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2018</year>
          )
          <fpage>1</fpage>
          -
          <lpage>10</lpage>
          . URL: https://arxiv.org/abs/
          <year>1804</year>
          .03608. doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2018</year>
          .
          <volume>00001</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>J.</given-names>
            <surname>Kapuriya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Buitelaar</surname>
          </string-name>
          , Flintstonessv++:
          <article-title>Improving story narration using visual scene graph</article-title>
          ,
          <source>in: Proceedings of the 8th Workshop on Narrative Extraction From Texts (Text2Story</source>
          <year>2025</year>
          ), volume
          <volume>3964</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2025</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>3964</volume>
          /paper3.pdf, accessed:
          <fpage>2025</fpage>
          -10-25.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>R.</given-names>
            <surname>Oliveira</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Santos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Silva</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Lima</surname>
          </string-name>
          , E. Costa,
          <article-title>Narrative generation from visual inputs: A framework for storytelling from images</article-title>
          ,
          <source>in: Proceedings of the 2021 Annual Meeting of the Association for Computational Linguistics</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>567</fpage>
          -
          <lpage>577</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Visual coherence losses for story generation from images</article-title>
          , in: Proceedings
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>