<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Forum for Information Retrieval Evaluation, December</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>NarrArt: Multilingual Story Illustration with AI for English and Hindi Narratives</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mrinmoy Sadhukhan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Indrajit Bhattacharya</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Paramartha Dutta</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer &amp; System Sciences</institution>
          ,
          <addr-line>Visva-Bharati, Santiniketan, Birbhum, 731235, West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Kalyani Government Engineering College</institution>
          ,
          <addr-line>Kalyani, Nodia, 741235, West Bengal</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2025</year>
      </pub-date>
      <volume>1</volume>
      <fpage>7</fpage>
      <lpage>20</lpage>
      <abstract>
        <p>This work addresses the challenge of multilingual story illustration, with an emphasis on generating visual depictions for narratives in hindi and english. The task holds substantial value in domains such as education and the entertainment industry, where visual storytelling enriches children's literature by adding character visualizations that foster engagement and imagination, and support the creation of comics and animated illustrations that enrich the learning experience. A central problem in this domain is balancing character diversity with narrative consistency, where unconstrained character generation can result in inconsistency across illustrations. To overcome this, a framework is proposed that leverages a curated dataset of publicly available stories and illustrations combined with strategies for maintaining both cultural diversity and visual coherence. The proposed approach can produce illustrations for multilingual stories that are not only semantically aligned with narrative segments but also consistent across an entire story, paving the way for richer, culturally rooted applications in storytelling and creative media.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Visual Story-telling</kwd>
        <kwd>Difusion Model</kwd>
        <kwd>Tf-Idf</kwd>
        <kwd>Summarization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>Storytelling has long played a vital role in a child’s cognitive and imaginative development. In earlier
times, children often heard stories narrated by their grandparents, where imagination was fueled
by verbal illustrations. Through these narratives, children envisioned characters and scenes in their
minds, guided by the expressive storytelling of their elders. However, with the shift towards nuclear
family structures in modern society, many children are deprived of this rich storytelling experience.
Artificial Intelligence (AI) now ofers new opportunities to recreate this experience by providing visual
illustrations of stories. Such AI-generated illustrations can spark imagination in children, support
teachers in classrooms, and assist illustrators—whose numbers are steadily decreasing—in producing
visuals for stories. These illustrations can also enhance comprehension, helping students to better
understand narrative content. Despite these advantages, current AI-based visual illustration models
face significant limitations. Most state-of-the-art text-to-image generation models are predominantly
trained on english language roman script, creating barriers for multilingual storytelling. For example,
stories written in Hindi or other Indian languages often cannot be directly processed, restricting the
accessibility of such tools. A straightforward solution might involve translating stories into English
before feeding them into difusion-based image generation models. However, this approach encounters
challenges when handling long narratives, where translation quality and consistency degrade.</p>
      <p>To address these challenges, we propose a pipeline that leverages multiple pretrained models. First,
stories are preprocessed and, depending on whether they are in Hindi or English, passed through an
appropriate translation model. The translated or original text is then segmented into semantically
coherent sub-stories. Each segment is summarized using abstractive summarization techniques to
produce concise yet meaningful representations. Finally, these summarized sub-stories are used as
input to difusion models, enabling the generation of consistent and contextually accurate illustrations
for multilingual storytelling.</p>
      <p>The remainder of this paper is organized as follows. Section 2 reviews the existing work in this
domain and highlights the key innovations of previous approaches. The description about the dataset
in presented in section 3. The proposed methodology is described in section 4. Section 5 presents the
experimental results along with key evaluation metrics. A discussion on the findings and potential
directions for future improvement is provided in section 6. Finally, the paper is concluded in section 7.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In the field of text-to-image generation, three primary categories of models have been explored, namely
Generative Adversarial Network (GAN)-based models, difusion-based models, and transformer-based
models. In GAN-based models, the generator produced images conditioned on noise vectors and text
embeddings, while the discriminator attempted to determine whether an image was real or synthetic,
thereby improving the generator through adversarial learning. In contrast, difusion-based models
always begin with pure noise and iteratively denoise it step by step, guided by text embeddings, to
synthesize an image. Transformer-based models treated images as sequences of tokens and jointly
modeled image tokens with text tokens in an autoregressive sampling framework, enabling image
generation conditioned on textual input. Xu et al. [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] proposed AttnGAN (Attentional Generative
Adversarial Network), which generated images from text descriptions by focusing on relevant words
corresponding to diferent regions of the image. High-resolution images were synthesized progressively
across multiple stages, where attention over word-level embeddings refined the image resolution
step by step. Qiao et al. [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] presented a GAN-based text-to-image model consisting of three major
components: the Semantic Text Embedding Module (STEM), the Global–Local Collaborative Attentive
Module (GLAM), and the Semantic Text Regeneration and Alignment Module (STREAM). The STEM
module employed a recurrent neural network to obtain word- and sentence-level embeddings. The
GLAM module was responsible for generating images at multiple scales, while the STREAM module,
implemented with an LSTM, attempted to regenerate the text description from the generated images to
ensure semantic alignment with the input text. Saharia et al. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] introduced a difusion-based model,
Imagen, for text-to-image generation. The model incorporated a pretrained T5-based transformer to
obtain text embeddings, which were then injected as conditions into an eficient U-Net backbone for
image synthesis. The U-Net progressively upscales images to higher resolutions by transferring learned
weights across diferent scales. This approach achieved an impressive FID score of 7.27. Ramesh et al. [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
proposed the DALL-E 2 model, a decoder-based difusion framework. The model followed a two-stage
pipeline: in the first stage, text was mapped into CLIP (Contrastive Language–Image Pretraining)
embeddings, and in the second stage, a difusion-based decoder generated images conditioned on these
embeddings. Rombach et al. [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] proposed the Stable Difusion model. In this framework, images were
ifrst compressed into a latent representation using a trained variational autoencoder (VAE).
Difusionbased denoising was then applied within this latent space, where CLIP-based text embeddings were
injected into a U-Net to guide the denoising process. Finally, the VAE decoder reconstructed the
highresolution image from the denoised latent representation. In the category of transformer-based models,
Ramesh et al. [6] presented a zero-shot text-to-image generation framework, known as DALL-E. The
approach first compressed images into a 32 × 32 latent space using a discrete VAE. The encoded text was
then concatenated with the latent representation, and an autoregressive transformer was trained to
model the joint distribution. Due to its zero-shot nature, this model demonstrated the ability to handle
unusual and imaginative prompts efectively.
      </p>
      <p>In multilingual text-to-image generation, translation plays a crucial role. A simple text encoder alone
is not suficient, since additional pretrained models are required to handle translation from multilingual
text into English. Furthermore, summarization techniques are often applied to compress sentences,
ensuring they are suitable as input for the generative model. Ramesh et al. [7] from the AI4Bharat
team proposed the IndicTrans model for multilingual text-to-English conversion. It supports 11 Indian
languages and was trained on the publicly available Samanantar corpus. Their implementation is
based on the fairseq framework, using a Transformer architecture with six encoder and six decoder
layers, an input embedding dimension of 1536, and 16 attention heads. Subsequent improvements led
to IndicTrans2 [8], which extends support to 22 scheduled Indian languages spanning diverse scripts.
IndicTrans2 employs 18 encoder layers and 18 decoder layers, with an input embedding dimension of
1024, a feed-forward dimension of 8192, and 16 attention heads, resulting in a total of 1.1B parameters.
The model achieved an average score of 62.8 on the IT2 benchmark and currently serves as the backend
for the Government of India’s Bhashini application. Costa-jussà et al. [9] from the NLLB (No Language
Left Behind) team introduced the NLLB model, which supports translation across approximately 200
languages, with particular focus on high-resource languages such as German, French, and English. It is
a Transformer-based sequence-to-sequence model with 54B parameters, although smaller variants are
also available. The training incorporates a Mixture-of-Experts (MoE) strategy to improve eficiency.
The reported performance reaches 44.8 on the ChrF++ metric. In addition to these research models,
several commercial large language models (LLMs), such as Google’s Gemini, OpenAI’s GPT-4o, and
Anthropic’s Claude, can also be leveraged for translation tasks.</p>
      <p>In the summarization domain, two primary approaches exist: extractive and abstractive
summarization. For story generation tasks, abstractive summarization is particularly important, as it enables the
creation of abstracted narratives that can subsequently be fed into image generation models. Lewis
et al. [10] from Facebook proposed BART, a denoising sequence-to-sequence model. BART integrates
two powerful pretrained components: a bidirectional Transformer encoder (similar to BERT) and
an autoregressive Transformer decoder (similar to GPT). It has been widely applied to tasks such as
summarization, text reconstruction, and sentence generation. A fine-tuned variant, BART-Large-CNN,
demonstrates strong performance on abstractive summarization, achieving robust ROUGE scores on the
CNN/DailyMail dataset. Zhang et al. [11] from Google introduced PEGASUS, a Transformer-based
encoder–decoder architecture explicitly designed for summarization tasks. The model is pretrained using
two key strategies: gap-sentence generation and masked language modeling. Pretraining is performed
on large-scale corpora, including the HugeNews dataset and the C4 dataset, followed by fine-tuning
on summarization benchmarks such as CNN/DailyMail and others. Rafel et al. [ 12] from Google
proposed the T5 (Text-to-Text Transfer Transformer) model, a versatile encoder–decoder Transformer
framework. T5 is capable of handling diverse downstream tasks, including translation, question
answering, summarization, and classification, by casting all tasks into a unified text-to-text format. The
largest variant, with 11B parameters, was trained on the C4 corpus and achieved a score of 83.28
on the GLUE benchmark. Beltagy et al. [13] from AllenAI presented the Longformer model, which
introduces an eficient attention mechanism called the attention pattern module to reduce the quadratic
time complexity of standard self-attention in Transformers. Trained in an autoregressive manner,
Longformer is particularly well-suited for long-document summarization, such as scientific papers and
legal documents. On summarization benchmarks, the model achieved a ROUGE score of 44.4.</p>
      <p>From the above discussion, it can be observed that, apart from encoder–decoder based models, a wide
range of large language models (LLMs) exist, available in both open-source and proprietary forms. These
models often demonstrate remarkable performance; however, their deployment on consumer-grade
GPUs is generally infeasible due to high computational requirements. Some open-source variants can
be executed locally through projects such as ollama.cpp, but this comes at the cost of significantly
slower inference speeds. Alternatively, several LLMs can be accessed via API calls. While this approach
simplifies usage, it introduces practical limitations, such as restricted request quotas and potential
latency, making it unsuitable for continuous large-scale tasks. Furthermore, the direct use of pretrained
models for summarization or translation is constrained by strict token-length limits. As a result, handling
long-form narratives requires specialized methods for segmenting, preprocessing, and reassembling the
text before feeding it into such models, which are described in below sections.</p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset Description</title>
      <p>The dataset provided by the MUSIA Shared Task [14, 15] comprises stories in both English and Hindi.
For the English training dataset, there are two primary directories: one containing the story text files
and another containing the corresponding images. The story directory for English includes 360 text
ifles, each containing a multi-paragraph story. The image directory holds the related images, named
using the pattern eng_story_XXX_01, where eng_story_XXX corresponds to the respective story file
in the stories directory. The Hindi training dataset follows the same structure, containing 185 story
text files along with their corresponding images. For the testing datasets, the English set includes 40
stories, while the Hindi set includes 30 stories. Unlike the training data, the test sets do not contain any
images. Instead, each test set includes a JSON file — EN_story_image_counts.json for English and its
equivalent for Hindi — which specifies the number of images that should be generated for each story.
This setup enables the evaluation of a model’s ability to generate a visually consistent and contextually
appropriate number of images corresponding to each story during testing.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Methodology</title>
      <p>
        In this section, we present our approach for generating visual illustrations from multilingual story data.
To ensure clarity and maintain scope, our work focuses on two languages: Hindi and English. For Hindi
stories, we first employ the pretrained translation model facebook/nllb-200-distilled-600M
[9] to translate text from Hindi (Devanagari script) into English (Latin script). Both the translated
stories and the original English texts are then standardized through a preprocessing stage, which
removes redundant spaces, newline characters, and other formatting inconsistencies. Next, each story
is segmented into semantically coherent sub-stories, corresponding to the number of images to be
generated per story. To achieve this segmentation, sentences are first encoded using a TF–IDF vectorizer
[16], and cosine similarity scores are computed between them. Sentences exhibiting higher similarity
values are grouped together, while low-similarity sentences delineate the boundaries between segments,
ensuring that each segment captures a coherent narrative idea. Since abstractive summarization models
impose sequence length constraints, these segments are subsequently rebalanced to achieve uniformity.
Each segment is initially tokenized using NLTK’s word_tokenize function, and token counts are
adjusted to ensure near-equal distribution across segments. The tokens are then reconstructed into
sentences, and each token-limited segmented sentence or chunk is validated against a maximum
sequence length of 512 tokens using the facebook/bart-large-cnn tokenizer [10]. In rare cases,
particularly long sentences are truncated, which may cause minor degradation in summarization
quality. After balancing, each chunk is passed through the facebook/bart-large-cnn abstractive
summarization model [10] to generate concise summaries that encapsulate the essence of each story
segment. These summaries serve as textual prompts for the stable-diffusion-X1-base-10 model
[
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], which produces corresponding visual illustrations. This pipeline efectively bridges the gap between
multilingual storytelling and automated visual generation. A detailed workflow of the proposed method
is illustrated in figure 1.
      </p>
    </sec>
    <sec id="sec-5">
      <title>5. Result</title>
      <p>During the evaluation of the proposed methodology, our focus was not on assessing the quality of the
pretrained models themselves, as these models are already well-established and widely validated. Instead,
the primary objective was to evaluate the overall efectiveness of the proposed framework—specifically,
its ability to generate images from multilingual stories with varying numbers of narrative segments. For
this purpose, we utilized the test dataset provided by the MUSIA Shared Task. It is important to note
that we did not fine-tune any of the pretrained models used in this framework due to the significant
GPU resource requirements associated with such training. Instead, all models were employed in their
of-the-shelf pretrained configurations, ensuring that the evaluation focuses purely on the integration
Story Language</p>
      <p>English
(ours)</p>
      <p>English
(Retriever)</p>
      <p>English
(Nandini_Divya)</p>
      <p>English
(NLPFusion)</p>
      <p>English
(JU Team)
Story Language</p>
      <p>Hindi
(ours)</p>
      <p>Hindi
(Retriever)</p>
      <p>Hindi
(Nandini_Divya)</p>
      <p>Hindi
(NLPFusion)
39
36
0
0
30
30
0</p>
      <p>Fair
2
0
0
39
39
Fair
0
0
0
30</p>
      <p>Good
3
34
5
0
0
Good
0
30
10
2
Good
15</p>
      <p>Visual Quality</p>
      <p>Moderate
22
and performance of the proposed pipeline rather than the optimization of individual components.
Several automatic evaluation metrics are commonly employed to assess the performance of generative
models, including Fréchet Inception Distance (FID), Inception Score (IS), Learned Perceptual Image Patch
Similarity (LPIPS), CLIPScore, and Diversity Score. These metrics evaluate diferent aspects of generative
performance, such as visual coherence, semantic alignment, and output diversity. FID and IS are
traditionally used to assess the visual realism and fidelity of generated images by comparing them with
real images, and are particularly valuable during the training phase of models such as difusion-based
generators. For text-to-image generation, however, CLIPScore is more suitable, as it directly measures
the semantic consistency between the input text and the generated image. The score ranges from 0 to
21
5
6
2
0
4
15</p>
      <p>Fair
26
0
12
33
37
Fair
30
0
5
24
24
8
0
0
Good
0
27
15
0
14
18
2
0
3
2
10
Consistency</p>
      <p>Moderate
3
1, with higher values indicating better alignment and higher-quality outputs. Additionally, diversity
can be evaluated using an LPIPS-based Diversity Score, which computes perceptual distances between
multiple images generated from the same text prompt. This reflects the model’s capacity to produce
varied yet semantically related outputs—where a lower score indicates reduced diversity and a higher
score suggests greater variability. Since our framework employs a pretrained text-to-image generation
model, metrics such as FID and IS are less informative, as they primarily assess image realism rather
than text–image correspondence. Therefore, we report CLIPScore as our primary evaluation metric. On
the MUSIA test dataset, our approach achieved an average CLIPScore of 0.31, indicating a moderate
level of semantic alignment between the generated images and their corresponding story segments.
Nonetheless, we recognize that CLIPScore is still a developing and limited metric, underscoring the
need for more comprehensive evaluation approaches tailored to multilingual text-to-image generation.</p>
      <p>Hence, we relied on human evaluation conducted by a team of experts from the MUSIA Shared
Task. The evaluators assessed the generated illustrations based on three key criteria: Relevance,
Visual Quality, and Consistency. Relevance measures how efectively the illustrations capture the key
moments and semantics of the story; Visual Quality assesses the originality, aesthetic appeal, and overall
artistic presentation; and Consistency evaluates whether the sequential images from diferent story
segments form a coherent and continuous narrative. Among these, Consistency is given the highest
priority during rating, followed by Relevance and Visual Quality. To quantify inter-rater reliability,
Cohen’s Kappa coeficient was computed, and the final ratings were categorized into three levels—Good,
Moderate, and Fair. Table 1 presents the aggregated evaluation results for English stories, while
Table 2 reports the corresponding results for Hindi stories. Both tables include a comparative analysis
between our proposed framework and other participating teams whose submissions were shortlisted
in the MUSIA Shared Task. From our results, it can be observed that our proposed method performs
strongly in terms of Visual Quality for both English and Hindi stories. In terms of Relevance, the
English stories achieve comparatively better performance, while the Hindi stories exhibit slightly lower
scores. Regarding Consistency, both language categories achieve moderate performance, which can be
considered satisfactory given the experimental constraints; this is expected, as pretrained models without
domain-specific fine-tuning often struggle to maintain strong relevance and narrative consistency across
all generated outputs. Figure 2 illustrates the application of the proposed methodology on an English
story, where the narrative is divided into multiple segments, and for each segment, a corresponding
image is generated to visually depict the key events and preserve the story’s progression. Similarly,
Figure 3 demonstrates the application of the method to a Hindi story, showcasing the framework’s
multilingual capability and its efectiveness in producing coherent visual illustrations across diferent
languages.</p>
    </sec>
    <sec id="sec-6">
      <title>6. Discussion</title>
      <p>Multilingual story illustration is an important and challenging problem, yet there is currently no
proven method that can produce perfect illustrations directly from text. The accuracy of story-to-image
generation largely depends on how the story is provided to the model, as well as the length and
complexity of the narrative being fed. From the methodology and results discussed above, it can be
observed that diferent pretrained models can be integrated at various stages of the proposed pipeline.
However, the final quality of illustrations ultimately depends on difusion-based image generation
models and the way in which the script is structured to guide the model in understanding the intended
scene. For multilingual stories, it is our assumption that the IndicTrans model developed by AI4Bharat
could potentially yield better results in translation tasks compared to the pretrained models we have
used, since it is specifically trained and validated on native Indian languages. One limitation we have
observed is that difusion-based models are computationally heavy, and their training requires complex
ifne-tuning procedures, which are sometimes impossible in consumer-grade GPUs. In this work, we
have employed a pretrained difusion model without additional fine-tuning on the given training and
validation datasets, which occasionally led to inconsistencies and mismatches in visualization. The
development of lightweight difusion models tailored specifically for this type of illustration task could
significantly reduce the dependency on heavy, server-grade GPUs. Such models would also allow easier
ifne-tuning, thereby facilitating domain adaptation with minimal efort and further enhancing the
quality and efectiveness of the generated illustrations.</p>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>This work explored the problem of multilingual story illustration, with a focus on English and Hindi
narratives. By combining pretrained language and vision models within a story segmentation based
framework, we demonstrated that it is possible to generate illustrations that are semantically aligned
with story segments while maintaining reasonable narrative coherence. Human evaluation showed that
the method performs well in terms of visual quality for both languages, achieves strong Relevance in
English but weaker in Hindi, and provides moderate consistency across stories. Despite these promising
results, the reliance on heavy difusion models without fine-tuning introduced occasional mismatches
in visualization and consistency. Future directions include developing lightweight difusion models
for easier adaptation and leveraging stronger multilingual translation systems, such as IndicTran, to
improve performance on native languages. Overall, this study lays the groundwork for culturally
rooted, coherent, and scalable multilingual story illustration systems with applications in education,
entertainment, and creative media.</p>
    </sec>
    <sec id="sec-8">
      <title>Declaration on Generative AI</title>
      <p>During the preparation of this work, the author(s) used ChatGPT-4 and Grammarly to check grammar
and spelling. After using these tool(s)/service(s), the author(s) reviewed and edited the content as
needed and take(s) full responsibility for the publication’s content.
[6] A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, I. Sutskever, Zero-Shot
Textto-Image Generation, 2021. URL: http://arxiv.org/abs/2102.12092. doi:10.48550/arXiv.2102.
12092, arXiv:2102.12092 [cs].
[7] G. Ramesh, S. Doddapaneni, A. Bheemaraj, M. Jobanputra, R. AK, A. Sharma, S. Sahoo, H. Diddee,
M. J, D. Kakwani, N. Kumar, A. Pradeep, S. Nagaraj, K. Deepak, V. Raghavan, A. Kunchukuttan,
P. Kumar, M. S. Khapra, Samanantar: The Largest Publicly Available Parallel Corpora Collection
for 11 Indic Languages, 2023. URL: http://arxiv.org/abs/2104.05596. doi:10.48550/arXiv.2104.
05596, arXiv:2104.05596 [cs].
[8] J. Gala, P. A. Chitale, R. AK, V. Gumma, S. Doddapaneni, A. Kumar, J. Nawale, A. Sujatha,
R. Puduppully, V. Raghavan, P. Kumar, M. M. Khapra, R. Dabre, A. Kunchukuttan, IndicTrans2:
Towards High-Quality and Accessible Machine Translation Models for all 22 Scheduled
Indian Languages, 2023. URL: http://arxiv.org/abs/2305.16307. doi:10.48550/arXiv.2305.16307,
arXiv:2305.16307 [cs].
[9] N. Team, M. R. Costa-jussà, J. Cross, O. Çelebi, M. Elbayad, K. Heafield, K. Hefernan, E. Kalbassi,
J. Lam, D. Licht, J. Maillard, A. Sun, S. Wang, G. Wenzek, A. Youngblood, B. Akula, L. Barrault,
G. M. Gonzalez, P. Hansanti, J. Hofman, S. Jarrett, K. R. Sadagopan, D. Rowe, S. Spruit, C. Tran,
P. Andrews, N. F. Ayan, S. Bhosale, S. Edunov, A. Fan, C. Gao, V. Goswami, F. Guzmán, P. Koehn,
A. Mourachko, C. Ropers, S. Saleem, H. Schwenk, J. Wang, No Language Left Behind: Scaling
Human-Centered Machine Translation, 2022. URL: http://arxiv.org/abs/2207.04672. doi:10.48550/
arXiv.2207.04672, arXiv:2207.04672 [cs].
[10] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation,
Translation, and Comprehension, 2019. URL: http://arxiv.org/abs/1910.13461. doi:10.48550/arXiv.
1910.13461, arXiv:1910.13461 [cs].
[11] J. Zhang, Y. Zhao, M. Saleh, P. J. Liu, PEGASUS: Pre-training with Extracted Gap-sentences for
Abstractive Summarization, 2020. URL: http://arxiv.org/abs/1912.08777. doi:10.48550/arXiv.
1912.08777, arXiv:1912.08777 [cs].
[12] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer, 2023. URL: http://arxiv.
org/abs/1910.10683. doi:10.48550/arXiv.1910.10683, arXiv:1910.10683 [cs].
[13] I. Beltagy, M. E. Peters, A. Cohan, Longformer: The Long-Document Transformer, 2020. URL:
http://arxiv.org/abs/2004.05150. doi:10.48550/arXiv.2004.05150, arXiv:2004.05150 [cs].
[14] K. Tewari, A. Malviya, S. Chanda, A. Mukherjee, S. Pal, Overview of the Shared Task on Multilingual
Story Illustration: Bridging Cultures through AI Artistry (MUSIA), in: Proceedings of the 17th
Annual Meeting of the Forum for Information Retrieval Evaluation, FIRE ’25, Association for
Computing Machinery, New York, NY, USA, 2026.
[15] K. Tewari, A. Malviya, S. Chanda, A. Mukherjee, S. Pal, Findings of the Shared Task on Multilingual
Story Illustration: Bridging Cultures through AI Artistry (MUSIA), in: Proceedings of the 17th
Annual Meeting of the Forum for Information Retrieval Evaluation, CEUR Working Notes, 2026.
[16] M. Sadhukhan, P. Bhattacherjee, T. Mondal, S. Dasgupta, I. Bhattacharya, Opinion classification
at subtopic level from COVID vaccination-related tweets, Innovations in Systems and Software
Engineering 21 (2025) 215–226. URL: https://doi.org/10.1007/s11334-022-00516-9. doi:10.1007/
s11334-022-00516-9.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>X. He,</surname>
          </string-name>
          <article-title>AttnGAN: Fine-Grained Text to Image Generation with Attentional Generative Adversarial Networks</article-title>
          ,
          <year>2017</year>
          . URL: http://arxiv.org/ abs/1711.10485. doi:
          <volume>10</volume>
          .48550/arXiv.1711.10485, arXiv:
          <fpage>1711</fpage>
          .10485 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Qiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tao</surname>
          </string-name>
          ,
          <article-title>MirrorGAN: Learning Text-to-image Generation by Redescription</article-title>
          ,
          <year>2019</year>
          . URL: http://arxiv.org/abs/
          <year>1903</year>
          .05854. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>1903</year>
          .
          <volume>05854</volume>
          , arXiv:
          <year>1903</year>
          .05854 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>C.</given-names>
            <surname>Saharia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Saxena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Whang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Denton</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. K. S.</given-names>
            <surname>Ghasemipour</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. K.</given-names>
            <surname>Ayan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. S.</given-names>
            <surname>Mahdavi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. G.</given-names>
            <surname>Lopes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Salimans</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. J.</given-names>
            <surname>Fleet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Norouzi</surname>
          </string-name>
          ,
          <article-title>Photorealistic Text-to-Image Difusion Models with Deep Language Understanding</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2205.11487. doi:
          <volume>10</volume>
          .48550/arXiv.2205.11487, arXiv:
          <fpage>2205</fpage>
          .11487 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Nichol</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Chu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <article-title>Hierarchical Text-Conditional Image Generation with CLIP Latents</article-title>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2204.06125. doi:
          <volume>10</volume>
          .48550/arXiv.2204. 06125, arXiv:
          <fpage>2204</fpage>
          .06125 [cs].
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <string-name>
            <surname>High-Resolution Image Synthesis with Latent Difusion Models</surname>
          </string-name>
          ,
          <year>2022</year>
          . URL: http://arxiv.org/abs/2112.10752. doi:
          <volume>10</volume>
          .48550/arXiv.2112. 10752, arXiv:
          <fpage>2112</fpage>
          .10752 [cs].
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>