<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>On the Challenges in Evaluating Visually Grounded Stories</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Aditya K Surikuchi→</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raquel Fernández</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sandro Pezzelle</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Institute for Logic, Language and Computation University of Amsterdam</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Producing stories grounded in visual content is an inherent trait of human intelligence and an integral aspect of interpersonal communication. With the surge of advanced vision-to-language models, there has been increased interest in developing and understanding the capabilities of models to generate visually grounded narratives. However, recent research has highlighted the challenges in evaluating model-generated stories. In this work, we study these evaluation limitations in the visually grounded story generation task by focusing on the recently released Visual Writing Prompts dataset and shared task. Through this study, we also explore the capabilities of several general-purpose vision-to-language foundation models for generating stories grounded in sequences of images. We observe that some recent models, such as Qwen2.5-VL, can generate stories that are coherent, consistent, and well-grounded in the visual data. Nevertheless, in line with the recent studies in this area, we !nd that the existing automatic evaluation metrics and methods are insu"cient in fully capturing all the aspects essential for assessing model-generated stories. We believe our !ndings reinforce the evidence and arguments emphasizing the need for improvements to automatic approaches that can comprehensively evaluate and understand models for visual storytelling.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;visual storytelling</kwd>
        <kwd>visually-grounded story generation</kwd>
        <kwd>vision-to-language models</kwd>
        <kwd>evaluation</kwd>
        <kwd>NLG</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <sec id="sec-1-1">
        <title>Given a sequence of multiple temporally ordered images as input, the visual storytelling or visually</title>
        <p>grounded story generation task requires models to generate plausible and compelling textual stories.</p>
      </sec>
      <sec id="sec-1-2">
        <title>Huang et al. [1] proposed this task and released the VIST dataset to facilitate the development of models</title>
        <p>
          that can generate stories based on the causal structure of the visual input sequence. Leveraging the
VIST dataset, various modeling approaches have been proposed over the years [
          <xref ref-type="bibr" rid="ref2 ref3 ref4 ref5 ref6 ref7">2, 3, 4, 5, 6, 7</xref>
          ]. For
evaluating stories generated by models, most work has predominantly resorted to using automatic
reference-based n-gram metrics such as BLEU [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] and METEOR [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. These metrics neither consider
the visual input when assessing the generated stories nor do they account for the fact that several
creative stories are plausible for a given image sequence. The research community has underlined this
problem and proposed reference-free automatic evaluation metrics to measure the quality of stories
along di#erent dimensions such as visual grounding, coherence, and repetition [10, 11]. However, the
latest work in this direction has shown that evaluating model-generated outputs in visual storytelling
requires consideration of more aspects besides measuring the degree of visual grounding, coherence,
and repetition [12].
        </p>
        <p>To verify this argument, in this work, we focus on the recently released Visual Writing Prompts (VWP)
[13] dataset from the Visually Grounded Story Generation challenge [14] and explore di#erent aspects
related to modeling and evaluation. First, with the VWP dataset, we train and generate stories using
models that are shown to perform well on the VIST dataset. We then consider several vision-to-language
foundation models designed for general-purpose tasks and use them to generate stories for the VWP
dataset in a zero-shot manner. Using the evaluation framework proposed in Surikuchi et al. [12], we
compare all the models and !nd that general-purpose VLMs achieve better results quantitatively in
terms of the three di#erent dimensions considered for assessment—visual grounding, coherence, and
repetition. However, through qualitative veri!cation, we identify that the existing metrics do not fully
capture all the aspects relevant for evaluating the quality of a story. These results are in line with the
!ndings of recent studies and we support the arguments for exploring and considering more nuanced
dimensions such as consistency of emotions and di#erentiating creative expressions from hallucinations
for evaluating visual storytelling. Our code is available at: ! akskuchi/vwp-visual-storytelling.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. VGSG Challenge</title>
      <sec id="sec-2-1">
        <title>The Visually Grounded Story Generation (VGSG) shared task [14] was proposed to test the capabilities</title>
        <p>of AI models to generate coherent, grounded, and diverse short stories for sequences of images. It
comprised three di#erent tracks—closed, open, grounding—and used the Visual Writing Prompts
(VWP) [13] dataset which we describe below. Further details pertaining to the challenge tracks, our
approaches, and experiments are discussed in Section 3.</p>
        <p>
          Data. Hong et al. [13] introduced the VWP dataset to overcome the various limitations present in
other visual storytelling datasets such as VIST [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. Primarily, image sequences in VWP are constructed
to be semantically well-connected and centered around recurring characters such that they serve
as meaningful writing prompts for human-annotators or AI models. VWP contains a total of 13213
sequences, each comprising 5 to 10 images from a curated set of frames obtained from the MovieNet
dataset [15]. For the selected image sequences, stories were provided by Amazon Mechanical Turk
crowd workers. The obtained text was then processed to anonymize the recognized named locations
and characters using placeholders (e.g., [female0], ... ,[femaleN]). The overall dataset was split into
11778 training, 849 validation, and 586 test samples. VWP is shown to have more events and characters
per story compared to the other visual storytelling datasets such as VIST. An example &lt;image sequence,
story&gt; pair from the VWP dataset is shown in Figure 2.
        </p>
        <p>
          Evaluation. Stories submitted to the challenge were evaluated using both automatic and human
evaluation methods. Reference-based metrics such as BLEU [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], METEOR [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], ROUGE [16], and CIDEr
[17] were used for comparing model-generated candidate stories with reference stories provided by
humans. Stories were also evaluated in a reference-free manner along three dimensions—coherence,
character grounding, and diversity. In the context of the shared task, coherence was de!ned in terms
of entity transitions throughout the text and the Generative Entity Grid [18] metric was used for
computing it. Character grounding scores were computed using the Character Matching [19] metric
which measures the degree of match between the ‘appearance’ matrices of characters present in the
image sequence and the generated text. Diversity of stories is measured as an average of various aspects
such as unique number of verbs, verb-to-vocabulary ratio, verb-token ratio, and percentage of diverse
verbs not in the top-5 most frequent verbs. An evaluation dashboard with reference-based metrics was
made available during the shared task training phase to verify the stories generated by models for the
        </p>
        <sec id="sec-2-1-1">
          <title>VWP validation data split.1 Furthermore, organizers also conducted human evaluation on the submitted</title>
          <p>stories and discussed their qualitative !ndings in the overall shared task summary report [20].</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Our Approach</title>
      <p>In this section, we discuss the various modeling approaches we used for the VGSG task both during
and after the completion of challenge. Speci!cally, we participated in the open and closed tracks
of the challenge and leveraged two models introduced in Surikuchi et al. [12]. We also consider two
additional state-of-the-art vision-language models (VLMs) to shed light on their zero-shot capabilities
for the visual story generation task. To comprehensively assess the quality of model-generated stories
in terms of their closeness to corresponding human-written ones, we make use of the recently proposed</p>
      <sec id="sec-3-1">
        <title>1https://huggingface.co/spaces/VGSG/TestVGSG</title>
        <p>human-centric evaluation method—dHM [12]. The remainder of this section describes the shared task
tracks, the models we used, and the dHM evaluation method.
3.1. Open Track</p>
      </sec>
      <sec id="sec-3-2">
        <title>The objective of the open track was to test the current state-of-the-art of the VGSG task and allowed</title>
        <p>
          participants to use any pre-trained visual encoders and textual decoders. Therefore, for this track, we
used the TAPM (+L!"#" 2) and LLaVA (visual context) models proposed in Surikuchi et al. [12]. Similar
to LLaVA, we consider two additional state-of-the-art VLMs and use them o#-the-shelf for this task.
TAPM (+L!"#" 2). Transitional Adaptation of Pretrained Models (TAPM) is an approach originally
proposed by Yu et al. [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] for the visual storytelling task. It follows the visual encoder to language
decoder architecture commonly used in models for visual storytelling. First, image-level and object-level
features of the input image sequence are obtained using pre-trained ResNet [21] and FasterRCNN [22]
respectively. These features are passed through the visual encoder, and the representations of the
image at each temporal position are pooled together with the features of images at the neighboring
positions for improved context. The context from the encoder is passed on to a pre-trained GPT-2 [
          <xref ref-type="bibr" rid="ref10">23</xref>
          ]
for story generation. Prior to this downstream task-speci!c !ne-tuning, for a pre-determined number
of epochs, TAPM comprises an adaptation step in which the language decoder is frozen and the visual
encoder parameters are adapted based on the outputs of the frozen decoder. The authors argue that this
step harmonizes the various pre-trained components of the model and facilitates semantic alignment
between visual and textual representations. Recently, it has been shown that replacing GPT-2 using
the L!"#" 2 [
          <xref ref-type="bibr" rid="ref11">24</xref>
          ] language model—TAPM (+L!"#" 2)—improves model performance across di#erent
datasets including VWP [12].
        </p>
        <sec id="sec-3-2-1">
          <title>We note that Surikuchi et al. [12] used the VWP dataset version v1.0.0 to train and test their models.2</title>
        </sec>
      </sec>
      <sec id="sec-3-3">
        <title>However, through the VGSG challenge authors released VWP v2.1 which provided anonymized stories</title>
        <p>and included details pertaining to the training, validation, and test splits.3 Therefore, we trained the</p>
      </sec>
      <sec id="sec-3-4">
        <title>TAPM (+L!"#" 2) model from scratch using VWP v2.1 by following the procedure described in</title>
      </sec>
      <sec id="sec-3-5">
        <title>Surikuchi et al. [12].</title>
        <p>
          O!-the-shelf VLMs. Large Language and Vision Assistant (LLaVA) [
          <xref ref-type="bibr" rid="ref12">25</xref>
          ] is a large vision-language
foundation model pre-trained for various general-purpose tasks such as image captioning and visual
question answering. We use LLaVAv1.6 in a zero-shot manner by prompting it under the visual context
setting proposed in Surikuchi et al. [12]. Speci!cally, we provide the model with the entire image
sequence—sequence of images combined horizontally into a single composite image—as input and
prompt it to generate a story with [num-images-in-the-sequence] sentences. To ensure that the
generated stories are not sensitive to the prompt, we use three variations of the prompts and report the
average of the resulting scores during evaluation. Besides LLaVAv1.6, we use two recent similarly
sized general-purpose VLMs—Qwen2.5-VL [
          <xref ref-type="bibr" rid="ref13">26</xref>
          ] and DeepSeek-VL [
          <xref ref-type="bibr" rid="ref14">27</xref>
          ]—that have demonstrated strong
performance on various vision-language benchmarks. Using the technical reports and the open source
information, we veri!ed that the VLMs were not directly pre-trained using any of the visual storytelling
datasets, including VWP. Additional details concerning the models are provided in Appendix A and the
inference procedure including the prompts are provided in Appendix B.
3.2. Closed Track
        </p>
      </sec>
      <sec id="sec-3-6">
        <title>The closed track was a controlled setting in which visual features for VWP images were extracted using</title>
        <p>
          the pre-trained SwinTransformer [
          <xref ref-type="bibr" rid="ref15">28</xref>
          ] and provided as part of the VGSG challenge data. Participants
of this track were instructed to not use any additional visual feature extractors and to focus on the
components that map the vision and language modalities. For this track, we modi!ed the TAPM (+L!"#"
        </p>
      </sec>
      <sec id="sec-3-7">
        <title>2https://github.com/vwprompt/vwp/releases/tag/v1.0.0</title>
      </sec>
      <sec id="sec-3-8">
        <title>3https://huggingface.co/datasets/tonyhong/vwp</title>
        <p>Coherence
Visual grounding
Repetition</p>
        <p>Qwen2.5-VL
DeepSeek-VL
LLaVA v1.6</p>
        <p>TAPMC
TAPMO
0</p>
      </sec>
      <sec id="sec-3-9">
        <title>2) approach outlined in Section 3.1 to leverage the image-level and object-level SwinTransformer features.</title>
        <sec id="sec-3-9-1">
          <title>In the subsequent text, we refer to this model as TAPMC and the one used for the open track as TAPMO.</title>
          <p>3.3. Evaluation
In this work, we aim to understand the degree to which model-generated stories comply with stories
produced by humans regarding three di#erent aspects essential for visual story generation—coherence,
visual grounding, and repetition. For these aspects, we leverage the de!nitions and metrics
operationalized by Surikuchi et al. [12]. Speci!cally, visual grounding is assessed using the GROOViST [11]
metric, which measures the degree of alignment between noun phrases in the story and the bounding
boxes in the images of the sequence. Coherence is operationalized using the RoViST-C [10] metric,
which measures the average probability with which each sentence follows the preceding sentences.</p>
        </sec>
      </sec>
      <sec id="sec-3-10">
        <title>For repetition, the RoViST-NR [10] metric is used, which measures ‘non-redundancy’ in terms of the</title>
        <p>number of inter- and intra-sentence co-occurring words. We note that despite being referred to using
similar terms, these three aspects are distinct in terms of their de!nitions from those considered for the</p>
      </sec>
      <sec id="sec-3-11">
        <title>VGSG shared task evaluation (described in Section 2).</title>
      </sec>
      <sec id="sec-3-12">
        <title>Using the three metrics, we !rst obtain coherence, visual grounding, and repetition scores for both</title>
        <p>model-generated stories and corresponding human-annotated stories, independently. We then compute
the absolute di#erences between the human stories and the model-generated ones to measure
metriclevel deviations (dCHM , dGHM , dRHM ). Finally, to quantify the degree of ‘closeness’ between model- and
human-stories, we compute the aggregate distance dHM as the average of metric-level di#erences:
dHM =|CH → CM |, dGHM = |GH → GM |, dRHM = |RH → RM |,</p>
        <p>C
dHM =(dCHM + dGHM + dHM )/3</p>
        <p>R
(1)</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results</title>
      <p>Figure 1 shows the overall dHM distances along with individual metric-level deviations for all the
models. Model-generated stories with lower dHM values are expected to be closer to human levels of
storytelling in terms of the three dimensions—coherence, visual grounding, and repetition. Firstly, we
observe that compared to the TAPM models, all o#-the-shelf VLMs achieve better overall distances.</p>
      <sec id="sec-4-1">
        <title>This pattern di#ers from the results reported in recent work [12] for the earlier version of the VWP</title>
        <p>dataset (v1.0.0). However, the TAPM models achieve the lowest distance in terms of repetition, which
aligns with the !ndings in Surikuchi et al. [12]. Among the foundational VLMs, Qwen2.5-VL achieves
the lowest overall distance and the di#erence compared to other models is particularly signi!cant along
the dimension of visual grounding. We hypothesize that Qwen2.5-VL—which is a relatively recent</p>
      </sec>
      <sec id="sec-4-2">
        <title>VLM—bene!ts from the extensive pre-training on long video understanding datasets and interleaved</title>
        <p>Qwen2.5-VL
LLaVAv1.6
DeepSeek-VL
TAPMO
TAPMC</p>
        <p>It was a normal evening for [female0] when she entered her room. She saw her friend [female1] sitting in the bed. [female0]
didn’t know what was happening. [female1] looks like she has been crying. [female0] was confused by [female1]’s state. She
asked [female1] what was happening as she didn’t know why [female1] is crying like that. Then [female1] explained that
she finally got a job a!er all the struggles she faced. Eve started to open her bag. [female0] was so happy for her friend as
she finally able to live her life as she wishes. Eve took few money from her bag saying that it is her advanced payment for the
job. They went to eat out to celebrate their happiness.</p>
        <p>In the cozy bedroom, Sarah noticed her friend, Emily, looking distressed. Emily sat on the bed, her head in her hands,
clearly upset. Sarah approached her, concerned, and asked what was wrong. Emily, still in shock, told Sarah about a recent
betrayal she had discovered. Sarah listened intently, o!ering comfort and support. As they talked, Sarah noticed a stack
of money in Emily’s handbag. Curious, she asked about it. Emily explained that the money belonged to someone else but
had been le! behind. Sarah suggested they return it. They decided to go to Pink’s Chili Dogs, a local favorite.
1. In a bustling city, a young woman finds herself lost in a maze of narrow streets. 2. She stops to ask for directions, her
eyes wide with confusion. 3. The man she approaches is kind and o!ers to guide her. 4. As they walk, he tells her
stories of the city’s history and culture. 5. They come across a small, hidden bookstore, its shelves filled with ancient tomes.
6. The woman is captivated by the sight of a rare, first edition book. 7. The man helps her purchase it, and they continue
their journey together. 8. As they walk, they share their dreams and aspirations, finding a connection in their shared love of
literature. 9. They reach a beautiful garden, where they sit and read together under the stars. 10. As the night wears on, they
realize they have found something special in each other’s company.</p>
        <p>In a cozy library, a young woman in a vibrant red sweater engrossed in a book, her fingers gently turning the pages. Nearby,
an older man in a warm, brown coat, his eyes scanning the shelves, searching for a specific title. The so! glow of the library
lights illuminates their focused expressions, as they delve into the world of knowledge. Outside, a child’s laughter pierces
the air, drawing the attention of the patrons to a colorful scene of joy and play. The sequence continues, capturing the
essence of a day in the life of this library, where stories unfold in the quiet corners and on the open shelves.</p>
        <p>He walked into his o"ice, but it was dark and empty. The woman in pink asks the woman in brown where the earring might
be. The woman in brown says that the man probably hid the earring inside a book. The woman in pink thinks about this.
Then the woman in pink asks the woman in brown if she is certain about this. The woman in brown insists that the man
would have hidden the earring inside a book. But she was feeling bad about it. [female0] cried, but there was nothing she
could do. In one of the presents it is full of stacks of money. They were ready to take o" in their helicopters and go to battle.
Her husband [male0] was looking at her while she was on the floor. The woman in pink asks the woman in brown
where the earring might be. The woman in brown says that the man probably hid the earring inside a book. The woman in
pink thinks about this. Then the woman in pink asks the woman in brown if she is certain about this. The woman in brown
insists that the man would have hidden the earring inside a book. The woman checks the book by the computer, and sure
enough, she finds her earring. She called her friend just to check on her. She kept thinking about how she would get out of
this job. It was a large and grand house with grand gardens.
image-text data crawled from the web, that might be enabling the model to accurately ground stories
across the temporal positions in image sequences.</p>
        <p>To understand if the dHM method and the individual metrics re$ect the quality of stories generated
by models, we qualitatively inspected a random selection of the generated stories. Along the three
dimensions considered, we observed that the current evaluation metrics re$ect the overall quality
of model-generated stories. However, we !nd that these metrics do not fully capture all the aspects
relevant to visual storytelling. For instance, we !nd that the inconsistencies in stories regarding the
overarching topic, the characters, and their emotions (see Figure 2) are not completely accounted for by
the evaluation metrics.4 Moreover, with the current set of evaluation approaches it is unclear how to
accurately di#erentiate creative expressions that are visually grounded from implausible hallucinations.</p>
      </sec>
      <sec id="sec-4-3">
        <title>We believe these !ndings add evidence to the claims made by Surikuchi et al. [12] and emphasize the need for improving evaluation methods.</title>
      </sec>
      <sec id="sec-4-4">
        <title>4Appendix C provides more examples.</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this work, we studied the visual storytelling task using the VWP dataset and underlined the various
challenges concerning evaluation of model-generated stories. We compared various models using
existing evaluation frameworks and showed that a general-purpose foundation model, Qwen2.5-VL,
achieves the best overall scores along the dimensions of visual grounding and coherence. Qualitatively,
we verify the generated stories and !nd that along the three dimensions considered, the evaluation
methods re$ect the quality of the models. However, in line with the latest studies, we also observed
that the current automatic evaluation methods do not fully capture all the aspects essential for visual
storytelling. Our !ndings support the need for research e#orts toward automatic evaluation methods
that approach the problem in a comprehensive manner for accurately assessing the quality of stories.</p>
      <sec id="sec-5-1">
        <title>Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, 2005, pp. 65–72.</title>
      </sec>
      <sec id="sec-5-2">
        <title>URL: https://aclanthology.org/W05-0909.</title>
        <p>[10] E. Wang, C. Han, J. Poon, RoViST: Learning Robust Metrics for Visual Storytelling, in: M. Carpuat,</p>
      </sec>
      <sec id="sec-5-3">
        <title>M.-C. de Marne#e, I. V. Meza Ruiz (Eds.), Findings of the Association for Computational Lin</title>
        <p>guistics: NAACL 2022, Association for Computational Linguistics, Seattle, United States, 2022,
pp. 2691–2702. URL: https://aclanthology.org/2022.!ndings-naacl.206. doi:10.18653/v1/2022.
findings-naacl.206.
[11] A. K. Surikuchi, S. Pezzelle, R. Fernández, GROOViST: A metric for grounding objects in visual
storytelling, in: H. Bouamor, J. Pino, K. Bali (Eds.), Proceedings of the 2023 Conference on Empirical</p>
      </sec>
      <sec id="sec-5-4">
        <title>Methods in Natural Language Processing, Association for Computational Linguistics, Singapore,</title>
        <p>2023, pp. 3331–3339. URL: https://aclanthology.org/2023.emnlp-main.202/. doi:10.18653/v1/
2023.emnlp-main.202.
[12] A. K. Surikuchi, R. Fernández, S. Pezzelle, Not (yet) the whole story: Evaluating visual storytelling
requires more than measuring coherence, grounding, and repetition, in: Y. Al-Onaizan, M. Bansal,</p>
      </sec>
      <sec id="sec-5-5">
        <title>Y.-N. Chen (Eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Asso</title>
        <p>ciation for Computational Linguistics, Miami, Florida, USA, 2024, pp. 11597–11611. URL: https:
//aclanthology.org/2024.!ndings-emnlp.679/. doi:10.18653/v1/2024.findings-emnlp.679.
[13] X. Hong, A. Sayeed, K. Mehra, V. Demberg, B. Schiele, Visual Writing Prompts:
Character</p>
      </sec>
      <sec id="sec-5-6">
        <title>Grounded Story Generation with Curated Image Sequences, Transactions of the Association</title>
        <p>for Computational Linguistics 11 (2023) 565–581. URL: https://aclanthology.org/2023.tacl-1.33.
doi:10.1162/tacl_a_00553.
[14] X. Hong, K. Mehra, A. Sayeed, V. Demberg, Visually Grounded Story Generation Challenge, in:</p>
      </sec>
      <sec id="sec-5-7">
        <title>S. Mille (Ed.), Proceedings of the 16th International Natural Language Generation Conference:</title>
      </sec>
      <sec id="sec-5-8">
        <title>Generation Challenges, Association for Computational Linguistics, Prague, Czechia, 2023, pp.</title>
        <p>17–22. URL: https://aclanthology.org/2023.inlg-genchal.3.
[15] Q. Huang, Y. Xiong, A. Rao, J. Wang, D. Lin, MovieNet: A Holistic Dataset for Movie Understanding,
in: A. Vedaldi, H. Bischof, T. Brox, J.-M. Frahm (Eds.), Computer Vision – ECCV 2020, Springer</p>
      </sec>
      <sec id="sec-5-9">
        <title>International Publishing, Cham, 2020, pp. 709–727.</title>
        <p>[16] C.-Y. Lin, ROUGE: A package for automatic evaluation of summaries, in: Text Summarization</p>
      </sec>
      <sec id="sec-5-10">
        <title>Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:</title>
        <p>https://aclanthology.org/W04-1013.
[17] R. Vedantam, C. Lawrence Zitnick, D. Parikh, CIDEr: Consensus-based image description
evaluation, in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
4566–4575.
[18] K. S. Smith, W. Aziz, L. Specia, Cohere: A toolkit for local coherence, in: N. Calzolari, K. Choukri,</p>
      </sec>
      <sec id="sec-5-11">
        <title>T. Declerck, S. Goggi, M. Grobelnik, B. Maegaard, J. Mariani, H. Mazo, A. Moreno, J. Odijk,</title>
      </sec>
      <sec id="sec-5-12">
        <title>S. Piperidis (Eds.), Proceedings of the Tenth International Conference on Language Resources</title>
        <p>and Evaluation (LREC‘16), European Language Resources Association (ELRA), Portoro%, Slovenia,
2016, pp. 4111–4114. URL: https://aclanthology.org/L16-1649/.
[19] X. Hong, V. Demberg, A. Sayeed, Q. Zheng, B. Schiele, Visual coherence loss for coherent and
visually grounded story generation, in: A. Rogers, J. Boyd-Graber, N. Okazaki (Eds.), Findings of the</p>
      </sec>
      <sec id="sec-5-13">
        <title>Association for Computational Linguistics: ACL 2023, Association for Computational Linguistics,</title>
      </sec>
      <sec id="sec-5-14">
        <title>Toronto, Canada, 2023, pp. 9456–9470. URL: https://aclanthology.org/2023.!ndings-acl.603/. doi:10.</title>
        <p>18653/v1/2023.findings-acl.603.
[20] X. Hong, A. Sayeed, V. Demberg, Summary of the visually grounded story generation challenge, in:</p>
      </sec>
      <sec id="sec-5-15">
        <title>S. Mille, M.-A. Clinciu (Eds.), Proceedings of the 17th International Natural Language Generation</title>
      </sec>
      <sec id="sec-5-16">
        <title>Conference: Generation Challenges, Association for Computational Linguistics, Tokyo, Japan,</title>
        <p>2024, pp. 39–46. URL: https://aclanthology.org/2024.inlg-genchal.3/.
[21] K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition, in: Proceedings of
the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
[22] S. Ren, K. He, R. Girshick, J. Sun, Faster r-cnn: Towards real-time object detection
with region proposal networks, in: C. Cortes, N. Lawrence, D. Lee, M. Sugiyama,</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>A. Models</title>
      <sec id="sec-6-1">
        <title>For o#-the-shelf VLMs, we use LLaVa version 1.6, Qwen2.5-VL-7B-Instruct, and DeepSeek-vl-7b-chat</title>
        <p>
          models, and access them using the HuggingFace transformers library.5 We note that the VLMs follow
the model architecture presented in Surikuchi et al. [
          <xref ref-type="bibr" rid="ref16">29</xref>
          ]. The speci!c vision encoder and language
decoder components of these VLMs are outlined in Table 1.
        </p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>B. Training and Inference</title>
      <sec id="sec-7-1">
        <title>For generating stories using o#-the-shelf VLMs, we used the following prompts:</title>
      </sec>
      <sec id="sec-7-2">
        <title>5https://huggingface.co/docs/transformers/</title>
        <p>sentences.'
more.'
P1 = 'Write a story using exactly [num-images-in-the-sequence] sentences for
this image sequence. Do not use more than [num-images-in-the-sequence]
P2 = 'Generate a story consisting of [num-images-in-the-sequence] sentences for
this image sequence. Use only [num-images-in-the-sequence] sentences and not
P3 = 'Output a story about this sequence of images using only
[num-images-in-the-sequence] sentences. Make sure the story does not include
more than [num-images-in-the-sequence] sentences.'</p>
      </sec>
      <sec id="sec-7-3">
        <title>Number of images in the input sequences of the VWP dataset ranged between 5 to 10 images and the prompts were adapted accordingly during inference.</title>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>C. Generated Stories</title>
      <p>Human Story A man ducks for cover. There are bullet holes in the wall near him, and he covers his face in fear. He tries to get away, but
another person grabs him from behind . He holds him in place as the man struggles to get away . Suddenly, the bullets start
flying. Both men duck or dive for safety. Food and bits of plaster go flying as the bullets fill the room. Two men are hit, and
they both go flying back.</p>
      <p>Qwen2.5-VL In the opulent room, a man in a tuxedo was playing the accordion, oblivious to the chaos unfolding around him. Suddenly,
another man burst into the room, his face contorted with anger as he lunged at the first man. The second man’s attack was
met with a swi! counter from the first, who managed to push him back. As they struggled, a cloud of dust filled the air,
obscuring their movements. In the midst of the commotion, a third man appeared, his expression one of shock and fear as he
watched the two men grapple.</p>
      <p>Human Story [female0] ran towards [male0] a!er she saw him fall down. She asked him if he was okay and if he could hear her. [male0]
laid on the ground as the pool of blood surrounded him. He was fading fast due to the loss of blood. Finally the paramedics
arrived to attend to him.</p>
      <p>Qwen2.5-VL The emergency responders hurriedly attended to the injured man, their faces reflecting the urgency of the situation. Nearby,
a woman watched with a mix of fear and determination, her eyes wide as she tried to understand what was happening. The
man lay unconscious on the ground, his body showing signs of severe trauma. As the paramedics worked quickly, another
person knelt beside him, o"ering silent support and comfort. Despite the chaos, there was a sense of coordinated e"ort as
they prepared to transport the injured man to safety.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>T.-H. K. Huang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Ferraro</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Mostafazadeh</surname>
            ,
            <given-names>I. Misra</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Agrawal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Kohli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vanderwende</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          , M. Mitchell, Visual Storytelling, in: K.
          <string-name>
            <surname>Knight</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Nenkova</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          Rambow (Eds.),
          <source>Proceedings of the</source>
          <year>2016</year>
          <article-title>Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Association for Computational Linguistics</article-title>
          , San Diego, California,
          <year>2016</year>
          , pp.
          <fpage>1233</fpage>
          -
          <lpage>1239</lpage>
          . URL: https://aclanthology.org/N16-1147. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>N16</fpage>
          -1147.
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Kim</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>O. Heo</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Son</surname>
          </string-name>
          , K.-W. Park, B.-T. Zhang, GLAC Net:
          <article-title>GLocal Attention Cascading Networks for Multi-image Cued Story Generation</article-title>
          , CoRR abs/
          <year>1805</year>
          .10973 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-F.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , No Metrics Are Perfect:
          <article-title>Adversarial Reward Learning for Visual Storytelling</article-title>
          , in: I. Gurevych, Y. Miyao (Eds.),
          <source>Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , Melbourne, Australia,
          <year>2018</year>
          , pp.
          <fpage>899</fpage>
          -
          <lpage>909</lpage>
          . URL: https://aclanthology.org/P18-1083. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>P18</fpage>
          -1083.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>C.-C. Hsu</surname>
            ,
            <given-names>Z.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Chen</surname>
          </string-name>
          , C.-Y. Hsu,
          <string-name>
            <surname>C.-C. Li</surname>
            ,
            <given-names>T.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>T.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , L.-W. Ku,
          <string-name>
            <surname>Knowledge-Enriched Visual</surname>
            <given-names>Storytelling</given-names>
          </string-name>
          ,
          <source>Proceedings of the AAAI Conference on Arti!cial Intelligence</source>
          <volume>34</volume>
          (
          <year>2020</year>
          )
          <fpage>7952</fpage>
          -
          <lpage>7960</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/view/6303. doi:
          <volume>10</volume>
          .1609/aaai.v34i05.
          <fpage>6303</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <surname>C</surname>
          </string-name>
          .
          <article-title>-y.</article-title>
          <string-name>
            <surname>Hsu</surname>
            ,
            <given-names>Y.-W.</given-names>
          </string-name>
          <string-name>
            <surname>Chu</surname>
            ,
            <given-names>T.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
          </string-name>
          , L.-W. Ku,
          <article-title>Plot and Rework: Modeling Storylines for Visual Storytelling</article-title>
          , in: C.
          <string-name>
            <surname>Zong</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Xia</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Navigli</surname>
          </string-name>
          (Eds.),
          <article-title>Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021</article-title>
          ,
          <article-title>Association for Computational Linguistics</article-title>
          , Online,
          <year>2021</year>
          , pp.
          <fpage>4443</fpage>
          -
          <lpage>4453</lpage>
          . URL: https://aclanthology.org/
          <year>2021</year>
          .!ndings-acl.
          <volume>390</volume>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2021</year>
          .findings-acl.
          <volume>390</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Takamura</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Nakayama</surname>
          </string-name>
          ,
          <article-title>Commonsense Knowledge Aware Concept Selection For Diverse and Informative Visual Storytelling</article-title>
          ,
          <source>Proceedings of the AAAI Conference on Arti!cial Intelligence</source>
          <volume>35</volume>
          (
          <year>2021</year>
          )
          <fpage>999</fpage>
          -
          <lpage>1008</lpage>
          . URL: https://ojs.aaai.org/index.php/AAAI/article/ view/16184. doi:
          <volume>10</volume>
          .1609/aaai.v35i2.
          <fpage>16184</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          , G. Kim,
          <article-title>Transitional Adaptation of Pretrained Models for Visual Storytelling</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>12658</fpage>
          -
          <lpage>12668</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          , in: P.
          <string-name>
            <surname>Isabelle</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Charniak</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          (Eds.),
          <article-title>Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://aclanthology.org/P02-1040. doi:
          <volume>10</volume>
          .3115/1073083.1073135.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments</article-title>
          , in: J.
          <string-name>
            <surname>Goldstein</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Lavie</surname>
            ,
            <given-names>C.-Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          Voss (Eds.),
          <source>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or R. Garnett (Eds.)</source>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>28</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2015</year>
          . URL: https://proceedings.neurips.cc/paper_!les/paper/2015/!le/ 14bfa6bb14875e45bba028a21ed38046-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language Models are Unsupervised Multitask Learners, OpenAI blog (</article-title>
          <year>2019</year>
          ). URL: https://cdn.openai.
          <article-title>com/better-language-models/ language_models_are_unsupervised_multitask_learners</article-title>
          .pdf.
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Stone</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Albert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Almahairi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Babaei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Bashlykov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bhargava</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bhosale</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Bikel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Blecher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. C.</given-names>
            <surname>Ferrer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Cucurull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Esiobu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fernandes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fuller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Gao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goswami</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Goyal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Hartshorn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hosseini</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Inan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Kardas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kerkez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Khabsa</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Kloumann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Korenev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. S.</given-names>
            <surname>Koura</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lavril</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Liskovich</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Martinet</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Mihaylov</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Mishra</surname>
            , I. Molybog,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Nie</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Poulton</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Reizenstein</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Rungta</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saladi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Schelten</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Silva</surname>
            ,
            <given-names>E. M.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Subramanian</surname>
            ,
            <given-names>X. E.</given-names>
          </string-name>
          <string-name>
            <surname>Tan</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Tang</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Taylor</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Williams</surname>
            ,
            <given-names>J. X.</given-names>
          </string-name>
          <string-name>
            <surname>Kuan</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Yan</surname>
            , I. Zarov,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Kambadur</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Narang</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Stojnic</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Edunov</surname>
          </string-name>
          ,
          <source>T. Scialom, Llama</source>
          <volume>2</volume>
          :
          <string-name>
            <given-names>Open</given-names>
            <surname>Foundation</surname>
          </string-name>
          and
          <string-name>
            <surname>Fine-Tuned Chat</surname>
            <given-names>Models</given-names>
          </string-name>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2307</volume>
          .
          <fpage>09288</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Visual instruction tuning</article-title>
          , in: A.
          <string-name>
            <surname>Oh</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Naumann</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Globerson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Hardt</surname>
          </string-name>
          , S. Levine (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>36</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2023</year>
          , pp.
          <fpage>34892</fpage>
          -
          <lpage>34916</lpage>
          . URL: https://proceedings.neurips.cc/ paper_!les/paper/2023/!le/6dcf277ea32ce3288914faf369fe6de0-Paper-Conference.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <year>Qwen2</year>
          .5-vl,
          <year>2025</year>
          . URL: https://qwenlm.github.io/blog/qwen2.5-vl/.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>H.</given-names>
            <surname>Lu</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Ruan</surname>
          </string-name>
          ,
          <article-title>Deepseek-vl: Towards real-world vision-language understanding</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2403</volume>
          .
          <fpage>05525</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <article-title>Swin transformer: Hierarchical vision transformer using shifted windows</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          (ICCV),
          <year>2021</year>
          , pp.
          <fpage>10012</fpage>
          -
          <lpage>10022</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [29]
          <string-name>
            <surname>A. K. Surikuchi</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Fernández</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Pezzelle</surname>
          </string-name>
          ,
          <article-title>Natural language generation from visual sequences: Challenges and</article-title>
          future directions,
          <year>2025</year>
          . URL: https://arxiv.org/abs/2502.13034. arXiv:
          <volume>2502</volume>
          .
          <fpage>13034</fpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>