<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Novel Evaluation Framework for Image2Text Generation</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Jia-Hong Huang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Hongyi Zhu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yixian Shen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Stevan Rudinac</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Alessio M. Pacces</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Evangelos Kanoulas</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Amsterdam</institution>
          ,
          <country country="NL">Netherlands</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Evaluating the quality of automatically generated image descriptions is challenging, requiring metrics that capture various aspects such as grammaticality, coverage, correctness, and truthfulness. While human evaluation ofers valuable insights, its cost and time-consuming nature pose limitations. Existing automated metrics like BLEU, ROUGE, METEOR, and CIDEr aim to bridge this gap but often show weak correlations with human judgment. We address this challenge by introducing a novel evaluation framework rooted in a modern large language model (LLM), such as GPT-4 or Gemini, capable of image generation. In our proposed framework, we begin by feeding an input image into a designated image captioning model, chosen for evaluation, to generate a textual description. Using this description, an LLM then creates a new image. By extracting features from both the original and LLM-created images, we measure their similarity using a designated similarity metric. A high similarity score suggests that the image captioning model has accurately generated textual descriptions, while a low similarity score indicates discrepancies, revealing potential shortcomings in the model's performance. Human-annotated reference captions are not required in our proposed evaluation framework, which serves as a valuable tool for evaluating the efectiveness of image captioning models. Its eficacy is confirmed through human evaluation.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Image Captioning</kwd>
        <kwd>Metrics for Automated Evaluation</kwd>
        <kwd>Large Language Models</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The evaluation of sentences generated through automated methods remains a formidable challenge in
the realm of image captioning. Current metrics for evaluating image descriptions aim to gauge multiple
desirable attributes, such as grammaticality, covering crucial aspects, correctness, truthfulness, and
more. Human evaluation plays a pivotal role in quantifying these properties, utilizing separate Likert
scales or pairwise scales [
        <xref ref-type="bibr" rid="ref1 ref2 ref3 ref4 ref5 ref6">1, 2, 3, 4, 5, 6</xref>
        ]. However, due to the expensive, challenging-to-reproduce, and
time-consuming nature of human studies, there is a growing need for automated evaluation measures.
For practical utility, these automated metrics should align closely with human judgment. Therefore,
the challenge in designing such an automatic metric lies in integrating the aforementioned diverse
evaluations attributes into a unified measure of sentence quality.
      </p>
      <p>
        Several automated metrics, including BLEU [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], ROUGE [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], METEOR [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], CIDEr [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and more,
have been introduced to assess image descriptions generated by automated approaches. BLEU, initially
designed for machine translation, relies on precision, while ROUGE, originating from the summarization
community, is a recall-based metric. METEOR is tailored for assessing the overall quality of image
descriptions. Nonetheless, research has indicated a weak correlation between these metrics and human
judgment [
        <xref ref-type="bibr" rid="ref10 ref11 ref12 ref4">10, 4, 11, 12</xref>
        ]. In contrast, the consensus-based metric CIDEr measures the similarity between
a generated sentence and a set of ground truth sentences authored by humans, demonstrating high
agreement with human consensus. However, preparing a set of ground truth sentences in advance
is a prerequisite for CIDEr. If the quantity of human-authored ground truth sentences is insuficient,
CIDEr may struggle to efectively evaluate image descriptions [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. A similar limitation is observed in
(Input image)
(Image encoder)
(Feature vectors)
…
      </p>
      <p>Three dogs are in a sunny park.
(Pre-trained CNN)
(Feature maps)</p>
      <p>
        (Transformer-based language generator)
the CLAIR method [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and other aforementioned approaches. Some metrics involve caption ranking
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ] but are limited in evaluating novel image descriptions.
      </p>
      <p>
        In addressing the above challenge, we present a novel framework for evaluating image descriptions.
This framework is rooted in the utilization of a modern LLM approach, e.g., GPT-4 [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] or Gemini [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ],
capable of generating images. The advancement of LLMs [
        <xref ref-type="bibr" rid="ref16 ref17">16, 17</xref>
        ], exemplified by models like GPT-4,
empowers us to provide textual descriptions, i.e., prompt, for generating images that closely correspond
and align with the semantic meaning conveyed in the given text. The underlying design philosophy of
the proposed framework hinges on the idea that if an image captioning model is validated as efective,
the generated image description by the model should be suficiently accurate to reconstruct the same or
a highly similar image compared to the original input image, relying on LLMs. The ongoing evolution
of LLM technology forms the bedrock of the proposed framework.
      </p>
      <p>Starting with the definition of the image captioning task, as illustrated in Figure 1, our proposed
framework begins by taking an image as input. Subsequently, this input undergoes processing through
a given image captioning model, generating a textual description for the initial image. Following this, a
given LLM, such as GPT-4, is employed to generate an image based on the textual description. Then,
we extract the image features from both the original input image and the LLM-generated image, and
assess their similarity using the cosine similarity metric. It is worth noting that human-annotated
reference captions are not needed in our proposed evaluation framework. In the proposed evaluation
framework, a high cosine similarity score is anticipated if the generated text-based description is of
suficient quality, signifying that the LLM can accurately reproduce an image highly similar to the
original input. Conversely, if the generated text-based description lacks accuracy, the image produced
by the LLM will deviate from the original input image and lead to a low cosine similarity score. This
incongruity suggests the suboptimal performance of the image captioning model. Consequently, the
proposed framework proves valuable for evaluating the eficacy of a given image captioning model.
The main contributions of this work are summarized as follows:
• Innovative Framework for Image Captioning Model Evaluation: We present a novel
framework that relies on the utilization of an LLM, such as GPT-4 or Gemini, to evaluate the
quality of image descriptions generated by an image captioning model. The proposed evaluation
framework does not necessitate human-annotated reference captions.
• Human Evaluation of the Framework: To verify the efectiveness of our evaluation framework,
we introduce a human-annotated dataset and conduct human evaluations.
• Comprehensive Experiments on Established Datasets: We perform extensive experiments
to demonstrate the eficacy of the proposed evaluation framework using widely-used image
captioning datasets.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>In this section, we begin by reviewing existing related literature, covering topics such as the existing
image captioning methods, the evolution of automated metrics, and the latest advancements in LLM
technology.</p>
      <sec id="sec-2-1">
        <title>2.1. Image Captioning Methods</title>
        <p>
          The encoder-decoder network architecture has become a cornerstone in the field of image captioning, as
evidenced by various studies [
          <xref ref-type="bibr" rid="ref18 ref19 ref20 ref21 ref22 ref23 ref24 ref25 ref26 ref27 ref28 ref29 ref30 ref31 ref32 ref33 ref34">18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34</xref>
          ]. Typically, these
networks employ a CNN as the encoder for extracting global image features, and an RNN as the decoder
for generating word sequences. [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] introduces a method for generating referring expressions, which
are descriptions for specific objects or regions within an image. In [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ], the bidirectional LSTM-based
method for image captioning takes advantage of both past and future information to learn long-term
visual-language interactions. Attention mechanisms have significantly enhanced the performance of
image captioning models. [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] introduces an area-based attention model that predicts the next word
and the corresponding image regions at each RNN timestep. While these advancements represent
significant strides, they predominantly focus on single-image based description generation. However,
certain abstract concepts or descriptions might not be fully captured using only image data [
          <xref ref-type="bibr" rid="ref38 ref39">38, 39</xref>
          ].
[
          <xref ref-type="bibr" rid="ref27 ref28">28, 27</xref>
          ] have explored the use of expert-defined keyword sequences to augment model capabilities
in generating more accurate and contextually relevant descriptions. Recent advancements have also
explored transformer-based architectures, such as Vision Transformers (ViT), which have shown
promise in capturing finer details and global context in images for caption generation [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ]. Furthermore,
the integration of multimodal learning approaches, where models are trained on both visual and textual
data, has led to significant improvements in generating contextually richer and more nuanced image
descriptions [
          <xref ref-type="bibr" rid="ref41">41</xref>
          ].
        </p>
        <p>
          The domain of medical image captioning has witnessed significant advancements, particularly
through methods that meld human expertise with algorithmic prowess. [
          <xref ref-type="bibr" rid="ref42">42</xref>
          ] has developed a Hybrid
Retrieval-Generation Reinforced Agent, which integrates human prior knowledge with AI-based caption
generation for medical images. This agent alternates between a generative module and a retrieval
mechanism that utilizes a template database reflecting human expertise, thereby producing
multifaceted, sequential sentences. [
          <xref ref-type="bibr" rid="ref39">39</xref>
          ] has contributed to this field with a multi-task learning framework
that simultaneously predicts tags and generates captions. Their method, which focuses on abnormal
areas in chest radiology images using an attention mechanism and a hierarchical LSTM, ofers detailed
descriptions. These methods primarily focus on generating reports for chest radiology images, which
are structurally diferent in terms of object size and detail compared to retinal images [
          <xref ref-type="bibr" rid="ref27 ref38 ref43">38, 43, 27</xref>
          ].
Additionally, the color features in chest radiology and retinal images difer significantly, with the
former being predominantly grey-scale and the latter being colorful [
          <xref ref-type="bibr" rid="ref27 ref38">38, 27</xref>
          ]. Most existing methods rely
primarily on the image input for caption generation. Recent advancements also include the enhancement
of the CNN-RNN framework with the TransFuser model [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. This model adeptly combines features
from diferent modalities and addresses the challenge of incorporating unordered keyword sequences
with visual inputs, minimizing information loss [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ]. This development represents a significant stride in
medical image captioning, reflecting the growing complexity and capability of these methods. Further
progress in deep learning, particularly the application of ViTs, has ofered promising results in medical
imaging [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ]. ViTs excel in capturing intricate details and providing a broader context for more accurate
medical image analysis and caption generation.
        </p>
        <p>The evaluation framework proposed in this paper is versatile and capable of assessing any existing
image captioning approaches.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Automatic Metrics for Image Captioning</title>
        <p>
          The evolution of image captioning has been significantly influenced by the development and application
of automatic metrics for evaluating caption quality [
          <xref ref-type="bibr" rid="ref45 ref46 ref47 ref7 ref8 ref9">7, 8, 45, 9, 46, 47</xref>
          ]. These metrics guide the training
of captioning models and provide a scalable means for performance assessment. The BLEU score,
a pioneering metric by [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], gauges n-gram precision in generated text against a reference. ROUGE,
developed by [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ], emphasizes recall through the overlap of N-grams and longest common subsequences.
Subsequent innovations introduced refined approaches. METEOR, by [
          <xref ref-type="bibr" rid="ref45">45</xref>
          ], aligns more closely with
human judgment by incorporating synonym matching and stemming.In [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], the CIDEr metric,
specifically designed for image captioning, assesses the similarity of generated captions to a set of reference
captions. The SPICE metric by [
          <xref ref-type="bibr" rid="ref46">46</xref>
          ] evaluates semantic content and the depiction of objects, attributes,
and relationships. Additionally, the NLG-Eval toolkit by [
          <xref ref-type="bibr" rid="ref47">47</xref>
          ] provides a comprehensive suite of metrics
for a more holistic evaluation of natural language generation. However, these metrics have limitations.
Metrics like BLEU and ROUGE often fail to capture the contextual nuances of captions [
          <xref ref-type="bibr" rid="ref7 ref8">7, 8</xref>
          ]. The
challenge of evaluating creativity and novelty in caption generation is also evident, as automated metrics
may penalize deviations from standard references [
          <xref ref-type="bibr" rid="ref46 ref9">9, 46</xref>
          ]. Recently, advancements like BERTScore [
          <xref ref-type="bibr" rid="ref48">48</xref>
          ]
and CLIPScore [
          <xref ref-type="bibr" rid="ref49">49</xref>
          ], which utilize contextual embeddings and visual-textual alignment, respectively,
have been proposed to address these challenges.
        </p>
        <p>In this study, human evaluation is employed to validate the efectiveness of the proposed evaluation
framework.</p>
        <p>(Original input image)
(Generated text descriptions)</p>
        <sec id="sec-2-2-1">
          <title>Feature Extraction</title>
        </sec>
        <sec id="sec-2-2-2">
          <title>Module</title>
          <p>(E.g., ViT-g/14 or VGG-16)
(E.g., InstructBLIP or LSTM)</p>
        </sec>
        <sec id="sec-2-2-3">
          <title>Image Captioning</title>
        </sec>
        <sec id="sec-2-2-4">
          <title>Module</title>
          <p>(Reconstructed image feature vector)
(Original input image feature vector)</p>
        </sec>
        <sec id="sec-2-2-5">
          <title>Similarity</title>
        </sec>
        <sec id="sec-2-2-6">
          <title>Calculator</title>
          <p>(E.g., Cosine similarity or L2-norm)
(E.g., GPT-4 or Gemini)</p>
          <p>LLM
(Reconstructed image)</p>
        </sec>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Large Language Models</title>
        <p>
          The advent of LLMs has significantly reshaped the landscape of natural language processing (NLP)
and Artificial Intelligence (AI). Pioneering models such as GPT, developed by [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], and BERT by
[
          <xref ref-type="bibr" rid="ref50">50</xref>
          ], have marked critical milestones in this evolution. These models, characterized by their vast
number of parameters and advanced deep learning architectures, have enhanced the capacity to
understand and generate human language, excelling in diverse tasks like translation, summarization,
and question-answering [
          <xref ref-type="bibr" rid="ref50 ref51">50, 51</xref>
          ]. The eficacy of LLMs such as GPT, which utilizes a
transformerbased architecture, stems from their comprehensive training across a broad spectrum of internet text,
enabling the generation of coherent and contextually pertinent language [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ]. BERT’s introduction
of bidirectional transformers has revolutionized pre-training in language understanding, showing
remarkable eficiency in tasks requiring intricate contextual comprehension [
          <xref ref-type="bibr" rid="ref50">50</xref>
          ]. The incorporation of
attention mechanisms, as conceptualized by [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], has further refined these models’ ability for nuanced
understanding and text generation. In the realm of image captioning, the deployment of LLMs like
GPT-3 has brought transformative changes. GPT-3’s adeptness in image captioning tasks is a testament
to its sophisticated transformer-based architecture and comprehensive training on a wide array of
internet text. This extensive training enables GPT-3 to intricately understand and generate content
that accurately aligns with both textual and visual contexts, producing coherent, contextually relevant,
and detailed image descriptions [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ]. The fusion of LLMs with advanced computer vision techniques
has been a significant leap forward, leading to the development of more sophisticated systems. These
systems are now better equipped to interpret and describe complex visual data with greater accuracy
and nuance [
          <xref ref-type="bibr" rid="ref52">52</xref>
          ]. This integration highlights the evolving capability of AI to understand and convey the
subtleties of visual information, mirroring a more human-like perception and articulation of images.
This advancement in image captioning technology is pivotal in enhancing how machines process and
narrate visual data, bridging the gap between visual perception and linguistic expression. Furthermore,
the use of LLMs goes beyond generating captions to evaluating their quality. A notable method in this
regard is CLAIR [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ], which leverages zero-shot language modeling to assess caption quality. CLAIR
shows a stronger correlation with human judgment compared to traditional metrics like BLEU, ROUGE,
METEOR, and CIDEr. By soliciting an LLM to rate how likely a candidate caption accurately describes an
image relative to a set of reference captions, CLAIR outperforms language-only measures, approaching
human-level correlation. However, CLAIR requires a set of human-annotated reference captions to
function, without which it cannot be applied.
        </p>
        <p>In this work, the proposed approach leverages modern LLMs like GPT-4 for an innovative and
comprehensive evaluation. We use LLMs to reverse-engineer the image captioning process, generating
images from textual descriptions to assess caption accuracy. This method ofers a unique advantage in
evaluating the semantic richness and contextual relevance of captions. By comparing the generated
images with the original, our approach provides a direct, visual assessment of caption quality, moving
beyond mere textual analysis. This novel methodology not only aligns with human perception but
also embraces the creativity and diversity inherent in image captioning, ofering a more rounded and
practical evaluation framework.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The proposed evaluation framework comprises several key components: an image captioning module, an
LLM-based text-to-image generator, an image feature extraction module, and a similarity calculator, as
depicted in Figure 2. Each of these components will be introduced in detail in the following subsections.
Furthermore, to ensure the validity of the evaluation results based on our framework—specifically,
their alignment with human judgment—we introduce a human-annotated image captioning dataset to
validate the efectiveness of the proposed framework.</p>
      <sec id="sec-3-1">
        <title>3.1. Image Captioning Module</title>
        <p>
          The module incorporates an image captioning model, which will undergo evaluation using the proposed
framework. This module takes an image as input and generates a text-based description as output.
To facilitate user comprehension of the proposed evaluation framework, we utilize the InstructBLIP
model [
          <xref ref-type="bibr" rid="ref53">53</xref>
          ] as an illustrative example in Section 4. This demonstration showcases the entire process
of leveraging the proposed framework to evaluate a given image captioning model, making it easily
understandable for users.
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. LLM-based Text-to-Image Generator</title>
        <p>
          Numerous studies [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
          ] have demonstrated the proficiency of LLM-based image generators,
exemplified by models like GPT-4, in producing high-quality images that closely align with the semantic
meaning of provided text-based prompts. Specifically, DALL-E, functioning as an image generation
model within GPT-4, a variant of GPT-3 boasting 12 billion parameters, is engineered by OpenAI
to generate images based on textual descriptions, drawing from a dataset comprising text-image
pairs. Its versatile capabilities include crafting anthropomorphized versions of animals and objects,
seamlessly combining unrelated concepts, rendering text, and applying transformations to existing
images. In the context of the proposed framework, the LLM-based image generator utilizes the
textbased image description generated by a preceding image captioning model. If the image captioning
model performs well, generating a high-quality and accurate image description, the LLM-based image
generator subsequently creates an image that is similar to the original input image. This connection
highlights the interplay between efective image captioning and the generation of corresponding images
by the LLM-based approach.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>3.3. Image Feature Extraction Module</title>
        <p>
          The image feature extraction module primarily consists of a pre-trained image encoder. This module
takes an image as input and produces a feature vector representing the input image as output. To enhance
user understanding of the proposed evaluation framework, we employ ViT-g/14 [
          <xref ref-type="bibr" rid="ref54">54</xref>
          ] as a demonstrative
example for image feature extraction in Section 4. ViT-g/14 is a vanilla ViT pre-trained for reconstructing
masked-out image-text aligned vision features conditioned on visible image patches. Through this
pretext task, the model eficiently scales up to one billion parameters, achieving notable performance
across various vision downstream tasks, including image recognition, video action recognition, object
detection, instance segmentation, and semantic segmentation, all without extensive supervised training.
This demonstration in Section 4 highlights the complete process, encompassing image feature extraction
for calculating similarity scores between the input and generated images. It illustrates how the proposed
framework can be leveraged to assess a given image captioning model, providing users with a clear
understanding. It is worth noting that the image feature extractor can be substituted with other
pre-trained CNNs, such as VGG-16 [
          <xref ref-type="bibr" rid="ref55">55</xref>
          ] or ResNet-52 [
          <xref ref-type="bibr" rid="ref56">56</xref>
          ].
        </p>
      </sec>
      <sec id="sec-3-4">
        <title>3.4. Similarity Calculator</title>
        <p>
          Cosine similarity, as defined in Equation (1), serves as a metric for quantifying the similarity between
two vectors in a multi-dimensional space. It evaluates the cosine of the angle between these vectors,
ofering insight into their degree of similarity or dissimilarity. The advantage of cosine similarity lies in
its ability to assess directional similarity rather than magnitude, rendering it robust against variations
in scale and orientation. This characteristic makes it a widely adopted metric in diverse domains,
including image processing and NLP. In these fields, cosine similarity is frequently employed to assess
the similarity between images, documents, or sentences represented as vectors in high-dimensional
spaces. The cosine similarity value CosSim(· , · ) ∈ [
          <xref ref-type="bibr" rid="ref1">− 1, 1</xref>
          ], where a value of 1 signifies that the vectors
are identical, 0 indicates orthogonality (i.e., no similarity), and − 1 indicates complete dissimilarity or
“A dog wearing a leash laying next to an
orange frisbee”,
“A dog with a collar on laying down next
to a frisbee”,
“A dog lies in the grass next to a
Frisbee.”,
“A frisbee on the ground next to a dog
sitting in the grass”,
“A dog that is laying on the ground next
to a Frisbee.”
“A skateboarder is jumping down a flight
of stairs.”,
“A skaterboarder getting major air over
some stairs during a night time shoot”,
“A skate boarder jumping down some
stairs at night”,
“A skateboarder riding down a flight of
stone stairs”,
“A young man skateboarding over a
flight of steps”
“All of these sheep have coats that are
ready for shearing.”,
“some sheep standing around by a
wooden wall”,
“Five sheep are standing and sitting in
their enclosure.”,
“One sheep lies down as four others
stand near.”,
“A group of five sheep wait outside a
barn.”
“A person is on a living room couch
watching TV and there is a stuffed panda
bear and a purse on the table.”,
“A child watches television while a
panda bear sits by a purse.”,
“A simple living room with a panda on
the coffee table”,
“A black and white stuffed koala bear is
in the room.”,
“A stuffed panda is on the living room
table.”
        </p>
        <p>CosSim(io, ig) = io · ig , (1)
‖io‖‖ig‖
where io · ig denotes the dot product (also known as the inner product) of the original input image
feature vector io and the LLM-generated image feature vector ig. ‖io‖ and ‖ig‖ represent the Euclidean
norms (also known as the magnitudes or lengths) of vectors io and ig, respectively. In words, cosine
similarity measures the cosine of the angle between two vectors, which represents their similarity in
direction and magnitude.</p>
      </sec>
      <sec id="sec-3-5">
        <title>3.5. Human-annotated Image Captioning Dataset</title>
        <p>The Microsoft Common Objects in Context (MSCOCO) dataset is a comprehensive resource widely
used across various image recognition tasks including object detection, segmentation, and captioning.
Originally, the MSCOCO Captions dataset comprised over 330, 000 images, each meticulously annotated
with 80 object categories. Notably, both the training and validation sets feature each image accompanied
by five distinct human-generated captions. This dataset holds significant importance within the realm of
computer vision research, serving as a cornerstone for the development and evaluation of numerous
stateof-the-art object detection and segmentation models. In our study, we enhance the existing MSCOCO
Caption dataset by incorporating an additional 30, 000 human-annotated image-description pairs. This
augmented dataset serves as the basis for evaluating the alignment of our proposed evaluation method
with human-annotated image descriptions. To aid in understanding the dataset, several examples from
the dataset are provided in Figure 3.</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments and Analysis</title>
      <p>In this section, our goal is to evaluate the efectiveness of the proposed evaluation framework designed
for image captioning models. To achieve this, we will validate our framework using both the widely
adopted human-annotated image captioning datasets and our newly introduced dataset, the details of
which are outlined in the Section 3.5. Since all datasets have undergone human annotation, our primary
objective in this assessment is to ascertain whether the evaluation results obtained through our proposed
framework align with human consensus or judgment. To elaborate, a correct caption—matching
the human-annotated counterpart—should yield a substantial cosine similarity score between the
generated and original images, as measured by our evaluation framework. Conversely, an incorrect
caption—deviating from the human-annotated version—should result in a comparatively smaller cosine
similarity score. This approach allows us to empirically validate the efectiveness of our proposed
evaluation framework in aligning with human judgment.</p>
      <sec id="sec-4-1">
        <title>4.1. Experimental Settings</title>
        <p>
          To illustrate the application of the proposed framework for evaluating an image captioning model, we
employ the InstructBLIP [
          <xref ref-type="bibr" rid="ref57">57</xref>
          ] model in our image captioning module. This model is equipped with
the pre-trained language model Vicuna-7B [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ] to generate image descriptions. Image captions are
generated using the prompt “&lt;Image&gt; A short image caption:”, guiding the model to produce sentences
of fewer than 100 tokens, excluding special symbols. For text-to-image generation, GPT-4 with the
built-in difusion model DALL-E-3 is employed. Notably, the difusion model can be replaced by Stable
Difusion models [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ], utilizing a fixed, pre-trained encoder (ViT-g/14) [
          <xref ref-type="bibr" rid="ref60">60</xref>
          ], and the entire difusion
model is pre-trained on the LAION-2B dataset [
          <xref ref-type="bibr" rid="ref61">61</xref>
          ]. Human evaluation serves as the validation method
for the proposed framework. Each image in the dataset comes with five human-annotated image
captions, and performance is quantified using the average cosine similarity score, as detailed in Section
4.3. The experiments are conducted using two NVIDIA-A6000 GPUs.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Datasets</title>
        <p>MSCOCO Dataset [62]. The MSCOCO dataset comprises two primary components: the images and
their corresponding annotations. The images are organized into a directory hierarchy, with top-level
directories for the train, validation, and test sets. Annotations are provided in JSON format, with each
ifle corresponding to a single image. Each annotation includes details such as the image file name,
dimensions (width and height), a list of objects with their respective class labels (e.g., “person,” “car”),
bounding box coordinates (, , width, height), segmentation mask (in polygon or RLE format), keypoints
and their positions (if available), and five captions describing the scene. Additional information provided
by the MSCOCO dataset includes image super categories, license details, and coco-stuf annotations
(pixel-wise annotations for stuf classes in addition to the 80 object classes). The MSCOCO dataset
provides various types of annotations, including object detection with bounding box coordinates and
full segmentation masks for 80 diferent objects, stuf image segmentation with pixel maps displaying 91
amorphous background areas, panoptic segmentation identifying items in images based on 80 “things”
and 91 “stuf” categories, dense pose annotations featuring over 39, 000 photos and mapping between
pixels and a template for over 56, 000 tagged persons, 3D model annotations and natural language
descriptions for each image, and keypoint annotations for over 250, 000 persons annotated with key
points such as the right eye, nose, and left hip.</p>
        <p>Flickr30k Dataset [63]. The authors in [63] advocate for utilizing the visual denotations of linguistic
expressions, represented by the set of images they describe, to define new denotational similarity metrics.
These metrics, as demonstrated in [63], prove to be at least as advantageous as distributional similarities
for tasks requiring semantic inference. The computation of these denotational similarities involves the
construction of a denotation graph—a subsumption hierarchy over constituents and their denotations.
This graph is established using a substantial corpus comprising 30, 000 images and 150, 000 descriptive
captions. The creation of this denotation graph involves the development of an image caption corpus
by the authors in [63], consisting of 158, 915 crowd-sourced captions elucidating 31, 783 images. This
corpus serves as an extension of their previous work on the Flickr8k Dataset. The new images and
captions specifically focus on individuals engaged in everyday activities and events.</p>
      </sec>
      <sec id="sec-4-3">
        <title>4.3. Efectiveness Analysis of the Proposed Evaluation Framework</title>
        <p>Human Evaluation Using the Proposed Dataset. The dataset introduced in this work, consisting of
pairs of images and captions, has undergone human annotation. Each image is accompanied by five
distinct human-generated captions. The details of our human evaluation process are outlined below. In
Step 1, we directly utilize the human-annotated ground truth caption to generate an image through a
text-to-image LLM, such as GPT-4 or Gemini. In Step 2, we extract the image features of both the ground
truth caption’s corresponding image and the image generated by the text-to-image LLM. In Step 3, we
apply the cosine similarity formula from Section 3.4 to compute the cosine similarity scores between
these two sets of image features. Given that the caption is a human-annotated ground truth description,
accurately portraying the corresponding image, we expect the similarity score from Step 3 to be high.
Conversely, if a caption inaccurately describes a given image, the cosine similarity score from Step
3 should be low. Consistency between the experimental result and these expectations indicates the
efectiveness of the proposed evaluation framework in aligning with human consensus.</p>
        <p>
          The evaluation results depicted in Figure 4 reveal notable insights. The blue lines in Figure 4 illustrate
the impact of the provided captions on the cosine similarity scores. Specifically, when the provided
caption matches the correct human-annotated description (upper blue line), the average cosine similarity
score reaches approximately 0.67. Conversely, when the caption is incorrect (lower blue line), the
average cosine similarity score drops to around 0.47. This discrepancy results in a similarity gap of
approximately 0.2. These findings underscore the efectiveness of the proposed evaluation framework,
as it closely aligns with human judgment. It is noteworthy that the robustness of this human evaluation
method is attributed to the remarkable text-to-image generation capabilities of modern LLM models.
Widely recognized models such as GPT-4 and Gemini have been extensively acclaimed in various studies
and by the broader community [
          <xref ref-type="bibr" rid="ref14 ref15 ref16">14, 15, 16</xref>
          ].
        </p>
        <p>Assessment Using MSCOCO and Flickr30k Datasets. Figure 4 reveals consistent trends in the
evaluation results across MSCOCO, Flickr30k, and our dataset. Similar patterns are observed in MSCOCO
and Flickr30k, where there is a notable decrease in the average cosine similarity when the
modelgenerated image caption difers from the human-annotated ground truth caption. These findings afirm
the efectiveness and reliability of the proposed evaluation framework for assessing image captioning
models.</p>
        <p>Qualitative Analysis. To gain deeper insights into the performance of the proposed evaluation
framework, we present qualitative results in Figure 5 and Figure 6. In Figure 5, we observe that the
human-annotated ground truth captions and the model-predicted captions exhibit poor alignment
in these four examples. Given the accurate image generation capabilities of existing LLMs based on
text-based prompts, the accuracy of model-generated image descriptions is crucial. However, in these
instances, all predicted captions are incorrect, resulting in LLM-generated images that significantly
difer from the ground truth images. Consequently, this discrepancy contributes to the low cosine-based
similarity scores.</p>
        <p>In Figure 6, these two examples illustrate a strong alignment between the model-generated
descriptions and the human-generated ground truth captions. Hence, this alignment results in LLM-generated
images that closely resemble the ground truth images. As a result, when calculating cosine similarity
scores based on the image features extracted from the LLM-generated and ground truth images, the
scores are notably high. We also calculate scores based on these metrics to highlight the advantage of
our proposed method over the aforementioned text-based evaluation metrics. In Figure 6, we observe
that despite the model-generated image captions closely matching the ground truth captions, the
scores based on text-based evaluation metrics are comparatively low. This observation underscores
the superiority of our proposed evaluation framework over existing text-based evaluation metrics for
image captioning models.</p>
        <p>BP =
︂{
1</p>
        <p>if  &gt; 
exp(1 − 
 ) if  ≤ 
; BLEU = BP · exp
︃( 
∑︁ log ,
=1
)︃
(2)
where  represents the efective length of the ground truth text,  signifies the length of the predicted</p>
        <p>Five Human-annotated Captions for Each Image
text, and BP stands for brevity penalty. The geometric mean of the adjusted -gram precisions  is
calculated using -grams up to a length of  , with positive weights  that sum to 1.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>In this study, we have introduced a novel framework for evaluating automatically generated image
descriptions, aiming to overcome the limitations of existing evaluation metrics like BLEU, ROUGE,
METEOR, and CIDEr. Our framework leverages advancements in LLMs such as GPT-4 or Gemini
to utilize image descriptions generated by an image captioning model for creating corresponding
images. By quantifying the cosine similarity between the representation of the original input image in
the image captioning model and the representation of the LLM-generated image, we can efectively
assess the model’s performance without relying on human-annotated reference captions. Through
extensive experiments on the established datasets like Flickr30k and MSCOCO, we have demonstrated
the efectiveness of the proposed evaluation framework. Our experimental results suggest that the
proposed framework’s performance closely correlates with human judgment, ofering a valuable method
Ground truth caption: “A man is sitting on a black motorcylce.”
Predicted caption (incorrect) : “A man walks down the street next to a cow with horns.”
Cosine-based image similarity: 0.2078
(Ground truth image)
(Image generated from the incorrect caption)</p>
      <p>Ground truth caption: “Two smiling women holding a big cake together”
Predicted caption (incorrect) : “A boy playing baseball waiting for a pitch”
Cosine-based image similarity: 0.1474
(Ground truth image)
(Image generated from the incorrect caption)
(a) Example 1
Ground truth caption: “A soccer player removes his shirt.”
Predicted caption (incorrect) : “Men in athletic clothing stand near bicycles.”
Cosine-based image similarity: 0.1882
(Ground truth image)
(Image generated from the incorrect caption)</p>
      <p>(b) Example 2
Ground truth caption: “People hold a presentation at a retirement home.”
Predicted caption (incorrect) : “A Man doing a high up jump on a bike with a cityscape behind him”
Cosine-based image similarity: 0.1641</p>
      <p>(Ground truth image) (Image generated from the incorrect caption)
(c) Example 3
(d) Example 4
(a) Example 1 (b) Example 2
Figure 6: Limitations of text-based evaluation metrics in image captioning. See Equation (2) for the
calculation of the BLEU score. “Predicted caption” refers to the caption generated by the InstructBLIP
model. “Text to text similarity” indicates the cosine similarity between the human-annotated ground
truth caption and the model-generated caption using text-based CLIP embeddings. “Mean similarity”
represents the average of the five values of “Text to text similarity”.
for evaluating the efectiveness of image captioning models. Additionally, human evaluations conducted
on our introduced dataset validate the framework’s eficacy in capturing various aspects such as
grammaticality, coverage, correctness, and truthfulness in automatically generated image descriptions.
Moving forward, the proposed framework presents new opportunities for evaluating image captioning
models, ofering a more eficient and reliable alternative to traditional human evaluations and existing
automated evaluation metrics. It is designed to complement, rather than replace, human judgment. In
summary, our work contributes to the ongoing development of robust evaluation frameworks for image
captioning models, bridging the gap between automated metrics and human judgment, and driving
advancements in this field.
[62] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C. L. Zitnick, Microsoft
coco: Common objects in context, in: Computer Vision–ECCV 2014: 13th European Conference,
Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, Springer, 2014, pp. 740–755.
[63] P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to visual denotations: New
similarity metrics for semantic inference over event descriptions, Transactions of the Association
for Computational Linguistics 2 (2014) 67–78.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>M.</given-names>
            <surname>Mitchell</surname>
          </string-name>
          , X. Han,
          <string-name>
            <surname>J</surname>
          </string-name>
          . Hayes, Midge:
          <article-title>Generating descriptions of images</article-title>
          ,
          <source>in: INLG 2012 Proceedings of the Seventh International Natural Language Generation Conference</source>
          ,
          <year>2012</year>
          , pp.
          <fpage>131</fpage>
          -
          <lpage>133</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>M.</given-names>
            <surname>Rohrbach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Qiu</surname>
          </string-name>
          , I. Titov,
          <string-name>
            <given-names>S.</given-names>
            <surname>Thater</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Pinkal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Schiele</surname>
          </string-name>
          ,
          <article-title>Translating video content to natural language descriptions</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>433</fpage>
          -
          <lpage>440</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Teo</surname>
          </string-name>
          , H.
          <string-name>
            <surname>Daumé</surname>
            <given-names>III</given-names>
          </string-name>
          , Y. Aloimonos,
          <article-title>Corpus-guided sentence generation of natural images</article-title>
          ,
          <source>in: Proceedings of the 2011 conference on empirical methods in natural language processing</source>
          ,
          <year>2011</year>
          , pp.
          <fpage>444</fpage>
          -
          <lpage>454</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>D.</given-names>
            <surname>Elliott</surname>
          </string-name>
          , F. Keller,
          <article-title>Image description using visual dependency representations</article-title>
          ,
          <source>in: Proceedings of the 2013 conference on empirical methods in natural language processing</source>
          ,
          <year>2013</year>
          , pp.
          <fpage>1292</fpage>
          -
          <lpage>1302</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>M.</given-names>
            <surname>Yatskar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Galley</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Vanderwende</surname>
          </string-name>
          , L. Zettlemoyer,
          <article-title>See no evil, say no evil: Description generation from densely labeled images</article-title>
          ,
          <source>in: Proceedings of the Third Joint Conference on Lexical and Computational Semantics (* SEM</source>
          <year>2014</year>
          ),
          <year>2014</year>
          , pp.
          <fpage>110</fpage>
          -
          <lpage>120</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>R.</given-names>
            <surname>Linkert</surname>
          </string-name>
          ,
          <article-title>A technique for measuring attitude scale</article-title>
          ,
          <source>Psychometrical</source>
          <volume>140</volume>
          (
          <year>1932</year>
          )
          <fpage>40</fpage>
          -
          <lpage>55</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>C.-Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <article-title>Rouge: A package for automatic evaluation of summaries</article-title>
          , in: Text summarization branches out,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vedantam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Lawrence Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          , Cider:
          <article-title>Consensus-based image description evaluation</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>4566</fpage>
          -
          <lpage>4575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <surname>S. by Saheel</surname>
          </string-name>
          ,
          <article-title>Baby talk: Understanding and generating image descriptions (????).</article-title>
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>C.</given-names>
            <surname>Callison-Burch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Osborne</surname>
          </string-name>
          , P. Koehn,
          <article-title>Re-evaluating the role of bleu in machine translation research, in: 11th conference of the european chapter of the association for computational linguistics</article-title>
          ,
          <year>2006</year>
          , pp.
          <fpage>249</fpage>
          -
          <lpage>256</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hodosh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Young</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hockenmaier</surname>
          </string-name>
          ,
          <article-title>Framing image description as a ranking task: Data, models and evaluation metrics</article-title>
          ,
          <source>Journal of Artificial Intelligence Research</source>
          <volume>47</volume>
          (
          <year>2013</year>
          )
          <fpage>853</fpage>
          -
          <lpage>899</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>D.</given-names>
            <surname>Chan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Petryk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. E.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Darrell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Canny</surname>
          </string-name>
          ,
          <article-title>Clair: Evaluating image captions with large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2310.12971</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>T.</given-names>
            <surname>Brown</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Ryder</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Subbiah</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Kaplan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dhariwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Neelakantan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Shyam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          , et al.,
          <article-title>Language models are few-shot learners</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>33</volume>
          (
          <year>2020</year>
          )
          <fpage>1877</fpage>
          -
          <lpage>1901</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>G.</given-names>
            <surname>Team</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Anil</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Borgeaud</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Soricut</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Schalkwyk</surname>
            ,
            <given-names>A. M.</given-names>
          </string-name>
          <string-name>
            <surname>Dai</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Hauth</surname>
          </string-name>
          , et al.,
          <article-title>Gemini: a family of highly capable multimodal models</article-title>
          ,
          <source>arXiv preprint arXiv:2312.11805</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Liu</surname>
          </string-name>
          , Q.-L. Han,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <article-title>A brief overview of chatgpt: The history, status quo and potential future development</article-title>
          ,
          <source>IEEE/CAA Journal of Automatica Sinica</source>
          <volume>10</volume>
          (
          <year>2023</year>
          )
          <fpage>1122</fpage>
          -
          <lpage>1136</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>arXiv preprint arXiv:1706.03762</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>X.</given-names>
            <surname>Xiao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Xiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <article-title>Deep hierarchical encoder-decoder network for image captioning</article-title>
          ,
          <source>IEEE Transactions on Multimedia</source>
          <volume>21</volume>
          (
          <year>2019</year>
          )
          <fpage>2942</fpage>
          -
          <lpage>2956</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <article-title>Show and tell: A neural image caption generator</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T. J.</given-names>
            <surname>Jun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kweon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kim</surname>
          </string-name>
          , T-net:
          <article-title>Nested encoder-decoder architecture for the main vessel segmentation in coronary angiography</article-title>
          ,
          <source>Neural Networks</source>
          <volume>128</volume>
          (
          <year>2020</year>
          )
          <fpage>216</fpage>
          -
          <lpage>233</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alfadly</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <article-title>Robustness analysis of visual qa models by basic questions</article-title>
          ,
          <source>VQA Challenge and Visual Dialog Workshop</source>
          , CVPR (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alfadly</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          , Vqabq:
          <article-title>Visual question answering by basic questions</article-title>
          ,
          <source>VQA Challenge Workshop</source>
          , CVPR (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <surname>J.-H. Huang</surname>
          </string-name>
          ,
          <article-title>Robustness analysis of visual question answering models by basic questions</article-title>
          , King Abdullah University of Science and Technology,
          <source>Master Thesis</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>C. D.</given-names>
          </string-name>
          <string-name>
            <surname>Dao</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alfadly</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
          </string-name>
          ,
          <article-title>A novel framework for robustness analysis of visual qa models</article-title>
          ,
          <source>in: Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>8449</fpage>
          -
          <lpage>8456</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alfadly</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Improving visual question answering models through robustness analysis and in-context learning with a chain of basic questions</article-title>
          ,
          <source>arXiv preprint arXiv:2304.03147</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Alfadly</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Ghanem</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Worring, Assessing the robustness of visual question answering</article-title>
          , arXiv preprint arXiv:
          <year>1912</year>
          .
          <volume>01452</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <surname>J.-H. Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.-H. H. Yang</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Tian</surname>
            , Y.-C. Liu, T.-W. Wu,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Morikawa</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Chang</surname>
          </string-name>
          , et al.,
          <article-title>Deepopht: medical report generation for retinal images via deep models and visual explanation</article-title>
          ,
          <source>in: WACV</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>2442</fpage>
          -
          <lpage>2452</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <surname>J.-H. Huang</surname>
          </string-name>
          , T.-W. Wu,
          <string-name>
            <surname>C.-H. H. Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>I. Lin</given-names>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tegner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Worring</surname>
          </string-name>
          , et al.,
          <article-title>Non-local attention improves description generation for retinal images</article-title>
          ,
          <source>in: WACV</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>1606</fpage>
          -
          <lpage>1615</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <surname>J.-H. Huang</surname>
            , T.-W. Wu,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Contextualized keyword representations for multi-modal retinal image captioning</article-title>
          ,
          <source>in: ICMR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>645</fpage>
          -
          <lpage>652</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <surname>J.-H. Huang</surname>
          </string-name>
          , T.-W. Wu,
          <string-name>
            <surname>C.-H. H. Yang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Deep context-encoding network for retinal image captioning</article-title>
          , in: ICIP, IEEE,
          <year>2021</year>
          , pp.
          <fpage>3762</fpage>
          -
          <lpage>3766</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <surname>J.-H. Huang</surname>
          </string-name>
          , T.-W. Wu,
          <string-name>
            <surname>C.-H. H. Yang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Longer version for" deep context-encoding network for retinal image captioning"</article-title>
          ,
          <source>arXiv preprint arXiv:2105.14538</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Murn</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Mrak</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <article-title>Worring, Gpt2mvs: Generative pre-trained transformer-2 for multi-modal video summarization</article-title>
          ,
          <source>in: ICMR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>580</fpage>
          -
          <lpage>589</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <surname>J.-H. Huang</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Query-controllable video summarization</article-title>
          ,
          <source>in: ICMR</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>242</fpage>
          -
          <lpage>250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <surname>T.-W. Wu</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Worring</surname>
          </string-name>
          ,
          <article-title>Expert-defined keywords improve interpretability of retinal image captioning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>1859</fpage>
          -
          <lpage>1868</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>J.</given-names>
            <surname>Mao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Camburu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. L.</given-names>
            <surname>Yuille</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Murphy</surname>
          </string-name>
          ,
          <article-title>Generation and comprehension of unambiguous object descriptions</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>11</fpage>
          -
          <lpage>20</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bartz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Meinel</surname>
          </string-name>
          ,
          <article-title>Image captioning with deep bidirectional lstms</article-title>
          ,
          <source>in: Proceedings of the 24th ACM international conference on Multimedia</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>988</fpage>
          -
          <lpage>997</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>M.</given-names>
            <surname>Pedersoli</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lucas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Schmid</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Verbeek</surname>
          </string-name>
          ,
          <article-title>Areas of attention for image captioning</article-title>
          ,
          <source>in: Proceedings of the IEEE international conference on computer vision</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>1242</fpage>
          -
          <lpage>1250</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <given-names>J.</given-names>
            <surname>Laserson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. D.</given-names>
            <surname>Lantsman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cohen-Sfady</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Tamir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Goz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Brestel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Atar</surname>
          </string-name>
          , E. Elnekave, Textray:
          <article-title>Mining clinical reports to gain a broad understanding of chest x-rays, in: Medical Image Computing and Computer Assisted Intervention-MICCAI</article-title>
          <year>2018</year>
          : 21st International Conference, Granada, Spain,
          <source>September 16-20</source>
          ,
          <year>2018</year>
          , Proceedings,
          <source>Part II 11</source>
          , Springer,
          <year>2018</year>
          , pp.
          <fpage>553</fpage>
          -
          <lpage>561</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39]
          <string-name>
            <given-names>B.</given-names>
            <surname>Jing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Xie</surname>
          </string-name>
          , E. Xing,
          <article-title>On the automatic generation of medical imaging reports</article-title>
          ,
          <source>arXiv preprint arXiv:1711.08195</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          , et al.,
          <article-title>An image is worth 16x16 words: Transformers for image recognition at scale</article-title>
          , arXiv preprint arXiv:
          <year>2010</year>
          .
          <volume>11929</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Hybrid retrieval-generation reinforced agent for medical image report generation</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>31</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>D. M. Tierney</surname>
            ,
            <given-names>J. S.</given-names>
          </string-name>
          <string-name>
            <surname>Huelster</surname>
            ,
            <given-names>J. D.</given-names>
          </string-name>
          <string-name>
            <surname>Overgaard</surname>
            ,
            <given-names>M. B.</given-names>
          </string-name>
          <string-name>
            <surname>Plunkett</surname>
            ,
            <given-names>L. L.</given-names>
          </string-name>
          <string-name>
            <surname>Boland</surname>
            ,
            <given-names>C. A.</given-names>
          </string-name>
          <string-name>
            <surname>St Hill</surname>
            ,
            <given-names>V. K.</given-names>
          </string-name>
          <string-name>
            <surname>Agboto</surname>
            ,
            <given-names>C. S.</given-names>
          </string-name>
          <string-name>
            <surname>Smith</surname>
            ,
            <given-names>B. F.</given-names>
          </string-name>
          <string-name>
            <surname>Mikel</surname>
            ,
            <given-names>B. E.</given-names>
          </string-name>
          <string-name>
            <surname>Weise</surname>
          </string-name>
          , et al.,
          <article-title>Comparative performance of pulmonary ultrasound, chest radiograph, and ct among patients with acute respiratory failure</article-title>
          ,
          <source>Critical Care Medicine</source>
          <volume>48</volume>
          (
          <year>2020</year>
          )
          <fpage>151</fpage>
          -
          <lpage>157</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. C.</given-names>
            <surname>Frey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <article-title>Vit-v-net: Vision transformer for unsupervised volumetric medical image registration</article-title>
          ,
          <source>arXiv preprint arXiv:2104.06468</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>Meteor:</surname>
          </string-name>
          <article-title>An automatic metric for mt evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation</article-title>
          and/or summarization,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Fernando</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , S. Gould, Spice:
          <article-title>Semantic propositional image caption evaluation, in: Computer Vision-ECCV</article-title>
          <year>2016</year>
          : 14th European Conference, Amsterdam, The Netherlands,
          <source>October 11-14</source>
          ,
          <year>2016</year>
          , Proceedings, Part V 14, Springer,
          <year>2016</year>
          , pp.
          <fpage>382</fpage>
          -
          <lpage>398</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>S.</given-names>
            <surname>Sharma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. E.</given-names>
            <surname>Asri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schulz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zumer</surname>
          </string-name>
          ,
          <article-title>Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation</article-title>
          ,
          <source>arXiv preprint arXiv:1706.09799</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          , arXiv preprint arXiv:
          <year>1904</year>
          .
          <volume>09675</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forbes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <article-title>Clipscore: A reference-free evaluation metric for image captioning</article-title>
          ,
          <source>arXiv preprint arXiv:2104.08718</source>
          (
          <year>2021</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          , M.-
          <string-name>
            <given-names>W.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Toutanova</surname>
          </string-name>
          , Bert:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          , arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners</article-title>
          ,
          <source>OpenAI blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>C.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-T.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Parekh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Pham</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.-H.</given-names>
            <surname>Sung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duerig</surname>
          </string-name>
          ,
          <article-title>Scaling up visual and vision-language representation learning with noisy text supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>4904</fpage>
          -
          <lpage>4916</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , Instructblip: Towards general
          <article-title>-purpose vision-language models with instruction tuning</article-title>
          .
          <source>arxiv</source>
          <year>2023</year>
          , arXiv preprint arXiv:
          <volume>2305</volume>
          .06500 (????).
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <article-title>Eva: Exploring the limits of masked visual representation learning at scale</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19358</fpage>
          -
          <lpage>19369</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M. H.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Fung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. C. H.</given-names>
            <surname>Hoi</surname>
          </string-name>
          , Instructblip: Towards general
          <article-title>-purpose vision-language models with instruction tuning</article-title>
          ,
          <source>ArXiv abs/2305</source>
          .06500 (
          <year>2023</year>
          ). URL: https://api.semanticscholar.org/CorpusID:258615266.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Stoica</surname>
          </string-name>
          ,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>ArXiv abs/2306</source>
          .05685 (
          <year>2023</year>
          ). URL: https://api.semanticscholar.org/CorpusID:259129398.
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>R.</given-names>
            <surname>Rombach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Blattmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lorenz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Esser</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Ommer</surname>
          </string-name>
          ,
          <article-title>High-resolution image synthesis with latent difusion models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>10684</fpage>
          -
          <lpage>10695</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Krueger</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International Conference on Machine Learning</source>
          ,
          <year>2021</year>
          . URL: https://api.semanticscholar.org/CorpusID:231591445.
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <given-names>C.</given-names>
            <surname>Schuhmann</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Vencu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Beaumont</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kaczmarczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Mullis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Katta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Coombes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jitsev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Komatsuzaki</surname>
          </string-name>
          , Laion-400m:
          <article-title>Open dataset of clip-filtered 400 million image-text pairs</article-title>
          ,
          <source>ArXiv abs/2111</source>
          .02114 (
          <year>2021</year>
          ). URL: https://api.semanticscholar.org/CorpusID:241033103.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>