<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Medical Image Interpretation with Large Multimodal Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Mahmudul Hoque</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Rakibul Hasan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md. Ismail Siddiqi Emon</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fahmi Khalifa</string-name>
          <email>fahmi.khalifa@morgan.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Md Mahmudur Rahman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Computer Science Department, Morgan State University</institution>
          ,
          <addr-line>1700 East Cold Spring Lane, Baltimore, Maryland 21251</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Electrical and Computer Engineering Department, School of Engineering, Morgan State University</institution>
          ,
          <addr-line>Baltimore MD 21251</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <abstract>
        <p>This working note documents the participation of CS_Morgan in the ImageCLEFmedical 2024 Caption subtasks, focusing on Caption Prediction and Concept Detection challenges. The primary objectives included training, validating, and testing multimodal Artificial Intelligence (AI) models intended to automate the process of generating captions and identifying multi-concepts of radiology images. The dataset used is a subset of the Radiology Objects in COntext version 2 (ROCOv2) dataset and contains image-caption pairs and corresponding Unified Medical Language System (UMLS) concepts. To address the caption prediction challenge, diferent variants of the Large Language and Vision Assistant (LLaVA) models were experimented with, tailoring them for the medical domain. Additionally, a lightweight Large Multimodal Model (LMM), and MoonDream2, a small Vision Language Model (VLM), were explored. The former is the instruct variant of the Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS (IDEFICS) 9B obtained through quantization. Besides LMMs, conventional encoder-decoder models like Vision Generative Pre-trained Transformer 2 (visionGPT2) and Convolutional Neural Network-Transformer (CNN-Transformer) architectures were considered. Consequently, this enabled 10 submissions for the caption prediction task, with the first submission of LLaVA 1.6 on the Mistral 7B weights securing the 2nd position among the participants. This model was adapted using 40.1M parameters and achieved the best performance on the test data across the performance metrics of BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682), CIDEr (0.245029), and RefCLIPScore (0.815534). For the concept detection task, our single submission based on the ConvMixer architecture-a hybrid approach leveraging CNN and Transformer advantages-ranked 9th with an F1-score of 0.107645. Overall, the evaluations on the test data for the caption prediction task submissions suggest that LMMs, quantized LMMs, and small VLMs, when adapted and selectively fine-tuned using fewer parameters, have ample potential for understanding medical concepts present in images.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Large Multimodal Models</kwd>
        <kwd>Vision Language Models</kwd>
        <kwd>Transformer</kwd>
        <kwd>Large Language and Vision Assistant</kwd>
        <kwd>Caption Prediction</kwd>
        <kwd>Concept Detection</kwd>
        <kwd>Medical Images</kwd>
        <kwd>Low-Rank Adaptation</kwd>
        <kwd>Quantization</kwd>
        <kwd>Image-aware Decoder Enhanced à la Flamingo with Interleaved Cross-attentionS</kwd>
        <kwd>Vision Generative Pre-trained Transformer 2</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        The tasks of automatic caption generation and multi-label prediction from medical images have become
crucial for improving healthcare due to the growing availability of medical images from diferent
modalities like X-radiation (X-ray), Computed Tomography (CT), Positron Emission Tomography (PET),
Magnetic Resonance Imaging (MRI), and Ultrasound (US), as well as the significant advancements in
the computing power of modern graphics processing units [
        <xref ref-type="bibr" rid="ref1 ref2 ref3">1, 2, 3</xref>
        ]. The increasing need for diagnostic
radiology services and the lack of report writing expertise in many medical facilities highlight the need
for automating the mentioned tasks. As a result, extensive applications of recently developed AI models
have been found in these domains. As an active research area of AI, combining large language models
(LLMs) with vision capabilities allows users to explore emergent abilities using multimodal data, which
is being popularized as LMMs or VLMs [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. For example, LLaVA [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], Flamingo [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ], and Contrastive
Language-Image Pretraining (CLIP) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] have shown remarkable performance in various vision-text
tasks. Consequently, there is also potential for applying LLMs in the biomedical imaging field [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. These
models are trained on extensive databases of human knowledge, demonstrating remarkable capabilities
in ofering valuable insights to physicians and healthcare professionals [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Utilizing knowledge from
millions to billions of training examples, VLMs can help detect minor abnormalities in low-resolution
radiology images that are dificult to spot with the naked eye [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Moreover, pre-trained LLMs like
ChatGPT-4 [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] exhibit emergent abilities on tasks they were not specifically trained for (i.e.,
visionlanguage domain) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Models like BiomedCLIP [
        <xref ref-type="bibr" rid="ref13 ref14">13, 14</xref>
        ], ChatDoctor [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], and GatorTron [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], which
are pretrained on high-quality medical datasets, ofer more useful applications for medical domain users.
In this working note, various multimodal models were demonstrated that were initially pretrained on
multimodal image-instruction pairs from diverse sources. This approach allows for attaining competitive
results in this competition of analyzing medical images such as brain MRI, chest X-ray, PET, etc.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Objectives</title>
      <p>
        For the ImageCLEFmedical Caption 2024 [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] challenge, CS_Morgan, participant in the competition,
was tasked with developing solutions to automatically predict captions and identify multi-label concepts
of radiology images from ROCOv2 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ] dataset. Considering the tasks, the objectives include the
following:
• Concept Detection [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: This task involved identifying and locating relevant concepts in the
specified dataset. This formed the foundation for scene understanding and was essential for
context-based image and information retrieval. The evaluation process was conducted using
metrics like F1-score.
• Caption Prediction [
        <xref ref-type="bibr" rid="ref19">19</xref>
        ]: This task focused on predicting coherent captions for the entire image
test dataset using the detected concepts and their interactions within the image. This task provided
insights into the interplay of visual elements. Evaluation metrics used for this task consisted
of BERTScore (as a primary approach), ROUGE (as a secondary approach), BLEU-1, BLEURT,
METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and MedBERTScore.
      </p>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        Dataset for both tasks included curated images from ROCOv2 [
        <xref ref-type="bibr" rid="ref18">18</xref>
        ], an updated version of the original
ROCO [
        <xref ref-type="bibr" rid="ref20">20</xref>
        ] dataset. The medical images were collected from biomedical articles in the PMC OpenAccess
and were accompanied by corresponding captions and concepts. The latter was also expressed using
UMLS [
        <xref ref-type="bibr" rid="ref21">21</xref>
        ] terms. The training, validation, and test sets contained 70,108, 9,972, and 17,237 radiology
images, respectively, with average dimensions of the images being 600× 600. As a result, for the deep
learning models implemented here, the images were resized to that average dimension, and the smaller
images were padded to have a uniform distribution of image dimensions. Furthermore, the length
of captions in words (without punctuations) or tokens for each image was 100 or fewer on average.
Moreover, by analyzing both training and validation image-caption pairs, 42,121 unique words (without
the punctuations) were found and used as the set of vocabulary in the models implemented. Additionally,
there were 1,944 unique CUIs found in the concept list of the train and validation images, among which
1,934 were enlisted in the CUI mapping file.
      </p>
    </sec>
    <sec id="sec-4">
      <title>4. Large Multimodal Models (LMMs)</title>
      <p>
        LMMs as an extended variation of LLMs mark a major leap forward in AI by handling and comprehending
various data types, including text, images, audio, and video [
        <xref ref-type="bibr" rid="ref22 ref23 ref24">22, 23, 24</xref>
        ]. By integrating and interpreting
information from these diverse sources, LMMs achieve a holistic understanding of complex data [
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ].
This capability allows them to perform sophisticated tasks, such as image captioning, visual question
answering, and content recommendation, by leveraging the relationships between diferent data types
[
        <xref ref-type="bibr" rid="ref22 ref23">22, 23</xref>
        ]. Figure 1 demonstrates theoretical architecture of LMMs.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Pre-training and Fine-tuning of LMMs</title>
        <p>
          During pre-training, the model is initially trained on vast and diverse datasets, enabling it to learn
general representations before being fine-tuned for specific tasks. This involves utilizing large-scale
datasets that include various modalities [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ]. For instance, models like ViLBERT [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] have been
pre-trained on extensive image-text pairs to increase their performance in downstream tasks like image
captioning and visual question answering [
          <xref ref-type="bibr" rid="ref25 ref26">25, 26</xref>
          ].
        </p>
        <p>
          Fine-tuning LMMs involves adjusting all pre-trained model parameters to enhance performance
on specific tasks, such as image captioning. This process is computationally intensive and
resourcedemanding, especially for models with billions of parameters. Despite these challenges, the full
finetuning technique remains popular due to its potential for achieving high accuracy. For instance, models
like BLIP-2 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] and InstructBLIP [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] have demonstrated enhancements in image captioning tasks
through full fine-tuning , utilizing their extensive pre-training on large datasets to adapt to specific tasks.
However, the substantial computational and memory requirements make full fine-tuning impractical
for many applications, leading to the exploration of more eficient fine-tuning methods.
        </p>
        <p>
          As a result, Parameter-Eficient Fine-Tuning (PEFT) [
          <xref ref-type="bibr" rid="ref30 ref31">30, 31</xref>
          ] presents a more eficient approach
compared to full fine-tuning by modifying only a small portion of the model’s parameters while leaving
the majority unchanged. This strategy substantially decreases computational and memory demands,
making it suitable for a variety of applications. In the domain of image captioning, PEFT techniques have
proven efective with models such as mPLUG [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ] and LLaVA [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. Notably, approaches like Low-Rank
Adaptation (LoRA) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] have been particularly successful in fine-tuning. LoRA optimizes a matrix of
updates to the pre-trained model weights rather than directly modifying them. This update matrix
is decomposed into two smaller, lower-rank matrices, reducing the number of parameters that need
updating while preserving the original weights [
          <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
          ]. This allows diferent task-specific LoRAs to
be easily swapped, efectively tailoring the pre-trained model for various applications. LoRA matches
the performance of the full fine-tuning technique by updating a small number of additional weights,
preventing catastrophic forgetting, and enabling better generalization with limited data [
          <xref ref-type="bibr" rid="ref33 ref34">33, 34</xref>
          ]. Figure
2 compares the approaches of LoRA and linear projection techniques.
        </p>
        <p>
          Figure 2 indicates that the LoRA approach involves two matrices,  and . The matrix  is the first
step in the adaptation process, projecting high-dimensional input features into a lower-dimensional
latent space. Typically, its shape includes two values: rank and original dimension (e.g., 32 and 4096).
The matrix  is the second component, mapping the lower-dimensional features back to the original
high-dimensional space, efectively reversing the reduction performed by the matrix  and the shape
becomes [
          <xref ref-type="bibr" rid="ref32">4096, 32</xref>
          ]. Both  and  matrices are trainable and updated during fine-tuning. LoRA
focuses on specific weight matrices within the model, for example, the query, key, and value matrices
in Transformer [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ] architectures. However, traditional Transformers are hindered by their slow
performance and high memory consumption, particularly with long sequences, due to the quadratic
time and memory complexity of self-attention. Flash Attention [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] addresses these issues with an
IO-aware exact attention algorithm that utilizes tiling to reduce the number of memory reads and writes
between the GPU’s high-bandwidth memory (HBM) and on-chip SRAM.
        </p>
        <p>
          Visual instruction tuning [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] enhances LMMs by fine-tuning them with instructions that combine
visual and textual data. This technique uses machine-generated instruction-following data to improve
the model’s zero-shot and few-shot performance on new tasks. For example, the LLaVA [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ] model
integrates a vision encoder with LLM for general-purpose visual and language understanding. The
process involves generating detailed, context-aware language-image instructions using a language-only
model like GPT-4 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. This data is then used to train the LMM, enabling it to perform tasks such as
image captioning, visual question answering, and detailed image descriptions.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Large Language and Vision Assistant (LLaVA)</title>
        <p>
          LLaVA [
          <xref ref-type="bibr" rid="ref37 ref38">37, 38</xref>
          ] stands as a comprehensive, end-to-end trained multimodal model that seamlessly merges
a vision encoder and a LLM to facilitate broad-ranging visual and language comprehension (see Figure
3). The vision encoder is tasked with processing input images () and transforming them into a series
of feature representations (). Situated above the vision encoder is the Projection ( ), functioning as
a vital conduit between the vision encoder and the language model. The projection matrix facilitates the
conversion of feature representations () from the vision encoder into a compatible format () for
the language model. On the right side of the diagram, the Language Instruction input () represents
the textual component that the model must comprehend and respond to in conjunction with the visual
input. This input undergoes processing by the language model, generating its own set of feature
representations (). The Language Model () (e.g., Vicuna 7B [
          <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
          ] or Mistral 7B [
          <xref ref-type="bibr" rid="ref41 ref42">41, 42</xref>
          ] in
this working note) ingests both the projected vision features () and the language features (),
seamlessly integrating them to produce a Language Response (). The resulting output constitutes a
coherent response incorporating elements from both visual and textual inputs. Figure 3 shows the basic
architecture of LLaVA and demonstrates its working principles.
4.2.1. LLaVA-v.1.6-Vicuna-7B
The Vicuna 7B [
          <xref ref-type="bibr" rid="ref39 ref40">39, 40</xref>
          ] language model components include (see Figure 4): (a) Embedding Layer
Converts input tokens into dense vectors with an embedding dimension of 4,096, (b) Decoder Layers
- Consists of 32 LLaMA-based Decoder Layer instances, where each layer includes a self-attention
mechanism, a Multi-layer Perceptron (MLP) using Sigmoid Linear Unit (SiLU) activation, and Root
Mean Square (RMS) normalization layers applied before and after the attention mechanisms, and (c)
Final Normalization Layer - A RMS normalization layer applied to the final output of the decoder layers.
The model supports input image resolutions of 672× 672, 336× 1344, and 1344× 336, enhancing visual
detail comprehension.
4.2.2. LLaVA-v.1.6-Mistral-7B
The LLaVA v.1.6 Mistral 7B [
          <xref ref-type="bibr" rid="ref43">43</xref>
          ] model integrates several key components for its functionality (see
Figure 5). At its core is the vision encoder, utilizing a pre-trained CLIP ViT-L/14 [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ] to extract visual
embeddings from high resolution images. This encoder processes visual input, converting it into a format
compatible with the language model. The language model itself is based on the Mistral-7B architecture,
which inherently incorporates advanced features like Sliding Window Attention and Grouped-Query
Attention, enhancing its capability to manage long sequences and improve inference eficiency [
          <xref ref-type="bibr" rid="ref41 ref42">41, 42</xref>
          ].
Additionally, A two-layer MLP projection matrix is employed to map the visual embeddings from the
vision encoder into the same embedding space as the language model, ensuring seamless integration of
visual and textual information. The CLIP ViT-L/14 [
          <xref ref-type="bibr" rid="ref44">44</xref>
          ], a Vision Transformer(ViT) with 14 layers, is
renowned for its ability to handle complex visual tasks, contributing to the model’s overall performance.
        </p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Caption Prediction Task</title>
      <p>
        To address the caption prediction task, the CS_Morgan team fine-tuned several LMMs that were
pretrained using extensive standard datasets from the field of computer vision. These models were derived
from well-known LLMs commonly utilized in Natural Language Processing (NLP). Ten submissions
were made, and the technical details, methods, and approaches of these submissions are detailed in the
following section. Moreover, the reproducible codes relevant to the following submissions can be found
here [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ].
      </p>
      <p>Before any tasks are performed, the dataset is pre-processed to ensure that it is clean and correctly
formatted. Beyond the initial image-text pre-processing described earlier, the training, validation,
and testing datasets were structured for generating captions to meet the input requirements of the
corresponding vision-language models. Furthermore, the dataset was managed using the Hugging Face
Hub. Computational details can be found in Appendix A.</p>
      <sec id="sec-5-1">
        <title>5.1. Submission 1: Selective fine-tuning of LLaVA-v.1.6-Mistral-7B</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Model Description</title>
          <p>For this submission, the pre-trained LLaVA 1.6 on Mistral 7B weights was loaded using
Mistral-7BInstruct-v.0.2 as the base LLM and flash attention was used to optimize attention mechanism
computations. To enhance training stability, all float16 instances of the Vision Tower model were replaced with
bfloat16. Additionally, prompts were set up by combining images and texts using the "mistral_instruct"
conversation mode.</p>
          <p>For eficient fine-tuning, LoRA was applied to specific layers, configuring it with a rank r = 16, an
alpha (lora_alpha) of 32, and a dropout rate of 0.05. The query, key, and value projection layers
in the self-attention mechanisms of the Mistral Decoder Layer, as well as the projection layers in the
MLP, were specifically targeted. In the vision model, LoRA was applied to the linear projection layers
within the self-attention mechanism (CLIP attention) of the encoder layers in the CLIP encoder. This
resulted in 40,108,032 trainable parameters, about 0.527% of the model’s total parameters. The LoRA
components included lora_A, lora_B, and lora_dropout representing the low-rank projection
to a smaller dimension, projection back to the original dimension, and a parameter to prevent overfitting,
respectively.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Training Process</title>
          <p>
            The training process involved setting up a Data Loader for the dataset, ensuring images and text inputs
were properly loaded. Custom callbacks were defined for printing the best checkpoint and implementing
early stopping. Key training parameters included a learning rate of 1e-4, bfloat16 precision, and the
AdamW [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ] optimizer. Each device processed batches of 4, with gradient accumulation steps of 8.
Evaluations and saves were performed every 1,095 steps, with the training capped at 21,900 steps (10
epochs). Early stopping was set with a patience of 5 steps and a threshold of 0.01, monitoring evaluation
loss (where lower values are better). Training was halted at 9,855 steps, and the best model, saved
at 4,380 steps, was reloaded at the end. For evaluation, caption generation was configured with a
temperature of 1.0, a beam width of 1, and a maximum of 512 new tokens. Figure 6 depicts the training
and validation loss over the steps.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>5.2. Submission 2: Additional fine-tuning of LLaVA-v.1.6-Mistral-7B Model</title>
        <sec id="sec-5-2-1">
          <title>5.2.1. Model Description</title>
          <p>The second submission was built upon the first one by fine-tuning a larger portion of the model
using the same pattern. This included an expanded application of LoRA to improve utilization of
the model’s capacity for more accurate and robust predictions. The fine-tuning involved additional
layers to enhance the learning and improve visual-textual alignment. Specifically, output projection
layers such as o_proj in the Mistral Decoder Layer’s self-attention mechanism and out_proj in
the vision model were included to better capture complex relationships within the data, which is
essential for tasks like image captioning. Targeting multimodal projector layers (mm_projector.0
and mm_projector.2) enhanced the alignment of visual and textual representations, which is crucial
for multimodal tasks. Despite the increased number of trainable parameters (98,467,840 compared to
40,108,032), this expansion represented only a small fraction (1.285%) of the total model parameters,
maintaining parameter eficiency while improving learning capabilities. LoRA was configured with a
rank r = 32, lora_alpha of 32, and a dropout rate of 0.05. Various layers were targeted in the Mistral
Decoder Layers, including query projection (q_proj), key projection (k_proj), value projection
(v_proj), and output projection (o_proj) in the self-attention mechanism, as well as gate projection
(gate_proj), up projection (up_proj), and down projection (down_proj) in the MLP components.
In the CLIP Vision Model, LoRA was applied to the similar projection layers in the attention
mechanism and fully connected layers (fc1 and fc2) in the MLP. Additionally, the multimodal projector
layers (mm_projector.0 and mm_projector.2) were included to further enhance the model’s
capabilities. These modifications were applied to the LLaVA-v.1.6 model and its pre-trained checkpoints
on the Mistral-7B.</p>
        </sec>
        <sec id="sec-5-2-2">
          <title>5.2.2. Training Process</title>
          <p>
            The training configuration included a learning rate of 1e-5, using the AdamW [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ] optimizer, bfloat16
precision, and Flash Attention enabled. Each device handled a batch size of 4, with gradient accumulation
steps set to 8. The model underwent training for a maximum of 8,760 steps (4 epochs), with checkpoints
and evaluations performed every 548 steps. Early stopping parameters were defined with a patience of 4
and a threshold of 0.01, monitoring the evaluation loss to select the best model, with lower values being
preferable. Training was halted early at 3,836 steps, and the model saved at this point was considered
the best and subsequently loaded. For evaluation, specifically for generating captions, parameters were
set with a temperature of 1.0, beam width of 1, and a maximum of 100 new tokens.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-3">
        <title>5.3. Submission 3: Hybrid fine-tuning of LLaVA-v.1.6-Mistral-7B</title>
        <p>This submission was built on the previous one, maintaining the same general pattern but altering which
layers were fine-tuned and the fine-tuning strategy itself. The fine-tuning strategy was multifaceted,
employing LoRA to adapt key components, such as attention mechanism projections, MLP
components, and multimodal projector layers. Additionally, the language model’s head (lm_head) and the
embedding tokens (embed_tokens) were explicitly set as trainable parameters to further enable
these parts of the model to learn and adapt to the task. This hybrid approach leveraged the strengths
of both LoRA adapters and traditional fine-tuning. Fine-tuning the lm_head allowed the model to
better tailor its output generation to specific tasks or datasets, which was particularly important for
generating appropriate language or captions from medical images. On the other hand, fine-tuning
the embed_tokens layer helped the model learn better representations of input tokens, improving
overall performance, especially when the input data distribution difers from the pre-training data.</p>
        <p>In this configuration, LoRA was set with a rank r = 32, and the lora_alpha was calculated as
32 × √32 to stabilize training and enhance low-rank adaptation performance. This scaling factor
normalized the learning rate for LoRA parameters based on rank, ensuring efective updates without
causing gradient explosion or vanishing gradients. A dropout rate of 0.05 was applied to prevent
overfitting and maintain generalization ability.</p>
        <p>For layers explicitly set as trainable, the lm_head was a linear layer that mapped hidden states
to the vocabulary space, generating the final output logits for each token. This layer was crucial for
the model’s text generation capability. The embed_tokens layer converted input token indices into
dense vectors, providing initial representations of the input tokens essential for the model to process
the input text. Both the lm_head and embed_tokens layers had their full weights fine-tuned, in
addition to the LoRA adapters.</p>
        <p>Overall, this hybrid fine-tuning approach combined LoRA fine-tuning for attention, MLP, and
multimodal projection layers with full weight fine-tuning of the lm_head and embed_tokens layers.
The total number of trainable parameters was 350,650,368 out of 7,654,729,728 total parameters, making
up 4.581% of the parameters.</p>
        <sec id="sec-5-3-1">
          <title>5.3.1. Training Process</title>
          <p>
            The training arguments included a learning rate of 1e-5, the AdamW [
            <xref ref-type="bibr" rid="ref46">46</xref>
            ] optimizer, bfloat16 precision,
Flash Attention, per-device batch sizes of 4, and gradient accumulation steps of 8. The model was
trained for a maximum of 6,570 steps (3 epochs), with checkpoints and evaluations performed every
548 steps. Gradient checkpointing was enabled using a re-entrant approach to reduce memory usage.
Early stopping was configured with a patience of 3 and a threshold of 0.01, monitoring evaluation loss
(with lower values being better). Early stopping was triggered at 3,836 steps, at which point the best
model was saved and later loaded. For evaluation and caption generation, the parameters were set to a
temperature of 1.0, num_beams of 1, and max_new_tokens of 100.
          </p>
        </sec>
        <sec id="sec-5-3-2">
          <title>5.4.1. Model Description</title>
          <p>For this submission, the pre-trained multimodal language model on checkpoints of Llava v.1.6 Vicuna
7B was loaded, which used the lmsys/vicuna-7b-v1.5 as its base LLM. The model preparation involved
configuring LoRA with a rank ( r) of 16, an lora_alpha of 32, and a dropout rate of 0.05. The
target modules for LoRA were expanded to include the query (q_proj), key (k_proj), and value
(v_proj) projections within the self-attention mechanism of the LLaMA Decoder Layer, as well as
the gate (gate_proj), up (up_proj), and down (down_proj) projections in the MLP components
of the same layer. Additionally, in the CLIP Vision Model’s CLIP Encoder layers, the key (k_proj),
value (v_proj), and query (q_proj) projections, along with the first ( fc1) and second (fc2) fully
connected layers of the CLIP MLP, were targeted. Furthermore, the multimodal projector layers
(mm_projector.0 and mm_projector.2) were included. This expanded application of LoRA
resulted in 34,422,784 trainable parameters out of a total of 7,097,329,664 parameters, constituting
approximately 0.485% of the model’s parameters.</p>
        </sec>
        <sec id="sec-5-3-3">
          <title>5.4.2. Training Process</title>
          <p>The training process involved setting up a Data Loader for the dataset and inspecting batches to
ensure correct loading of images and text inputs. Custom callbacks were created for printing the best
checkpoint and enabling early stopping. The training used a learning rate of 1e-4, bfloat16 precision,
Flash Attention, the AdamW optimizer, batch sizes of 4 per device, and gradient accumulation steps of 8,
with evaluation and save steps every 548 steps. The model was trained for a maximum of 10,950 steps (5
epochs), with early stopping configured with a patience of 3 and a threshold of 0.01. The evaluation loss
was monitored to select the best model, with lower values being preferable. Early stopping occurred at
4,932 steps, and the best model, saved at 4,384 steps, was loaded at the end. For generating captions
during evaluation, parameters included a temperature of 1.0, num_beams set to 1, and a maximum of
512 new tokens.</p>
        </sec>
      </sec>
      <sec id="sec-5-4">
        <title>5.5. Submission 5: Hybrid Fine-tuning of LLaVA-v.1.6-Vicuna-7B</title>
        <p>For this submission, same approach of the third submission was followed. The only diference was
the implementation of Vicuna LLM. The total number of trainable parameters is 346,718,208 out of
7,147,481,088 total parameters (4.851% of parameters). The training process employed was similar to
the previous submission. The only diference was that the maximum token limit was set to 150 for
this submission. The model was trained for a maximum of 10,950 steps (5 epochs), with early stopping
configured with a patience of 5 and a threshold of 0.01. Early stopping occurred at 6,576 steps, and the
best model, saved at 4,384 steps, was loaded for evaluation.</p>
      </sec>
      <sec id="sec-5-5">
        <title>5.6. Submission 6: Selective Fine-tuning of LLaVA-v.1.5-7B</title>
        <p>The LLaVA 1.5 7B shares a similar architecture with that of LLaVA-v.1.6 Vicuna-7B. LLaVA 1.5
checkpoints on 7B parameters were loaded, and the expanded use of LoRA resulted in 84,574,208 trainable
parameters out of a total of 7,147,476,992, constituting approximately 1.183% of the model’s
parameters. Precision was adjusted from float16 to bfloat16 to enhance computational eficiency, and Flash
attention was not enabled in this submission. Instead, LLaMA Scaled Dot-Product Attention (SDPA)
was utilized in the 32 layers of the LLaMA Decoder Layer. LoRA was configured with a rank of 32,
lora_alpha of 32, and a dropout rate of 0.05. Target modules for LoRA included various projections
in LLaMA Decoder Layer, MLP components, and attention mechanisms within CLIP Vision Model.
The training process involved creating a Data Loader, defining custom callbacks for early stopping
and checkpoint printing, and setting training arguments such as a learning rate of 1e-5, and AdamW
optimizer. Training was conducted for a maximum of 8760 steps with early stopping triggered at 4,672
steps, saving the best model. For evaluation, parameters included temperature = 1.0, num_beams = 1,
and max_new_tokens = 100.</p>
      </sec>
      <sec id="sec-5-6">
        <title>5.7. Submission 7: Adaptation of MoonDream2</title>
        <sec id="sec-5-6-1">
          <title>5.7.1. Model Description</title>
          <p>
            Moondream2, a small vision language model designed for eficient operation on edge devices, was
evaluated on the ImageCLEF 2024 dataset using pre-trained weights from Huggingface [
            <xref ref-type="bibr" rid="ref47 ref48">47, 48</xref>
            ]. These
weights were initialized from Sigmoid Loss for Language-Image Pre-Training (SigLIP) and Phi-1.5
models. Phi-1.5 [
            <xref ref-type="bibr" rid="ref49">49</xref>
            ], developed by Microsoft Research, is a compact Transformer-based language model
with 24 layers, 32 heads (each with a dimension of 64), rotary embeddings, a rotary dimension of 32, a
context length of 2,048, and flash-attention. SigLIP [
            <xref ref-type="bibr" rid="ref50">50</xref>
            ], an enhancement of the CLIP model, replaces the
softmax loss with a pairwise sigmoid loss, operating on image-text pairs without global normalization.
SigLIP’s architecture includes a ViT [
            <xref ref-type="bibr" rid="ref51">51</xref>
            ] backbone that processes image patches through a transformer
encoder and a classification head with a MLP using Gaussian Error Linear Unit (GELU) activation for
ifnal predictions. Moreover, the pre-processing included resizing, type conversion, and normalization.
This architecture efectively combined visual and textual processing for caption generation.
          </p>
          <p>LoRA was configured with an alpha ( lora_alpha) of 32, which adjusts the learning rate for
lowrank matrices, and a rank (lora_rank) of 64 for the adaptation process. It was applied to specific
linear layers in both the vision encoder and the text model. In the vision encoder, LoRA targeted the
projection layers (proj), and fully connected layers (fc1 and fc2) within the 27 ViTBlock components.
Additionally, LoRA was applied to the fc1 and fc2 layers in the multimodal projection layer, a custom
module integrated to adapt the projection layer for the purpose of the caption prediction challenge.
In the language model, LoRA targeted the Wqkv, out_proj, fc1, and fc2 layers within the 24 Phi
Decoder Layer components. Wqkv in the Phi Decoder Layer represents the combined weights for the
self-attention mechanism’s linear projections (query, key, and value). With LoRA applied, the model
had 74,422,272 trainable parameters, which was about 3.850% of the total parameters (1,931,904,880).</p>
        </sec>
        <sec id="sec-5-6-2">
          <title>5.7.2. Training Process</title>
          <p>The training process employed various key parameters and strategies to optimize the model’s
performance. The number of image tokens was set to 729, aligning with text tokens. Training spanned 10
epochs over 40,000 steps, using a batch size of 8 and gradient accumulation steps of 4, with evaluation
after each epoch. An early stopping mechanism with a patience of 6 epochs and a minimum delta of
0.0001 monitored validation loss to prevent overfitting. Data loading and batching utilized PyTorch’s
DataLoader with custom collation for images and text tokens, pre-processed and padded for uniform
sequence lengths. Gradient accumulation steps set to 4 simulated a larger batch size for better GPU
memory management. The Adam8bit optimizer from the bitsandbytes library, with a dynamic learning
rate adjusted via a cosine schedule, was used. Loss computation combined image and text embeddings,
processed by the Phi language model. The training loop iterated over epochs and batches, updating
parameters post-gradient accumulation and checking validation loss for early stopping. LoRA parameters
were optimized with an initial learning rate of 3e-6, scaled by a factor of 4, balancing exploration and
convergence. This approach, along with gradient accumulation, enhanced resource use and fine-tuning
eficiency.</p>
        </sec>
        <sec id="sec-5-6-3">
          <title>5.8.1. Model Description</title>
          <p>
            IDEFICS 9B Instruct [
            <xref ref-type="bibr" rid="ref52">52</xref>
            ] is an advanced multimodal model developed by Hugging Face for integrated
image and text processing tasks. The model combines the vision model CLIP ViT-H/14 [
            <xref ref-type="bibr" rid="ref53">53</xref>
            ] and the
language model LLaMA 7B [
            <xref ref-type="bibr" rid="ref54">54</xref>
            ], incorporating novel transformer blocks to connect these modalities.
Trained on extensive datasets, including OBELICS, Wikipedia, LAION, and PMD, the IDEFICS 9B
Instruct variant is fine-tuned on supervised and instruction datasets.
          </p>
          <p>The lightweight IDEFICS 9B Instruct variant was explored using 4-bit quantization to reduce model
size and computational requirements while maintaining performance. BitsandByters (BnB) quantization
assigns 4-bit precision to the model using double quantization with the normalized floating-point
format (NF4) and bfloat16 precision for computations, crucial for running large language models
on smaller devices. For fine-tuning IDEFICS 9B Instruct on the ImageCLEF dataset, the checkpoint
HuggingFaceM4/idefics-9b-instruct was specified to load the pre-trained model with 4-bit quantization
using the BitsAndBytesConfig class.</p>
          <p>LoRA was applied to the query projection (q_proj), key projection (k_proj), and value projection
(v_proj) in both the ViT and decoder layers, as well as the perceiver attention and gated cross-attention
layers. However, the output projection (o_proj and out_proj) in the decoder, gated cross-attention,
and perceiver attention layers did not use LoRA but remained as standard Linear4bit layers. This
selective application of LoRA allows for eficient fine-tuning by reducing the number of trainable
parameters specifically within the attention mechanisms while leaving other projections, like the output
projection layers, unmodified.</p>
        </sec>
        <sec id="sec-5-6-4">
          <title>5.8.2. Training Process</title>
          <p>Custom callbacks were defined for printing the best checkpoint and early stopping. The training
arguments included a learning rate of 1e-4, the AdamW optimizer, batch sizes of 2 per device for
training and evaluation, gradient accumulation steps of 8, and evaluation and save steps every 500 steps.
The model was trained for a maximum of 8762 steps (2 epochs). Early stopping parameters were set
with a patience of 6 and a threshold of 0.001. Evaluation loss was monitored to select the best model,
with lower values being better. Early stopping was triggered at 8,000 steps, and the best model, saved at
8,000 steps, was loaded at the end of training.</p>
        </sec>
      </sec>
      <sec id="sec-5-7">
        <title>5.9. Submission 9: VisionGPT2</title>
        <sec id="sec-5-7-1">
          <title>5.9.1. Model Description</title>
          <p>
            The Encoder-Decoder model was designed to take an image as input and generate a descriptive caption
as output. In this model, the Encoder was a ViT [
            <xref ref-type="bibr" rid="ref51 ref55">55, 51</xref>
            ] that processed the input image and extracted
meaningful features. These features were then fed into the Decoder, which is based on GPT-2 [
            <xref ref-type="bibr" rid="ref56">56</xref>
            ],
a powerful language model that generates the corresponding textual caption. For fine-tuning the
model, the Hugging Face Seq2SeqTrainer [
            <xref ref-type="bibr" rid="ref57">57</xref>
            ] was employed. This trainer, part of the Hugging Face
transformers library, is specifically designed to handle sequence-to-sequence tasks, making it well-suited
for this image captioning model. The fine-tuning process leverages the transformers library to adapt
the pre-trained ViT and GPT-2 models.
          </p>
        </sec>
        <sec id="sec-5-7-2">
          <title>5.9.2. Training Process</title>
          <p>Initially, the pre-trained layers were frozen to focus on training the cross-attention layers. In subsequent
epochs, GPT-2 was unfrozen and trained, and in the final few epochs, the ViT was also unfrozen. The
Adam optimizer and the One Cycle Learning Rate (OneCycleLR) scheduler are used for optimization.
Mixed-precision fp16 training was employed with autocast and GradScaler in PyTorch. The training
metrics are cross-entropy loss and perplexity, with both metrics aimed to be minimized. The best model
was saved based on validation perplexity and was loaded during caption generation.</p>
        </sec>
      </sec>
      <sec id="sec-5-8">
        <title>5.10. Submission 10: CNN-Transformer Fusion Model</title>
        <p>
          The CNN-Transformer fusion model for this submission was built around three core models. First, the
pre-trained ChexNet [
          <xref ref-type="bibr" rid="ref58">58</xref>
          ] (a DenseNet121 backbone based CNN model) was used to extract features
from the input images. These features captured essential visual information and were then passed to
the second component, a Transformer Encoder [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ]. The Transformer-based encoder processed the
extracted image features to generate a new, more informative representation of the inputs. Finally, the
third component, a Transformer Decoder [
          <xref ref-type="bibr" rid="ref59">59</xref>
          ], took the output from the encoder along with the text data
(sequences). The decoder used these inputs to learn and generate the corresponding image captions,
completing the image-to-text translation process. The hyper-parameters for the model included an
embedding dimension set to 512 and an initial learning rate of 0.0001. The encoder used a single attention
head, while the decoder utilized two attention heads to process the information. For early stopping,
the patience level was set to 5, meaning the training process halted if there was no improvement in
validation loss after five epochs.
        </p>
      </sec>
      <sec id="sec-5-9">
        <title>5.11. Performance Measurement Metrics for the Caption Prediction Task</title>
        <p>The performance of all the submissions regarding the caption generation task were evaluated using the
following metrics.</p>
        <p>
          • BERTScore [
          <xref ref-type="bibr" rid="ref60">60</xref>
          ] evaluates text generation by computing the similarity between BERT embeddings
of the candidate and reference sentences, capturing semantic meaning better than traditional
metrics.
• ROUGE (Recall-Oriented Understudy for Gisting Evaluation) [
          <xref ref-type="bibr" rid="ref61">61</xref>
          ] is a set of metrics for evaluating
automatic summarization and machine translation by comparing overlap in n-grams, word
sequences, and word pairs between the candidate and reference texts.
• BLEU (Bilingual Evaluation Understudy) [
          <xref ref-type="bibr" rid="ref62">62</xref>
          ] is a precision-based metric for evaluating machine
translation quality by comparing n-grams of the candidate translation to those of the reference
translation. BLEU-1 specifically considers unigram matches.
• BLEURT (Bilingual Evaluation Understudy with Representations from Transformers) [
          <xref ref-type="bibr" rid="ref63">63</xref>
          ] is a
learned evaluation metric for natural language generation that uses pre-trained transformers
ifne-tuned on a variety of supervised and unsupervised signals to predict human judgment scores.
• METEOR (Metric for Evaluation of Translation with Explicit ORdering) [
          <xref ref-type="bibr" rid="ref64">64</xref>
          ] evaluates machine
translation by considering precision, recall, stemming, synonymy, and alignment, aiming to
improve correlation with human judgment.
• CIDEr (Consensus-based Image Description Evaluation) [
          <xref ref-type="bibr" rid="ref65">65</xref>
          ] is a metric for evaluating image
captioning by comparing candidate captions to reference captions using TF-IDF weighting and
n-gram similarity, ensuring relevance and importance of the words are considered.
• CLIPScore [
          <xref ref-type="bibr" rid="ref66">66</xref>
          ] is an evaluation metric that uses the CLIP model to compare image and text
similarity. It measures the alignment between visual content and textual descriptions, providing
a score based on their embedding similarity.
• RefCLIPScore [
          <xref ref-type="bibr" rid="ref66">66</xref>
          ] is an extension of CLIPScore that includes a reference-based evaluation,
incorporating both the similarity of the generated text to a reference text and the similarity
between the image and the generated text.
• ClinicalBLEURT [
          <xref ref-type="bibr" rid="ref67">67</xref>
          ] adapts BLEURT for clinical text generation, fine-tuning it on clinical datasets
to better evaluate the quality and relevance of generated clinical text against reference clinical
text.
• MedBERTScore [
          <xref ref-type="bibr" rid="ref67">67</xref>
          ] adapts BERTScore for the medical domain, using BERT embeddings
specifically fine-tuned on medical texts to provide a more accurate evaluation of medical text generation
tasks.
        </p>
      </sec>
      <sec id="sec-5-10">
        <title>5.12. Results and Discussion on Caption Prediction Submissions</title>
        <p>
          In this year’s evaluation for the ImageCLEF task, BERTScore [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] was the primary metric used to assess
the quality of the generated captions, with ROUGE [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ] as the secondary metric.
        </p>
        <p>Table 1 shows the results of submissions in terms of the primary metrics of performance. In addition to
BERTScore and ROUGE, some other performance metrics were also adopted to assess submission results.
These metrics are BLEU-1, BLEURT, METEOR, CIDEr, CLIPScore, RefCLIPScore, ClinicalBLEURT, and
MedBERTScore. Table 2 shows the results of the additional performance metrics other than the
BERTScore and ROUGE used for the caption prediction task. In both tables, the submissions are listed
according to the BERTScore (highest to lowest).</p>
        <p>Our results indicate that LMMs, when selectively fine-tuned with fewer parameters, can achieve
high performance. Additionally, LMMs obtained through quantization and smaller VLMs can maintain
competitive performance in medical image understanding and caption generation. From Tables 1 and
2, it is evident that four diferent submission outperformed the others in terms of the pre-specified
performance measurement metrics. Submission 1 using the LLaVA-v1.6-Mistral-7B model with 40.1M
ifne-tuned parameters using the LoRA technique achieved the highest scores across several key metrics:
BERTScore (0.628059), ROUGE (0.250801), BLEU-1 (0.209298), BLEURT (0.317385), METEOR (0.092682),
CIDEr (0.245029), and RefCLIPScore (0.815534). Submission 3, also using the LLaVA-v.1.6-Mistral-7B
model with hybrid LoRA fine-tuning approach (350.6M parameters) attained the highest CLIPScore of
0.824171, indicating an improved semantic match between the generated captions and the visual content
of the medical images. Submission 10, the CNN-Transformer fusion approach (Pre-trained CheXNet as
the encoder and Transformer as the decoder) performs better than other submissions in terms of the
ClinicalBEURT score of 0.676905. Finally, the submission 8 which was IDEFICS-9B-Instruct quantized to
4-bit, excelled in capturing relevant biomedical concepts compared to other submissions, achieving the
highest MedBERTScore of 0.657460034. Overall, the first submission can be claimed as the top performer
because of the highest scores in the primary and secondary metrics. Figure 7 shows the comparison of
the submissions in terms of the primary and secondary metrics. The significance of these submissions
lies in their demonstration of advanced fine-tuning techniques and model performance optimization in
the context of generative models. These findings highlight the evolving landscape of model fine-tuning
strategies, advocating for resource-eficient methods that maintain or enhance performance. This is
crucial for practical and scalable AI deployments across diverse medical applications.
Submission Results for the Caption Prediction Task - Secondary Scores</p>
        <p>BLEU-1</p>
        <p>In addition to the above-mentioned submissions, Submission 4, utilizing the LLaVA v.1.6 Vicuna 7B
with selective fine-tuning using LoRA (34.4M parameters), demonstrated well-balanced performance
and closely followed Submission 1. Moreover, submissions 3 and 2, both based on the LLaVA v.1.6
Mistral 7B model but with diferent approaches for optimization, closely followed submission 4 in terms
of BERTScore and ROUGE. However, the sixth submission of LLaVA v.1.5 7B, based on another variant of
LLaVA, could not outperform the LLaVA v.1.6 variants except for LLaVA v.1.6 Vicuna 7B with hybrid
finetuning using LoRA technique (Submission 5). The experiment with the MoonDream2 with 74.4M
finetuned parameters in Submission 7 showed competitive performance on the test data relative to the larger
models across multiple metrics. Submissions 9 and 10 were based on the pre-trained Transformer-based
encoder-decoder models other than the LMMs. VisionGPT2 outperformed the conventional pre-trained
CheXNet-Transformer or CNN-Transformer based model in every metric except ClinicalBLEURT. Table
3 shows the generated captions for a test image (ID: ImageCLEFmedical_Caption_2024_test_000016)
corresponding to the submissions made for the caption prediction task.</p>
        <p>Anteroposterior radiograph of the pelvis showing a sacral fracture (yellow
arrows) and a pubic fracture (yellow arrowhead).</p>
        <p>X-ray of the pelvis showing bilateral sacroiliitis (yellow arrows) and
bilateral pubic symphysis (yellow arrowheads).</p>
        <p>X-ray of the pelvis showing a large pelvic mass (arrows).</p>
        <p>Plain radiograph of the pelvis showing a large pelvic mass (yellow arrows)
with a large right-sided pelvic hematoma.</p>
        <p>X-ray of the pelvis showing the presence of a foreign body in the bladder
(yellow arrow) and the presence of a foreign body in the rectum.</p>
        <p>X-ray of the pelvis showing the fracture of the right pubis.</p>
        <p>Anteroposterior radiograph of the pelvis showing a right-sided sacroiliitis.</p>
        <p>X-ray of the pelvis showing the fracture of the right ilium (yellow arrows).</p>
        <p>CT scan of the chest. The CT scan showed a nodule in the right upper
lobe.</p>
        <p>Bone defect detected in the axillary region.</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Concept Detection Task</title>
      <p>
        This year CS_Morgan team submitted a single submission for the concept detection task. The submission
involved the implementation of ConvMixer [
        <xref ref-type="bibr" rid="ref68 ref69">68, 69</xref>
        ] model which combined the CNN and Transformer
architectures.
      </p>
      <sec id="sec-6-1">
        <title>6.1. Model Description</title>
        <p>
          ConvMixer [
          <xref ref-type="bibr" rid="ref68 ref69">68, 69</xref>
          ] closely resembles the MLP-Mixer [
          <xref ref-type="bibr" rid="ref70">70</xref>
          ] model, with key diferences in its architecture.
Instead of fully-connected layers, ConvMixer employs standard convolution layers. It uses batch
normalization rather than layer normalization technique, which is typically used in ViT [
          <xref ref-type="bibr" rid="ref51">51</xref>
          ] and
MLPMixers [
          <xref ref-type="bibr" rid="ref70">70</xref>
          ]. ConvMixer utilizes two types of convolution layers: depth-wise convolutions for mixing
spatial locations of the images and point-wise convolutions, following the depth-wise convolutions, for
mixing channel-wise information across the patches. Additionally, ConvMixer uses larger kernel sizes
to achieve a larger receptive field. Figure 8 shows the corresponding architecture of the model.
        </p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Training and Result</title>
        <p>The training process involved developing a ConvMixer model designed for classification or concept
detection task with 1,944 unique CUIs. The model was built using TensorFlow and Keras, with key
components including an initial rescaling layer, a patch extraction stem, and a series of ConvMixer
blocks. The model utilized GELU activations and batch normalization for better performance. The
architecture included a global average pooling layer followed by a dense output layer with a sigmoid
activation function. Training was conducted over 200 epochs with a batch size of 8, a learning rate
of 0.001, and a weight decay of 0.0001. The Adam optimizer was used for training, and the binary
cross-entropy loss function was chosen for the multi-label classification task. Performance metrics such
as accuracy, precision, recall, and Area Under the Curve (AUC) were tracked during training. However,
on the F1-score was reported for the submission. A model checkpoint callback was implemented to
save the best model based on validation accuracy. After training, the model was evaluated using the
best checkpointed weights.</p>
        <p>By implementing this model, the F1-score of 0.107645 was attained on the test data, placing it the
ninth position for the concept detection task among the participants. This score indicates that the
model’s performance in terms of precision and recall is relatively low, as it represents the harmonic
mean of precision and recall, providing a single metric that balances both. The score suggests that the
model is struggling to correctly identify and classify the relevant instances among the 1,944 classes,
leading to either a high number of false positives, false negatives, or both. This low score reflects
room for improvement in the model’s ability to accurately predict the target labels. For a test image
(ID: ImageCLEFmedical_Caption_2024_test_000016), the predicted concepts or CUIs based on this
ConvMixer model were C0030797, C0000726, and C1306645, whereas the ground truth concepts were
C1306645, C0030797, and C0034014 (See Figure 9).</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Conclusion</title>
      <p>For the Caption Prediction task, submitted models included LLaVA v.1.6 with Mistral 7B and Vicuna 7B
checkpoints, as well as the LLaVA v.1.5 7B model. Additionally, a 4-bit quantized instruct variant of
the IDEFICS 9B model and MoonDream2, a compact VLM, were explored. Two fine-tuning strategies,
selective and hybrid fine-tuning, were utilized. Furthermore, traditional encoder-decoder models like
VisionGPT2 and CNN-Transformer architectures were also experimented with. Among these, the
top-performing submission was the selective training of the LoRA projectors (40.1M parameters) on the
LLaVA 1.6 model with Mistral 7B weights. For the Concept Detection subtask, a single model based on
the ConvMixer architecture was submitted, which combines the strengths of CNNs and Transformers.</p>
      <p>In future research, the primary aim will be to incorporate Explainable AI and reinforcement learning.
Explainable AI will enhance model safety and reliability by identifying potential failures and undesirable
actions in LMMs. Reinforcement learning, using context-aware reward modeling, will integrate detailed
medical image concepts to improve content understanding and performance in multimodal tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>8. Acknowledgments</title>
      <p>This work was supported by the National Science Foundation (NSF) grant (ID. 2131307) “CISE-MSI: DP:
IIS: III: Deep Learning-Based Automated Concept and Caption Generation of Medical Images Towards
Developing an Efective Decision Support."</p>
    </sec>
    <sec id="sec-9">
      <title>A. Specifications of the Computational Environment</title>
      <p>The specifications of the utilised computational resources and environment included two machines.
The details are as follows.</p>
      <p>• Machine 1
• Machine 2
– Machine Type: a2-highgpu-2g (Accelerator Optimized: 2 NVIDIA Tesla A100 GPUs, 24
vCPUs, 170GB RAM)
– GPU: NVIDIA A100-40GB x 2
– Booting Disk: 1000 GB SSD
– Data Disk: 1000 GB SSD
– Language: Python 3.12.x
– Machine Type: n1-highmem-16 (16 vCPUs, 104 GB RAM)
– GPU: NVIDIA V100 x 2
– Boot disk: 150 GB SSD
– Data disk: 1000 GB SSD
– Language: Python 3.12.x
– Frameworks: PyTorch 2.x and Tensorflow 2.16.x</p>
    </sec>
    <sec id="sec-10">
      <title>B. GitHub Repository</title>
      <p>
        [
        <xref ref-type="bibr" rid="ref45">45</xref>
        ] provides the link to GitHub repository which is publicly available for accessing the reproducible
codes relevant to the submissions made for this competition.
      </p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>I.</given-names>
            <surname>Allaouzi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. Ben</given-names>
            <surname>Ahmed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Benamrou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Ouardouz</surname>
          </string-name>
          ,
          <article-title>Automatic caption generation for medical images</article-title>
          ,
          <source>in: Proceedings of the 3rd International Conference on Smart City Applications</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>T.</given-names>
            <surname>Pang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <article-title>A survey on automatic generation of medical imaging reports based on deep learning</article-title>
          ,
          <source>BioMedical Engineering OnLine</source>
          <volume>22</volume>
          (
          <year>2023</year>
          )
          <fpage>48</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>R.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Image caption and medical report generation based on deep learning: a review and algorithm analysis</article-title>
          ,
          <source>in: 2021 International Conference on Computer Information Science and Artificial Intelligence (CISAI)</source>
          , IEEE,
          <year>2021</year>
          , pp.
          <fpage>373</fpage>
          -
          <lpage>379</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <surname>M.-H. Van</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Verma</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          <string-name>
            <surname>Wu</surname>
          </string-name>
          ,
          <article-title>On Large Visual Language Models for Medical Imaging Analysis: An Empirical Study</article-title>
          , arXiv e-prints (
          <year>2024</year>
          ) arXiv-
          <fpage>2402</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          , Visual instruction tuning,
          <source>Advances in neural information processing systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <surname>J.-B. Alayrac</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Donahue</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Luc</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Miech</surname>
            , I. Barr,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Hasson</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Lenc</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Mensch</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          <string-name>
            <surname>Millican</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Reynolds</surname>
          </string-name>
          , et al.,
          <article-title>Flamingo: a visual language model for few-shot learning</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>23716</fpage>
          -
          <lpage>23736</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Hallacy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Ramesh</surname>
          </string-name>
          , G. Goh,
          <string-name>
            <given-names>S.</given-names>
            <surname>Agarwal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Sastry</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Askell</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Mishkin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Clark</surname>
          </string-name>
          , et al.,
          <article-title>Learning transferable visual models from natural language supervision</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>8748</fpage>
          -
          <lpage>8763</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>D.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <article-title>The role of large language models in medical image processing: a narrative review</article-title>
          ,
          <source>Quantitative Imaging in Medicine and Surgery</source>
          <volume>14</volume>
          (
          <year>2024</year>
          )
          <fpage>1108</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Pan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Advancing Medical Imaging with Language Models: A Journey from N-grams to ChatGPT</article-title>
          , arXiv e-prints (
          <year>2023</year>
          ) arXiv-
          <fpage>2304</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>I.</given-names>
            <surname>Hartsock</surname>
          </string-name>
          , G. Rasool,
          <article-title>Vision-language models for medical report generation and visual question answering: A review</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2403</volume>
          .
          <fpage>02469</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <surname>OpenAI</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Achiam</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Adler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Agarwal</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Ahmad</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          <string-name>
            <surname>Akkaya</surname>
          </string-name>
          , et al.,
          <source>Gpt-4 technical report</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>08774</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>D.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Elhoseiny</surname>
          </string-name>
          , Minigpt-4:
          <article-title>Enhancing vision-language understanding with advanced large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2304</volume>
          .
          <fpage>10592</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <article-title>BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs, author=Sheng Zhang and Yanbo Xu and Naoto Usuyama and Hanwen Xu and Jaspreet Bagga and Robert Tinn and Sam Preston and Rajesh Rao and Mu Wei and Naveen Valluri and Clif Wong and Andrea Tupini and Yu Wang and Matt Mazzola and Swadheen Shukla and Lars Liden and Jianfeng Gao and Matthew P. Lungren and Tristan Naumann and Sheng Wang</article-title>
          and
          <string-name>
            <given-names>Hoifung</given-names>
            <surname>Poon</surname>
          </string-name>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>00915</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Usuyama</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bagga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Tinn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Preston</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Rao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Valluri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wong</surname>
          </string-name>
          , et al.,
          <article-title>Large-scale domain-specific pretraining for biomedical vision-language processing</article-title>
          , arXiv e-prints (
          <year>2023</year>
          ) arXiv-
          <fpage>2303</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Dan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Zhang,</surname>
          </string-name>
          <article-title>Chatdoctor: A medical chat model fine-tuned on a Large Language Model Meta-AI (LLAMA) using medical domain knowledge</article-title>
          ,
          <source>Cureus</source>
          <volume>15</volume>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>X.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chen</surname>
          </string-name>
          , N. PourNejatian,
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. E.</given-names>
            <surname>Smith</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Parisien</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Compas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Martin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Costa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Flores</surname>
          </string-name>
          , et al.,
          <article-title>A large language model for electronic health records</article-title>
          ,
          <source>NPJ digital medicine 5</source>
          (
          <year>2022</year>
          )
          <fpage>194</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcıa Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karpenka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Esperança-Rodier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2024:
          <article-title>Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 15th International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , H. Müller,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2405.10004v1. arXiv:
          <volume>2405</volume>
          .
          <fpage>10004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2024 -
          <article-title>Caption Prediction and Concept Detection</article-title>
          , in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>Radiology Objects in Context (ROCO): a multimodal image dataset</article-title>
          ,
          <source>in: Intravascular Imaging and Computer Assisted Stenting and LargeScale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop</source>
          , CVII-STENT 2018 and Third International Workshop, LABELS 2018,
          <article-title>Held in Conjunction with MICCAI 2018, Granada</article-title>
          , Spain,
          <year>September 16</year>
          ,
          <year>2018</year>
          , Proceedings 3, Springer,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          ,
          <source>Nucleic acids research</source>
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <article-title>A survey of large language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2303</volume>
          .
          <fpage>18223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Yi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          , et al.,
          <article-title>A survey on evaluation of large language models</article-title>
          ,
          <source>ACM Transactions on Intelligent Systems and Technology</source>
          <volume>15</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Manzoor</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Albarri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Xian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Meng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Nakov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <article-title>Multimodality representation learning: A survey on evolution, pretraining and its applications</article-title>
          ,
          <source>ACM Transactions on Multimedia Computing, Communications and Applications</source>
          <volume>20</volume>
          (
          <year>2023</year>
          )
          <fpage>1</fpage>
          -
          <lpage>34</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Alam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Qu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chowdhury</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. Zhang,</surname>
          </string-name>
          <article-title>Eficient large language models: A survey</article-title>
          ,
          <source>Transactions on Machine Learning Research</source>
          (
          <year>2024</year>
          ). URL: https://openreview.net/forum?id=bsCCJHbO8A, survey Certification.
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <surname>G.</surname>
          </string-name>
          <article-title>Wang, Instruction tuning for large language models: A survey</article-title>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2308</volume>
          .
          <fpage>10792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Batra</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>32</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Savarese</surname>
          </string-name>
          , S. Hoi, BLIP-2
          <article-title>: Bootstrapping language-image pre-training with frozen image encoders and large language models</article-title>
          ,
          <source>in: International conference on machine learning, PMLR</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>19730</fpage>
          -
          <lpage>19742</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M. H.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Fung</surname>
          </string-name>
          , S. Hoi, InstructBLIP: Towards general
          <article-title>-purpose vision-language models with instruction tuning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Tam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Muqeeth</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mohta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Bansal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. A.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <article-title>Few-shot parametereficient fine-tuning is better and cheaper than in-context learning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>35</volume>
          (
          <year>2022</year>
          )
          <fpage>1950</fpage>
          -
          <lpage>1965</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. M.-C. So</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          <string-name>
            <surname>Lam</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          <string-name>
            <surname>Bing</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Collier</surname>
          </string-name>
          ,
          <article-title>On the efectiveness of parameter-eficient ifne-tuning</article-title>
          ,
          <source>in: Proceedings of the AAAI Conference on Artificial Intelligence</source>
          , volume
          <volume>37</volume>
          ,
          <year>2023</year>
          , pp.
          <fpage>12799</fpage>
          -
          <lpage>12807</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Tian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ye</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Cao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhou</surname>
          </string-name>
          , L. Si, mPLUG: Efective and
          <article-title>Eficient Vision-Language Learning by Cross-modal Skip-connections</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2205</volume>
          .
          <fpage>12005</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>E. J.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Wallis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Allen-Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          , W. Chen, LoRA: Low-Rank
          <source>Adaptation of Large Language Models</source>
          ,
          <year>2021</year>
          . arXiv:
          <volume>2106</volume>
          .
          <fpage>09685</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          [34]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zeng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <source>The Expressive Power of Low-Rank Adaptation</source>
          ,
          <year>2024</year>
          . arXiv:
          <volume>2310</volume>
          .
          <fpage>17513</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          [35]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , L. u. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          , in: I. Guyon,
          <string-name>
            <given-names>U. V.</given-names>
            <surname>Luxburg</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fergus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Vishwanathan</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>30</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2017</year>
          . URL: https://proceedings.neurips.cc/paper_files/paper/2017/file/ 3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          [36]
          <string-name>
            <surname>NVIDIA</surname>
          </string-name>
          ,
          <string-name>
            <surname>Flash</surname>
            <given-names>Attention</given-names>
          </string-name>
          , https://docs.nvidia.com/nemo-framework/user-guide/latest/ nemotoolkit/nlp/nemo_megatron/flash_attention.html,
          <year>2024</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          [37]
          <string-name>
            <given-names>H.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. J.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <article-title>Improved Baselines with Visual Instruction Tuning</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2024</year>
          , pp.
          <fpage>26296</fpage>
          -
          <lpage>26306</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          [38]
          <string-name>
            <surname>Hugging</surname>
            <given-names>Face</given-names>
          </string-name>
          ,
          <source>Hugging Face Transformers Documentation: LLaVA</source>
          ,
          <year>2024</year>
          . URL: https://huggingface. co/docs/transformers/model_doc/llava, hugging Face documentation.
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          [39] lmsys,
          <string-name>
            <surname>Vicuna-</surname>
          </string-name>
          7B
          <source>-V1</source>
          .3, https://huggingface.co/lmsys/vicuna-7b
          <source>-v1.3</source>
          ,
          <year>2023</year>
          .
          <article-title>Hugging Face model hub</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          [40]
          <string-name>
            <given-names>L.</given-names>
            <surname>Zheng</surname>
          </string-name>
          , W.-L. Chiang,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Sheng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhuang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Xing</surname>
          </string-name>
          , et al.,
          <article-title>Judging llm-as-a-judge with mt-bench and chatbot arena</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref41">
        <mixed-citation>
          [41]
          <string-name>
            <given-names>A. Q.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Mensch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Bamford</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. S.</given-names>
            <surname>Chaplot</surname>
          </string-name>
          , D. d. l. Casas,
          <string-name>
            <given-names>F.</given-names>
            <surname>Bressand</surname>
          </string-name>
          , G. Lengyel,
          <string-name>
            <given-names>G.</given-names>
            <surname>Lample</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          , et al.,
          <source>Mistral 7B, arXiv preprint arXiv:2310.06825</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref42">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>M. AI</given-names>
            ,
            <surname>Mistral-</surname>
          </string-name>
          7B
          <source>-v0.1</source>
          , https://huggingface.co/mistralai/Mistral-7B
          <source>-v0.1</source>
          ,
          <year>2024</year>
          .
          <article-title>Hugging Face model hub</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref43">
        <mixed-citation>
          [43]
          <string-name>
            <surname>Liuhaotian</surname>
            ,
            <given-names>Llava</given-names>
          </string-name>
          <year>v1</year>
          .
          <article-title>6 mistral 7b</article-title>
          , https://huggingface.co/liuhaotian/llava-v1.
          <fpage>6</fpage>
          -mistral-7b,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref44">
        <mixed-citation>
          [44]
          <string-name>
            <surname>OpenAI</surname>
          </string-name>
          ,
          <string-name>
            <surname>Clip</surname>
          </string-name>
          vit-l/14 model, https://huggingface.co/openai/clip-vit
          <article-title>-large-</article-title>
          <string-name>
            <surname>patch14</surname>
          </string-name>
          ,
          <year>2021</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref45">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>M.</given-names>
            <surname>Hoque</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. R.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <article-title>Medical image interpretation with large multimodal models</article-title>
          , https://github. com/HoqueMahmudul/Medical-Image-
          <article-title>Interpretation-with-</article-title>
          <string-name>
            <surname>Large-Multimodal-Models</surname>
          </string-name>
          ,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -06-19.
        </mixed-citation>
      </ref>
      <ref id="ref46">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          , Decoupled weight decay regularization,
          <year>2019</year>
          . arXiv:
          <volume>1711</volume>
          .
          <fpage>05101</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref47">
        <mixed-citation>
          [47] vikhyatk, Moondream2, https://huggingface.co/vikhyatk/moondream2,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref48">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          , et al.,
          <article-title>Transformers: State-of-the-art natural language processing</article-title>
          ,
          <source>in: Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref49">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bubeck</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Eldan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. D.</given-names>
            <surname>Giorno</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gunasekar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y. T.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <source>Textbooks Are All You Need II: phi-1.5 technical report</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2309</volume>
          .
          <fpage>05463</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref50">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Mustafa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          , L. Beyer,
          <article-title>Sigmoid loss for language image pre-training</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF International Conference on Computer Vision</source>
          ,
          <year>2023</year>
          , pp.
          <fpage>11975</fpage>
          -
          <lpage>11986</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref51">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2010</year>
          .11929, https://huggingface. co/google/vit-base-patch16-
          <volume>224</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref52">
        <mixed-citation>
          [52]
          <string-name>
            <given-names>H.</given-names>
            <surname>Laurençon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Saulnier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Tronchon</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bekman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Singh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lozhkov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Karamcheti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M.</given-names>
            <surname>Rush</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kiela</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <surname>Obelics:</surname>
          </string-name>
          <article-title>An open web-scale filtered dataset of interleaved image-text documents</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2306</volume>
          .
          <fpage>16527</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref53">
        <mixed-citation>
          [53]
          <string-name>
            <surname>LAION</surname>
          </string-name>
          ,
          <string-name>
            <surname>CLIP ViT H-14 LAION2B S32B</surname>
          </string-name>
          <article-title>B79K</article-title>
          , https://huggingface.co/laion/ CLIP-ViT
          <string-name>
            <surname>-H-</surname>
          </string-name>
          14
          <source>-laion2B-s32B-b79K</source>
          ,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -06-19.
        </mixed-citation>
      </ref>
      <ref id="ref54">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Lavril</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Izacard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Martinet</surname>
          </string-name>
          , M.
          <article-title>-</article-title>
          <string-name>
            <surname>A. Lachaux</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lacroix</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          <string-name>
            <surname>Rozière</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Goyal</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          <string-name>
            <surname>Hambro</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Azhar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Rodriguez</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Joulin</surname>
          </string-name>
          , E. Grave, G. Lample,
          <article-title>Llama: Open and eficient foundation language models</article-title>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2302</volume>
          .
          <fpage>13971</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref55">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>B.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Wan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Tomizuka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gonzalez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Keutzer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Vajda</surname>
          </string-name>
          ,
          <article-title>Visual transformers: Token-based image representation and processing for computer vision</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>2006</year>
          .03677.
        </mixed-citation>
      </ref>
      <ref id="ref56">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language models are unsupervised multitask learners (</article-title>
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref57">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>H.</given-names>
            <surname>Face</surname>
          </string-name>
          , transformers.seq2seqtrainer, https://huggingface.co/docs/transformers/main_classes/ trainer#transformers.
          <source>Seq2SeqTrainer</source>
          ,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref58">
        <mixed-citation>
          [58]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Irvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shpanskaya</surname>
          </string-name>
          , et al.,
          <article-title>CheXNet: Radiologist-level pneumonia detection on chest x-rays with deep learning</article-title>
          ,
          <source>arXiv preprint arXiv:1711.05225</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref59">
        <mixed-citation>
          [59]
          <string-name>
            <given-names>F.</given-names>
            <surname>Chollet</surname>
          </string-name>
          , Image captioning, https://keras.io/examples/vision/image_captioning/,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -05-28.
        </mixed-citation>
      </ref>
      <ref id="ref60">
        <mixed-citation>
          [60]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Artzi</surname>
          </string-name>
          , Bertscore:
          <article-title>Evaluating text generation with bert</article-title>
          ,
          <year>2020</year>
          . arXiv:
          <year>1904</year>
          .09675.
        </mixed-citation>
      </ref>
      <ref id="ref61">
        <mixed-citation>
          [61]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref62">
        <mixed-citation>
          [62]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          , W.-J. Zhu,
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          ,
          <source>in: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          ,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref63">
        <mixed-citation>
          [63]
          <string-name>
            <given-names>T.</given-names>
            <surname>Sellam</surname>
          </string-name>
          ,
          <string-name>
            <surname>D. Das</surname>
            ,
            <given-names>A. P.</given-names>
          </string-name>
          <string-name>
            <surname>Parikh</surname>
          </string-name>
          ,
          <source>BLEURT: Learning Robust Metrics for Text Generation</source>
          ,
          <year>2020</year>
          . arXiv:
          <year>2004</year>
          .04696.
        </mixed-citation>
      </ref>
      <ref id="ref64">
        <mixed-citation>
          [64]
          <string-name>
            <given-names>S.</given-names>
            <surname>Banerjee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lavie</surname>
          </string-name>
          ,
          <string-name>
            <surname>METEOR:</surname>
          </string-name>
          <article-title>An automatic metric for MT evaluation with improved correlation with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation</article-title>
          and/or summarization,
          <year>2005</year>
          , pp.
          <fpage>65</fpage>
          -
          <lpage>72</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref65">
        <mixed-citation>
          [65]
          <string-name>
            <given-names>R.</given-names>
            <surname>Vedantam</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          , D. Parikh,
          <source>CIDEr: Consensus-based Image Description Evaluation</source>
          ,
          <year>2015</year>
          . arXiv:
          <volume>1411</volume>
          .
          <fpage>5726</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref66">
        <mixed-citation>
          [66]
          <string-name>
            <given-names>J.</given-names>
            <surname>Hessel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Holtzman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Forbes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. L.</given-names>
            <surname>Bras</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Choi,</surname>
          </string-name>
          <article-title>CLIPScore: A Reference-free Evaluation Metric for Image Captioning</article-title>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2104</volume>
          .
          <fpage>08718</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref67">
        <mixed-citation>
          [67]
          <string-name>
            <surname>A. B. Abacha</surname>
            , W. wai Yim, G. Michalopoulos,
            <given-names>T.</given-names>
          </string-name>
          <string-name>
            <surname>Lin</surname>
          </string-name>
          ,
          <source>An Investigation of Evaluation Metrics for Automated Medical Note Generation</source>
          ,
          <year>2023</year>
          . arXiv:
          <volume>2305</volume>
          .
          <fpage>17364</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref68">
        <mixed-citation>
          [68]
          <string-name>
            <given-names>A.</given-names>
            <surname>Trockman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. Z.</given-names>
            <surname>Kolter</surname>
          </string-name>
          ,
          <source>Patches are all you need?</source>
          ,
          <year>2022</year>
          . arXiv:
          <volume>2201</volume>
          .
          <fpage>09792</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref69">
        <mixed-citation>
          [69]
          <string-name>
            <surname>Keras</surname>
          </string-name>
          , ConvMixer example, https://keras.io/examples/vision/convmixer/,
          <year>2023</year>
          . Accessed:
          <fpage>2024</fpage>
          -05- 28.
        </mixed-citation>
      </ref>
      <ref id="ref70">
        <mixed-citation>
          [70]
          <string-name>
            <given-names>I. O.</given-names>
            <surname>Tolstikhin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Keysers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          , et al.,
          <article-title>Mlp-mixer: An all-mlp architecture for vision</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>34</volume>
          (
          <year>2021</year>
          )
          <fpage>24261</fpage>
          -
          <lpage>24272</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>