<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UIT-DarkCow team at ImageCLEFmedical Caption 2024: Diagnostic Captioning for Radiology Images Eficiency with Transformer Models</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Quan Van Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Huy Quang Pham</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dan Quang Tran</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thang Kien-Bao Nguyen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Nhat-Hao Nguyen-Dang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Thien B. Nguyen-Tat</string-name>
          <email>thienntb@uit.edu.vn</email>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Information Technology</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Vietnam National University</institution>
          ,
          <addr-line>Ho Chi Minh City</addr-line>
          ,
          <country country="VN">Vietnam</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>Purpose: This study focuses on the development of automated text generation from radiology images, termed diagnostic captioning, to assist medical professionals in reducing clinical errors and improving productivity. The aim is to provide tools that enhance report quality and eficiency, which can significantly impact both clinical practice and deep learning research in the biomedical field. Methods: In our participation in the ImageCLEFmedical2024 Caption evaluation campaign, we explored caption prediction tasks using advanced Transformer-based models. We developed methods incorporating Transformer encoder-decoder and Query Transformer architectures. These models were trained and evaluated to generate diagnostic captions from radiology images. Results: Experimental evaluations demonstrated the efectiveness of our models, with the VisionDiagnostorBioBART model achieving the highest BERTScore of 0.6267. This performance contributed to our team, DarkCow, achieving third place on the leaderboard. Our source code is public at this link. Conclusion: Our diagnostic captioning models show great promise in aiding medical professionals by generating high-quality reports eficiently. This approach can facilitate better data processing and performance optimization in medical imaging departments, ultimately benefiting healthcare delivery.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;ImageCLEF</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Diagnostic Captioning</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>Image Understanding</kwd>
        <kwd>Radiology Images</kwd>
        <kwd>Transformer Models</kwd>
        <kwd>Encoder-Decoder</kwd>
        <kwd>Query Transformer</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        Machine learning, especially Deep Learning, is creating breakthroughs in many diferent fields, and
its impact on biomedicine is remarkable. With the exponential growth of biomedical data, researchers
are exploring its potential in biomedical engineering, advanced computing, imaging systems, and
biomedical data mining algorithms based on machine learning [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. One important area is Diagnostic
Captioning. Diagnostic Captioning is the process of automatically generating diagnostic text based on a
set of medical images collected during a medical examination. It can assist less experienced physicians
by minimizing clinical errors and helping experienced physicians generate diagnostic reports faster [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        ImageCLEF is an annual multimodal machine learning campaign, part of the Cross-Language
Evaluation Forum (CLEF), which has been running since 2003. It encourages breakthroughs in research
and development of processing systems. Advanced multimedia processing in computer vision, image
analysis, classification and retrieval in a multilingual, multimodal context. This year, one of ImageCLEF’s
four main missions is ImageCLEFMedical, which includes a series of challenges from annotating images
to creating synthetic images and answering questions. In ImageCLEF 2024 [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], we took part in the
      </p>
      <p>
        ImageCLEFmedical Caption task [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. As in previous years, this task comprised two subtasks: concept
detection and caption prediction.
      </p>
      <p>Concept detection aims to associate biomedical images with related medical concepts while
captioning prediction focuses on automatically generating preliminary diagnostic reports that accurately
describe medical conditions and structures and anatomy shown in images. Concept detection also
supports diagnostic notes by identifying key concepts that should be included in the preliminary report.
Additionally, it can be used to index medical images according to related concepts, facilitating more
eficient organization and retrieval.</p>
      <p>
        Captioning prediction, in other words, diagnostic captioning, remains a challenging research problem,
designed to support the diagnostic process by providing a preliminary report rather than replacing the
physicians and human factors involved [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. It is designed as a tool to assist in generating an initial
diagnostic report of a patient’s condition, helping doctors focus on important areas of the image [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] and
assisting them in making diagnoses. Guess more accurately quickly [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. This approach can increase the
eficiency of experienced clinicians, allowing them to handle high volumes of daily medical examinations
more quickly and eficiently. For less experienced clinicians, automated annotation can help reduce the
likelihood of clinical errors[
        <xref ref-type="bibr" rid="ref7">7</xref>
        ].
      </p>
      <sec id="sec-1-1">
        <title>1.1. DarkCow Team Contributions</title>
        <p>
          In this paper, we presented the experiments and the systems that were submitted by our DarkCow
team in this year’s caption prediction task, which helped us secure third place on the leaderboard (see
Table 1). Our new approaches build on the rapid development of deep learning techniques, especially
the Transformer [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] encoder-decoder architecture and the Query Transformer [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] for Large Language
Model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. We leveraged the Vision Transformer (ViT) to extract visual features from radiology
images. To optimize the use of information, we also used VinVL [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to extract features of objects in
the images. Our first approach is based on encoder-decoder architecture to generate image captions.
In the second approach, we leveraged Query Transformer to help LLM understand images. We also
conducted experiments with image pre-processing, caption length, and object features to analyze the
impact of those aspects.
        </p>
        <p>This paper is organized explicitly as follows: Section 2 presents an overview of studies related to our
research field. In Section 3, we introduce the data process and some detailed analysis of our dataset.
Next, Section 4 introduces some image pre-processing techniques. Section 5 details the design of the
proposed methods and evaluation metric. Section 6 Present experimental results based on the proposed
method. Section 7 discusses some impact. Finally, Section 8 summarizes the research and suggests
future directions.</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Background and Related Works</title>
      <sec id="sec-2-1">
        <title>2.1. Radiology Techniques</title>
        <p>
          With the continuous advancement of imaging technology, medical imaging diagnosis has evolved from
a supplementary examination tool to the most important clinical diagnostic and diferential diagnostic
method in modern medicine. Radiology techniques are used to scan images within the body, which
are then interpreted and reported by radiologists to specialists [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. With advancements in imaging
technology, various imaging diagnostic methods have been developed, each with its own advantages
and limitations. For example, X-ray imaging [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ] ofers non-invasive, quick, and painless imaging, but
it involves exposure to ionizing radiation, which increases the risk of developing cancer later in life. On
the other hand, MRI imaging [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ] provides non-ionizing radiation and high spatial resolution, but it has
relatively low sensitivity and longer scanning times, etc.
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Former Medical Image Captioning Datasets</title>
        <p>
          Medical imaging diagnosis today plays an incredibly important role in both the healthcare and
information technology sectors. It not only aids in diagnosis and increases understanding of diseases
but also holds immense potential in improving healthcare delivery and enhancing quality of life. The
application of deep learning in medical image captioning in an era where AI is ubiquitous is evident; it
automates the annotation process and significantly accelerates image analysis. Several datasets have
been created to facilitate the training of medical image captioning tasks such as ROCO [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ], PadChest
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], MIMIC-CXR [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ], IU X-Ray [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ], and MedICaT [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ].
        </p>
      </sec>
      <sec id="sec-2-3">
        <title>2.3. Related Work Methods</title>
        <p>
          For the task of medical image captioning, various methods have been developed, with pioneering
work in applying the CNN-RNN encoder-decoder approach to generate captions from medical images
conducted by Shin et al. [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ]. They utilized either the Network-in-Network or GoogLeNet architectures
as encoding models, followed by LSTM [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] or GRU [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] as the decoding RNN to translate the encoded
images into descriptive captions. In the process of translating images into biomedical text, MDNET [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ]
made a notable advancement by incorporating an attention mechanism. This model employs RESNET
for image encoding, extending its skip connections to mitigate gradient vanishing.
        </p>
        <p>
          In recent studies by Wang et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ], Kougia et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ], and Li et al. [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ], a fusion of generative models
and retrieval systems for Medical Image Captioning (MIC) has been explored. For instance, Wang
et al. [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ] proposed an approach that alternates between template retrieval and sentence generation for
rare abnormal descriptions. This method relies on a contextual relational-topic encoder derived from
visual and textual features, facilitating semantic consistency through hybrid knowledge co-reasoning.
Additionally, Kougia et al. [
          <xref ref-type="bibr" rid="ref24">24</xref>
          ] from AUEB NLP group presented various systems for the Image-CLEFmed
2019 Caption task. One approach utilized a retrieval-based model that leverages visual features to
retrieve the most similar images based on cosine similarity, combining their concepts to predict relevant
captions. Another system incorporated CheXNet [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] with enhanced classification labels, employing
a CNN encoder and a feed-forward neural network (FFNN) for multi-label classification. They also
suggested an ensemble model by combining these systems, computing scores for returned concepts and
merging them with image similarity scores to select the most relevant concepts.
        </p>
        <p>
          Large language models (LLMs) have catalyzed significant progress in medical question answering;
Med-PaLM [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] was the first model to exceed a “passing” score in US Medical Licensing Examination
(USMLE). However, this and other prior work suggested significant room for improvement, especially
when models’ answers were compared to clinicians’ answers. Med-PaLM 2 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ] bridges these gaps
by leveraging a combination of base LLM improvements, medical domain finetuning, and prompting
strategies including a novel ensemble refinement approach.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Dataset</title>
      <p>
        Thanks to AUEB NLP Group for providing an excellent analysis of the dataset in the study of Kaliosis
et al. [
        <xref ref-type="bibr" rid="ref29">29</xref>
        ]. When comparing ImageCLEFmedical2023 data with ImageCLEFmedical2024, we found no
significant diferences in the task of caption prediction. Therefore, we decided to reapply to analyze the
dataset in this section.
      </p>
      <p>
        This year’s ImageCLEFmedical Caption task provided a dataset that includes 70,108 radiology images
in the training set, each annotated with medical concepts using UMLS terms and diagnostic captions.
The organizers initially divided the dataset into training and validation subsets [
        <xref ref-type="bibr" rid="ref30">30</xref>
        ]. Building on
previous campaigns, this year’s dataset is an updated and expanded version of the Radiology Objects in
Context (ROCO) dataset, which is sourced from a variety of biomedical studies in the PubMed Central
OpenAccess (PMC OA) subset. The dataset used for the caption prediction task includes images from
diferent modalities, such as X-ray and Computed Tomography (CT), although specific details about
the image types were not provided. The goal of the caption prediction task is to generate open-ended
diagnostic texts for the medical images (see Figure 1).
      </p>
      <p>Most common words (excluding stop-words)
left ct image chest scan computed tomography shows
18,136 15,167 10,245 10,082 9,296 9,273 8,969 8,600</p>
      <p>In the Caption prediction sub-task, each image has a diagnostic caption describing the described
medical condition. There are a total of 69,743 captions in the training dataset and 9,959 captions in the
validation dataset, one for each image. Similar to last year’s campaign, the majority of captions (99.47%,
or 69,743 out of 70,108) were unique. This is a notable diference from previous versions of the quest,
where the uniqueness percentage was much lower. As a result, traditional retrieval methods based on
nearest neighbor search are less eficient this year, including variants with a weighting mechanism
based on the cosine similarity of the retrieved images. Therefore, more complex methods of creating
subtitles are needed.</p>
      <p>We observed that the maximum number of words in a single caption is 848 (occurred once), while
the minimum is 1 (encountered 1 time). The average caption length is 20.84 words. These statistics
apply to the entire dataset ( training set and valid set). The five most common captions, as well as the
ten most popular words (excluding stopwords), can be found in Tables 3 and 2, respectively. In Figure 2
and Figure 3, we present a distribution caption length of the training and valid sets, both indicating
that the majority of captions contain fewer than 100 words.</p>
    </sec>
    <sec id="sec-4">
      <title>4. Image Pre-processing</title>
      <sec id="sec-4-1">
        <title>4.1. Denoising</title>
        <p>Denoising is crucial in enhancing image quality by reducing the noise while preserving the important
details. Noise in medical images can come from various sources, such as sensor imperfections, poor
scan conditions, or inherent patient movements during image acquisition.</p>
        <p>The smoothness of images is controlled through the utilization of a Gaussian filter with a fixed kernel
size. The Gaussian filter operates by smoothing images using a technique called convolution. It employs
a Gaussian kernel - a matrix based on the Gaussian function to adjust pixel values. This kernel is applied
over each pixel in the image, averaging the pixel values in its vicinity, weighted by their distance from
the central pixel. The standard deviation  of the Gaussian determines the amount of blurring: a larger
 results in more blurring, smoothing out more details and noise. This process helps in reducing noise
and is often used as a preparatory step in image processing tasks to enhance image quality without
losing critical structural details (see Figure 4).</p>
        <p>The 2-D Gaussian function is given by:
(, ) =</p>
        <p>1 − 22+22
2 2
(1)</p>
        <p>Medical image enhancement is one of the most widely used medical image processing techniques in
medical domain. Its purpose is to improve the visual efect of the image and facilitate the analysis and
understanding of the image by humans or machines. The Laplace transform and the Sobel gradient
operator are two common ways of performing edge detection, image sharpening, and enabling the
enhancement of the image (see Figure 5).</p>
        <p>Step 1 Laplace Transform: Apply the Laplace transform to enhance contrast by emphasizing areas
of rapid intensity change in the original image.</p>
        <p>Step 2 Sobel Operator: Use the Sobel operator to enhance the edges of the image. This step also
helps to smooth out noise, making the edges clearer and more cohesive.</p>
        <p>Step 3 Smoothing: Smooth the image processed by the Sobel operator using a 3x3 mean filter. This
step increases the contrast of the edges against the background.</p>
        <p>Step 4 Dot Product: Intensify the contrast by performing a dot product of the smoothed image with
the result from the Laplace transform from step 1.</p>
        <p>Step 5 Addition for Final Sharpening: Enhance the sharpness and visibility of detail by adding
the result of the dot product back to the original image.</p>
        <p>Step 6 Histogram Equalization: Apply histogram equalization to distribute the histogram of the
image uniformly, improving the overall contrast and making fine details more visible.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Image Enhancement</title>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Proposed Method</title>
      <sec id="sec-5-1">
        <title>5.1. Encoder-Decoder Approach</title>
        <sec id="sec-5-1-1">
          <title>5.1.1. Features Embedding</title>
          <p>
            We propose VisionDiagnostor centers around the implementation of Transformer encoder-decoder
approach and deployed to evaluate methods having ClinicalT5 [
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] and BioBART [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ] as
encoderdecoder module (see Figure 6). ClinicalT5, based on the T5 [
            <xref ref-type="bibr" rid="ref33">33</xref>
            ] architecture, and BioBART, a variant
of the BART [34] architecture, have both been pre-trained on large of biomedical text data. These
models stand out as the preeminent and potent pre-trained language models for the medical domain,
ensuring the eficacy and robustness of our proposed method.
          </p>
          <p>Object features: To extract object features in an image, we used the VinVL model to extract object
features  = {1, 2, ..., } from an image, with each  being a 2048-dimensional vector. Bounding
box coordinates are normalized as  = [︁  , ℎ ,  , ℎ ]︁, forming obj = {1, 2, ..., }.</p>
          <p>Final object features obj are computed by projecting  and obj to the language model dimension
and summing the results:
obj = ′ + o′bj
(2)</p>
          <p>We use ViT for visual feature extraction due to its ability to capture global information through
its attention mechanism. By freezing ViT and projecting the last hidden state to match the language
model’s dimension, we obtain visual features  .</p>
          <p>VinVL
ViT</p>
          <p>Visual Features
CC BY [Peres et al. (2017)]</p>
          <p>The input embedding to the encoder-decoder module is:
Encoder</p>
          <p>Decoder
Image Caption
(3)
(4)
(5)
(6)</p>
          <p>Input = Concat(, obj)</p>
          <p>Where  are the visual features from ViT, and obj are the VinVL region object features. The Concat(· )
function concatenates these features.</p>
        </sec>
        <sec id="sec-5-1-2">
          <title>5.1.2. Encoder-Decoder Module</title>
          <p>
            In this task, we employed the Transformer encoder-decoder architecture, which is used in ClinicalT5
[
            <xref ref-type="bibr" rid="ref31">31</xref>
            ] and BioBART [
            <xref ref-type="bibr" rid="ref32">32</xref>
            ] for the encoder-decoder module of VisionDiagnostor. The encoder receives the
input features and then passes them to the decoder to generate the output sentence. In the decoder,
attention mechanisms are employed, directing focus to both the output of the encoder and the input of
the decoder.
          </p>
        </sec>
      </sec>
      <sec id="sec-5-2">
        <title>Encoder</title>
      </sec>
      <sec id="sec-5-3">
        <title>Multi-Head Attention:</title>
        <p>Attention(Enc)(, ,  ) = softmax
√

where , , and  are the query, key, and value matrices, and  is the dimensionality of the key
vectors.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Encoder Feed-Forward Network:</title>
        <p>FFN(Enc)() = ReLU(1(Enc) + (Enc))2(Enc) + (Enc)
1 2
where 1(Enc), 2(Enc), (1Enc), and (Enc) are learnable parameters.
2</p>
      </sec>
      <sec id="sec-5-5">
        <title>Encoder Layer Normalization:</title>
        <p>where LN(Enc) is the layer normalization function.</p>
        <p>LayerNorm(Enc)() = LN(Enc)( + LayerNorm(Enc)())</p>
      </sec>
      <sec id="sec-5-6">
        <title>Decoder Self-Attention:</title>
        <p>where , , and  are the query, key, and value matrices, and  is the dimensionality of the key
vectors.</p>
      </sec>
      <sec id="sec-5-7">
        <title>Decoder-Encoder Cross-Attention:</title>
        <p>Attention(Dec)(, ,  ) = softmax
Attention(Dec)(, ,  ) = softmax
where  comes from the decoder and ,  come from the encoder.</p>
      </sec>
      <sec id="sec-5-8">
        <title>Decoder Feed-Forward Network:</title>
        <p>FFN(Dec)() = ReLU(1(Dec) + (Dec))2(Dec) + (Dec)
1 2
where 1(Dec), 2(Dec), (1Dec), and (Dec) are learnable parameters.
2</p>
      </sec>
      <sec id="sec-5-9">
        <title>Decoder Layer Normalization:</title>
        <p>LayerNorm(Dec)() = LN(Dec)( + LayerNorm(Dec)())
where LN(Dec) is the layer normalization function.</p>
      </sec>
      <sec id="sec-5-10">
        <title>5.2. Query Transformer Approach</title>
        <p>Inspired by the BLIP2 architecture [35], we leveraged the Query Transformer (Q-Former) module, which
serves as the trainable intermediary between a fixed image encoder and a fixed Large Language Model.
It extracts a consistent number of output features from the image encoder, irrespective of the input
image resolution. Q-Former comprises two transformer submodules that share self-attention layers: an
image transformer for visual feature extraction from the fixed image encoder and a text transformer
acting as both an encoder and decoder.</p>
        <p>We initialize a set number of learnable query embeddings as input to the image transformer. These
queries engage in self-attention and cross-attention interactions with each other and the frozen image
features. Additionally, they can interact with the text through self-attention layers, with diferent
attention masks applied based on the pre-training task.</p>
        <p>In our experiments, we employ 64 queries, each with a dimensionality of 768, matching the hidden
dimension of Q-Former. We utilize VIT-huge [36] as the frozen image encoder and BioMistral-7B [37] as
the frozen LLM for caption generation, and we call it VisionDiagnostor-Q-BioMistral which is depicted
in Figure 7. This bottleneck architecture, combined with our pre-training objectives, compels the queries
to extract visual information most pertinent to the accompanying text.
(7)
(8)
(9)
(10)</p>
        <p>Vision and Language
Representation Learning</p>
        <p>Vision-to-Language
Generative Learning
Image
Encoder</p>
        <p>Q-Former
Querrying
Tranformer</p>
      </sec>
      <sec id="sec-5-11">
        <title>5.3. Evaluation Metrics</title>
        <sec id="sec-5-11-1">
          <title>5.3.1. BERTScore</title>
          <p>BERTScore is computed as proposed by Zhang et al. [38], where the cosine similarity of each hypothesis
token  with each token  in the reference sentence is calculated using contextualized embeddings.
Instead of using a time-consuming best-case matching approach, a greedy matching strategy is employed.
The F1 measure is then calculated as follows:
BERT =
BERT =
1 ∑︁ max cos(⃗, ⃗),
|r| ∈r ∈p
1 ∑︁ max cos(⃗, ⃗),
|p| ∈p ∈r
BERTScore = BERT = 2 · BERT · BERT .</p>
          <p>BERT + BERT
(11)
(12)
(13)</p>
          <p>The BERTScore correlates better with human judgments for the tasks of image captioning and
machine translation.</p>
        </sec>
        <sec id="sec-5-11-2">
          <title>5.3.2. Other Metrics</title>
          <p>In addition to BERTScore, the competition also uses many other metrics such as ROUGE [39], BLEU-1
[40], BLEURT [41], METEOR [42], CIDEr [43], CLIPScore [44], RefCLIPScore [44], ClinicBLEURT
[45] and MedBERTScore [46]. Applying a variety of these metrics helps us have a more accurate and
general view of the model performance of participating teams. Each measure has its own advantages
and provides a diferent perspective on text quality that makes it relevant in a medical context. This
multi-dimensional evaluation helps identify outstanding models and gain an objective view of the
competition.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>6. Experiment Results</title>
      <sec id="sec-6-1">
        <title>6.1. Experimental Configuration</title>
        <p>All our proposed methods were trained and fine-tuned using the Adam optimization [ 47]. We utilized
an A100-GPU setup with 80GB of memory to train models, taking 10 hours on average for each method.
We set the learning rate to 3e-05, dropout is set at 0.2, batch size is 32, and the training process is
terminated after 3 epochs of not finding any reduction in the valid loss.</p>
      </sec>
      <sec id="sec-6-2">
        <title>6.2. Main Result</title>
        <p>Table 4 presents a comprehensive of the results on the test set achieved by individual models,
showcasing their BERTScore and other metrics. The findings underscore significant disparities in
performance among the various models, providing valuable insights into their respective strengths and
weaknesses. Notably, within the baseline models, VisionDiagnostor-BioBART stands out as the top
performer, showcasing an impressive BERTScore of 0.6267 and almost all other metrics with the smallest
size at 227M parameters. Moreover, Table 4 demonstrates that using large-scale pre-trained models in
VisionDiagnostor-Q-BioMistral with a very large size (8B) does not result in significant performance
improvement in this task.</p>
      </sec>
    </sec>
    <sec id="sec-7">
      <title>7. Result Analysis</title>
      <p>In this section, we conduct a subjective analysis of the valid set due to the limited number of submissions
in the competition. This means that instead of using the test set for objective evaluation, we used the
valid set to analyze the results our proposed methods achieved.</p>
      <sec id="sec-7-1">
        <title>7.1. Impact of Image Pre-processing</title>
        <p>Table 5 presents the results comparing the performance of the models with and without image
preprocessing on the validation dataset, evaluated using BERTScore. Specifically, for the
VisionDiagnostorQ-BioMistral model, BERTScore decreased from 0.6841 to 0.6740 after applying pre-processing,
corresponding to a decrease of 0.0101. Similarly, VisionDiagnostor-ClinicalT5 also saw a decrease in
performance from 0.7071 to 0.6905, a decrease of 0.0166. In contrast, VisionDiagnostor-BioBART is
the only model with an improvement with BERTScore increasing from 0.7165 to 0.7363, an increase of
0.0198.</p>
        <p>Overall, applying image pre-processing does not appear to yield significant improvement for most
models. Even for two of the three models (VisionDiagnostor-Q-BioMistral and
VisionDiagnostorClinicalT5), image pre-processing degrades performance. The reason may be because of the input
images are of good quality and have almost no noise. Some images also have clear instructions, such as
arrows pointing to the relevant caption of the image (see Figure 1 in Section 3), making it easy for the
model to understand and process the content without additional pre-processing.</p>
      </sec>
      <sec id="sec-7-2">
        <title>7.2. Impact of Caption Length</title>
        <p>• Short caption: These are captions shorter than 21 words.
• Medium caption: This group includes captions from 21 to 25 words.
• Long caption: Captions in this group from 26 to 30 words.</p>
        <p>• Very long caption: This group contains captions longer than 30 words.</p>
        <p>The illustration from Figure 8 is an important step in gaining insight into the model’s performance
for diferent caption lengths. The results show that the length of the caption plays an important role in
influencing model performance.</p>
        <p>Specifically, the two models VisionDiagnostor-ClinicalT5 and VisionDiagnostor-BioBART based on
the encoder-decoder method have similar trends, both showing a gradual decrease in BERTScore as the
caption length increases. This may indicate a limitation in handling longer captions with this method.</p>
        <p>It is worth noting that the VisionDiagnostor-Q-BioMistral model represents a diferent case, with
performance increasing as the caption length increases. This may imply that this model is capable
of handling longer captions more eficiently than other models, possibly due to its complexity and
magnitude.</p>
      </sec>
      <sec id="sec-7-3">
        <title>7.3. Impact of Object Features</title>
        <p>
          According to papers from competing teams in previous years [48], [49], [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ], the most popular image
feature extraction methods today have two main directions: convolutional neural networks (CNN)
and Vision transformers (ViT). Studies and demonstrations have shown that ViT often gives better
results than CNN in the task of image captioning. ViT is capable of capturing long-term and global
relationships in images more efectively, leading to the creation of richer and more accurate captions.
However, to improve the quality of feature extraction further, we used the VinVL model. VinVL takes
advantage of the power of the ability to detect and represent objects in images in detail. This allows the
model to gain a deeper understanding of the context and elements in the image, thereby creating more
accurate captions.
        </p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>8. Conclusion and Future Works</title>
      <p>In this paper, we have proposed three diferent models to solve the task of medical image captioning, in
other words medical image diagnosis, including VisionDiagnostor-ClinicalT5 and
VisionDiagnostorBioBART based on encoder-decoder architecture, VisionDiagnostor-Q-BioMistral based on BLIP2
architecture with Query Transformer which leveraging the power of Large Language Models (LLM).</p>
      <p>Our results show that the VisionDiagnostor-BioBART model achieved third place on the leaderboard,
with the highest BERTScore of 0.6267, despite being the smallest in size with only 227M parameters.
Additionally, we performed analysis of the results to gain a deeper understanding of the factors that
influence the performance of the models, including image pre-processing, caption length, and object
features. These analyses have provided the comprehensive insight needed to shape and improve future
methods and models for this task.</p>
      <p>In future works, our objective is to delve deeper into the applications of other biomedical large
language models (LLMs) BioMedLM [50], BioGPT [51], especially focusing on enhancing their capabilities
to generate precise captions that are context-sensitive. This development will be pursued through
methods like instruction tuning and better alignment of the models with specific user requirements. In
addition, we plan to explore the integration of dense retrieval techniques into the biomedical image
captioning process [52]. By adopting frameworks akin to Retrieval Augmented Generation, we intend
to supplement the LLMs with an external, non-parametric memory using a FAISS index [53], thereby
enriching their reasoning capabilities. Another area of interest will be investigating the interconnections
between these approaches. We also anticipate evaluating the qualitative variations in the captions
generated through these diferent methodologies to ascertain their eficacy and practicality in real-world
applications.</p>
    </sec>
    <sec id="sec-9">
      <title>Acknowledgment</title>
      <p>This research is funded by University of Information Technology-Vietnam National University
HoChiMinh City under grant number D4-2024-01.
research 21 (2020) 1–67.
[34] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L. Zettlemoyer,
Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation,
and comprehension, arXiv preprint arXiv:1910.13461 (2019).
[35] J. Li, D. Li, S. Savarese, S. Hoi, Blip-2: Bootstrapping language-image pre-training with frozen
image encoders and large language models, in: International conference on machine learning,
PMLR, 2023, pp. 19730–19742.
[36] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, R. Girshick, Masked autoencoders are scalable vision
learners, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition,
2022, pp. 16000–16009.
[37] Y. Labrak, A. Bazoge, E. Morin, P.-A. Gourraud, M. Rouvier, R. Dufour, Biomistral: A
collection of open-source pretrained large language models for medical domains, arXiv preprint
arXiv:2402.10373 (2024).
[38] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, Bertscore: Evaluating text generation with
bert, arXiv preprint arXiv:1904.09675 (2019).
[39] C.-Y. Lin, Rouge: A package for automatic evaluation of summaries, in: Text summarization
branches out, 2004, pp. 74–81.
[40] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, Bleu: a method for automatic evaluation of machine
translation, in: Proceedings of the 40th annual meeting of the Association for Computational
Linguistics, 2002, pp. 311–318.
[41] T. Sellam, D. Das, A. Parikh, Bleurt: Learning robust metrics for text generation, in: Proceedings
of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 7881–7892.
[42] S. Banerjee, A. Lavie, Meteor: An automatic metric for mt evaluation with improved correlation
with human judgments, in: Proceedings of the acl workshop on intrinsic and extrinsic evaluation
measures for machine translation and/or summarization, 2005, pp. 65–72.
[43] R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider: Consensus-based image description evaluation,
in: Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp.
4566–4575.
[44] J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, Y. Choi, Clipscore: A reference-free evaluation
metric for image captioning, in: Proceedings of the 2021 Conference on Empirical Methods in
Natural Language Processing, 2021, pp. 7514–7528.
[45] K. Huang, J. Altosaar, R. Ranganath, Clinicalbert: Modeling clinical notes and predicting hospital
readmission, arXiv preprint arXiv:1904.05342 (2019).
[46] A. B. Abacha, W.-w. Yim, G. Michalopoulos, T. Lin, An investigation of evaluation methods in
automatic medical note generation, in: Findings of the Association for Computational Linguistics:
ACL 2023, 2023, pp. 2575–2588.
[47] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980
(2014).
[48] A. Nicolson, J. Dowling, B. Koopman, A concise model for medical image captioning, in: CLEF2023</p>
      <p>Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, Thessaloniki, Greece, 2023.
[49] W. Zhou, Z. Ye, Y. Yang, S. Wang, H. Huang, R. Wang, D. Yang, Transferring pre-trained large
language-image model for medical image captioning, in: CLEF2023 Working Notes, CEUR
Workshop Proceedings, CEUR-WS. org, Thessaloniki, Greece, 2023.
[50] E. Bolton, A. Venigalla, M. Yasunaga, D. Hall, B. Xiong, T. Lee, R. Daneshjou, J. Frankle, P. Liang,
M. Carbin, et al., Biomedlm: A 2.7 b parameter language model trained on biomedical text, arXiv
preprint arXiv:2403.18421 (2024).
[51] R. Luo, L. Sun, Y. Xia, T. Qin, S. Zhang, H. Poon, T.-Y. Liu, Biogpt: generative pre-trained transformer
for biomedical text generation and mining, Briefings in bioinformatics 23 (2022) bbac409.
[52] G. Moschovis, E. Fransén, Neuraldynamicslab at imageclefmedical 2022., in: CLEF (Working</p>
      <p>Notes), 2022, pp. 1487–1504.
[53] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with gpus, IEEE Transactions on
Big Data 7 (2019) 535–547.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>C. E.</given-names>
            <surname>Lawson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Martí</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Radivojevic</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. V. R.</given-names>
            <surname>Jonnalagadda</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gentz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N. J.</given-names>
            <surname>Hillson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Peisert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. A.</given-names>
            <surname>Simmons</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. J.</given-names>
            <surname>Petzold</surname>
          </string-name>
          , et al.,
          <article-title>Machine learning for metabolic engineering: A review</article-title>
          ,
          <source>Metabolic Engineering</source>
          <volume>63</volume>
          (
          <year>2021</year>
          )
          <fpage>34</fpage>
          -
          <lpage>60</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papamichail</surname>
          </string-name>
          ,
          <article-title>Diagnostic captioning: a survey</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>64</volume>
          (
          <year>2022</year>
          )
          <fpage>1691</fpage>
          -
          <lpage>1722</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karpenka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Esperança-Rodier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2024:
          <article-title>Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 15th International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2024 -
          <article-title>Caption Prediction and Concept Detection</article-title>
          , in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>Learning to read chest x-rays: Recurrent neural cascade model for automated image annotation</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2497</fpage>
          -
          <lpage>2506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <source>Medical image captioning based on deep architectures</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos,</surname>
          </string-name>
          <article-title>A survey on biomedical image captioning</article-title>
          ,
          <source>in: Proceedings of the second workshop on shortcomings in vision and language</source>
          ,
          <year>2019</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          , Ł. Kaiser,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          ,
          <article-title>Attention is all you need</article-title>
          ,
          <source>Advances in neural information processing systems</source>
          <volume>30</volume>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Anchor detr: Query design for transformer-based detector</article-title>
          ,
          <source>in: Proceedings of the AAAI conference on artificial intelligence</source>
          , volume
          <volume>36</volume>
          ,
          <year>2022</year>
          , pp.
          <fpage>2567</fpage>
          -
          <lpage>2575</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          , et al.,
          <article-title>A survey of large language models</article-title>
          ,
          <source>arXiv preprint arXiv:2303.18223</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Choi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gao</surname>
          </string-name>
          , Vinvl:
          <article-title>Revisiting visual representations in vision-language models</article-title>
          ,
          <source>in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>5579</fpage>
          -
          <lpage>5588</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Kasban</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>El-Bendary</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Salama</surname>
          </string-name>
          ,
          <article-title>A comparative study of medical imaging techniques</article-title>
          ,
          <source>International Journal of Information Science and Intelligent System</source>
          <volume>4</volume>
          (
          <year>2015</year>
          )
          <fpage>37</fpage>
          -
          <lpage>58</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>J. A.</given-names>
            <surname>Seibert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. M.</given-names>
            <surname>Boone</surname>
          </string-name>
          ,
          <article-title>X-ray imaging physics for nuclear medicine technologists. part 2: X-ray interactions and image formation</article-title>
          ,
          <source>Journal of nuclear medicine technology 33</source>
          (
          <year>2005</year>
          )
          <fpage>3</fpage>
          -
          <lpage>18</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>M. S.</given-names>
            <surname>Atkins</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B. T.</given-names>
            <surname>Mackiewich</surname>
          </string-name>
          ,
          <article-title>Fully automatic segmentation of the brain in mri</article-title>
          ,
          <source>IEEE transactions on medical imaging 17</source>
          (
          <year>1998</year>
          )
          <fpage>98</fpage>
          -
          <lpage>107</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>Radiology objects in context (roco): a multimodal image dataset</article-title>
          ,
          <source>in: Intravascular Imaging and Computer Assisted Stenting and LargeScale Annotation of Biomedical Data and Expert Label Synthesis: 7th Joint International Workshop</source>
          , CVII-STENT 2018 and Third International Workshop, LABELS 2018,
          <article-title>Held in Conjunction with MICCAI 2018, Granada</article-title>
          , Spain,
          <year>September 16</year>
          ,
          <year>2018</year>
          , Proceedings 3, Springer,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Bustos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Pertusa</surname>
          </string-name>
          ,
          <string-name>
            <surname>J.-M. Salinas</surname>
          </string-name>
          ,
          <string-name>
            <surname>M. De La Iglesia-Vaya</surname>
          </string-name>
          ,
          <article-title>Padchest: A large chest x-ray image dataset with multi-label annotated reports</article-title>
          ,
          <source>Medical image analysis 66</source>
          (
          <year>2020</year>
          )
          <fpage>101797</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>A. E.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , T. J.
          <string-name>
            <surname>Pollard</surname>
            ,
            <given-names>N. R.</given-names>
          </string-name>
          <string-name>
            <surname>Greenbaum</surname>
            ,
            <given-names>M. P.</given-names>
          </string-name>
          <string-name>
            <surname>Lungren</surname>
            , C.-y. Deng,
            <given-names>Y.</given-names>
          </string-name>
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>R. G.</given-names>
          </string-name>
          <string-name>
            <surname>Mark</surname>
            ,
            <given-names>S. J.</given-names>
          </string-name>
          <string-name>
            <surname>Berkowitz</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Horng</surname>
          </string-name>
          ,
          <article-title>Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs</article-title>
          , arXiv preprint arXiv:
          <year>1901</year>
          .
          <volume>07042</volume>
          (
          <year>2019</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Wijerathna</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Raveen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Abeygunawardhana</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. D.</given-names>
            <surname>Ambegoda</surname>
          </string-name>
          ,
          <article-title>Chest x-ray caption generation with chexnet</article-title>
          , in: 2022 Moratuwa Engineering Research Conference (MERCon), IEEE,
          <year>2022</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>6</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S.</given-names>
            <surname>Subramanian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. L.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bogin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Mehta</surname>
          </string-name>
          , M. van
          <string-name>
            <surname>Zuylen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Parasa</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Hajishirzi</surname>
          </string-name>
          ,
          <article-title>Medicat: A dataset of medical images, captions, and textual references, Findings of the Association for Computational Linguistics: EMNLP (</article-title>
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>A.</given-names>
            <surname>Graves</surname>
          </string-name>
          ,
          <string-name>
            <surname>A.</surname>
          </string-name>
          <article-title>Graves, Long short-term memory, Supervised sequence labelling with recurrent neural networks (</article-title>
          <year>2012</year>
          )
          <fpage>37</fpage>
          -
          <lpage>45</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <given-names>R.</given-names>
            <surname>Dey</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F. M.</given-names>
            <surname>Salem</surname>
          </string-name>
          ,
          <article-title>Gate-variants of gated recurrent unit (gru) neural networks</article-title>
          ,
          <source>in: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS)</source>
          , IEEE,
          <year>2017</year>
          , pp.
          <fpage>1597</fpage>
          -
          <lpage>1600</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>Z.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>McGough</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <article-title>Mdnet: A semantically and visually interpretable medical image diagnosis network</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>6428</fpage>
          -
          <lpage>6436</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [23]
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Guo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <article-title>Imagesem group at imageclefmed caption 2021 task: Exploring the clinical significance of the textual descriptions derived from medical images</article-title>
          ,
          <source>in: Conference and Labs of the Evaluation Forum</source>
          ,
          <year>2021</year>
          . URL: https://api.semanticscholar.org/CorpusID:237298727.
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [24]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP group at imageclefmed caption
          <year>2019</year>
          , in: L.
          <string-name>
            <surname>Cappellato</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          <string-name>
            <surname>Ferro</surname>
            ,
            <given-names>D. E.</given-names>
          </string-name>
          <string-name>
            <surname>Losada</surname>
          </string-name>
          , H. Müller (Eds.),
          <source>Working Notes of CLEF 2019 - Conference and Labs of the Evaluation Forum, Lugano, Switzerland, September</source>
          <volume>9</volume>
          -
          <issue>12</issue>
          ,
          <year>2019</year>
          , volume
          <volume>2380</volume>
          <source>of CEUR Workshop Proceedings, CEUR-WS.org</source>
          ,
          <year>2019</year>
          . URL: https://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /paper_136.pdf.
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [25]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E. P.</given-names>
            <surname>Xing</surname>
          </string-name>
          ,
          <article-title>Hybrid retrieval-generation reinforced agent for medical image report generation</article-title>
          , in: S. Bengio,
          <string-name>
            <given-names>H.</given-names>
            <surname>Wallach</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Larochelle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Grauman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Cesa-Bianchi</surname>
          </string-name>
          , R. Garnett (Eds.),
          <source>Advances in Neural Information Processing Systems</source>
          , volume
          <volume>31</volume>
          ,
          <string-name>
            <surname>Curran</surname>
            <given-names>Associates</given-names>
          </string-name>
          , Inc.,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [26]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Irvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Shpanskaya</surname>
          </string-name>
          , et al.,
          <article-title>Chexnet: Radiologist-level pneumonia detection on chest x-rays with deep learning</article-title>
          ,
          <source>arXiv preprint arXiv:1711.05225</source>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [27]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Azizi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          , et al.,
          <article-title>Large language models encode clinical knowledge</article-title>
          ,
          <source>Nature</source>
          <volume>620</volume>
          (
          <year>2023</year>
          )
          <fpage>172</fpage>
          -
          <lpage>180</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          [28]
          <string-name>
            <given-names>K.</given-names>
            <surname>Singhal</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Gottweis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Sayres</surname>
          </string-name>
          , et al.,
          <article-title>Towards expert-level medical question answering with large language models</article-title>
          ,
          <source>arXiv 2305.09617</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          [29]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaliosis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Charalambakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , Aueb nlp group at imageclefmedical caption
          <year>2023</year>
          , in: CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS. org, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          [30]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. B.</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            , H. Müller,
            <given-names>P. A.</given-names>
          </string-name>
          <string-name>
            <surname>Horn</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          <string-name>
            <surname>Nensa</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          <article-title>M. Friedrich, ROCOv2: Radiology Objects in COntext version 2, an updated multimodal image dataset</article-title>
          ,
          <year>2024</year>
          . URL: https: //arxiv.org/abs/2405.10004v1. arXiv:
          <volume>2405</volume>
          .
          <fpage>10004</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          [31]
          <string-name>
            <given-names>E.</given-names>
            <surname>Lehman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , Clinical-t5:
          <article-title>Large language models built using mimic clinical text</article-title>
          ,
          <source>PhysioNet</source>
          (
          <year>2023</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          [32]
          <string-name>
            <given-names>H.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Yuan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Gan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Yu</surname>
          </string-name>
          ,
          <article-title>Biobart: Pretraining and evaluation of a biomedical generative language model</article-title>
          ,
          <source>arXiv preprint arXiv:2204.03905</source>
          (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          [33]
          <string-name>
            <given-names>C.</given-names>
            <surname>Rafel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Narang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Matena</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. J.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <article-title>Exploring the limits of transfer learning with a unified text-to-text transformer</article-title>
          ,
          <source>Journal of machine learning</source>
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>