<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Last accessed:</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>AUEB NLP Group at ImageCLEFmedical Caption 2024</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Marina Samprovalaki</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anna Chatzipapadopoulou</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Moschovis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Foivos Charalampakos</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Panagiotis Kaliosis</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Pavlopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ion Androutsopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Archimedes Unit, Athena Research Center</institution>
          ,
          <addr-line>1, Artemidos Street, GR-151 25 Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Informatics, Athens University of Economics and Business</institution>
          ,
          <addr-line>76, Patission Street, GR-104 34 Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2024</year>
      </pub-date>
      <volume>202</volume>
      <fpage>4</fpage>
      <lpage>06</lpage>
      <abstract>
        <p>This article describes the approaches that the AUEB NLP Group experimented with during its participation in the 8th edition of the ImageCLEFmedical Caption evaluation campaign, including both Concept Detection and Caption Prediction tasks. The objective of Concept Detection is to automatically categorize biomedical images into a set of one or more concepts. In contrast, the Caption Prediction task focuses on generating a precise and meaningful diagnostic caption that describes the medical conditions depicted in the image. Building on our prior research for the Concept Detection task, we utilized a diverse set of Convolutional Neural Network (CNN) encoders, followed by a Feed-Forward Neural Network. Additionally, we implemented two versions of the retrieval-based -NN algorithm: a version that assigned concepts based on statistical frequency and a weighted version that took into account the order of the retrieved neighbors. Both models used the CNN image encoders to improve their retrieval capabilities. Regarding the Caption Prediction task, we fine-tuned the InstructBLIP model to generate initial captions and then enhanced it by employing rephrasing techniques with further pre-trained models. We also used synthesizing techniques that incorporated information from similar neighboring images in the training set to refine these captions. Additionally, we employed “Distance from Median Maximum Concept Similarity” (DMMCS), a novel guided-decoding approach that drives the model's behaviour throughout the decoding process, aiming to integrate information from the predicted concepts of Concept Detection. We explored the application of DMMCS to all of our developed systems. Our group ranked 2nd in Concept Detection and 4th in Caption Prediction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Biomedical Images</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Multi-Label Classification</kwd>
        <kwd>Caption Generation</kwd>
        <kwd>Generative Models</kwd>
        <kwd>Transformers</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ImageCLEF [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is an ongoing evaluation initiative, first run in 2003 as part of the Cross Language
Evaluation Forum (CLEF)1, that promotes the evaluation of technologies for annotation, indexing,
classification, and retrieval of multi-modal data. ImageCLEFmedical is one of the four main tasks in
this year’s ImageCLEF campaign. We participated in the ImageCLEFmedical Caption task, which was
organized for the eigth time [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. As in previous years, the task comprised two sub-tasks: Concept
Detection and Caption Prediction.
      </p>
      <p>
        The objective of Concept Detection is to accurately associate a biomedical image with one or more
relevant medical concepts (tags), while in Caption Prediction, the goal is to automatically generate a
preliminary diagnostic report that accurately describes the medical findings, as well as the anatomy
of the body structures and organs shown in the image. Diagnostic Captioning remains a challenging
research problem aimed at assisting the diagnostic process for patients by providing a preliminary
report, rather than replacing medical professionals involved in the procedure [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It can thus be seen as
an assistive tool, capable of producing an initial draft diagnosis regarding the patient’s condition. Such a
document would ideally allow doctors to focus on critical areas of the image [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] and help them produce
more precise medical diagnoses at an increased speed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. Experienced clinicians could enhance their
throughput by analyzing the large volume of daily medical examinations more quickly and eficiently.
Less experienced clinicians could consider the automatically generated captions to reduce the likelihood
of clinical errors [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Concept Detection can further improve Diagnostic Captioning by identifying
key concepts that should be included in the draft report. We demonstrate the connection between the
two sub-tasks by using “Distance from Median Maximum Concept Similarity” (DMMCS)2 [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ], which
employs information derived from our Concept Detection systems in order to improve the performance
of our Caption Prediction systems.
      </p>
      <sec id="sec-1-1">
        <title>1.1. AUEB NLP Group contributions</title>
        <p>
          In this work, we present the experiments conducted and the systems submitted as part of the AUEB
NLP Group’s participation in this year’s Concept Detection and Caption Prediction tasks. We used a
number of new approaches influenced by the remarkable progress in the field of NLP and based on
instruction-tuned Large Language Models (LLMs) [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ].
        </p>
        <p>Our submissions to the Concept Detection sub-task are based on two distinct approaches. We used
a Convolutional Neural Network (CNN) encoder to extract visual features from the medical images.
In the first approach, these features were fed into a Feed-Forward Neural Network (FFNN) to classify
the images into various medical concepts. In the second approach, we implemented a separate method
using a -nearest neighbors (-NN) algorithm. In this approach,  neighbors are first retrieved, and the
most frequently occurring concepts among these neighbors are selected.</p>
        <p>
          Regarding the Caption Prediction sub-task,we tried five main approaches. First, we employed an
InstructBLIP model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] that was fine-tuned on the specified dataset [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] to generate an initial set of
captions, which were then also used in the other four approaches. In the second approach, we enhanced
the initial captions by drawing insights from captions of similar images and training a FLAN-T5 model
[
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] to refine them [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ]. The third approach was similar, but instead of FLAN-T5, we employed
ClinicalT5 [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], which is pre-trained on numerous medical datasets, in order to rephrase and correct
the initial captions produced by InstructBLIP. The fourth approach involved integrating the DMMCS
algorithm [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] in the language model’s decoding process in order to promote the inclusion of a given set
of keywords, which in this case where predicted by one of our Concept Detection systems. Lastly, we
also applied DMMCS decoding to ClinicalT5 in order to maximize their eficacy and improve the overall
caption quality. In all our models we used CNN encoders, since there are signs that vision transformers
[
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] still have inferior performance in visual tasks, such as classification and semantic segmentation
[
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], especially in medical image tagging [
          <xref ref-type="bibr" rid="ref17 ref5">5, 17</xref>
          ].
        </p>
        <p>
          Extending our history of successful entries [
          <xref ref-type="bibr" rid="ref18">18, 19, 20, 21, 22</xref>
          ] in the ImageCLEFmedical campaign,
our submissions ranked 2nd among 9 participating groups in the Concept Detection sub-task and 4th
among 11 participating groups in the Caption Prediction sub-task. In Section 2, we provide insight into
this year’s dataset, followed by a discussion of our approaches in Section 3. In Section 4, we present our
experimental results for each sub-task. Finally, in Section 5, we summarize our findings and suggest
directions for future research.
        </p>
        <p>All code used for our experiments is available on GitHub.3</p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>
        In this year’s edition of the ImageCLEFmedical Caption task, the dataset is an updated and extended
version of the Radiology Objects in Context (ROCO) dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], which originates from biomedical
articles of the PubMed Open Access (PMC OA) subset.4.
      </p>
      <p>
        This dataset, which is common for both sub-tasks, consists of 80,080 biomedical images along with
their respective medical concepts, in the form of UMLS [23] terms5, and diagnostic captions. The
dataset was originally split by the organizers into training and validation subsets, with 70,108 radiology
images in the first set and 9,972 in the latter. After merging the provided data, we split them again,
this time into three subsets, in order to also obtain a development (private test) subset for evaluation
purposes. We used a 75%-10%-15% training-validation-development split, keeping relatively equal
concept distributions in all three subsets. Consequently, we obtained 64,928 images as our training
data, 7,179 images as our validation set, while the remaining 7,973 images constituted our held-out
development set. All of our submissions were also evaluated on the hidden oficial test set (ROCOv2)
[24]. The test dataset utilizes Radiology Objects in COntext Version 2 (ROCOv2) [24], an updated and
extended version of the ROCO dataset [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. This set includes 17,237 previously unseen images.
      </p>
      <sec id="sec-2-1">
        <title>2.1. Concept Detection</title>
        <p>Concept Detection is a multi-label classification problem covering a broad range of 1,945 distinct
biomedical concepts, originating from the Unified Medical Language System (UMLS) [ 23]. In this
sub-task, the goal is to identify (assign) the distinct medical concepts (tags) depicted in each image
(e.g., particular medical conditions). Among the available concepts (tag set), four are specific imaging
modalities: X-Ray Computed Tomography, Ultrasonography, Magnetic Resonance Imaging (MRI),
PET/CT scans. All concepts are represented by Concept Unique Identifiers (CUIs) following the UMLS
standard. Some examples of images and their ground truth concepts can be found in Figure 1.</p>
        <sec id="sec-2-1-1">
          <title>CUI UMLS Term</title>
          <p>C0041618 Ultrasonography
C0018827 Heart Ventricle
C1510420 Cavitation</p>
          <p>CC BY [Magdás et al. (2021)]</p>
          <p>The distribution of concepts is highly skewed. Some concepts are present in more than 25, 000
images, whereas others are associated with only 1 image. Figure 2(a) depicts the long-tail distribution
of the entire (development + validation + train) dataset, as shown in the left plot, where the frequencies
of the concepts (number of images each concept is associated with) are plotted in descending order
against their respective class indices. After conducting a comprehensive exploratory analysis of this
year’s dataset, we found that certain concepts were more prevalent (Table 1); these mostly correspond
4PMC Open Access: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, Last accessed: 2024-06-20
5UMLS: https://www.nlm.nih.gov/research/umls/index.html, Last accessed: 2024-06-20
to kinds of medical examinations, such as X-Ray Computed Tomography or Plain x-ray. Most images
are associated (in the ground truth) with at least one of these overarching concepts, alongside more
specialized ones. The maximum and minimum number of concepts assigned to a single image are 27
and 1, occurring in 1 and 8,567 images respectively. The average number of assigned concepts per
image is 3.1583. The aforementioned observations are outlined in the histogram in Figure 2(b).</p>
        </sec>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Caption Prediction</title>
        <p>In the Caption Prediction data, each image is accompanied by a gold diagnostic caption that describes the
medical conditions present in the image. There are 80, 080 gold captions across the whole dataset, one
for each provided image. Similar to last year’s campaign, the vast majority of the captions, specifically
99.47% (79, 658 out of 80, 080 captions), are unique. The maximum number of words in a single caption
is 848 (occurred once), while the minimum is 1 (encountered 73 times). The average caption length
is 21.01 words. These statistics apply to the dataset as a whole, but we have carefully checked that
they remain consistent in all three subsets (training, validation, development) we formed. The five
most common captions, as well as the ten most popular words, excluding the stopwords, can be found
in Tables 2 and 3, respectively. In Figure 3, we provide a histogram alongside a box plot, utilizing a
logarithmic scale in our visualizations. This helps make smaller counts more visible and reduces the
dominance of larger values, giving a more balanced view of how the data is distributed.</p>
        <p>Number of Images vs Number of Words in Captions (Log Scale)
Boxplot of Caption Lengths (Log Scale)
103
se
ifroaeubgN110021
m
m
100
1</p>
        <p>100
0 100 200 300 Num40b0erofwords
500 600 700 800 900
101 Numberofwords
102
103
(a)
(b)
According to the organizers, each caption is pre-processed before evaluated in the following manner:
• The caption is converted to lower-case.
• Numbers are replaced by words, e.g., number 10 becomes “ten”.</p>
        <p>• Punctuation is removed.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>In this section, we present the methods we used in our submissions for both the Concept Detection and
the Caption Prediction sub-tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Concept Detection</title>
        <p>
          Our submissions for this year’s Concept Detection sub-task are built upon two frameworks. Initially,
we extensively explored a CNN+FFNN framework, building upon our prior research [
          <xref ref-type="bibr" rid="ref18">18, 19, 20, 21</xref>
          ],
experimenting with various image encoders. Additionally, we used a neural image retrieval approach
by integrating a -nearest neighbors (-NN) algorithm, which selects  neighbors and aggregates tags
based on their frequency among the neighbors. Furthermore, we submitted several ensembles of the
aforementioned systems. The ensembles employed strategies such as union-based and
intersectionbased aggregation.
3.1.1. CNN + FFNN
This system employs a CNN encoder as its backbone, followed by an FFNN classification head. We
extract image features from the last convolutional layer of the image encoder and we condense these
feature maps into a feature vector (an image embedding) using global pooling. More specifically, we
used the Generalized-Mean (GeM) pooling [25] mechanism.
        </p>
        <p>
          The FFNN component classifies the image into one or more concepts. Its output layer has || neurons,
where  represents the set of unique concepts in the dataset. Each neuron uses a sigmoid activation
function to transform its value into a probability value in [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]. This results in one probability per label,
and if this probability exceeds a specific threshold value , the corresponding concept is assigned to the
image. The threshold, which is the same for all concepts, was chosen through a grid search procedure
that optimized the primary metric of the competition, on our validation set. The model was trained by
minimizing binary cross-entropy, treating each concept as a separate binary target and summing up the
individual losses. We used the Adam optimizer [26], along with a decreasing learning rate strategy and
early stopping based on the validation set loss with a patience value of 3 epochs. We used an initial
learning rate of  = 10− 3 and decreasing factor of 10.
        </p>
        <p>In order to form the ensembles, we trained several instances of the model, using diferent random
initializations, and combined them using the union and the intersection of their predicted concept
sets. More details about our submitted ensemble systems can be found in Section 4.1.
3.1.2. CNN + -NN
For our -nearest neighbors (-NN) approach, we leveraged the image embeddings obtained from the
encoder of the trained CNN+FFNN system (Section. 3.1.1). We discarded the dense classification head
and used the last GeM pooling layer to extract embeddings (feature vectors) for all the training images.
These embeddings served as the basis for the retrieval process in the -NN algorithm. Given a test
image, the goal of the system is to retrieve similar images from the training set and select concepts
from the retrieved neighbors. For each test image, we used the same encoder to obtain its embedding
and we retrieved the  closest neighbors from the training set, based on cosine similarity computed on
the image embeddings. We tuned the value of  in the range from 1 to 100 using our validation set,
which led to  = 33.</p>
        <p>For each test image, having obtained its  neighbors from the training set, we formed the set of
concepts associated with the neighbors. We then ranked the concepts of the set based on the number
of retrieved neighbors associated with each concept, ordering them from highest to lowest frequency.
The concept with the highest frequency was always included in the predictions of the -NN method
for the test image. We then used two thresholds, 1 and 2, which we tuned using grid search on our
Fr(concept1) − Fr(concept2)</p>
        <p>Fr(concept1)</p>
        <p>Similarly, we determined whether to include in the prediction the third most frequent concept or
not, based on a comparison involving the first and third most frequent concepts. We calculated the
diference between the frequencies of the first and third concepts, dividing it by the frequency of the
ifrst concept, and if this ratio exceeded 2, we included the third concept:</p>
        <p>Fr(concept1) − Fr(concept3)</p>
        <p>Fr(concept1)
Fr(concept1) − Fr(concept4)</p>
        <p>Fr(concept1)
≥ 2 .
≥ 2 .</p>
        <p>The same approach was applied to the diference between the first and fourth most frequent concepts,
checking again against 2, to decide if the fourth most frequent concept should be predicted:</p>
        <p>
          We opted to predict at most four concepts due to the fact that the average number of concepts in
the training split was 3.08. The rationale was to select concepts that have frequencies close to that
of the highest frequency concept, while excluding concepts that show a significant drop in frequency
compared to the preceding ones. We experimented with 1, 2 values ranging from 0.3 to 0.9. Validation
results indicated that the best parameters were 1 = 0.58 and 2 = 0.65.
validation set, to select which other concepts of the neighborhood to include in the predictions of -NN.
We calculated the diference in frequency ( Fr) between the first and second most frequent concepts,
divided by the frequency of the first concept, and if the result exceeded 1, we included the second
concept in the prediction:
(1)
(2)
(3)
(4)
(5)
(; 1, . . . , ) =
∑︀=1  · ,,
∑︀
=1 
where ,, = 1 if concept  is present in the ground truth of the -th neighbor of , otherwise ,, = 0,
and  is the weight assigned to the -th nearest neighbor position; we explain below how the weights
 are learned. Concept  is predicted for the test image  if and only if (; 1, . . . , ) ≥ , yielding
the predicted label set (; 1, . . . , ) = {|(; 1, . . . , ) ≥ }. The classification threshold
 ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ] and the number of neighbors  ∈ [
          <xref ref-type="bibr" rid="ref1">1, 100</xref>
          ] were tuned on our validation set, resulting in
 = 0.35 and  = 50. The weights 1, . . . ,  are the same for all the concepts  and test images .
They are learned using a genetic algorithm (GA) [28] by maximizing the following objective, where 
denotes the validation set,  () is the ground truth set of concepts of image , and 1 is the oficial
evaluation measure of the Concept Detection task:
3.1.3. CNN + weighted -NN
We also developed a weighted version of the -NN algorithm, using the voting scheme that was
described in [27]. More specifically, given a test image , we calculate for each concept  ∈  a score
(; 1, . . . , ) from the  neighbors retrieved for :
        </p>
        <p>max ∑︁ 1( (), (; 1, . . . , ))
1,..., ∈
s.t.</p>
        <p>1 ≥ 1 ≥ . . . ≥  ≥ 0 .</p>
        <p>
          In detail, we created a population of 500 randomly initialized weight vectors, initial chromosomes
in GA terminology. Each chromosome had the form ⟨1, . . . , ⟩, with all weights  ∈ [
          <xref ref-type="bibr" rid="ref1">0, 1</xref>
          ]; we
ensured that the monotonicity constraint 1 ≥ 1 ≥ . . . ≥  ≥ 0 was satisfied by all chromosomes.
We then used a crossover mechanism where two chromosomes were combined to form two new ones.
At each application of the crossover mechanism, we selected pairs of chromosomes (parents) out of the
population and combined their values to form two new ones from each pair of parents. The crossover
operator splits the two parent chromosomes at a random point and creates two children chromosomes
by combining the values before the crossover point (or after) for one parent, and after (or before) the
crossover point for the other parent. Furthermore, we used a mutation mechanism that perturbed the
values of the resulting children chromosomes by adding a random value in [− 0.1, 0.1] to every gene,
with a 0.1 mutation probability per gene (). Both the crossover and the mutation operators paid
respect to the range and monotonicity constraints; we added a clipping and a sorting operation that were
applied if any of the constraints were violated in the resulting chromosomes. We used 1( (), ())
as the fitness function. The fitness function is used to select the chromosomes to be used as parents in
the crossover mechanism at each iteration of the algorithm (fitter chromosomes are selected with higher
probability as parents). At each generation (new population), we performed the crossover mechanism
as many times as necessary to have a new generation with as many members as the previous one
(and as many as the initial population, i.e., 500 chromosomes). We run the optimization process for 30
iterations (generations).
        </p>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Caption Prediction</title>
        <p>
          Our submissions for the Caption Prediction sub-task focused on four primary systems. The first system
employs an InstructBLIP model [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] (Section 3.2.1), while the remaining submissions build on this
model using techniques such as rephrasing [
          <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
          ] (Section 3.2.3) and synthesizing [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] (Section 3.2.2).
Finally, we implemented an innovative guided-decoding mechanism, DMMCS [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (Section 3.2.4), which
leverages information from the tags predicted by our CNN+-NN classifier (Section 3.1.2) in the Concept
Detection task to improve the generated caption.
        </p>
        <sec id="sec-3-2-1">
          <title>3.2.1. InstructBLIP</title>
          <p>
            The InstructBLIP model [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] is a sophisticated neural network designed to generate descriptive text
for scientific images. It employs a technique known as instruction-tuning [ 29], which refines its
behavior and responses based on user-provided instructions. This approach aims to enhance the
model’s controllability and its adaptability across diferent domains. The InstructBLIP model comprises
three key components: an image encoder, a Q-Former [30], and an LLM. The frozen image encoder
converts the image into a low-dimensional vector and generates image embeddings. The Q-Former
then extracts instruction-aware visual features from these embeddings and can process the text prompt
(instruction) to enhance this extraction. Through extensive training, the LLM learns to correlate
textual prompts with relevant image features, thereby generating coherent and contextually appropriate
descriptions. The InstructBLIP model played a crucial role in creating the initial captions, which were
subsequently utilized in our other caption prediction methods.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Synthesizer</title>
          <p>
            Our goal was to the captions obtained from the InstructBLIP model (Section 3.2.1) by leveraging
information from similar training images, based on the intuition that similar images may have similar
captions [31, 32]. To achieve this, we computed embeddings for all images in the dataset using the CCN
+ FFNN model, which was developed for Concept Detection (Section 3.1.1). A cosine similarity threshold
was then applied to decide if an image qualified as a neighbor of the test image. Images exceeding
this threshold were considered neighbors [33]. For each image in the test set [24], we identified the 
most similar images from the entire dataset [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], which includes training, validation, and development
images, to retrieve their corresponding captions. We experimented with  ∈ {1, 3, 5}; the best results
in our validation set were obtained for  = 5, so we used that value. The Synthesizer, a FLAN-T5 model
[
            <xref ref-type="bibr" rid="ref11">11</xref>
            ], was trained to refine the captions generated by InstructBLIP by considering also the captions of
the neighbors, which are concatenated to the caption of InstructBLIP, similarly in spirit to [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ]. We
also experimented with diferent beam sizes , for the beam search decoding of the Synthesizer during
inference; setting  = 5 yielded the best validation scores, so we used that value. Figure 4 illustrates the
process (for  = 3), starting with the caption generated by InstructBLIP, merging it with the captions
of the neighbors, and using FLAN-T5 to obtain a refined caption.
          </p>
        </sec>
        <sec id="sec-3-2-3">
          <title>3.2.3. Rephraser</title>
          <p>
            Furthermore, we experimented with a domain-specific variation of T5, namely ClinicalT5. This is an
encoder-decoder transformer, which is pre-trained in a series of both supervised and unsupervised
tasks [34], including denoising tasks, and then further pre-trained on the union of MIMIC-III and IV
clinical notes, to which we were granted access through PhysioNet6. Following our previous work
[35], we created a corrective text-to-text training set, consisting of noisy and ground truth caption
pairs, with the former having been generated by our captioning systems. Therefore, we treated our
original system as a noise-insertion function, then we further fine-tuned ClinicalT5, in order to rephrase
the noisy captions to approximate the gold ones, hoping it would acquire knowledge of the medical
domain, use medical terms more accurately and therefore generate more medically fluent text captions.
Specifically, we fine-tuned ClinicalT5 to rephrase the captions of InstructBlip (Section 3.2.1), InstructBlip
with FLAN-T5 Synthesizer (Section 3.2.2) on top and InstructBlip with DMMCS (Section 3.2.4) using
 = 0.10. Performance in terms of the primary metric in our development set improved, but test-time
performance (in the oficial evaluation) deteriorated.
3.2.4. DMMCS
In this section, we present “Distance from Median Maximum Concept Similarity” (DMMCS) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ], a novel
data-driven guided decoding mechanism designed to incorporate domain-specific information (in the
form of keywords) into the text generation process. The intuition behind this guided decoding algorithm
lies in the observation that an accurate diagnostic caption should mention the key medical conditions
6https://www.physionet.org/content/clinical-t5/1.0.0/, Last accessed: 2024-06-20
depicted in the given image. For example, if a radiology image is assigned the tag “Pneumonia”, but
the generated caption does not refer to this medical condition either explicitly or implicitly, then the
caption is potentially inaccurate. Such conditions are typically represented by the medical tags provided
in the ImageCLEF2024 dataset, which the Concept Detection task is also trying to predict. Therefore we
use tags predicted by one of our Concept Detection systems (Section 3.1), in order to guide our Caption
Prediction models towards captions that express the tags appropriately. We achieve this by imposing a
new penalty at each decoding step, aiming to prioritize the generation of words semantically similar to
the (predicted) medical tags. This penalty also considers the frequency with which each tag is explicitly
or implicitly expressed in the dataset’s gold captions.
          </p>
          <p>
            In more detail, recent work examining DC datasets [
            <xref ref-type="bibr" rid="ref7">22, 7</xref>
            ] has shown that some tags are more
prominently expressed than others in the corresponding diagnostic captions. More specifically, Kaliosis
et al. [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] performed an exploratory analysis on the ImageCLEF2023 and MIMIC-CXR datasets, where
they investigated the relationship between each tag and the gold captions of the images that are
associated with the tag in the ground truth. This was achieved by calculating the cosine similarity
between the word embeddings of each caption’s tokens and each tag. The results showed that some tags
are always explicitly expressed in the gold captions of the images the tags are associated with, while
other tags are mentioned more implicitly or even not at all. More concretely, the similarity between a
tag  and a caption  is defined as the maximum cosine similarity (MCS) between the centroid ℎ() of
the word embeddings of  and the embedding ℎ() of each token in , i.e.,
          </p>
          <p>MCS(, ) =
max sim(ℎ(), ℎ()).</p>
          <p>1≤ ≤| |
A high MCS score between a tag  and a caption  implies that  is strongly expressed in the caption,
while a low MCS score indicates that it was rather implicitly (or not at all) mentioned. The MCS similarity
is also calculated for all the gold captions of the images a tag  is associated with in the training data.
Specifically, for each tag  and the set  containing its associated captions, the distribution (, ) is
calculated as:</p>
          <p>(, ) = {MCS(, )| ∈ }.</p>
          <p>MMCS(, ) = median((, )).
(7)
(8)
(9)
The median value of the distribution (, ), hereafter called Median Maximum Cosine Similarity
(MMCS), indicates how strongly  is expressed on average in the training captions it is associated with.</p>
          <p>During inference, when generating the caption for an image with a single tag , the MCS(, ) of the
tag  and each candidate (possibly still incomplete) caption  of the beam search is calculated (Eq. 7).
The penalty, imposed at each decoding step, is then defined as the squared diference between MCS(, )
and MMCS(, ). The former shows how strongly the tag is mentioned in the candidate caption, while
the latter indicates how strongly the tag is expressed on average in the gold training captions associated
with the tag. When more than one tags are assigned to an image, a distinct penalty is calculated for
each tag, and the overall penalty is the average of the individual penalties. Thus, given a candidate
caption , the set of its associated training captions , and a set of tags  , the penalty is calculated as:
DMMCSpen(, , ) =
1 ∑︁(MCS(, ) − MMCS(, ))2.
| | ∈
(10)</p>
          <p>Intuitively, the objective of the DMMCS algorithm is to guide the model to generate captions that
express each associated tag as explicitly (or implicitly) as it is expressed in the training corpus. Overall,
at each decoding step, each candidate caption  generated through the beam search process is scored by
the following formula:</p>
          <p>DMMCS() =  · DMMCSpen(, , ) + (1 −  ) · (1 − Dscore),
where  is a given set of predicted tags,  is a tunable weighting factor, while Dscore is the score that
the decoder assigns to the candidate caption .</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments, Submissions and Results</title>
      <p>
        In this section, we provide details about our experiments regarding this year’s evaluation campaign [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ].
Moreover, we share details about our submissions and the scores achieved in our held-out development
set, as well as the oficial test set of the competition [24] for both sub-tasks.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Concept Detection</title>
        <p>In the Concept Detection sub-task we submitted our ten best performing models, after evaluating
them on our held-out development set. We submitted two instances with diferent image encoders of
our CNN + FFNN model (Section 3.1.1), one instance of our CNN + -NN model (Section 3.1.2), and a
single instance of our CNN + weighted -NN model (Section 3.1.3). In our subsequent submissions, we
employed ensemble systems. These involved exploring the integration of predictions from multiple
instances by computing either the union or the intersection of their predicted concept sets. Our
submitted ensemble systems consisted of various combinations of CNN-based architectures paired with
diferent classifiers, specifically CNN + FFNN, CNN + -NN (KNN), and CNN + weighted -NN (wKNN).
To enhance the diversity and robustness of our ensembles, we incorporated diferent architectures for
the CNN component.</p>
        <p>The primary evaluation metric for this year’s Concept Detection sub-task was the 1-score, calculated
between the predicted and ground truth captions. It is calculated as the sum of the 1-scores for each
test image, divided by the total number of test images. Each partial score is derived from the binary
multi-hot candidate vector compared to the corresponding ground truth vector. Specifically, let 1
represent the overall 1-score, and ^1 denote the individual 1-score for each test image. Additionally,
let  and  be the predicted and ground truth concepts for an image , respectively. Finally, let  be
the test set [24].</p>
        <p>1 =
1 ∑︁ ^1(, )
| | ∈
(6)</p>
        <p>Moreover, a secondary evaluation metric (again an 1 score) was calculated, which only considered
manually selected concepts, such as anatomy, topography, and modality.</p>
        <p>For our first system (CNN+FFNN), we experimented with a variety of CNN encoders as their backbone
components. Specifically, we trained the networks using state-of-the-art CNN architectures, including
EficientNet and DenseNet. Furthermore, we extended our experiments by incorporating these CNN
encoders into our -NN models.</p>
        <p>During testing on our held-out development set, we observed a slightly higher F1 score in models
utilizing the EficientNet image encoder.</p>
        <p>Our ensembling approaches did not show significant improvement over our individual models, with
minimal diferences observed in both the development and test set [24].</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Caption Prediction</title>
        <p>For the Caption Prediction sub-task, we submitted nine systems based on their performance on our
development set. Our submissions included InstructBLIP (Section 3.2.1), a synthesizer variant
combining InstructBLIP with FLAN-T5 (Section 3.2.2), and a rephrasing variant that employs ClinicalT5
(Section 3.2.3). Additionally, we explored combinations of all three approaches, aiming to refine the
captions generated by InstructBLIP and FLAN-T5 (Section 3.2.2) using our ClinicalT5 rephraser on
top. Furthermore, we submitted three variations of InstructBLIP and DMMCS, each with a diferent 
value (Section 3.2.4). Finally, we provided two instances where we employed ClinicalT5 to rephrase the
results generated by the combination of InstructBLIP and DMMCS, in this case using a  = 0.10.</p>
        <p>In this year’s campaign, BERTScore [36] was the primary evaluation metric in the Caption Prediction
task, while ROUGE [37] was the secondary metric. Other metrics utilized include, for example, BLEU-1
[38], BLEURT [39], and METEOR [40]. Table 6 shows captions produced by each of our submissions for
the test image CC BY [Muacevic et al. (2024)], extracted from the test dataset [24].</p>
        <p>Finally, Table 7 provides an overview of our models, detailing their performance across fundamental
campaign metrics in both our development set and the provided test set [24], along with our attained
magnetic resonance imaging of the head and neck showing a
hyperintense lesion in the right internal carotid.</p>
        <p>Axial computed tomography scan of the head showing a mass
in the left maxillary sinus (arrow).</p>
        <p>Computed tomography scan of the head and neck showing a
mass in the right parotid gland.</p>
        <p>InstructBLIP + DMMCS (alpha 0.1)</p>
        <p>Chest X-ray showing bilateral pulmonary edema.</p>
        <p>InstructBLIP + DMMCS (alpha 0.1) Computed tomography scan of the head and neck showing a
+ Rephraser mass in the right parotid gland.</p>
        <p>InstructBLIP + DMMCS (alpha 0.1) Anteroposterior radiograph of the pelvis showing a large
right+ Rephraser (random restart) sided pleural efusion.
rankings. Additionally, Table 8 presents a summary of all the metrics utilized in this year’s campaign,
ofering a comprehensive view of the experiments.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusion</title>
      <p>Our participation in the ImageCLEFmedical Caption task provided an opportunity to explore innovative
NLP approaches for medical image captioning. Utilizing state-of-the-art models, we demonstrated
competitive performance in both the Concept Detection and Caption Prediction sub-tasks.</p>
      <p>In the Concept Detection sub-task, we achieved a 2nd place ranking among the participating groups.
Our top-performing system was a CNN+FFNN pipeline (Section 3.1.1), while our remaining submissions
included a CNN+KNN (Section 3.1.2) and a CNN+wKNN (Section 3.1.3), which also produced competitive
results. We also employed ensembles that combined these approaches using union and intersection (of
predicted tags) approaches.</p>
      <p>
        In the Caption Prediction sub-task, we were ranked 4th among all participating groups, by both
extending our previous work [
        <xref ref-type="bibr" rid="ref17">22, 21, 17</xref>
        ] and exploiting the state-of-the-art in NLP, such as
instructiontuned Large Language Models. Our approach involved the initial generation of captions using the
InstructBLIP model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], followed by their enrichment through the synthesis of information from the
captions of similar images [
        <xref ref-type="bibr" rid="ref12 ref13">12, 13</xref>
        ] and the utilization of a model further pre-trained in the medical
domain [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ] to improve the originally generated captions.
      </p>
      <p>In future work, we plan to further investigate and improve biomedical LLMs and further explore
their reasoning capabilities through instruction tuning and, more generally, alignment with medical
professionals needs [41]. We also plan to utilize a model capable of processing both image and text inputs
in our Synthesizer approach (Section 3.2.2) to combine information not only from the captions of the
neighbors, but also from the images themselves. Furthermore, we plan to exploit Retrieval-Augmented
Generation [42] algorithms to combine prior knowledge with new medical cases. Finally, the generated
captions need to be evaluated in collaboration with medical experts, to assess their medical accuracy
and usefulness.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>This work has been partially supported by project MIS 5154714 of the National Recovery and Resilience
Plan Greece 2.0 funded by the European Union under the NextGenerationEU Program.
Switzerland, September 9-12, volume 2380 of CEUR Workshop Proceedings, 2019.
[19] B. Karatzas, J. Pavlopoulos, V. Kougia, I. Androutsopoulos, AUEB NLP Group at ImageCLEFmed
Caption 2020, in: Working Notes of CLEF 2020 - Conference and Labs of the Evaluation Forum,
Thessaloniki, Greece, September 22-25, volume 2696 of CEUR Workshop Proceedings, 2020.
[20] F. Charalampakos, V. Karatzas, V. Kougia, J. Pavlopoulos, I. Androutsopoulos, AUEB NLP Group
at ImageCLEFmed Caption Tasks 2021, in: Proceedings of the Working Notes of CLEF 2021
Conference and Labs of the Evaluation Forum, Bucharest, Romania, September 21-24, volume 2936
of CEUR Workshop Proceedings, 2021, pp. 1184–1200.
[21] F. Charalampakos, G. Zachariadis, J. Pavlopoulos, V. Karatzas, C. Trakas, I. Androutsopoulos,
AUEB NLP Group at ImageCLEFmedical Caption 2022, in: CLEF2022 Working Notes, CEUR
Workshop Proceedings, CEUR-WS.or, Bologna, Italy, 2022, pp. 1355–1373.
[22] P. Kaliosis, G. Moschovis, F. Charalampakos, J. Pavlopoulos, I. Androutsopoulos, AUEB NLP Group
at ImageCLEFmedical Caption 2023, in: CLEF2023 Working Notes, CEUR Workshop Proceedings,
CEUR-WS.org, Thessaloniki, Greece, 2023.
[23] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical terminology,</p>
      <p>Nucleic acids research 32 (2004) D267–D270. doi:10.1093/nar/gkh061.
[24] J. Rückert, L. Bloch, R. Brüngel, A. Idrissi-Yaghir, H. Schäfer, C. S. Schmidt, S. Koitka, O. Pelka, A. B.</p>
      <p>Abacha, A. G. S. de Herrera, H. Müller, P. A. Horn, F. Nensa, C. M. Friedrich, ROCOv2: Radiology
Objects in COntext Version 2, an Updated Multimodal Image Dataset, Scientific Data (2024). URL:
https://arxiv.org/abs/2405.10004v1. doi:10.1038/s41597-024-03496-6.
[25] F. Radenović, G. Tolias, O. Chum, Fine-Tuning CNN Image Retrieval with No Human Annotation,
IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019) 1655–1668. doi:10.
1109/TPAMI.2018.2846566.
[26] D. P. Kingma, J. L. Ba, Adam: A Method for Stochastic Optimization, in: 3rd International
Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference
Track Proceedings, 2015.
[27] T.-H. Chiang, H.-Y. Lo, S.-D. Lin, A Ranking-based KNN Approach for Multi-Label Classification,
in: Proceedings of the Asian Conference on Machine Learning, volume 25, Singapore Management
University, Singapore, 2012, pp. 81–96.
[28] A. Eiben, J. E. Smith, Introduction to Evolutionary Computing, 2nd ed., Springer Publishing</p>
      <p>Company, Incorporated, 2015. doi:10.1007/978-3-662-44874-8.
[29] J. Wei, M. Bosma, V. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, Q. V. Le, Finetuned
Language Models Are Zero-Shot Learners, International Conference on Learning Representations
abs/2109.01652 (2021). doi:10.48550/arXiv.2109.01652.
[30] J. Li, D. Li, S. Savarese, S. C. H. Hoi, BLIP-2: Bootstrapping Language-Image Pre-training with
Frozen Image Encoders and Large Language Models, in: International Conference on Machine
Learning, 2023. URL: https://api.semanticscholar.org/CorpusID:256390509. doi:10.48550/arXiv.
2301.12597, Last accessed: 2024-06-20.
[31] Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, H. Wang, Retrieval-Augmented
Generation for Large Language Models: A Survey, 2024. doi:10.48550/arXiv.2312.10997.
arXiv:2312.10997.
[32] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Kuttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks, Neural Information Processing Systems abs/2005.11401 (2020).
[33] Y. Huang, J. Huang, A Survey on Retrieval-Augmented Text Generation for Large Language Models,
2024. doi:10.48550/arXiv.2404.10981. arXiv:2404.10981.
[34] C. Rafel, N. M. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu, Exploring
the Limits of Transfer Learning with a Unified Text-to-Text Transformer, Journal of machine
learning research 21 (2019) 140:1 – 140:67. doi:10.48550/arXiv.1910.10683.
[35] P. Kaliosis, Exploring Uni-modal, Multi-modal and Few-Shot Deep Learning Methods for Diagnostic
Captioning, 2023. M.Sc. thesis, Department of Informatics, Athens University of Economics and
Business.
[36] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, Y. Artzi, BERTScore: Evaluating text generation
with BERT, International Conference on Learning Representations abs/1904.09675 (2019).
[37] C.-Y. Lin, ROUGE: A Package for Automatic Evaluation of Summaries, in: Text Summarization
Branches Out, Association for Computational Linguistics, Barcelona, Spain, 2004, pp. 74–81. URL:
https://aclanthology.org/W04-1013, Last accessed: 2024-06-20.
[38] K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a Method for Automatic Evaluation of Machine
Translation, in: P. Isabelle, E. Charniak, D. Lin (Eds.), Proceedings of the 40th Annual Meeting
of the Association for Computational Linguistics, Association for Computational Linguistics,
Philadelphia, Pennsylvania, USA, 2002, pp. 311–318. URL: https://aclanthology.org/P02-1040.
doi:10.3115/1073083.1073135, Last accessed: 2024-06-20.
[39] T. Sellam, D. Das, A. Parikh, BLEURT: Learning Robust Metrics for Text Generation, in: D. Jurafsky,
J. Chai, N. Schluter, J. Tetreault (Eds.), Proceedings of the 58th Annual Meeting of the Association
for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 7881–
7892. URL: https://aclanthology.org/2020.acl-main.704. doi:10.18653/v1/2020.acl-main.704,
Last accessed: 2024-06-20.
[40] S. Banerjee, A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation
with Human Judgments, in: J. Goldstein, A. Lavie, C.-Y. Lin, C. Voss (Eds.), Proceedings of the
ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or
Summarization, Association for Computational Linguistics, Ann Arbor, Michigan, 2005, pp. 65–
72. URL: https://aclanthology.org/W05-0909. doi:10.3115/1626355.1626389, Last accessed:
2024-06-20.
[41] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama,
A. Ray, J. Schulman, J. Hilton, F. Kelton, L. E. Miller, M. Simens, A. Askell, P. Welinder, P. Christiano,
J. Leike, R. J. Lowe, Training language models to follow instructions with human feedback, Neural
Information Processing Systems abs/2203.02155 (2022). doi:10.48550/arXiv.2203.02155.
[42] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih,
T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for Knowledge-Intensive NLP
Tasks, in: Advances in Neural Information Processing Systems, volume 33, Curran Associates,
Inc., 2020, pp. 9459–9474. doi:10.48550/arXiv.2005.11401.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcıa Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Prokopchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Karpenka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Macaire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Schwab</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Lecouteux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Esperança-Rodier</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Fu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Heinrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Kiesel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Potthast</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Stein</surname>
          </string-name>
          , Overview of ImageCLEF 2024:
          <article-title>Multimedia retrieval in medical applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 15th International Conference of the CLEF Association (CLEF</source>
          <year>2024</year>
          ), Springer Lecture Notes in Computer Science LNCS, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Bracke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Damm</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. M. G.</given-names>
            <surname>Pakull</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. S.</given-names>
            <surname>Schmidt</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2024 -
          <article-title>Caption Prediction and Concept Detection</article-title>
          , in: CLEF2024 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Grenoble, France,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papamichail</surname>
          </string-name>
          , Diagnostic Captioning: A Survey,
          <source>Knowledge and Information Systems</source>
          <volume>64</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2101.07299.
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H.-C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation</article-title>
          ,
          <source>in: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>2497</fpage>
          -
          <lpage>2506</lpage>
          . doi:
          <volume>10</volume>
          . 48550/arXiv.1603.08486.
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <article-title>Medical image captioning based on Deep Architectures, Master's thesis</article-title>
          , KTH Royal Institute of Technology, Stockholm, Sweden,
          <year>2022</year>
          . URL: http://urn.kb.se/resolve?urn=urn:nbn:se: kth:
          <fpage>diva</fpage>
          -323528, Last accessed:
          <fpage>2024</fpage>
          -06-20.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Biomedical Image Captioning</article-title>
          , in: R. Bernardi,
          <string-name>
            <given-names>R.</given-names>
            <surname>Fernandez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gella</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Kafle</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.</surname>
          </string-name>
          Nabi (Eds.),
          <source>Proceedings of the Second Workshop on Shortcomings in Vision and Language</source>
          , Association for Computational Linguistics, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          . URL: https://aclanthology.org/W19-1803. doi:
          <volume>10</volume>
          .18653/v1/
          <fpage>W19</fpage>
          -1803, Last accessed:
          <fpage>2024</fpage>
          -06-20.
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>P.</given-names>
            <surname>Kaliosis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Charalampakos</surname>
          </string-name>
          , G. Moschovis,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>A data-driven guided decoding mechanism for diagnostic captioning</article-title>
          ,
          <source>in: Findings of the Association for Computational Linguistics: ACL</source>
          <year>2024</year>
          ,
          <year>2024</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>W. X.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Min</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Du</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jiang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Tang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          , P. Liu,
          <string-name>
            <given-names>J.-Y.</given-names>
            <surname>Nie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.-R.</given-names>
            <surname>Wen</surname>
          </string-name>
          ,
          <source>A Survey of Large Language Models</source>
          ,
          <year>2023</year>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2303.18223. arXiv:
          <volume>2303</volume>
          .
          <fpage>18223</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>W.</given-names>
            <surname>Dai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. M. H.</given-names>
            <surname>Tiong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Zhao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P. N.</given-names>
            <surname>Fung</surname>
          </string-name>
          , S. Hoi, InstructBLIP: Towards
          <article-title>General-purpose Vision-Language Models with Instruction Tuning</article-title>
          ,
          <source>Advances in Neural Information Processing Systems</source>
          <volume>36</volume>
          (
          <year>2024</year>
          ). doi:
          <volume>10</volume>
          .48550/arXiv.2305.06500.
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>"Radiology Objects in COntext (ROCO): A Multimodal Image Dataset:</article-title>
          7th Joint International Workshop, CVII-STENT 2018 and Third International Workshop, LABELS 2018,
          <article-title>Held in Conjunction with MICCAI 2018, Granada</article-title>
          , Spain,
          <year>September 16</year>
          ,
          <year>2018</year>
          ,
          <source>Proceedings"</source>
          ,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          . doi:
          <volume>10</volume>
          .1007/978-3-
          <fpage>030</fpage>
          -01364-6_
          <fpage>20</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>H. W.</given-names>
            <surname>Chung</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Hou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Longpre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Zoph</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Tay</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Fedus</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Brahma</surname>
          </string-name>
          , et al.,
          <string-name>
            <surname>Scaling Instruction-Finetuned Language</surname>
            <given-names>Models</given-names>
          </string-name>
          ,
          <source>Journal of Machine Learning Research</source>
          <volume>25</volume>
          (
          <year>2024</year>
          )
          <fpage>1</fpage>
          -
          <lpage>53</lpage>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2210.11416.
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>Y.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Liang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Hu</surname>
          </string-name>
          , E. Xing,
          <string-name>
            <surname>Knowledge-Driven</surname>
            <given-names>Encode</given-names>
          </string-name>
          , Retrieve,
          <source>Paraphrase for Medical Image Report Generation, in: AAAI Conference on Artificial Intelligence</source>
          , volume abs/
          <year>1903</year>
          .10122,
          <year>2019</year>
          . doi:
          <volume>10</volume>
          .1609/aaai.v33i01.
          <fpage>33016666</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>G.</given-names>
            <surname>Vernikos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Brazinskas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Adamek</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Mallinson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Severyn</surname>
          </string-name>
          , E. Malmi,
          <article-title>Small Language Models Improve Giants by Rewriting Their Outputs</article-title>
          , in: Y. Graham, M. Purver (Eds.),
          <source>Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume</source>
          <volume>1</volume>
          :
          <string-name>
            <surname>Long</surname>
            <given-names>Papers)</given-names>
          </string-name>
          ,
          <source>Association for Computational Linguistics</source>
          , St. Julians, Malta,
          <year>2024</year>
          , pp.
          <fpage>2703</fpage>
          -
          <lpage>2718</lpage>
          . URL: https://aclanthology.org/
          <year>2024</year>
          .
          <article-title>eacl-long</article-title>
          .
          <volume>165</volume>
          . doi:
          <volume>10</volume>
          .48550/arXiv.2305.13514,
          <string-name>
            <surname>Last</surname>
            <given-names>accessed</given-names>
          </string-name>
          :
          <fpage>2024</fpage>
          -06-20.
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Dou</surname>
          </string-name>
          , T. Nguyen,
          <article-title>ClinicalT5: A Generative Language Model for Clinical Text, in: Findings of the Association for Computational Linguistics: EMNLP</article-title>
          <year>2022</year>
          ,
          <year>2022</year>
          , pp.
          <fpage>5436</fpage>
          -
          <lpage>5443</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2022</year>
          .findings-emnlp.
          <volume>398</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>A.</given-names>
            <surname>Dosovitskiy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Beyer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Kolesnikov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Weissenborn</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Unterthiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dehghani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Minderer</surname>
          </string-name>
          , G. Heigold,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gelly</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Houlsby</surname>
          </string-name>
          ,
          <article-title>An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale</article-title>
          , in: International Conference on Learning Representations,
          <year>2021</year>
          . URL: https://openreview.net/forum?id=YicbFdNTTy. doi:
          <volume>10</volume>
          .48550/arXiv.
          <year>2010</year>
          .
          <volume>11929</volume>
          ,
          <string-name>
            <surname>Last</surname>
            <given-names>accessed</given-names>
          </string-name>
          :
          <fpage>2024</fpage>
          -06-20.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>I.</given-names>
            <surname>Athanasiadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuoma</surname>
          </string-name>
          ,
          <article-title>Weakly-Supervised Semantic Segmentation via Transformer Explainability</article-title>
          ,
          <source>in: ML Reproducibility Challenge 2021 (Fall Edition)</source>
          ,
          <year>2022</year>
          . doi:
          <volume>10</volume>
          .5281/ zenodo.6574631.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          , E. Fransén,
          <source>NeuralDynamicsLab at ImageCLEF Medical</source>
          <year>2022</year>
          , in: CLEF2022 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP Group at ImageCLEFmed
          <source>Caption</source>
          <year>2019</year>
          , in: Working Notes of CLEF 2019 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Lugano,
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>