<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AUEB NLP Group at ImageCLEFmedical Caption 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Panagiotis Kaliosis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Georgios Moschovis</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Foivos Charalampakos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>John Pavlopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ion Androutsopoulos</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Informatics, Athens University of Economics and Business</institution>
          ,
          <addr-line>76, Patission Street, GR-104 34 Athens</addr-line>
          ,
          <country country="GR">Greece</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This article describes the methods that the AUEB NLP Group experimented with during its participation in the 7th edition of the ImageCLEFmedical Caption sub-tasks, namely Concept Detection and Caption Prediction. The former intends to automatically classify biomedical images into a set of one or more tags based solely on the visual input, while the latter aims to generate a syntactically and semantically accurate diagnostic caption that addresses the medical conditions depicted on a given image. For the Concept Detection sub-task, extending our previous work, we utilized a wide range of Convolutional Neural Network encoders followed by a Feed-Forward Neural Network, both in a single-task and a multi-task fashion, as well as combined with a contrastive learning approach. Our methods concerning the Caption Prediction sub-task are influenced by both our previous work and recent progress in Natural Language Processing (NLP) methods. Our two base systems use CNN-RNN and Transformer-to-Transformer encoder-decoder architectures, respectively. Additionally, we experimented with a Transformer-based denoising component, which was trained to reformulate the generated captions in a more syntactically coherent and medically accurate way. Our group ranked 1st in Concept Detection and 3rd in Caption Prediction.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Natural Language Processing</kwd>
        <kwd>Computer Vision</kwd>
        <kwd>Biomedical Images</kwd>
        <kwd>Convolutional Neural Networks</kwd>
        <kwd>Multi-Label Classification</kwd>
        <kwd>Caption Generation</kwd>
        <kwd>Generative Models</kwd>
        <kwd>Transformers</kwd>
        <kwd>Deep Learning</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>link a biomedical image with one or more medical concepts (categories), whereas in Caption
Prediction the goal is to automatically generate a draft diagnostic report that accurately outlines
the medical situation, as well as the topology of the body structures and organs shown in the
image.</p>
      <p>
        Diagnostic Captioning still constitutes a challenging research problem that aims to assist the
diagnostic process for a patient by providing a draft report, rather than replacing the doctors
and any human factor involved in the procedure [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. It may thus be viewed as an assistive tool,
capable of providing an initial draft diagnostic report regarding the patient’s condition. Such a
draft would ideally allow the doctors’ attention to focus on important regions of the image [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]
and aid them to produce more accurate medical diagnoses with improved accuracy and speed [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
Experienced clinicians could improve their throughput, by analyzing faster and more eficiently
the large volume of medical examinations that they daily handle. Less experienced clinicians
could ideally consider the automatically generated captions in order to reduce the probability
of clinical errors [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ]. Concept Detection may assist Diagnostic Captioning by detecting key
concepts that need to be mentioned in the draft report. It can also be used to index medical
images by relevant concepts.
      </p>
      <sec id="sec-1-1">
        <title>1.1. AUEB NLP Group contributions</title>
        <p>
          In this work we present the experiments conducted, as well as the systems submitted as part of
AUEB NLP Group’s participation in this year’s Concept Detection and Caption Prediction tasks.
We experimented with several extensions of our previous work [
          <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8 ref9">7, 8, 9, 10, 11</xref>
          ] in the Diagnostic
Captioning task, in addition to a number of new approaches influenced by the expeditious
progress of Transformer-based [12] Deep Learning methods in Sequence-to-Sequence (Seq2Seq)
architectures [13] and Large Language Models (LLMs) [14].
        </p>
        <p>Our submissions to the Concept Detection sub-task revolve around two main directions. In
the first one, we employed a Convolutional Neural Network (CNN) encoder in order to obtain
the images’ visual representations, followed by a Feed-Forward Neural Network (FFNN) that
classifies the images into one or more medical concepts. In the second direction, we employed
contrastive learning [15, 16], aiming at bringing the high-dimensional representations of images
and their assigned concepts closer in the vector space. Finally, we experimented with various
ensembles of our proposed systems, either by performing majority voting based on each system’s
predictions or by calculating the intersection and the union of their predicted concepts.</p>
        <p>
          For the Caption Prediction sub-task, our work can be divided into three major directions. The
ifrst one, following our last year’s submissions [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], is a Show and Tell model [17], which more
specifically adopts an architecture that includes a CNN and a Recurrent Neural Network (RNN).
The CNN-RNN architecture still remains competitive, while it also lays the foundations for
further experiments, such as investigating new variants and modified forms [ 18]. Furthermore,
we implemented an encoder-decoder model, where we employed Transformers for both the
encoder and decoder components. More specifically, we employed a Vision Transformer (ViT)
[19] instance as the image encoder and a GPT-2 [20] decoder in charge of generating the
predicted captions. As our third major direction, we implemented a novel pipeline, where we
used a denoising sequence-to-sequence model on top of the two aforementioned architectures.
We trained our denoising model to rewrite or rephrase the initial draft radiology reports by
providing it with the ground truth captions. Thus, it was able to learn and subsequently correct
the common mistakes of our two base models, resulting in a more fluent and consistent generated
caption.
        </p>
        <p>
          Extending our history of successful entries [
          <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8">7, 8, 10, 11</xref>
          ] in the ImageCLEFmedical campaign,
our submissions ranked 1st among 10 participating groups in the Concept Detection sub-task
and 3rd among 13 participating groups in the Caption Prediction sub-task. In Section 2 below,
we provide insight into this year’s dataset, followed by a discussion of our methods in Section 3.
In Section 4, we present our experimental results for each sub-task. Finally, in Section 5 we
summarize our findings and suggest directions for future research.
        </p>
      </sec>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>In this year’s edition of the ImageCLEFmedical Caption task, a dataset consisting of 71,355
biomedical images along with their respective medical concepts, in the form of UMLS [21]
terms,2 and diagnostic captions was provided. The set was originally split by the organizers into
training and validation subsets. Following the previous years’ campaigns, the dataset constitutes
an updated and extended version of the Radiology Objects in Context (ROCO) dataset [22],
which originates from a range of biomedical articles available in the PubMed Central Open
Access (PMC OA) subset3.</p>
      <p>The dataset, common for both sub-tasks, comprised images of diferent modalities (i.e., X-Ray,
Computed Tomography), although no further insight was provided regarding the diferent types
of images included. Concept Detection is a multi-label classification problem covering a broad
range of 2,125 distinct biomedical concepts, originating from the Unified Medical Language
System (UMLS) [21], whereas caption prediction aims at open-ended generation of diagnostic
texts for the medical images. After merging the provided training and validation data, we
split them into three subsets, holding out a development (private test) subset for evaluation
purposes. We followed a 75%-10%-15% split, keeping relatively equal data distributions in all
three subsets. We confirmed it by comparing the concepts distribution between the subsets.
Thus, we considered 53,516 images as our training data, 7,135 images as our validation set,
while the remaining 10,704 images constituted our held-out development set. Moreover, an
oficial test set, consisting of 10,473 images was shared. All of our submissions were evaluated
based on their performance on the oficial test set.</p>
      <sec id="sec-2-1">
        <title>2.1. Concept Detection</title>
        <p>Regarding the Concept Detection sub-task, a set of one or more medical concepts were originally
assigned to each radiology image. The concepts are ofered in the form of Concept Unique
Identifiers (CUIs) in accordance with the Unified Medical Language System (UMLS) [ 21]. For
example, the biomedical concept “Pericardial Efusion” is associated with the CUI term “C0031039”.
Each concept is retrieved from the image’s corresponding diagnostic caption in order to be
2UMLS: https://www.nlm.nih.gov/research/umls/index.html, Last accessed: 2023-07-07
3PMC Open Access: https://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/, Last accessed: 2023-07-07
employed as the training target. Some examples of images and their corresponding ground
truth concepts can be found in Figure 1.</p>
        <p>CUI: C0041618
UMLS Term: Ultrasonography
CUI: C0238207
UMLS Term: Ectopic Kidney
CUI: C0030797
UMLS Term: Pelvis</p>
        <p>CUI: C1306645
UMLS Term: Plain X-Ray
CUI: C0039985
UMLS Term: Plain Chest X-Ray</p>
        <p>CUI: C1306645
UMLS Term: Plain X-Ray
CC BY [Khougali et al. (2021)]</p>
        <p>CC BY [Kaler et al. (2018)]</p>
        <p>CC BY [Uddin et al. (2012)]
(a)
(b)
(c)</p>
        <p>The dataset contains 2,125 distinct biomedical concepts. It is highly imbalanced in terms of
concepts, as there are some that appear more than 20,000 times, while others are assigned to
only 4 or 5 images. Figure 2 below illustrates the dataset’s long-tail distribution (left plot) by
plotting the number of each concept’s appearances in descending order against its index (class
index). Furthermore, after performing a thorough exploratory analysis of this year’s dataset, we
observed that some concepts were more common, while also representing a greater category of
medical examinations, such as X-Ray or Ultrasonography. Besides, we observed that the vast
majority of the images is associated with one of these concepts, in addition to the rest, more
specific concepts. Based on this observation, we decided to explore the potential of a multi-task
classification model based on a shared backbone encoder, which will be described in Section 3.</p>
      </sec>
      <sec id="sec-2-2">
        <title>2.2. Caption Prediction</title>
        <p>
          In the Caption Prediction sub-task, the images are accompanied by a diagnostic caption that
expresses the medical conditions present in the image. There are 71,355 captions across the
whole dataset, one for each provided image. Similarly to last year’s campaign, the vast majority of
the captions, specifically 99.46% (70,974 out of 71,355 captions) are unique. This is an important
diferentiation from previous versions of this task, where this percentage was significantly
lower [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ]. Consequently, typical retrieval methods based on nearest neighbours search [23]
are not so eficient this year, including extended variations with weighting mechanisms relying
on the cosine similarities of the retrieved images [24]. Therefore, more elaborate captioning
methods are needed.
        </p>
        <p>We found out that the maximum number of words in a single caption is 315 (occured once),
while the minimum is 1 (encountered 134 times). The average caption length is 16.04 words.
These statistics refer to the dataset as a whole, but we have carefully checked that they remain
consistent in all three subsets. The five most common captions, as well as the ten most popular
words, after excluding the stopwords, can be found in Tables 2 and 3 respectively. In Figure 3,
we provide a histogram, as well as a box-plot, both showing that most of the captions do not
exceed 100 tokens.</p>
        <p>25000
se20000
g
a
m
i
fo15000
r
e
b
um10000
N
5000
0
1
0
100
200 300 400
Number of words
(a)
500
600
0
100 200 300 400 500 600</p>
        <p>Number of words
(b)
• The caption is converted to lower-case.
• Numbers are replaced by words, e.g., number 10 becomes “ten”.</p>
        <p>
          • Punctuation is removed.
Unlike last year’s campaign [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], we decided to perform experiments while both adopting
and ignoring the pre-processing procedure during training, taking into consideration that this
year stop-words were not removed during pre-processing by the organizers. Removal of the
stop-words could potentially lead to distortion of important words in either the predicted or
the ground truth captions.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>3. Methods</title>
      <p>In this section, we present the methods we used in our submissions for both the Concept
Detection and the Caption Prediction sub-tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Concept Detection</title>
        <p>
          Our submissions in this year’s Concept Detection sub-task are based on three groundwork
systems. First, following our previous work [
          <xref ref-type="bibr" rid="ref10 ref11 ref7 ref8">7, 8, 10, 11</xref>
          ], we thoroughly experimented with a
CNN+FFNN system, as well as a multitask classifier with a more complex, yet similar
architecture. Moreover, we implemented a contrastive learning retrieval-based classifier, using it
as a standalone system as well as combining it with the best performing CNN+FFNN system.
Additionally, we made several submissions using ensembles (majority, union and
intersection-based) of the three aforementioned systems, as they achieved higher performance on the
primary evaluation metric of the task in our held-out development set.
Our first system utilizes a CNN encoder backbone, followed by an FFNN classification head
that employs one or more hidden layers. The image features that represent the visual input are
extracted from the last convolutional layer of the CNN. Then, a global pooling layer is used in
order to acquire the final feature vector. We experimented with three global pooling strategies;
max, average and Generalized-Mean (GeM) global pooling [25], which all resulted in enhanced
performance compared to no pooling scheme. Max pooling retrieves the maximum value of
each feature map, while average pooling computes the respective mean value [26]. In addition,
GeM pooling is a generalized version of both the max and average pooling strategies [25].
        </p>
        <p>Specifically, given an input image, the CNN encoder outputs a three-dimensional tensor 
of shape  ×  × .  denotes the number of channels (feature maps), while  and 
represent the image’s height and width. Let  be a feature map, hence equal to  × 
for  ∈ [1, 2, . . . , ], and , ,  be the max, average and GeM pooling functions respectively.
The pooling layer’s output for input  is a single value  that can be computed based on
Equations 1, 2, and 3, hereunder, depending on the pooling strategy employed:
() = () = max 
 ∈</p>
        <p>⎛ ⎞
() = () = ⎝ |1| · ∑∈︁ ⎠</p>
        <p>⎛
() = () = ⎝ |1| · ∑∈︁ 
⎞ 1
⎠
(1)
(2)
(3)
GeM pooling is equivalent to max pooling when  → ∞, and equivalent to average pooling
when  = 1 [25]. The hyperparameter  can either be trained by integrating it in the
network’s training process, or be manually initialized beforehand.</p>
        <p>The FFNN component, consisting of multiple hidden and dropout [27] layers, classifies the
image into one or more concepts. The network’s output layer consists of || neurons, where 
is the set of the unique concepts in the dataset and is featured with sigmoid activation gates in
order to squash the neurons’ values between 0 and 1, hence transforming them into probabilities.
We therefore end up with one probability per label and if it exceeds a specific threshold value
, then the corresponding concept is assigned to the image. The threshold (same value for all
concepts) was selected by performing a grid search in the range (0.1, 0.7) on our validation
set aiming to optimize the competition’s primary metric, the 1 score. Our model was trained
by minimizing the binary cross-entropy loss, treating each concept as a binary classification
problem. Moreover, we used the Adam [28] optimizer, as well as a linear decreasing learning
rate strategy and early stopping based on our validation set loss with patience equal to 3. We
do not exploit the validation set for our final model since there is no guarantee that the same
number of epochs is the best when using all training data and it has been previously observed
that "the gain of re-training the model after merging all the splits is almost negligible" [29]. We
experimented with various initial learning rates (e.g., 1 − 3, 1 − 4) and decreasing factors
(e.g., 0.1, 0.05) using random search.</p>
        <sec id="sec-3-1-1">
          <title>3.1.2. CNN+FFNN-based Multi-task Classifier</title>
          <p>Our second system adopts the CNN+FFNN architecture described in the previous section and
utilizes it in a multi-task fashion. We observed that some of the medical concepts were more
common and represented generic medical terms (see also Section 2). This observation led
us into experimenting with a multi-task classification model composed of a shared encoding
backbone and two task-specific classification heads. The first head corresponds to a single-label
classification task ( Modality prediction), while the second one to a multi-label classification
problem (Modality-specific concepts prediction ). An overview of the system’s architecture can be
found in Figure 4.</p>
          <p>The first head is in charge of classifying the image into one out of five candidate classes; the
four main modalities (namely X-Ray, Computed Tomography, Magnetic Resonance Imaging
and Ultrasonography) or none of them. Concurrently, the second head performs multi-label
classification on the image features excluding the main modality tags, attempting to identify the
rest of the concepts present in the image. The intuition behind this method is that, through the
aggregated loss, the shared backbone will be driven to learn optimized image representations
suitable for both tasks.</p>
          <p>Input
Layer</p>
          <p>Shared backbone</p>
          <p>Task-specific Layers
…</p>
          <p>Task-specific
Predictions</p>
          <p>Modality Tag
Modality-specific</p>
          <p>Tag</p>
          <p>Moreover, we stay consistent with our first system and use multiple hidden and dropout
[27] layers. The Modality Prediction head consists of five neurons on the output layer, one for
each modality (including the “None” option), featured with a Softmax activation function. It
was trained by attempting to minimize the categorical cross-entropy loss. On the other end,
the Modality-specific classification head’s output layer consists of || − 4 neurons (where ||
denotes the overall number of possible concepts) alongside a sigmoid activation gate, and is
trained by minimizing the binary cross-entropy loss. Both components’ learning rates were
initialized at a relatively low value, swiftly increased to a pre-defined maximum and then slowly
decreased until the end of the optimization process. This strategy is shown to preserve training
stability and minimize the degree of divergence in the network’s parameters, especially in the
deeper layers [30].</p>
          <p>The entire network, composed of the shared backbone encoder and the two task-specific
classifiers, is trained based on the aggregated loss that derives from the two FFNN components.
Specifically, let ℒ be the network’s loss, loss be the single-label classifier’s loss, and finally
loss be the multi-label classification component’s loss. The total loss is equal to:
ℒ =  · loss(single, ^single) + (1 −  ) · loss(multirest , ^multirest ),
0 &lt;  ≤ 1 (4)
where  is initialized to 0.5 and can either stay fixed or be automatically adapted during training.
In the case where  is adaptive, we used the following approach. At the end of each epoch, if
the total loss is increased compared to the previous epoch, we proceed to examine the partial
task-specific losses. If only the loss increased, then we increase  by a pre-defined factor
(e.g., 10%), aiming to put more emphasis on reducing the loss throughout the next epoch.
The same procedure is followed vice versa regarding the loss. In case both losses increased,
then we slightly adjust  either upwards or downwards, depending on which loss increased
more. Moreover, even if the total loss decreased, we still attempt to optimize the losses’ weights.
To do so, we modify  ’s value, in accordance with which loss decreased more between the two.
We either increase or decrease it aiming to place greater emphasis on the component with the
less decreased loss.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.3. Contrastive Learning-based Tagger</title>
          <p>This system is based on the idea proposed by CLIP [16], which is a framework based on
multimodal learning. CLIP employs a contrastive learning objective and jointly trains an image
encoder and a text encoder to predict the correct pairings of (image, text) examples. The
(bidirectional) constrastive objective aims at bringing the representations of true pairings closer
in the vector space, while pushing the representations of mismatching pairs far away.</p>
          <p>Based on this approach, we formulate a similar training procedure where we utilize the
available (image, concepts) pairs instead. We again use an image encoder and a text encoder
based on BERT [31]. The encoders are trained to map the image representations and their gold
concepts representations to nearby points in a joint representation space (see Figure 5). We
compute the embeddings of the 2,125 concepts using the text encoder before training and treat
them as trainable variables, which we update instead of updating the text encoder’s parameters.
We use a bidirectional temperature-scaled version of the binary cross-entropy function as the
training objective with the images-concepts similarity matrix  ∈ R×| | (computed via the
dot product of the respective embeddings in a batch of  images) as the logits (see Eq. 5). The
goal is to maximize the similarity of the image embeddings with the embeddings of the gold
truth concepts assigned to each image.</p>
          <p>ℒCLIP(multi, ) =
(multi, / ) + (m ulti,  / )
2
(5)
where  is the temperature hyper-parameter.</p>
          <p>During inference, given an image, we compute the similarities between its embedding
and the || concepts and assign to it the top- most similar concepts. We select to learn

the  parameter using the following scheme: we create a dataset ′ = {s, }=1 where
s = [1, . . . , ||] is the vector that contains the similarities of the embedding of the -th
image with each embedding of the || concepts and  is the number of concepts assigned to
this image. Using ′, we train a Multi-Layer Perceptron (MLP) regressor in order to predict the
“X-ray”</p>
          <p>Image
Encoder</p>
          <p>Overview of our CLIP-based approach. For each batch of  images, we compute the
embeddings for  = || concepts (and the  images) and aim to maximize the red-colored similarity
values which correspond to the similarities between each image and its gold truth concepts. Figure
adapted from [16].
number of assigned concepts () of the -th image based on its similarity vector with the ||
concepts. Thus, we feed this network image-concepts similarities and it outputs a number  as
the expected assigned concepts for this image.</p>
          <p>We also used this system with an ensemble-like design together with a FFNN. In the system
described above, we added a trainable FFNN classifier which was fed the image embeddings from
the CNN encoder. These same embeddings were also used in the calculation of the similarity
the standard binary cross entropy loss loss:
matrix. The final output logits were formed by interpolating the classifier’s logits
and the similarity matrix :  =  ·  () + (1 −  ) ·  (), where  is a trainable parameter
and  is the sigmoid function. Additionally, the system was trained using both the ℒCLIP and
 ∈ R×| |
ℒensemble(multi, ) = ℒCLIP(multi, ) + loss(multi, )
2
(6)</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Caption Prediction</title>
        <p>Our submissions in the Caption Prediction sub-task again revolve around three main systems,
two of which follow an encoder-decoder approach. The first one utilizes a CNN as the encoder
and an RNN as the decoder, while the second one is Transformer-based, employing a ViT [19] as
the encoding unit and OpenAI’s GPT-2 [20] as the decoding unit. Furthermore, we implemented
a sequence-to-sequence [13] denoising component, which when employed on top of the two
aforementioned systems forms a novel captioning pipeline.
3.2.1. CNN-RNN
Our first system is based on the CNN-RNN encoder-decoder [ 17] method, which employs a
CNN encoder and an RNN decoder that generates the caption.</p>
        <p>In Figure 6, we present a high-level overview of the system’s architecture. The CNN encoder
is responsible for extracting image representations, which are then passed to the decoder.
The RNN decoder has been implemented with Gated Recurrent Units (GRU cells) [32] and
concatenates the encoded visual features with the hidden states of its encoding cells. At each
recurrent step, the previous GRU cell’s state, which contains knowledge about the extracted
visual features and the part of the caption that has been generated so far, is passed alongside the
previously generated word as an input to the current GRU cell. Afterwards, the GRU output is
passed to an MLP component that yields a probability distribution over the model’s vocabulary
words and the one with the highest probability is selected as the sentence’s next token. This
recurrent process terminates once a special token, denoting the end of the generated sequence,
is predicted. The model is trained by attempting to maximize the likelihood of the provided
ground truth caption given a visual instance [17].</p>
        <p>log p1(OUT1) log p2(OUT2)
log pN-1(OUTN-1)
r
e
d
o
c
n
e
N
N
C</p>
        <p>Image
p1</p>
        <p>U
R
G
p2</p>
        <p>U
R</p>
        <p>G
U
R
G</p>
        <p>…
We × OUT0</p>
        <p>We × OUT1
OUT0</p>
        <p>OUT1
pN-1</p>
        <p>
          Following the pre-processing steps of Show&amp;Tell [17], we first added two special tokens in
each training caption, a &lt;start of sequence&gt; and an &lt;end of sequence&gt; token. Next, we created
the model’s vocabulary by keeping all words that appeared at least 4 times throughout the
training set, replacing the out-of-vocabulary (OOV) words with the &lt;UNK&gt; special token. We
experimented with multiple maximum length values ranging from 40 to 120 tokens, unlike our
last year’s submission [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], where we had used a fixed maximum length of 40 tokens based on
preliminary experiments.
        </p>
        <p>As far as the decoding method is concerned, we ran experiments using both greedy and beam
search decoding [33]. In the former option, we selected the word with the highest probability
yielded by the MLP component at each step, while in the latter case we would search for the
most probable sequences of tokens, by maintaining and updating a set of the  best candidates
at each decoding step. The selection of these candidates is based on the likelihood of each
path being the correct choice. It is calculated as the sum of the log probabilities of the so far
generated sequence’s tokens. Greedy decoding can be considered as a special case of beam
search decoding, when the beam size is equal to one ( = 1). We experimented with numerous
values for the beam size , specifically  ∈ {2, 3, 5}. Overall, beam search decoding resulted in
better performance than following the greedy choice at each step.
3.2.2. ViT-GPT2
Our second system for the Caption Prediction sub-task is also based on the encoder-decoder
framework, only that in this case we employed Transformer-based encoders and decoders.
Influenced by the expeditious progress in the domain of Large Language Models (LLMs), as
well as the impressive performance that these systems are able to achieve in NLP and Speech
Recognition tasks, we decided to create a pipeline where Transformers are also utilized for
computer vision [34].</p>
        <p>The encoding component of our model, which is responsible for extracting the feature
representation of a given image, consists of a Vision Transformer (ViT) [19] instance loaded
from a pre-trained checkpoint. Regarding the decoding component, we employed GPT-2 [20],
an open source, autoregressive LLM that achieves notable results in numerous text generation
tasks. We also experimented with its distilled version (distilGPT2) as it is considered to be more
time eficient with little to no decrement in performance [ 35]. However, we preferred to use the
GPT-2 base version, as it performed better in preliminary experiments.</p>
        <p>GPT-2 [20] is an autoregressive decoder-only model that is composed of a stack of 12
Transformer decoder blocks. Each one of these blocks sequentially processes the visual representation
of the image, obtained by the image encoder, and the so far generated tokens. Following the
last decoder block, a dense linear layer followed by a Softmax activation function is in charge
of yielding a probability distribution over the model’s vocabulary, and thus predict the next
generated token. The process described so far forms a single decoding step. A vector
containing the word embedding of each step’s output, concatenated with its positional embedding is
autoregressively fed to the bottom decoder block. This gradual, step-wise generation procedure
is repeated until a special token, which denotes the end of the generated sequence, is predicted.
We experimented with multiple decoding strategies; namely greedy decoding, beam search
decoding (as described in Section 3.2.1), as well as top- and nuclear sampling [33, 36]. Both
beam search decoding and the two sampling methods achieved equally competitive performance.
In addition, we followed the same pre-processing steps that we have previously described.</p>
        <sec id="sec-3-2-1">
          <title>3.2.3. 2xE-D: Captioning Model + Seq2Seq denoiser</title>
          <p>Our third system is a denoising model, which we employ on top of the two aforementioned
caption prediction systems (see Sections 3.2.1 and 3.2.2) resulting in a novel captioning pipeline.
The model is trained on the captions output by our two basic systems and the corresponding
ground truth captions, in order to improve readability. Both the original generative pipeline and
the denoising component feature an encoder and a decoder. Hence, we call this system 2xE-D,
where E-D denotes the encoder-decoder architecture. For the denoising part, we experimented
with two prominent sequence to sequence architectures, BART [37] and T5 [38].</p>
          <p>BART is a denoising autoencoder which is trained by reconstructing text that has been
distorted by an arbitrary noise function [37]. It constitutes of a bidirectional encoder and a
left-to-right autoregressive decoder. The denoising autoencoder is pre-trained on a series of
tasks, which have been altered by one or more of the following corruption processes applied
stochastically to the input sequences:
• Random token masking.
• Random token deletion.
• -random tokens masking (employing a single masking token).
• Sentence permutation.</p>
          <p>• Document rotation.</p>
          <p>In detail, we intended to employ an instance of BART on a task similar to the one that it
has been originally pre-trained on. We started of from a pre-trained BART checkpoint and
ifne-tuned it by providing the intermediate captions as input and the respective ground truth
captions as the target text. We utilized the large version of the model, which contains 12
bidirectional encoder blocks and an equal number of decoder blocks. Table 4 shows three
captions; the provided ground truth, the CNN-RNN generated one, and its revised version
generated by our Seq2Seq denoising model. The denoiser was able to correct part of the initial
generated caption, as it successfully revised the existing medical condition from “a mass” to “a
lesion” and also accurately re-addressed the point of contention from “a liver lobe” to “a hepatic
lobe”. Moreover, it chose to state “Computed Tomography” as its abbreviation (“CT”), which is
a common tactic in diagnostic reports [39].</p>
          <p>Extending this idea, we decided to fine-tune BART in a larger collection of noisy and denoised
caption pairs. Therefore, we implemented a noise-insertion function, in accordance with the
aforementioned noise transformations that BART is pre-trained on [37], and applied it to our
training ground truth captions. In this way, we created an alternative text-to-text training set,
consisting of (noisy - ground truth) caption pairs. We once again fine-tuned a pre-trained BART
instance on the newly created dataset in order to build a ClinicalBART model, hoping it would
acquire extended knowledge of the biomedical domain, and therefore generate more medically
lfuent text sequences.</p>
          <p>Furthermore, we also experimented with T5, another encoder-decoder model pre-trained in
a series of both supervised and unsupervised tasks [38], including denoising tasks. Last but
not least, we were granted access to ClinicalT5 through PhysioNet4. ClinicalT5 is a biomedical
version of T5, pre-trained on the MIMIC-III dataset [40]. We further fine-tuned ClinicalT5
similarly to BART, in order to rephrase the intermediate captions produced by the CNN-RNN
model (see Section 3.2.1) to approximate the gold ones.</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Experiments, Submissions and Results</title>
      <p>
        In this section, we provide details and insight into our experiments regarding this year’s
campaign [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. Moreover, we share details about our submissions and the scores achieved in our
held-out development set, as well as the oficial test set of the competition for both sub-tasks.
      </p>
      <sec id="sec-4-1">
        <title>4.1. Concept Detection</title>
        <p>In the Concept Detection sub-task, we submitted our nine best performing models, after
evaluating them on our held-out development set. We submitted a single instance of our CNN+FFNN
model (see Section 3.1.1) and two instances of our Contrastive learning-based tagger
(henceforward ContrastiveTagger, see Section 3.1.3). The rest of our submissions were ensemble systems.
We investigated the combination of the predictions of two or more instances by calculating
the union or the intersection of their predicted concept sets. We also experimented with
a majority voting rule. That is, given an ensemble system consisting of  models, a concept
is assigned to the image if at least 2 + 1 models predicted it. All of our submitted ensemble
systems were combinations of our CNN+FFNN and CNN+FFNN-based multi-task classifiers
(henceforward MultiTask-CNN+FFNN ).</p>
        <p>
          This year’s primary evaluation metric for the Concept Detection sub-task was the 1-score
between the predicted and the ground truth captions. It is calculated as the sum of the 1-score
for each test image, divided by the total number of test images. Each partial score is calculated
between the binary multi-hot candidate vector and the corresponding ground truth vector.
Precisely, let 1 be the overall 1-score, and ^1 be the individual 1-score for every test image.
Moreover, let  and  be the predicted and ground truth concepts for an image . Finally, let 
denote the test set. Then, 1 is computed as:
4https://www.physionet.org/content/clinical-t5/1.0.0/, Last accessed: 2023-07-07
Moreover, a secondary evaluation metric was calculated that only included manually validated
concepts, such as anatomy, topography and modality [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ].
        </p>
        <p>
          In the case of our first two systems ( CNN+FFNN, MultiTask-CNN+FFNN ), and specifically
regarding their backbone component, we experimented with a wide range of CNN encoders.
Namely, we trained the two networks using state-of-the-art CNN architectures, like EficientNet
[41], DenseNet [
          <xref ref-type="bibr" rid="ref12">42</xref>
          ] and ResNet [
          <xref ref-type="bibr" rid="ref13">43</xref>
          ]. In addition, we extended the CNN experimental range
compared to our previous participations, by utilizing MobileNet [
          <xref ref-type="bibr" rid="ref14">44</xref>
          ], InceptionNet [
          <xref ref-type="bibr" rid="ref15">45</xref>
          ] and
CheXNet [
          <xref ref-type="bibr" rid="ref16">46</xref>
          ]. We also experimented with Vision Transformers (ViT) [19], as well as older CNN
encoders like VGG [
          <xref ref-type="bibr" rid="ref17">47</xref>
          ] and AlexNet [
          <xref ref-type="bibr" rid="ref18">48</xref>
          ]. However, they were not included in our submissions
as they did not provide competitive results. These were either pre-trained on ImageNet [
          <xref ref-type="bibr" rid="ref19">49</xref>
          ] or
were trained with uniformly initialized weights.
        </p>
        <p>
          As expected, the model instances pre-trained on ImageNet [
          <xref ref-type="bibr" rid="ref19">49</xref>
          ] performed better than the
randomly initialized ones in terms of the corresponding 1 score. The training loss converged
faster, despite the fact that biomedical images like the ones we deal with, come from a diferent
domain compared to ImageNet’s training set. Moreover, CNN backbones outperformed ViT,
which is in line with previous observations that they typically outperform other architectures
such as ViT and Hybrid-ViT in classification and semantic segmentation for generic images [
          <xref ref-type="bibr" rid="ref20">50</xref>
          ],
as well as classification of biomedical images [
          <xref ref-type="bibr" rid="ref5">29, 5</xref>
          ]. EficientNetB0 [ 41] and DenseNet-121 [
          <xref ref-type="bibr" rid="ref12">42</xref>
          ]
were the two best performing ones for both systems in terms of the primary evaluation metric
(Equation 7).
        </p>
        <p>
          We also experimented with freezing some of the encoder’s layers, in order to speed up the
training process and also prevent their weights from being modified, in an efort to preserve the
model’s already acquired knowledge [
          <xref ref-type="bibr" rid="ref21">51</xref>
          ] and potentially prevent catastrophic forgetting [
          <xref ref-type="bibr" rid="ref22">52</xref>
          ].
However, our experiments showed that training the whole network resulted in higher 1 score,
while the speed up in terms of training time was not large enough in order to trade of the
higher performance levels obtained by a fully-trainable network. Moreover, we experimented
with data augmentation techniques [
          <xref ref-type="bibr" rid="ref23">53</xref>
          ] (i.e, random rotation, random cropping) during the
loading of each image, but they did not provide any significant improvement in the system’s
performance.
        </p>
        <p>Furthermore, we observed that despite the relatively high performance in terms of the primary
evaluation metric, our models were not able to achieve satisfactory results in the prediction
of the under-represented concepts. In other words, the high 1-score levels were due to the
system’s good performance in the common concepts (see Table 1), rather than to an overall
classification ability. In an attempt to tackle this behaviour, we experimented with training
a diferent instance of our CNN+FFNN classifier for each one of the four main modalities, in
hopes that each classifier would be able to excel at some modality-specific characteristics. The
results were mixed; two of the modality classifiers (X-Ray and MRI) were able to achieve almost
30% increase in their performance, while the other two performed even worse compared to
the original version of the model. Overall, this approach did not manage to achieve more
competitive results.</p>
        <p>In Table 5, we list all the methods we experimented with during our participation in this
year’s Concept Detection sub-task, along with the best score achieved in our development set
for each one of the methods, as we experimented with numerous configurations (i.e. learning
rate scheduler, number of hidden layers). To facilitate easier referencing in the rest of this
section, we assign a unique ID to each method in the first column of Table 5. Moreover, in Table
6, we present an overview of our nine valid submissions regarding the Concept Detection task.
We include each method’s performance on the primary 1-score in both the development and
test subset, as well as the oficial results regarding the secondary evaluation metric. The last
column contains the rank of our systems across all the task’s submitted runs.</p>
        <p>
          Our team oficially ranked 1st among 10 participating research groups in terms of the primary
evaluation metric. Our best performing model was a union ensemble consisting of three
instances of our CNN+FFNN system, where three diferent encoding backbones were used;
EficientNetB0 [ 41], EficientNetB0v2 and DenseNet121 [
          <xref ref-type="bibr" rid="ref12">42</xref>
          ]. Furthermore, we ranked 2nd in the
secondary evaluation metric by employing a single CNN+FFNN instance, using EficientNetB0
[41] as the image encoder.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Caption Prediction</title>
        <p>In the Caption Prediction sub-task, we also submitted nine systems, which were selected after
evaluating them on our development set. We submitted two instances of our CNN-RNN
encoderdecoder (see Section 3.2.1) and two instances of our Transformer-based ViT-GPT2 model (see
Section 3.2.2). The diference between each model’s submissions lays in the number of beams
used during the beam search decoding. We submitted instances where the beam size was equal
to three and five, and therefore denote them as CNN-RNN-BS 3, CNN-RNN-BS5, ViT-GPT2-BS3
and ViT-GPT2-BS5. In addition, we submitted five instances of our Seq2Seq denoising system
employed on top of the four aforementioned submissions. The denoising models utilized were
T5 [38], ClinicalT5, BART [37], as well as ClinicalBART (BART-large further pre-trained in
ImageCLEF captions, see Section 3.2.3).</p>
        <p>
          In this year’s campaign, BERTscore [
          <xref ref-type="bibr" rid="ref24">54</xref>
          ] was used as the primary evaluation metric, in contrast
to last year that used BLEU [
          <xref ref-type="bibr" rid="ref25">55</xref>
          ]. ROUGE-1 [
          <xref ref-type="bibr" rid="ref26">56</xref>
          ] constitutes the secondary evaluation metric.
Unlike BLEU and ROUGE-1, BERTscore [
          <xref ref-type="bibr" rid="ref24">54</xref>
          ] ofers a more contextual evaluation system, as it
leverages BERT’s [31] word embeddings and attempts to compute the semantic afinity between
the words of the predicted and ground truth captions based on their cosine similarity.
        </p>
        <p>
          Regarding our CNN-RNN model, we relied on our last year’s experiments and only adopted
the encoding architectures that performed best; EficientNetB0 [ 41] and DenseNet121 [
          <xref ref-type="bibr" rid="ref12">42</xref>
          ]. The
encoder extracted the image representations, which we stored, before feeding them to the RNN
decoding unit. We experimented with retrieving the image features from either a pre-trained
CNN instance or the encoding unit of our best performing CNN+FFNN classification model,
in hopes that it has learned to generate quality biomedical image representations through the
training procedure. An interesting research point would be to try to train the CNN and the RNN
encoder concurrently. Overall, the CNN-RNN encoder decoder achieved decent performance in
the BERTscore [
          <xref ref-type="bibr" rid="ref24">54</xref>
          ] metric and, as in our last year’s participation [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ] noteworthy scores in the
ROUGE-1 [
          <xref ref-type="bibr" rid="ref26">56</xref>
          ] evaluation metric.
        </p>
        <p>
          Our ViT-GPT2 model did not yield the expected results. We experimented with numerous
configurations, like higher or lower learning rate along with scheduling techniques, increased
generation penalty, as well as data augmentation. Specifically, we transformed each image
on the fly, during the loading process. We first rotated it by an angle of 30 degrees towards a
random direction and then resized it to 224 × 224 × 3 pixels, which is the size that we selected
to employ for every image. In this way, a slightly diferent view of the same image was passed
to the encoding component in each epoch aiming to increase the data variety, improve the
model’s robustness, as well as prevent it from quickly overfitting [
          <xref ref-type="bibr" rid="ref23">53</xref>
          ].
        </p>
        <p>Our best submission, which managed to rank 7th out of 70 submitted systems, was the 2xE-D
model, which is comprised of one of the aforementioned captioning models and a subsequent
denoising component. Specifically, the instance that used BART outperformed the three other
denoising models; T5 [38], ClinicalT5 and ClinicalBART (see section 3.2.3). We also experimented
with multiple configurations, as well as decoding schemes. In this case, beam search decoding
outperformed both nucleus and top- sampling [33, 36] in multiple preliminary experiments.</p>
        <p>In Table 7, we present a summary of our nine submissions, including the method’s identifiers,
their performance on the primary and secondary metric for both the development and test
set, as well as its oficial rank across 70 submitted systems. Our group ranked 3rd among 13
teams in the Caption Prediction sub-task based on the primary evaluation metric. Our best
model was BART@CNN-RNN-BS3, followed in close distance, by
ClinicalBART@CNN-RNNBS3, the biomedically-wise fine-tuned instance of the same system. In Table 8 we present our
submissions’ performance on all the oficial metrics, as reported by the organizers, in order to
provide a more thorough evaluation of their capabilities.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions</title>
      <p>
        Regarding Concept Detection, our best-performing system was a CNN+FFNN pipeline (Section
3.1.1), while our remaining submissions included a CNN+FFNN-based multi-task classifier
(Section 3.1.2), a contrastive learning-based system with a CLIP-like objective (Section 3.1.3) and
ensembles employing the aforementioned approaches based on majority voting, union,
intersection, as well as scaling by a factor  in the case of our contrastive system. Our ensembles based
on the CNN+FFNN pipeline, including its multi-task version, were ranked at positions 1, 2, 3, 4,
5, 6 and 7 among approximately 60 systems in the respective sub-task, which is consistent with
their successful performance in previous years [
        <xref ref-type="bibr" rid="ref10 ref11">10, 11</xref>
        ], while our best-performing individual
CNN+FFNN system was ranked at position 8 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>
        In the Caption Prediction sub-task, we ranked 3rd among the participating groups, by both
extending our previous work [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] and exploiting the state-of-the-art methods in NLP. Our
systems included a typical Show and Tell model [17] with a CNN backbone encoder and a
recurrent decoder with GRU cells [32], a Transformer-based pipeline using a ViT encoder
[19] and GPT-2 decoder [20], as well as a sequence-to-sequence [13] denoising autoencoder
employed on top of the two other systems, in order to rephrase and correct the initial draft
radiology reports.
      </p>
      <p>
        In future work, we plan to expand our research in biomedical LLMs and their reasoning
abilities, towards the goal of exploiting the generative capabilities of models like BioGPT [
        <xref ref-type="bibr" rid="ref27">57</xref>
        ]
or BioMedLM [58] to produce high-quality captions; possibly via instruction tuning and, more
generally, alignment with user needs [59]. Furthermore, apart from making use of the knowledge
encoded in the weights of the LLMs, we aim to shed light in the use of dense retrieval [60] in
biomedical image captioning [
        <xref ref-type="bibr" rid="ref5">5, 29</xref>
        ], based on architectures similar to Retrieval Augmented
Generation [61]. Such pipelines will allow us to increase the LLMs’ capacity by an additional,
non-parametric memory, in the form of a FAISS index [62], towards the goal of improving their
reasoning abilities. We would also be interested to discover potential associations between the
two sub-tasks. Last but not least, the qualitative diferences in the captions generated by the
diferent methods are to be considered, since they highlight their practical usefulness in real-life
scenarios [
        <xref ref-type="bibr" rid="ref5">5, 29</xref>
        ].
CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022, pp. 1355–1373.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, I.
Polosukhin, Attention is All you Need, in: Advances in Neural Information Processing Systems,
volume 30, 2017.
[13] I. Sutskever, O. Vinyals, Q. V. Le, Sequence to Sequence Learning with Neural Networks,
      </p>
      <p>NIPS’14, MIT Press, Cambridge, MA, USA, 2014, p. 3104–3112.
[14] W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong,
Y. Du, C. Yang, Y. Chen, Z. Chen, J. Jiang, R. Ren, Y. Li, X. Tang, Z. Liu, P. Liu, J. Nie, J. Wen,
A survey of large language models, ArXiv abs/2303.18223 (2023).
[15] T. Chen, S. Kornblith, M. Norouzi, G. Hinton, A Simple Framework for Contrastive
Learning of Visual Representations, in: Proceedings of the 37th International Conference
on Machine Learning, ICML’20, 2020.
[16] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell,
P. Mishkin, J. Clark, G. Krueger, I. Sutskever, Learning Transferable Visual Models from
Natural Language Supervision, in: Proceedings of the 38th International Conference on
Machine Learning, volume 139, PMLR, 2021, pp. 8748–8763.
[17] O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and Tell: A neural image caption
generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
(2014) 3156–3164.
[18] M. Stefanini, M. Cornia, L. Baraldi, S. Cascianelli, G. Fiameni, R. Cucchiara, From Show to
Tell: A survey on Deep Learning-Based Image Captioning, IEEE Transactions on Pattern
Analysis and Machine Intelligence (2022).
[19] A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M.
Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, N. Houlsby, An Image is Worth
16x16 Words: Transformers for Image Recognition at Scale, in: International Conference
on Learning Representations, 2021.
[20] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, et al., Language models are
unsupervised multitask learners, OpenAI blog 1 (2019).
[21] O. Bodenreider, The Unified Medical Language System (UMLS): integrating biomedical
terminology, Nucleic Acids Research 32 (2004).
[22] O. Pelka, S. Koitka, J. Rückert, F. Nensa, C. Friedrich, Radiology Objects in COntext (ROCO):
A Multimodal Image Dataset: 7th Joint International Workshop, CVII-STENT 2018 and
Third International Workshop, LABELS 2018, Held in Conjunction with MICCAI 2018,
Granada, Spain, September 16, 2018, Proceedings, 2018, pp. 180–189.
[23] G. Liu, T. H. Hsu, M. B. A. McDermott, W. Boag, W. Weng, P. Szolovits, M. Ghassemi,
Clinically Accurate Chest X-Ray Report Generation, in: Proceedings of the Machine
Learning for Healthcare Conference, MLHC 2019, 9-10 August 2019, Ann Arbor, Michigan,
USA, volume 106 of Proceedings of Machine Learning Research, PMLR, 2019, pp. 249–269.
[24] F. Charalampakos, Exploring Deep Learning Methods for Medical Image Tagging, Master’s
thesis, Athens University of Economics and Business, Athens, Greece, 2022.
[25] F. Radenović, G. Tolias, O. Chum, Fine-tuning CNN image retrieval with no human
annotation, IEEE Transactions on Pattern Analysis and Machine Intelligence 41 (2019)
1655–1668.
[26] A. Zafar, M. Aamir, N. Nawi, A. Arshad, S. Riaz, A. Alruban, A. Dutta, S. Alaybani, A
Comparison of Pooling Methods for Convolutional Neural Networks, Applied Sciences 12
(2022) 8643.
[27] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: A simple
way to prevent Neural Networks from overfitting, Journal of Machine Learning Research
15 (2014) 1929–1958.
[28] D. Kingma, J. Ba, Adam: A method for stochastic optimization, International Conference
on Learning Representations (2014).
[29] G. Moschovis, E. Fransén, Neuraldynamicslab at imageclef medical 2022, in: CLEF2022</p>
      <p>Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bologna, Italy, 2022.
[30] A. Gotmare, N. S. Keskar, C. Xiong, R. Socher, A Closer Look at Deep Learning Heuristics:
Learning rate restarts, Warmup and Distillation, in: 7th International Conference on
Learning Representations, ICLR 2019, New Orleans, LA, USA, 2019.
[31] J. Devlin, M. W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of Deep Bidirectional
Transformers for Language Understanding, in: Proceedings of the 2019 Conference of
the North American Chapter of the Association for Computational Linguistics: Human
Language Technologies, Volume 1 (Long and Short Papers), Association for Computational
Linguistics, Minneapolis, Minnesota, USA, 2019, pp. 4171–4186.
[32] J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, Empirical Evaluation of Gated Recurrent Neural</p>
      <p>Networks on Sequence Modeling, CoRR (2014).
[33] S. Zarrieß, H. Voigt, S. Schüz, Decoding Methods in Neural Language Generation: A</p>
      <p>Survey, Information 12 (2021).
[34] M. Naseer, M. Hayat, S. W. Zamir, F. Khan, M. Shah, Transformers in Vision: A Survey,</p>
      <p>ACM Computing Surveys 54 (2022).
[35] T. Li, Y. E. Mesbahi, I. Kobyzev, A. Rashid, A. Mahmud, N. Anchuri, H. Hajimolahoseini,
Y. Liu, M. Rezagholizadeh, A short study on compressing decoder-based Language Models,
ArXiv abs/2110.08460 (2021).
[36] G. Wiher, C. Meister, R. Cotterell, On Decoding Strategies for Neural Text Generators,</p>
      <p>Transactions of the Association for Computational Linguistics 10 (2022) 997–1012.
[37] M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, L.
Zettlemoyer, BART: Denoising Sequence-to-Sequence Pre-training for Natural Language
Generation, Translation, and Comprehension, in: Proceedings of the 58th Annual Meeting of
the Association for Computational Linguistics, Association for Computational Linguistics,
Online, 2020, pp. 7871–7880.
[38] C. Rafel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P. J. Liu,
Exploring the limits of Transfer Learning with a Unified Text-to-Text Transformer, J.</p>
      <p>Mach. Learn. Res. 21 (2020) 140:1–140:67.
[39] W. Qi, P. Stetson, A study of abbreviations in clinical notes, AMIA, Annual Symposium
proceedings / AMIA Symposium (2007) 821–5.
[40] A. Johnson, T. Pollard, L. Shen, L. W. Lehman, M. Feng, M. Ghassemi, B. Moody, P. Szolovits,
L. Celi, R. Mark, MIMIC-III, a freely accessible critical care database, Scientific Data 3
(2016) 160035.
[41] M. Tan, Q. V. Le, Eficientnet: Rethinking Model Scaling for Convolutional Neural Networks,
in: Proceedings of the 36th International Conference on Machine Learning, ICML 2019,
9-15 June 2019, Long Beach, California, USA, volume 97 of Proceedings of Machine Learning
[58] E. Bolton, D. Hall, M. Yasunaga, T. Lee, C. Manning, P. Liang, BioMedLM, 2022.
[59] L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal,
K. Slama, A. Gray, J. Schulman, J. Hilton, F. Kelton, L. Miller, M. Simens, A. Askell, P.
Welinder, P. Christiano, J. Leike, R. Lowe, Training language models to follow instructions with
human feedback, in: Advances in Neural Information Processing Systems, 2022.
[60] V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, W. Yih, Dense
Passage Retrieval for Open-Domain Question Answering, in: Proceedings of the 2020
Conference on Empirical Methods in Natural Language Processing (EMNLP), Association
for Computational Linguistics, Online, 2020, pp. 6769–6781.
[61] P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis,
W. Yih, T. Rocktäschel, S. Riedel, D. Kiela, Retrieval-Augmented Generation for
KnowledgeIntensive NLP Tasks, in: Advances in Neural Information Processing Systems, volume 33,
Curran Associates, Inc., 2020, pp. 9459–9474.
[62] J. Johnson, M. Douze, H. Jégou, Billion-scale similarity search with GPUs, IEEE
Transactions on Big Data 7 (2019) 535–547.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Snider</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Garcia Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papachrysos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schöler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Coman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stan</surname>
          </string-name>
          , G. Ioannidis,
          <string-name>
            <given-names>H.</given-names>
            <surname>Manguinhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ştefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deshayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          , Overview of ImageCLEF 2023:
          <article-title>Multimedia retrieval in medical, socialmedia and recommender systems applications</article-title>
          , in: Experimental IR Meets Multilinguality, Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 14th International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ), Springer Lecture Notes in Computer Science LNCS, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2023 -
          <article-title>Caption Prediction and Concept Detection</article-title>
          , in: CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Papamichail</surname>
          </string-name>
          ,
          <article-title>Diagnostic captioning: a survey</article-title>
          ,
          <source>Knowledge and Information Systems</source>
          <volume>64</volume>
          (
          <year>2022</year>
          )
          <fpage>1</fpage>
          -
          <lpage>32</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>H. C.</given-names>
            <surname>Shin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Roberts</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Lu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Yao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <article-title>Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation</article-title>
          ,
          <source>2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          (
          <year>2016</year>
          )
          <fpage>2497</fpage>
          -
          <lpage>2506</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <article-title>Medical image captioning based on Deep Architectures, Master's thesis</article-title>
          , KTH Royal Institute of Technology, Stockholm, Sweden,
          <year>2022</year>
          . URL: http://urn.kb.se/resolve? urn=urn:nbn:se:kth:
          <fpage>diva</fpage>
          -323528, Last accessed:
          <fpage>2023</fpage>
          -07-07.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>A</surname>
          </string-name>
          <article-title>Survey on Biomedical Image Captioning</article-title>
          ,
          <source>in: Proceedings of the Second Workshop on Shortcomings in Vision and Language</source>
          , Association for Computational Linguistics, Minneapolis, Minnesota,
          <year>2019</year>
          , pp.
          <fpage>26</fpage>
          -
          <lpage>36</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP group at ImageCLEFmed
          <source>Caption</source>
          <year>2019</year>
          , in: Working Notes of CLEF 2019 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Lugano, Switzerland, September 9-
          <issue>12</issue>
          , volume
          <volume>2380</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>B.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP group at ImageCLEFmed
          <source>Caption</source>
          <year>2020</year>
          , in: Working Notes of CLEF 2020 -
          <article-title>Conference and Labs of the Evaluation Forum</article-title>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          , volume
          <volume>2696</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          ,
          <article-title>Medical Image Tagging by Deep Learning and Retrieval, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <source>Interaction: 11th International Conference of the CLEF Association, CLEF</source>
          <year>2020</year>
          , Thessaloniki, Greece,
          <source>September 22-25</source>
          ,
          <year>2020</year>
          , Proceedings,
          <year>2020</year>
          , p.
          <fpage>154</fpage>
          -
          <lpage>166</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>F.</given-names>
            <surname>Charalampakos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kougia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP group at ImageCLEFmed Caption tasks
          <year>2021</year>
          ,
          <source>in: Proceedings of the Working Notes of CLEF 2021 - Conference and Labs of the Evaluation Forum</source>
          , Bucharest, Romania,
          <source>September 21-24</source>
          , volume
          <volume>2936</volume>
          <source>of CEUR Workshop Proceedings</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>1184</fpage>
          -
          <lpage>1200</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>F.</given-names>
            <surname>Charalampakos</surname>
          </string-name>
          , G. Zachariadis,
          <string-name>
            <given-names>J.</given-names>
            <surname>Pavlopoulos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Karatzas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Trakas</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Androutsopoulos</surname>
          </string-name>
          , AUEB NLP Group at ImageCLEFmedical
          <source>Caption</source>
          <year>2022</year>
          , in: CLEF2022 Working Notes, Research,
          <year>2019</year>
          , pp.
          <fpage>6105</fpage>
          -
          <lpage>6114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [42]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. Van Der</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          , Densely Connected Convolutional Networks,
          <source>in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2017</year>
          , pp.
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [43]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [44]
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Howard</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Chen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Kalenichenko</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Weyand</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Andreetto</surname>
          </string-name>
          , H. Adam,
          <article-title>MobileNets: Eficient Convolutional Neural Networks for Mobile Vision Applications</article-title>
          , CoRR (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [45]
          <string-name>
            <given-names>C.</given-names>
            <surname>Szegedy</surname>
          </string-name>
          , W. Liu,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Sermanet</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Reed</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Anguelov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Vanhoucke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rabinovich</surname>
          </string-name>
          ,
          <article-title>Going deeper with convolutions</article-title>
          ,
          <source>in: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>1</fpage>
          -
          <lpage>9</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [46]
          <string-name>
            <given-names>P.</given-names>
            <surname>Rajpurkar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Irvin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Mehta</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Duan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D. Y.</given-names>
            <surname>Ding</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Bagul</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. P.</given-names>
            <surname>Langlotz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. S.</given-names>
            <surname>Shpanskaya</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lungren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Y.</given-names>
            <surname>Ng</surname>
          </string-name>
          ,
          <article-title>CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning</article-title>
          ,
          <source>CoRR abs/1711</source>
          .05225 (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [47]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          ,
          <source>in: 3rd International Conference on Learning Representations, ICLR</source>
          <year>2015</year>
          , San Diego, CA, USA, May 7-
          <issue>9</issue>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [48]
          <string-name>
            <given-names>A.</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          ,
          <article-title>One weird trick for parallelizing convolutional neural networks</article-title>
          ,
          <source>CoRR</source>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [49]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: 2009 IEEE Conference on Computer Vision and Pattern Recognition</source>
          ,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [50]
          <string-name>
            <given-names>I.</given-names>
            <surname>Athanasiadis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>G.</given-names>
            <surname>Moschovis</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tuoma</surname>
          </string-name>
          ,
          <article-title>Weakly-Supervised Semantic Segmentation via Transformer Explainability</article-title>
          ,
          <source>in: ML Reproducibility Challenge 2021 (Fall Edition)</source>
          ,
          <year>2022</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [51]
          <string-name>
            <given-names>I.</given-names>
            <surname>Kandel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Castelli</surname>
          </string-name>
          ,
          <article-title>How Deeply to Fine-Tune a Convolutional Neural Network: A Case Study Using a Histopathology Dataset</article-title>
          ,
          <source>Applied Sciences</source>
          <volume>10</volume>
          (
          <year>2020</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [52]
          <string-name>
            <surname>R. M. French</surname>
          </string-name>
          ,
          <article-title>Catastrophic forgetting in connectionist networks</article-title>
          ,
          <source>Trends in Cognitive Sciences</source>
          <volume>3</volume>
          (
          <year>1999</year>
          )
          <fpage>128</fpage>
          -
          <lpage>135</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          [53]
          <string-name>
            <given-names>A.</given-names>
            <surname>Mikołajczyk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Grochowski</surname>
          </string-name>
          ,
          <article-title>Data augmentation for improving deep learning in image classification problem</article-title>
          , in: 2018
          <source>International Interdisciplinary PhD Workshop</source>
          (IIPhDW),
          <year>2018</year>
          , pp.
          <fpage>117</fpage>
          -
          <lpage>122</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          [54]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating Text Generation with BERT</article-title>
          ,
          <source>in: 8th International Conference on Learning Representations, ICLR</source>
          <year>2020</year>
          ,
          <string-name>
            <given-names>Addis</given-names>
            <surname>Ababa</surname>
          </string-name>
          , Ethiopia,
          <source>April 26-30</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          [55]
          <string-name>
            <given-names>K.</given-names>
            <surname>Papineni</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Roukos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Ward</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W. J.</given-names>
            <surname>Zhu</surname>
          </string-name>
          ,
          <article-title>Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          [56]
          <string-name>
            <given-names>C. Y.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <surname>ROUGE:</surname>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics</article-title>
          , Barcelona, Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          [57]
          <string-name>
            <given-names>R.</given-names>
            <surname>Luo</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Xia</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Qin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , H. Poon, T. Y. Liu,
          <article-title>BioGPT: generative pre-trained transformer for biomedical text generation and mining</article-title>
          , Briefings in Bioinformatics (
          <year>2022</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>