<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Conference and Labs of the Evaluation Forum, September</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <title-group>
        <article-title>Detecting Concepts and Generating Captions from Medical Images: Contributions of the VCMI Team to ImageCLEFmedical Caption 2023</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Isabel Rio-Torto</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cristiano Patrício</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Helena Montenegro</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tiago Gonçalves</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jaime S. Cardoso</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
          <xref ref-type="aff" rid="aff3">3</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Departamento de Ciência de Computadores, Faculdade de Ciências, Universidade do Porto</institution>
          ,
          <addr-line>Rua do Campo Alegre s/n, 4169-007 Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Departamento de Informática, Universidade da Beira Interior, Rua Marquês de Ávila e Bolama</institution>
          ,
          <addr-line>6201-001 Covilhã</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Faculdade de Engenharia, Universidade do Porto</institution>
          ,
          <addr-line>Rua Dr. Roberto Frias s/n, 4200-465 Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
        <aff id="aff3">
          <label>3</label>
          <institution>INESC TEC</institution>
          ,
          <addr-line>Campus da FEUP Rua Dr. Roberto Frias s/n, 4200-465 Porto</addr-line>
          ,
          <country country="PT">Portugal</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2023</year>
      </pub-date>
      <volume>1</volume>
      <fpage>8</fpage>
      <lpage>21</lpage>
      <abstract>
        <p>This paper presents the main contributions of the VCMI Team to the ImageCLEFmedical Caption 2023 task. We addressed both the concept detection and caption prediction tasks. Regarding concept detection, our team employed diferent approaches to assign concepts to medical images: multi-label classification, adversarial training, autoregressive modelling, image retrieval, and concept retrieval. We also developed three model ensembles merging the results of some of the proposed methods. Our best submission obtained an F1-score of 0.4998, ranking 3rd among nine teams. Regarding the caption prediction task, our team explored two main approaches based on image retrieval and language generation. The language generation approaches, based on a vision model as the encoder and a language model as the decoder, yielded the best results, allowing us to rank 5th among thirteen teams, with a BERTScore of 0.6147.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Concept Retrieval</kwd>
        <kwd>Image Captioning</kwd>
        <kwd>Medical Concept Detection</kwd>
        <kwd>Multi-label Classification</kwd>
        <kwd>Natural Language Generation</kwd>
        <kwd>Vision Transformers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>
        ImageCLEF 2023 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] is a multi-modal challenge organised as part of the CLEF Initiative Labs1
(Conference and Labs of the Evaluation Forum, formerly known as Cross-Language Evaluation
Forum) set to promote the evaluation of technologies for annotation, indexing, classification
and retrieval of multi-modal data. The 2023 edition included four challenges from diverse
applications (i.e. medical, social media and Internet, and content recommendation).
      </p>
      <p>
        Similarly to last year [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], our team, composed of members of the Visual Computing and
Machine Intelligence (VCMI) Research Group of the Institute for Systems and Computer
Engineering, Technology and Science (INESC TEC) from Porto, Portugal, participated in the
ImageCLEFmedical Caption 2023 task [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] where the goal is to challenge the scientific
community to design and train automatic algorithms capable of interpreting and summarising the
insights gained from medical images. Once again, this challenge consisted of two independent,
but complementary, tasks: concept detection, which aims to identify the presence of relevant
concepts in a large corpus of medical images; and caption prediction, which aims to generate
coherent textual descriptions describing a medical image. We addressed both the concept
detection and caption prediction tasks.
      </p>
      <p>For the concept detection task, we developed five diferent approaches: (i) baseline multi-label
classification, in which a convolutional neural network (CNN) simultaneously predicts all the
concepts from an image; (ii) adversarial approach, in which a multi-label classifier and a concept
discriminator are trained in an adversarial manner to promote the learning of admissible concept
combinations by the multi-label classifier; (iii) autoregressive approach, that aims to model
dependencies between concepts using autoregressive learning; (iv) image retrieval, in which a
model assigns concepts to an image based on its most similar images from the training data; and
(v) concept retrieval, in which a model learns to map concepts and images into a common latent
space where images are closer to the concepts they contain. We also developed three model
ensembles using the aforementioned approaches: (i) multi-label classification and concept
retrieval, (ii) autoregressive model and image retrieval using autoregressive model, (iii) adversarial
model and image retrieval using autoregressive model. Our best submission (i.e.ensemble with
autoregressive model and image retrieval using autoregressive model) obtained an F1-score of
0.4998, ranking 3rd among nine teams.</p>
      <p>
        For the caption prediction task, we relied on Vision Encoder-Decoder Transformer-based
architectures, since they worked well on last year’s competition [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. We explored two diferent
categories of image feature extractors for the Encoder, namely a Vision Transformer and a
CNN. Furthermore, we introduced a caption-to-concepts classification branch as an additional
supervisory signal for the model, since the caption needs to contain enough information to
allow, to some extent, for the prediction of the concepts. Our best submission (i.e. the Vision
Transformer encoder model trained on both training and validation sets) achieved a BERTScore
of 0.6147, ranking 5th among thirteen participating teams.
      </p>
      <p>The remainder of this paper is organised as follows: Section 2 provides an overview of the
data provided by the organisation to address the tasks and describes our exploratory data
analysis; Section 3 details the diferent proposals developed to solve the aforementioned tasks;
1http://www.clef-initiative.eu (accessed on: 02-06-2023)
Section 4 presents the results and their discussion; and Section 5 concludes this paper and
recommends future work directions. The code related to this paper is publicly available in a
GitHub repository2.</p>
    </sec>
    <sec id="sec-2">
      <title>2. Data</title>
      <p>
        The dataset provided in this competition is an extended version of the Radiology Objects
in COntext (ROCO) dataset [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ]. The data originates from biomedical articles of the PMC
OpenAccess subset [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]. The images provided to the participants are divided into training (60,918
images), validation (10,437 images) and test (10,473 images) sets.
      </p>
      <p>
        The concepts provided in the training and validation data were annotated according to the
Unified Medical Language System (UMLS) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] 2022 AB release, wherein each concept is uniquely
identified through a Concept Unique Identifier (CUI). For additional details, please refer to [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ].
      </p>
      <p>Table 1 presents an analysis of the number of concepts contained in each training and
validation image. On average, each image has 3.7 concepts, and while there are 4716 images
with only 1 concept, there are also 233 images with more than 10 concepts, the maximum
number of concepts per image being 24 in the training set.
2https://github.com/TiagoFilipeSousaGoncalves/ImageCLEFmedical2023VCMI</p>
    </sec>
    <sec id="sec-3">
      <title>3. Methodology</title>
      <p>The following sections describe the methods developed to fulfill the concept detection and
caption prediction tasks.</p>
      <sec id="sec-3-1">
        <title>3.1. Concept Detection</title>
        <p>The concept detection task was solved using two main approaches: modelling the concept
detection task as a multi-label classification problem and as an information retrieval problem.
We developed three models based on multi-label classification: a baseline model, a model trained
in an adversarial manner and a model trained using autoregressive learning. Furthermore, we
developed models to perform concept retrieval and image retrieval. The following subsections
describe, in detail, each of the proposed methods.</p>
        <sec id="sec-3-1-1">
          <title>3.1.1. Baseline Multi-Label Approach</title>
          <p>A conventional approach to address the concept detection task involves employing a
multilabel classification model, considering the inherent nature of images to encompass multiple
non-mutually exclusive concepts.</p>
          <p>
            Specifically, we adapted the DenseNet-121 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] architecture by modifying the classification
layer to have  outputs, where  is the number of concepts, i.e. 2125.
          </p>
          <p>
            In the training phase, the model was trained using the binary cross-entropy loss function
and the adaptive moment estimation (Adam) optimiser [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] with its default hyperparameters.
The model was trained during 100 epochs with a learning rate of 1e-4. Concretely, we trained
the classification layer of the model while keeping the remaining layers frozen. Subsequently,
the model with the best validation loss was selected for the testing phase.
          </p>
        </sec>
        <sec id="sec-3-1-2">
          <title>3.1.2. Adversarial Approach</title>
          <p>
            Ensuring that the multi-label baseline approach learns the correct combination of concepts is
not trivial (e.g. concepts related to diferent body parts should not be combined). Hence, we
propose adversarial training to learn a realistic combination of concepts per image, according
to the distribution of the training data. This model is composed of two blocks (see Figure 1):
• A multi-label classifier trained to predict the top-K most frequent concepts ( = 100) in
the database. This block uses ResNet50 [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] as a feature extractor along with a multi-layer
perceptron (MLP) with a sigmoid activation.
• A concept discriminator trained to distinguish between real (i.e. admissible) and fake (i.e.
inadmissible) combinations of concepts. This block is an MLP with two fully-connected
layers followed by a ReLU activation and a fully-connected layer with a sigmoid activation.
          </p>
          <p>
            We trained this model for 20 epochs using binary cross-entropy as the loss function for both
the multi-label classifier and the concept discriminator, and Adam [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] as the optimiser. The best
model is saved according to the lowest validation loss.
          </p>
          <p>Ground-Truth Concepts</p>
          <p>Predicted Concepts
Multi-Label
Classifier</p>
          <p>Concept
Discriminator</p>
          <p>Real / Fake</p>
        </sec>
        <sec id="sec-3-1-3">
          <title>3.1.3. Autoregressive Approach</title>
          <p>The main limitation of the baseline multi-label classification approach is that it assumes that
the concepts are independent of each other. However, there may be dependencies between
concepts, as there are concepts that never appear together in the training data, or concepts
that only exist in the presence of other concepts. To overcome this limitation, we devised an
approach to model dependencies between concepts based on autoregressive learning.</p>
          <p>
            The proposed model is a multi-label classification network that, instead of having a final
classification layer with 2125 units to predict all the concepts, contains several classification
layers, each predicting a subset of concepts. To model dependencies, each layer is conditioned
on the output of the previous layers. An overview of the autoregressive model is depicted in
Figure 2. As the feature extractor of the network, we used a VGG16 [
            <xref ref-type="bibr" rid="ref9">9</xref>
            ] network pre-trained
on ImageNet [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], followed by two fully-connected layers with LeakyReLU activations and
Dropout. All of the classification layers are fully-connected layers with sigmoid activation.
          </p>
          <p>Since it is easier for a network to predict concepts that exist in more images, we organised
the concepts in the layers according to how frequent they are among the training images. The</p>
          <p>Classification</p>
          <p>Layer 1</p>
          <p>Classification</p>
          <p>Layer 2
…</p>
          <p>Classification</p>
          <p>Layer n
most common concepts are predicted by the first while the rarest ones are predicted by the last
classification layers. Since there is a total of 2125 concepts, we used 17 classification layers,
each responsible for predicting 125 concepts.</p>
          <p>In the training phase, the model was trained using binary cross-entropy as the loss function
and the Adam optimiser with a learning rate of 1e-5. We trained the model in two phases.
First, we trained the classification layers of the model for 50 epochs, with the feature extractor
frozen. Then, we fine-tuned the entire network by training it for 20 epochs. We selected the
best instance of the model by monitoring its loss on the validation data.</p>
        </sec>
        <sec id="sec-3-1-4">
          <title>3.1.4. Retrieval Approaches</title>
          <p>We implemented two main approaches to predict the concepts of an image based on information
retrieval techniques: concept retrieval and image retrieval. In concept retrieval, the method
maps images and concepts into a common latent space, retrieving the closest concepts to an
image. In image retrieval, the method assigns concepts to an image based on its most similar
images from the training data. Both these methods will be described in detail below.</p>
          <p>In the concept retrieval approach, we use an image encoder and a concept encoder to map
images and concepts into a common latent space. Then, we compute the Euclidean distance
between the latent representations of the images and concepts, as depicted in Figure 3. During
training, we minimise the Euclidean distance between an image and the concepts it contains,
and we maximise the distance between the image and the concepts it does not contain.</p>
          <p>In our implementation, the image encoder is a CNN with four blocks of convolutional layers
with Batch Normalisation and Max Pooling, followed by a layer that performs Global Average
Pooling and a fully-connected layer. The concept encoder is a Multi-Layer Perceptron (MLP)
with one fully-connected layer with Dropout and LeakyReLU as the activation function, followed
by a second fully-connected layer.</p>
          <p>In addition to the Image-to-Concept (ITC) loss, we also performed some experiments where
we added the following loss functions to the training of the networks:
• Concept-to-Concept (CTC) loss: Minimises the distance between two diferent concepts
that exist in the same images, and maximises the distance between concepts that do not
appear together in any image. We apply a weight to the loss function by multiplying it
by the percentage of images that two concepts share (intersection over union).
Concept
Encoder</p>
          <p>Distance</p>
          <p>• Image-to-Image (ITI) loss: Minimises the distance between images that have some
concepts in common, and maximises the distance between images that do not share any
concepts. We apply a weight to the loss function by multiplying it by the percentage of
concepts that two images have in common (intersection over union).</p>
          <p>We performed three experiments: (i) training the concept retrieval networks only with the
ITC loss for 2600 epochs, (ii) fine-tuning the network trained with the ITC loss using the CTC
loss for 100 epochs, and (iii) fine-tuning the network trained with the ITC loss simultaneously
using the CTC and ITI losses for 100 epochs. The networks were trained using the Adam
optimiser with a learning rate of 1e-5. We monitored the loss on the validation data to obtain
the best model.</p>
          <p>In the image retrieval approach, we use pre-trained models to obtain latent representations
of the images, which are then used to measure the distance between the target image whose
concepts we want to predict and the images of the training data. We devised three strategies to
assign concepts to the target image, based on its most similar images:
• Strategy 1 (S1): Retrieve the closest image and assign its concepts to the target image.
• Strategy 2 (S2): Retrieve the Top-N closest images and assign to the target image the
concepts of the closest image that also exist in at least one other image from the Top-N
retrieved images. If no concept of the closest image appears in another image of the
Top-N, then all the concepts of the closest image are assigned to the target image.
• Strategy 3 (S3): Retrieve the Top-N closest images and assign the concepts that exist in
at least two of the Top-N retrieved images to the target image. Similarly to Strategy 2,
if no concept appears in at least two images of the Top-N, then all the concepts of the
closest image are assigned to the target image.</p>
          <p>We empirically chose to retrieve the Top-4 closest images in strategies 2 and 3.</p>
          <p>
            As the pre-trained models to obtain a latent representation of the images we used a ResNet50
[
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] trained on ImageNet [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ], and the image encoders of the previously described concept
retrieval and autoregressive models.
          </p>
        </sec>
        <sec id="sec-3-1-5">
          <title>3.1.5. Ensemble</title>
          <p>The multi-label classification-based approaches (baseline, adversarial and autoregressive) often
fail to predict any concepts for a given test image, leading to many images in the test dataset
with no predicted concepts. As such, we devise an ensemble strategy where, for each image
where the multi-label approaches fail to predict any concepts, we assign the concepts predicted
by one of the retrieval approaches.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>3.2. Caption Prediction</title>
        <p>The caption prediction task involves generating text that describes an image. To tackle this
task we considered two categories of approaches, retrieval and language generation, which we
describe in more detail below.</p>
        <sec id="sec-3-2-1">
          <title>3.2.1. Retrieval Approach</title>
          <p>
            We applied the image retrieval approach developed for the concept detection task to obtain
captions for the test images. We used the pre-trained ResNet [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ], the concept retrieval network
trained using the ITC loss and the autoregressive network to obtain latent representations of
the images. These representations were then used to obtain the closest images from the training
and validation data whose captions were assigned to the test samples.
          </p>
        </sec>
        <sec id="sec-3-2-2">
          <title>3.2.2. Language Generation Approaches</title>
          <p>
            The language generation-based strategies used to tackle this task employ an Encoder-Decoder
framework, since it was our best performing approach in last year’s competition [
            <xref ref-type="bibr" rid="ref2">2</xref>
            ]. The
Encoder, typically a CNN or a Vision Transformer, is responsible for analysing the image
and extracting relevant features. The Decoder then receives the encoded image features and
generates the caption. Thus, it is usually an autoregressive model, such as GPT-2 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ].
          </p>
          <p>
            We experimented with two diferent encoders: the small distilled version of the Data-eficient
image Transformer (DeiT) [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ] from the Huggingface Transformers library [
            <xref ref-type="bibr" rid="ref13">13</xref>
            ], and, inspired
by the work of Hou et al. [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ], DenseNet121 [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] from TorchXRayVision [
            <xref ref-type="bibr" rid="ref15 ref16">15, 16</xref>
            ] pre-trained
on all available datasets (densenet121-res224-all). The decoder consisted of the distilled
version of GPT-2 [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ]. Both models were trained with an initial learning rate of 1e-4 using the
AdamW optimiser [
            <xref ref-type="bibr" rid="ref17">17</xref>
            ] for 25 epochs. We monitored the BERTScore on the validation data to
obtain the best model.
          </p>
          <p>Since the UMLS concepts of the concept detection task are tightly related to the captions in
the caption prediction task, we hypothesise that it should be possible to predict the concepts
from the captions to some extent. Furthermore, predicting the concepts from the captions might
prove a good additional supervisory signal for training the caption prediction model. Therefore,
we explored the inclusion of a text classifier that takes the caption of a given image and predicts
its concepts (see Figure 4).</p>
          <p>
            To accomplish this we originally trained a DistilBERT [
            <xref ref-type="bibr" rid="ref18">18</xref>
            ] model for caption-to-concept
multi-label classification. The model was trained with the binary cross-entropy loss on the CLS
token for 20 epochs with an initial learning rate of 2e-5 and the AdamW optimiser. This
captionto-concept classifier was then used (but kept frozen) on top of the DenseNet-DistilGPT2 model
to provide an extra loss function for training. However, since the output of the Encoder-Decoder
module and the input of the caption-to-concept classifier (i.e. the generated text) is discrete,
Reinforcement Learning (RL) is needed, similarly to what is done in Self-Critical Sequence
Training [
            <xref ref-type="bibr" rid="ref19">19</xref>
            ]; thus, the whole sentence needs to be generated before classification can occur,
making this approach much slower compared to teacher forcing-only training.
          </p>
          <p>We also experimented with simply adding a fully connected layer directly on top of the
latent representation of the Decoder’s last token and training the whole Encoder-Decoder plus
classification layer together. This approach has the advantage of not needing RL, thus making
it faster and easier to train.</p>
          <p>Chest X-ray showing large…
Encoder</p>
          <p>Decoder</p>
          <p>Text Classifier
&lt;SOS Token&gt;</p>
          <p>Predicted
Concepts
CC BY-NC [Al Mulhim
et al. (2022)]</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>4. Results and Discussion</title>
      <p>This section details the results obtained by the methods developed for the concept detection
and caption prediction tasks.</p>
      <sec id="sec-4-1">
        <title>4.1. Concept Detection</title>
        <p>The concept detection task is evaluated using the example-based F1-score between the predicted
and ground-truth concepts. Table 4 presents the results in terms of F1-Score obtained by each
proposed method on the validation and test data. Furthermore, it presents a secondary F1-score
metric (F1-Score Manual) that compares the concepts predicted on the test data with a subset of
manually validated concepts.</p>
        <p>The baseline multi-label classification approach obtained an F1-score of 0.4469 on the test
set. Contrary to our expectations, the adversarial approach did not improve upon the baseline.
This might be explained by the fact that this adversarial model was only trained on the top-100
concepts. Thus, we leave as future work a more in-depth exploration of this approach. The
autoregressive approach achieved the highest performance among the multi-label-based models.</p>
        <p>In the concept retrieval approach, we verify that adding the CTC and ITI loss functions to
the network trained only with the ITC loss leads to a lower F1-score.</p>
        <p>Regarding the image retrieval method, we empirically found that Strategy 3 (S3) produced
the best results. This ablation study can be found in Table 5, that compares the diferent image
retrieval strategies on the validation data, using the concept retrieval model trained with ITC
loss as the base. We verify that assigning concepts that exist in at least two of the Top-4 most
similar images (Strategy 3) leads to the highest F1-Score. Among the three diferent base models
used (ResNet, autoregressive and concept retrieval), the best results were obtained by using
the autoregressive model. Nevertheless, these results do not surpass the values obtained by the
multi-label classification-based autoregressive model.</p>
        <p>However, the retrieval-based approaches proved very useful as complements to the
classification-based methods. As expected, the ensemble methods, which combine both
techniques, improved the results of all three multi-label classification networks (baseline, adversarial
and autoregressive). We obtained the best results by merging our two best models from each
category, the autoregressive multi-label classification network and the image retrieval approach
using the autoregressive model, achieving an F1-Score of 0.4998 and a Manual F1-Score of 0.9162.
This allowed us to rank 3rd in the competition among nine teams.</p>
      </sec>
      <sec id="sec-4-2">
        <title>4.2. Caption Prediction</title>
        <p>
          The caption prediction task is evaluated in terms of BERTScore [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] and ROUGE [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ]. We present
the obtained results in Table 6 for both retrieval and language generation-based approaches.
        </p>
        <p>All retrieval approaches ranked below the language generation-based approaches, which
confirms that simply using the captions from similar images is not enough to accurately describe
a diferent image.</p>
        <p>Regarding the generation-based approaches, using the DeiT encoder yielded slightly improved
results when compared to using DenseNet-121. As expected, adding the classification loss
improved the corresponding base architecture, but it was not enough to surpass the DeiT +
DistilGPT2 model. This suggests that, had time permitted, adding the classification loss to
the DeiT instead of the DenseNet-based model would have further improved our results. We
would like to point out that we do not report the results obtained by our model with the RL
concept-to-caption classifier because we were not able to train it in a reasonable amount of
time given the computational resources available.</p>
        <p>Thus, our best results were obtained by the DeiT + DistilGPT2 model trained on both training
and validation sets. This also suggests that our other developed methods could have better
results if trained on both sets, something we leave as future work. In the end, these results
awarded us the 5th place in the competition among thirteen participating teams.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>5. Conclusions and Future Work</title>
      <p>This work described the methods developed by the VCMI team in the ImageCLEFmedical
Caption 2023 task. We developed approaches based on multi-label classification and retrieval to
assign concepts to medical images, obtaining an F1-Score of 0.4998 that granted us 3rd place
among the nine teams that participated in the challenge. For caption generation, we focused on
encoder-decoder approaches with Transformers, obtaining a 5th place among thirteen teams,
with a BERTScore of 0.6147.</p>
      <p>
        In the concept detection task, the experiments show that training an autoregressive
multilabel classification network to model dependencies between concepts is a promising approach
capable of achieving high performance. As such, future work includes the further development
of autoregressive models, potentially with the integration of more advanced autoregressive
networks from the literature, such as Transformers [
        <xref ref-type="bibr" rid="ref22">22</xref>
        ]. We also intend to continue developing
the concept retrieval approach by pre-training the concept encoder using the concept-to-concept
loss before training the whole model. Finally, we consider the application of the adversarial
approach to predict all concepts, rather than only the Top-100 most frequent concepts, and
the potential integration between the adversarial and the autoregressive approaches into one
model.
      </p>
      <p>In the caption prediction task, future work involves exploring diferent and more powerful
image encoders, as well as more recent language models. We also intend to explore more
in-depth the inclusion of the concept classification loss into our base encoder-decoder approach,
not only by applying it to all our model configurations, but also by investigating the best way
of integrating it during training, e.g. only after the captioning module is suficiently trained.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgments</title>
      <p>We would like to thank our colleague Pedro Neto for his valuable feedback and suggestions.</p>
      <p>This work was partially funded by the Project TAMI - Transparent Artificial Medical
Intelligence (NORTE-01-0247-FEDER-045905) financed by ERDF - European Regional Fund through
the North Portugal Regional Operational Program - NORTE 2020 and by the Portuguese
Foundation for Science and Technology - FCT under the CMU - Portugal International Partnership,
and also by the Portuguese Foundation for Science and Technology (FCT) within PhD grants
2022.14516.BD, 2022.11566.BD, 2020.06434.BD and 2020.07034.BD.
Recognition (CVPR), IEEE Computer Society, Los Alamitos, CA, USA, 2021, pp. 16473–
16483. doi:10.1109/CVPR46437.2021.01621.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Drăgulinescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Yim</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Snider</surname>
          </string-name>
          , G. Adams,
          <string-name>
            <given-names>M.</given-names>
            <surname>Yetisgen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Bloch</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hicks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. A.</given-names>
            <surname>Riegler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Thambawita</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Storås</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Halvorsen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Papachrysos</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schöler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Andrei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Radzhabov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>I.</given-names>
            <surname>Coman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Stan</surname>
          </string-name>
          , G. Ioannidis,
          <string-name>
            <given-names>H.</given-names>
            <surname>Manguinhas</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Ştefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. G.</given-names>
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Deshayes</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Popescu</surname>
          </string-name>
          , Overview of ImageCLEF 2023:
          <article-title>Multimedia retrieval in medical, socialmedia and recommender systems applications</article-title>
          , in: Experimental IR Meets Multilinguality, Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 14th International Conference of the CLEF Association (CLEF</source>
          <year>2023</year>
          ), Springer Lecture Notes in Computer Science LNCS, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>I.</given-names>
            <surname>Rio-Torto</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Patrício</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Montenegro</surname>
          </string-name>
          , T. Gonçalves,
          <article-title>Detecting Concepts and Generating Captions from Medical Images: Contributions of the VCMI Team to ImageCLEFmedical 2022 Caption</article-title>
          , in
          <source>: Proceedings of the Working Notes of CLEF 2022 - Conference and Labs of the Evaluation Forum, CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Bologna, Italy,
          <year>2022</year>
          , pp.
          <fpage>1535</fpage>
          -
          <lpage>1553</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. G.</given-names>
            <surname>Seco de Herrera</surname>
          </string-name>
          , L. Bloch,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brüngel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Idrissi-Yaghir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Schäfer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          , Overview of ImageCLEFmedical 2023 -
          <article-title>Caption Prediction and Concept Detection</article-title>
          , in: CLEF2023 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Thessaloniki, Greece,
          <year>2023</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Koitka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Rückert</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Nensa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <article-title>Radiology Objects in COntext (ROCO): A Multimodal Image Dataset</article-title>
          ,
          <source>in: Intravascular Imaging and Computer Assisted Stenting and Large-Scale Annotation of Biomedical Data and Expert Label Synthesis</source>
          , Springer International Publishing, Cham,
          <year>2018</year>
          , pp.
          <fpage>180</fpage>
          -
          <lpage>189</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>O.</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          ,
          <article-title>The Unified Medical Language System (UMLS): integrating biomedical terminology</article-title>
          ,
          <source>Nucleic Acids Research</source>
          <volume>32</volume>
          (
          <year>2004</year>
          )
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          . doi:https://doi.org/10. 1093/nar/gkh061.
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>G.</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Liu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. V. D.</given-names>
            <surname>Maaten</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <article-title>Densely connected convolutional networks</article-title>
          ,
          <source>in: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>2261</fpage>
          -
          <lpage>2269</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2017</year>
          .
          <volume>243</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>D. P.</given-names>
            <surname>Kingma</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          ,
          <source>in: 3rd International Conference on Learning Representations (ICLR)</source>
          , San Diego, CA, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep residual learning for image recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , IEEE,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          ,
          <source>in: 3rd International Conference on Learning Representations (ICLR)</source>
          , San Diego, CA, USA,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>J.</given-names>
            <surname>Deng</surname>
          </string-name>
          ,
          <string-name>
            <given-names>W.</given-names>
            <surname>Dong</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Socher</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.-J.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Li</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fei-Fei</surname>
          </string-name>
          ,
          <article-title>ImageNet: A large-scale hierarchical image database</article-title>
          ,
          <source>in: 2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          , IEEE, Miami, FL, USA,
          <year>2009</year>
          , pp.
          <fpage>248</fpage>
          -
          <lpage>255</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2009</year>
          .
          <volume>5206848</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>A.</given-names>
            <surname>Radford</surname>
          </string-name>
          ,
          <string-name>
            <surname>J. Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Child</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Luan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Amodei</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Sutskever</surname>
          </string-name>
          ,
          <article-title>Language Models are Unsupervised Multitask Learners</article-title>
          ,
          <source>OpenAI Blog 1</source>
          (
          <year>2019</year>
          )
          <article-title>9</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>H.</given-names>
            <surname>Touvron</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Cord</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Douze</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Sablayrolles</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Jegou</surname>
          </string-name>
          ,
          <article-title>Training dataeficient image transformers &amp; distillation through attention</article-title>
          ,
          <source>in: Proceedings of the 38th International Conference on Machine Learning (ICML)</source>
          , volume
          <volume>139</volume>
          <source>of Proceedings of Machine Learning Research, PMLR, Online</source>
          ,
          <year>2021</year>
          , pp.
          <fpage>10347</fpage>
          -
          <lpage>10357</lpage>
          . URL: https://proceedings. mlr.press/v139/touvron21a.html.
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>T.</given-names>
            <surname>Wolf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Delangue</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Moi</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Cistac</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Rault</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Louf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Funtowicz</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Davison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Shleifer</surname>
          </string-name>
          , P. von Platen, C. Ma,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Jernite</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Plu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T. Le</given-names>
            <surname>Scao</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gugger</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Drame</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q.</given-names>
            <surname>Lhoest</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Rush</surname>
          </string-name>
          , Transformers:
          <article-title>State-of-the-Art Natural Language Processing</article-title>
          ,
          <source>in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing</source>
          (EMNLP):
          <article-title>System Demonstrations, Association for Computational Linguistics</article-title>
          , Online,
          <year>2020</year>
          , pp.
          <fpage>38</fpage>
          -
          <lpage>45</lpage>
          . doi:
          <volume>10</volume>
          .18653/v1/
          <year>2020</year>
          .emnlp-demos.
          <volume>6</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>B.</given-names>
            <surname>Hou</surname>
          </string-name>
          , G. Kaissis,
          <string-name>
            <given-names>R. M.</given-names>
            <surname>Summers</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Kainz</surname>
          </string-name>
          , RATCHET:
          <article-title>Medical Transformer for Chest Xray Diagnosis and Reporting, in: Medical Image Computing and Computer Assisted Intervention (MICCAI</article-title>
          ), Springer International Publishing, Cham,
          <year>2021</year>
          , pp.
          <fpage>293</fpage>
          -
          <lpage>303</lpage>
          . doi:https://doi.org/10.1007/978-3-
          <fpage>030</fpage>
          -87234-2_
          <fpage>28</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hashir</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Bertrand</surname>
          </string-name>
          ,
          <article-title>On the limits of cross-domain generalization in automated X-ray prediction</article-title>
          ,
          <source>in: Proceedings of the Third Conference on Medical Imaging with Deep Learning (MIDL)</source>
          , volume
          <volume>121</volume>
          <source>of Proceedings of Machine Learning Research, PMLR, Online</source>
          ,
          <year>2020</year>
          , pp.
          <fpage>136</fpage>
          -
          <lpage>155</lpage>
          . URL: https://proceedings.mlr.press/v121/cohen20a.html.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>J. P.</given-names>
            <surname>Cohen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J. D.</given-names>
            <surname>Viviano</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Bertin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Morrison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Torabian</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Guarrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M. P.</given-names>
            <surname>Lungren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Chaudhari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Brooks</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Hashir</surname>
          </string-name>
          , H. Bertrand,
          <article-title>TorchXRayVision: A library of chest X-ray datasets and models</article-title>
          ,
          <source>in: Proceedings of The 5th International Conference on Medical Imaging with Deep Learning (MIDL)</source>
          , volume
          <volume>172</volume>
          <source>of Proceedings of Machine Learning Research, PMLR, Online</source>
          ,
          <year>2022</year>
          , pp.
          <fpage>231</fpage>
          -
          <lpage>249</lpage>
          . URL: https://proceedings.mlr.press/ v172/cohen22a.html.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <given-names>I.</given-names>
            <surname>Loshchilov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Hutter</surname>
          </string-name>
          ,
          <article-title>Decoupled weight decay regularization</article-title>
          ,
          <source>in: 7th International Conference on Learning Representations (ICLR)</source>
          , New Orleans, LA, USA,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          [18]
          <string-name>
            <given-names>V.</given-names>
            <surname>Sanh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Debut</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Chaumond</surname>
          </string-name>
          , T. Wolf,
          <article-title>DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter</article-title>
          ,
          <source>in: 5th Edition of EMC2: Energy Eficient Machine Learning and Cognitive Computing Workshop at Neural Information Processing Systems (NeurIPS)</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          [19]
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Rennie</surname>
          </string-name>
          , E. Marcheret,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Mroueh</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Goel</surname>
          </string-name>
          ,
          <article-title>Self-Critical Sequence Training for Image Captioning</article-title>
          , in: 2017 IEEE Conference on
          <article-title>Computer Vision and Pattern Recognition (CVPR)</article-title>
          ,
          <source>IEEE Computer Society</source>
          , Los Alamitos, CA, USA,
          <year>2017</year>
          , pp.
          <fpage>1179</fpage>
          -
          <lpage>1195</lpage>
          . doi:
          <volume>10</volume>
          . 1109/CVPR.
          <year>2017</year>
          .
          <volume>131</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          [20]
          <string-name>
            <given-names>T.</given-names>
            <surname>Zhang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kishore</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Wu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K. Q.</given-names>
            <surname>Weinberger</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y. Artzi,</surname>
          </string-name>
          <article-title>BERTScore: Evaluating Text Generation with BERT</article-title>
          ,
          <source>in: International Conference on Learning Representations (ICLR)</source>
          ,
          <source>Addis Ababa, Ethiopia</source>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          [21]
          <string-name>
            <surname>C.-Y. Lin</surname>
            ,
            <given-names>ROUGE:</given-names>
          </string-name>
          <article-title>A package for automatic evaluation of summaries, in: Text Summarization Branches Out, Association for Computational Linguistics (ACL), Barcelona</article-title>
          , Spain,
          <year>2004</year>
          , pp.
          <fpage>74</fpage>
          -
          <lpage>81</lpage>
          . URL: https://aclanthology.org/W04-1013.
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          [22]
          <string-name>
            <given-names>J.</given-names>
            <surname>Lanchantin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>T.</given-names>
            <surname>Wang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Ordonez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Qi</surname>
          </string-name>
          ,
          <article-title>General Multi-label Image Classification with Transformers</article-title>
          ,
          <source>in: 2021 IEEE/CVF Conference on Computer Vision</source>
          and Pattern
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>