<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta>
      <journal-title-group>
        <journal-title>Bucharest, Romania
" tsuneda.riku.am@kde.cs.tut.ac.jp (R. Tsuneda); asakawa@kde.cs.tut.ac.jp (T. Asakawa); aono@tut.jp (M. Aono)</journal-title>
      </journal-title-group>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3115/1073083.1073135</article-id>
      <title-group>
        <article-title>Kdelab at ImageCLEF 2021: Medical Caption Prediction with Efective Data Pre-processing and Deep Learning</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Riku Tsuneda</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tetsuya Asakawa</string-name>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Masaki Aono</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Computer Science and Engineering, Toyohashi University of Technology</institution>
          ,
          <addr-line>Aichi</addr-line>
          ,
          <country country="JP">Japan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2021</year>
      </pub-date>
      <volume>000</volume>
      <fpage>0</fpage>
      <lpage>0002</lpage>
      <abstract>
        <p>ImageCLEF 2021 Caption Prediction Task is an example of a challenging research problem in the field of image captioning. The goal of this research is to automatically generate accurate captions describing a given medical image. We describe our approach to captioning medical images and illustrate the text and image pre-processing that is efective for our task dataset. In this paper, we have applied sentence-ending period removal as text pre-processing and histogram normalization of luminance as image pre-processing. Furthermore, we present the efectiveness of our text data augmentation approach. Submission of our kdelab team on the task test dataset achieved a BLEU evaluating of 0.362.</p>
      </abstract>
      <kwd-group>
        <kwd>eol&gt;Image Captioning</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Medical Images</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>1. Introduction</title>
      <p>In recent years, multimodal processing of images and natural language has attracted much
attention in the field of machine learning. Image Captioning is one of these representative tasks,
which aims at proper captioning of input images. As these accuracies improve, it is expected
that computers will not only be able to detect objects in images, but also to understand the
relationships and behaviors between objects.</p>
      <p>Image captioning is also efective in the medical field. For example, interpreting and
summarizing possible disease symptoms from a large number of radiology images (e.g. X-ray images
and CT images) is a time-consuming task that can only be understood by highly knowledgeable
specialists. If computers could understand medical images and generate accurate captions, it
would help solve the world’s growing shortage of medical doctors. However, there is still the
bottleneck problem that few physicians are able to give accurate annotations.</p>
      <p>In this paper, we describe our approach to general Image Captioning task in medical domain
at Image Captioning such as Fig. 1(right).</p>
      <p>
        The nature of medical images are quite diferent from general images such as MS-COCO [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]
in many aspects.
      </p>
      <p>
        In the following, we first describe related work on Image Captioning task and Medical
Image Captioning in Section 2, followed by the description of the dataset provided for
ImageCLEF2021 [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] Medical Image Captioning [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]dataset in Section 3. In Section 4, we describe details
of the method we have applied, and then of our experiments we have conducted in Section 5.
We finally conclude this paper in Section 6.
      </p>
    </sec>
    <sec id="sec-2">
      <title>2. Related Work</title>
      <p>
        In the field of image recognition, convolutional neural networks (CNN), including VGG [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ],and
ResNet [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], have been widely used. In the field of natural language processing for text
understanding, encoder-decoder models (seq2seq) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] have been the mainstream, but in recent years
Transformers [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] such as BERT [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] have become common. The Image Captioning task is a
fusion of image recognition and sentence generation, and lies in the middle of these two.
      </p>
      <p>
        For example, Oriol Vinyals et al. proposed caption generation using an encoder-decoder
model [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], and Kelvin Xu et al. proposed Show, Attend and Tell, which adds visual attention
to the encoder-decoder model [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ]. Recently, P. Anderson et al. presented a model using
Bottom-Up Attention obtained by pre-training a Faster-R-CNN used for object detection [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ].
      </p>
      <p>
        In addition, the Caption Prediction Task is the first time of its kind to be held at an
ImageCLEF conference. However, a similar task, the VQA-Med task [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], has been contested at
ImageCLEF2018, 2019, and 2020.
For the ImageCLEF 2021 Medical Caption Prediction task, organizers have provided us with a
training set of 2,756 radiology images with the same number of captions, a validation set of 500
radiology images with the same number of captions, and a test set of 444 radiology images with
the same number of captions. We are supposed to use these as our datasets. Most of the images
in the dataset are non-colored, and they potentially include non-essential logos and texts. The
task participants have to generate automatic caption based on radiology image data.
      </p>
      <p>According to our analysis, the top word frequencies were dominated by prepositions and
words such as right and left that indicate position. The word cloud of case-insensitive words
and the top 14 ranking words in terms of word frequency are summarized in Figure 2 and Table
1, respectively.</p>
    </sec>
    <sec id="sec-3">
      <title>4. Methodlogy</title>
      <p>The overview of our Medical Image Captioning methodology is divided into three main parts,
as shown in Figure 3.</p>
      <p>
        The first is the image and text pre-processing. As preliminaries, we propose a method for
pre-processing the images and text in the dataset. The second is the encoder part. In the encoder
part, the features of the image are extracted. The third is the decoder part. In the decoder part,
words are predicted recursively using LSTM [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] and attention mechanism.
      </p>
      <p>
        We have adopted Show, Attend and Tell as the base model. This model is known to have
high accuracy among Image Captioning models that do not use object detection such Faster
R-CNN [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ].
      </p>
      <sec id="sec-3-1">
        <title>4.1. Input Data Pre-processing</title>
        <sec id="sec-3-1-1">
          <title>4.1.1. Image Pre-processing</title>
          <p>Image pre-processing includes image normalization.</p>
          <p>The image processing consists of two steps. In the first step, we normalize images using
histogram smoothing based on the luminance of the image. In the second step, we resize all
images to a size of 256 × 256.</p>
          <p>We have tried two ways to normalize the luminance distribution of an image. The first is
histogram flattening. Histogram flattening is a method of smoothing the luminance distribution
of the entire image. When flattened, the contrast of the image is enhanced and the image
becomes clearer. The second is adaptive histogram flattening. This method performs the
histogram flattening described in the first method on a small area of the image. In general, this
technique can reduce the occurrence of tone jumps. A comparison of the raw image and the
pre-processed image is shown in Figure 4.</p>
        </sec>
        <sec id="sec-3-1-2">
          <title>4.1.2. Text Pre-processing</title>
          <p>We preprocess the text by removing and lowercasing periods in the captions of the training
data. In general, the MS-COCO captioning task is not case-sensitive, and it is well known that
symbols such as periods had better be removed. If there are multiple captions for a single image,
only the period in the last caption is should be removed. As a contribution to these, the period is
recognized as one of the words in the sentence, since the period is present only in the sentence.</p>
        </sec>
      </sec>
      <sec id="sec-3-2">
        <title>4.2. Caption Data Expanding using EDA</title>
        <p>
          We tried EDA (Easy Data Augmentation) [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] as an extension of our text dataset. EDA is a
text classification task in natural language processing, and is an efective method that works
well when the dataset is small. In a typical captioning task using MS-COCO, five captions are
provided for one image. However, in the ImageCLEF2021 dataset, only one caption per image
is provided. We have tested the efectiveness of this approach using various data expansion
methods in EDA.
        </p>
      </sec>
      <sec id="sec-3-3">
        <title>4.3. Neural network model</title>
        <p>
          As a base neural network model for caption generation, we have adopted ”Show, Attend and
Tell” model [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ]. This model is capable of highly accurate captioning without using object
detection. The architecture of the models is almost the same, but our model difers in that we
employ ResNet-101 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] instead of VGG16 [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] as the CNN encoder .
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>5. Experiments and results</title>
      <sec id="sec-4-1">
        <title>5.1. Setting up hyper-parameters and performing pre-processing with validation data</title>
        <p>We experimented with hyper-parameter adjustment and image pre-processing using training
and validation data. As noted in 4.1, all characters in the train caption data are lowercased.</p>
        <p>
          We have setup the following hyper-parameters as follows; batch size as 32, optimization
function as “Adam” with a decoder learning rate of 0.001 , and the number of epochs 200. For
the implementation, we employ PyTorch1.7.1 [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] as our deep learning framework. For the
evaluation of captioning , we utilize BLEU4 [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Table 2 shows the results. Here we compare in
terms of BLEU for data pre-processing.
        </p>
      </sec>
      <sec id="sec-4-2">
        <title>5.2. The results with test data</title>
        <p>The test dataset consists of the test images distributed as described in 4.1. The test image
consists of 444 medical images, without the correct answer captions. In contrast to the text
pre-processing in 5.1, the captions used in the training have been all lowercased and the periods
at the end of sentences were deleted.</p>
        <p>Table 3 shows the BLEU results for the test data. In the experiments on the test data, the
BLEU evaluation was the highest when Histogram Normalization was used. Example of our
seemingly successful caption generation results are shown in Fig 5.</p>
        <p>Table 4 shows the BLEU ratings for the EDA attempts. The pre-processing of the dataset uses
the method that achieved the highest BLEU rating in Table 3. Using EDA’s synonym substitution
and other methods, we compare the case of adding one caption, two captions, and four captions.
In all cases where data expansion has been performed using EDA, the BLEU rating has dropped.</p>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>6. Conclusions</title>
      <p>We have described our system with which we submitted to the ImageCLEF2021 Caption
Prediction task. In our system, we have done our own data pre-processing, and have attempted to
add data augmentation with EDA. In addition, two types of luminance smoothing and period
removal were applied to image and text pre-processing. The results demonstrate that these
processes have improved the caption prediction accuracy of the neural network model. EDA
turns out to be inefective in this task. Finally, from organizer’s evaluation, we have achieved a
BLEU score of 0.362 in the ImageCLEF2021Caption Prediction task, placing us 4th.</p>
    </sec>
    <sec id="sec-6">
      <title>Acknowledgment</title>
      <p>A part of this research was carried out with the support of Grant for Education and Research in
Toyohashi University of Technology.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>T.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Maire</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. J.</given-names>
            <surname>Belongie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L. D.</given-names>
            <surname>Bourdev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Hays</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Perona</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Ramanan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            <surname>Dollár</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. L.</given-names>
            <surname>Zitnick</surname>
          </string-name>
          ,
          <string-name>
            <surname>Microsoft</surname>
            <given-names>COCO</given-names>
          </string-name>
          :
          <article-title>common objects in context</article-title>
          ,
          <source>CoRR abs/1405</source>
          .0312 (
          <year>2014</year>
          ). URL: http://arxiv.org/abs/1405.0312. arXiv:
          <volume>1405</volume>
          .
          <fpage>0312</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>B.</given-names>
            <surname>Ionescu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Péteri</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Kozlovski</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Liauchuk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Dicente</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.</given-names>
            <surname>Kovalev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <surname>A. G. S. de Herrera</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Jacutprakart</surname>
            ,
            <given-names>C. M.</given-names>
          </string-name>
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Berari</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Tauteanu</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <string-name>
            <surname>Fichou</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Brie</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          <string-name>
            <surname>Dogariu</surname>
            ,
            <given-names>L. D.</given-names>
          </string-name>
          <string-name>
            <surname>Ştefan</surname>
            ,
            <given-names>M. G.</given-names>
          </string-name>
          <string-name>
            <surname>Constantin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>T. A.</given-names>
          </string-name>
          <string-name>
            <surname>Oliver</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          <string-name>
            <surname>Moustahfid</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          <string-name>
            <surname>Popescu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          <string-name>
            <surname>Deshayes-Chossart</surname>
          </string-name>
          ,
          <article-title>Overview of the ImageCLEF 2021: Multimedia retrieval in medical, nature, internet and social media applications, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>O.</given-names>
            <surname>Pelka</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>García Seco de Herrera</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Jacutprakart</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C. M.</given-names>
            <surname>Friedrich</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Overview of the ImageCLEFmed 2021 concept &amp; caption prediction task, in: Experimental IR Meets Multilinguality</article-title>
          , Multimodality, and
          <string-name>
            <surname>Interaction</surname>
          </string-name>
          ,
          <source>Proceedings of the 12th International Conference of the CLEF Association (CLEF</source>
          <year>2021</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>K.</given-names>
            <surname>Simonyan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Zisserman</surname>
          </string-name>
          ,
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          ,
          <source>CoRR abs/1409</source>
          .1556 (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>Zhang</surname>
          </string-name>
          , S. Ren,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <article-title>Deep Residual Learning for Image Recognition</article-title>
          ,
          <source>in: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2016</year>
          , pp.
          <fpage>770</fpage>
          -
          <lpage>778</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2016</year>
          .
          <volume>90</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>I.</given-names>
            <surname>Sutskever</surname>
          </string-name>
          ,
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Q. V.</given-names>
            <surname>Le</surname>
          </string-name>
          ,
          <article-title>Sequence to Sequence Learning with Neural Networks</article-title>
          ,
          <source>in: Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, NIPS'14</source>
          , MIT Press, Cambridge, MA, USA,
          <year>2014</year>
          , pp.
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>A.</given-names>
            <surname>Vaswani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Shazeer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Parmar</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Uszkoreit</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Jones</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. N.</given-names>
            <surname>Gomez</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Kaiser</surname>
          </string-name>
          ,
          <string-name>
            <surname>I. Polosukhin</surname>
          </string-name>
          , Attention Is All You Need,
          <source>CoRR abs/1706</source>
          .03762 (
          <year>2017</year>
          ). URL: http: //arxiv.org/abs/1706.03762. arXiv:
          <volume>1706</volume>
          .
          <fpage>03762</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>J.</given-names>
            <surname>Devlin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Chang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <surname>K.</surname>
          </string-name>
          <article-title>Toutanova, BERT Pre-training of Deep Bidirectional Transformers for Language Understanding</article-title>
          , CoRR abs/
          <year>1810</year>
          .04805 (
          <year>2018</year>
          ). URL: http://arxiv. org/abs/
          <year>1810</year>
          .04805. arXiv:
          <year>1810</year>
          .04805.
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>O.</given-names>
            <surname>Vinyals</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Toshev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Erhan</surname>
          </string-name>
          ,
          <article-title>Show and tell: A neural image caption generator</article-title>
          ,
          <source>in: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)</source>
          ,
          <year>2015</year>
          , pp.
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          . doi:
          <volume>10</volume>
          .1109/CVPR.
          <year>2015</year>
          .
          <volume>7298935</volume>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>K.</given-names>
            <surname>Xu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Ba</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Kiros</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Cho</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A. C.</given-names>
            <surname>Courville</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. S.</given-names>
            <surname>Zemel</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Y.</given-names>
            <surname>Bengio</surname>
          </string-name>
          , Show, Attend and Tell:
          <article-title>Neural Image Caption Generation with Visual Attention</article-title>
          ,
          <source>CoRR abs/1502</source>
          .03044 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1502.03044. arXiv:
          <volume>1502</volume>
          .
          <fpage>03044</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          [11]
          <string-name>
            <given-names>P.</given-names>
            <surname>Anderson</surname>
          </string-name>
          ,
          <string-name>
            <given-names>X.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            <surname>Buehler</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Teney</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Johnson</surname>
          </string-name>
          , S. Gould,
          <string-name>
            <surname>L. Zhang,</surname>
          </string-name>
          <article-title>Bottom-up and top-down attention for image captioning and VQA</article-title>
          ,
          <source>CoRR abs/1707</source>
          .07998 (
          <year>2017</year>
          ). URL: http://arxiv.org/abs/1707.07998. arXiv:
          <volume>1707</volume>
          .
          <fpage>07998</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          [12]
          <string-name>
            <given-names>A.</given-names>
            <surname>Ben Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sarrouti</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S. A.</given-names>
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            <surname>Müller</surname>
          </string-name>
          ,
          <article-title>Overview of the VQA-Med Task at ImageCLEF 2021: Visual Question Answering and Generation in the Medical Domain</article-title>
          , in: CLEF 2021 Working Notes, CEUR Workshop Proceedings, CEUR-WS.org, Bucharest, Romania,
          <year>2021</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          [13]
          <string-name>
            <given-names>S.</given-names>
            <surname>Hochreiter</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Schmidhuber</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Long</given-names>
            <surname>Short-Term</surname>
          </string-name>
          <string-name>
            <surname>Memory</surname>
          </string-name>
          ,
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <year>1997</year>
          )
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          [14]
          <string-name>
            <given-names>S.</given-names>
            <surname>Ren</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>He</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R. B.</given-names>
            <surname>Girshick</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Sun</surname>
          </string-name>
          ,
          <string-name>
            <surname>Faster R-CNN</surname>
          </string-name>
          <article-title>: towards real-time object detection with region proposal networks</article-title>
          ,
          <source>CoRR abs/1506</source>
          .01497 (
          <year>2015</year>
          ). URL: http://arxiv.org/abs/1506. 01497. arXiv:
          <volume>1506</volume>
          .
          <fpage>01497</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          [15]
          <string-name>
            <given-names>J. W.</given-names>
            <surname>Wei</surname>
          </string-name>
          ,
          <string-name>
            <given-names>K.</given-names>
            <surname>Zou</surname>
          </string-name>
          ,
          <article-title>EDA: easy data augmentation techniques for boosting performance on text classification tasks</article-title>
          , CoRR abs/
          <year>1901</year>
          .11196 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1901</year>
          .11196. arXiv:
          <year>1901</year>
          .11196.
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          [16]
          <string-name>
            <given-names>A.</given-names>
            <surname>Paszke</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Gross</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            <surname>Massa</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Lerer</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bradbury</surname>
          </string-name>
          , G. Chanan,
          <string-name>
            <given-names>T.</given-names>
            <surname>Killeen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>Lin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>N.</given-names>
            <surname>Gimelshein</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Antiga</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Desmaison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Köpf</surname>
          </string-name>
          ,
          <string-name>
            <given-names>E.</given-names>
            <surname>Yang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Z.</given-names>
            <surname>DeVito</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Raison</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Tejani</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chilamkurthy</surname>
          </string-name>
          ,
          <string-name>
            <given-names>B.</given-names>
            <surname>Steiner</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.</given-names>
            <surname>Fang</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            <surname>Bai</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.</given-names>
            <surname>Chintala</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pytorch:</surname>
          </string-name>
          <article-title>An imperative style, high-performance deep learning library</article-title>
          , CoRR abs/
          <year>1912</year>
          .01703 (
          <year>2019</year>
          ). URL: http://arxiv.org/abs/
          <year>1912</year>
          .01703. arXiv:
          <year>1912</year>
          .01703.
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          [17]
          <string-name>
            <surname>Papineni</surname>
          </string-name>
          , Kishore and Roukos, Salim and Ward, Todd and Zhu,
          <string-name>
            <surname>Wei-Jing</surname>
          </string-name>
          ,
          <article-title>Bleu: a method for automatic evaluation of machine translation, in: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics</article-title>
          , Philadelphia, Pennsylvania, USA,
          <year>2002</year>
          , pp.
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          . URL: https://www.aclweb.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>