<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Ensemble of Streamlined Bilinear Visual Question Answering Models for the ImageCLEF 2019 Challenge in the Medical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Minh H. Vu</string-name>
          <email>minh.vu@umu.se</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Raphael Sznitman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tufve Nyholm</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tommy Lofstedt</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>ARTORG Center, University of Bern</institution>
          ,
          <country country="CH">Switzerland</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Umea University</institution>
          ,
          <addr-line>901 87 Umea</addr-line>
          ,
          <country country="SE">Sweden</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the contribution by participants from Umea University, Sweden, in collaboration with the University of Bern, Switzerland, for the Medical Domain Visual Question Answering challenge hosted by ImageCLEF 2019. We proposed a novel Visual Question Answering approach that leverages a bilinear model to aggregate and synthesize extracted image and question features. While we did not make use of any additional training data, our model used an attention scheme to focus on the relevant input context and was further boosted by using an ensemble of trained models. We show here that the proposed approach performs at state-of-the-art levels, and provides an improvement over several existing methods. The proposed method was ranked 3rd in the Medical Domain Visual Question Answering challenge of ImageCLEF 2019.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Deep learning (DL) has dramatically reshaped the state-of-the-art in computer
vision, natural language processing (NLP), and many other domains. This is the
case within medical image analysis as well. With exceptional outcomes for various
diagnostic and prognostic tasks, DL has attracted the attention of the medical
community. The hope is that DL will improve results or provide automated tools
that can support clinical decision making, for example in the Visual Question
Answering (VQA) task.</p>
      <p>
        VQA is a complex multimodal task that aims at answering a question about
an image. Here, a system needs to fathom both the image and question in order
to correctly answer the question. Most recent VQA methods consists of arti cial
neural networks trained to answer a question regarding a given image [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. Such
models incorporate: (1) a question model encoding the question input, (2) an
image model extracting visual features from the input image, (3) a fusion scheme
      </p>
    </sec>
    <sec id="sec-2">
      <title>Q:\Is this a</title>
      <p>contrast or
noncontrast CT?"</p>
    </sec>
    <sec id="sec-3">
      <title>A: \Noncontrast"</title>
    </sec>
    <sec id="sec-4">
      <title>Q:\Is this an MRI image?" A: \Yes"</title>
    </sec>
    <sec id="sec-5">
      <title>Q:\Is this image modality T1, T2, or FLAIR?" A: \T2"</title>
    </sec>
    <sec id="sec-6">
      <title>Q:\What is most alarming about this MRI?"</title>
    </sec>
    <sec id="sec-7">
      <title>A: \Dermoid cyst"</title>
      <p>that combines the image and question features, and (4) a classi er that uses the
combined features to select the most likely answer.</p>
      <p>
        ImageCLEF [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] aims to support the need of the global community of reusable
resources for benchmarking the cross-language annotation and retrieval of
images. In 2019, ImageCLEF had four main tasks: lifelogging, medicine, nature,
and security. With the purpose of providing a \second opinion" for clinicians
on complex medical images and o ering patients an economical way to monitor
their disease status, ImageCLEF organizes a medical domain VQA challenge,
called ImageCLEF-VQA-Med [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] (see examples in Figure 1).
      </p>
      <p>
        In the present work, we describe the model that we developed for the
ImageCLEFVQA-Med 2019 challenge. First, we present a novel fusion scheme for questions
and images. Second, we introduce an image preprocessing step that suppresses
unwanted distortions to enhance the quality of the ImageCLEF-VQA-Med
images before they are fed into a Convolutional Neural Network (CNN) for image
feature extraction. Third, we propose to utilize a pre-trained Bidirectional
Encoder Representations from Transformers (BERT) model [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to extract the
question features. Last, we present an ensemble of VQA models that gave a large
boost in the evaluation metrics on both validation and test sets.
2
      </p>
      <sec id="sec-7-1">
        <title>Related Work</title>
        <p>
          Since most existing VQA methods use standard embedding models for text [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]
and standard CNNs to extract image features [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ], the research focus has largely
been on fusion strategies that combine information from both input sources [
          <xref ref-type="bibr" rid="ref10 ref3 ref6">6,10,3</xref>
          ].
Recently, attention schemes have also been introduced in VQA models in order
to focus the trained models towards question-guided evaluations. The review
paper of Ka e et al. [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ] o ers a comprehensive overview of recent VQA models.
        </p>
        <p>
          An image model is used to extract visual features from the input images. Most
recent VQA models use CNNs, often ones that are pre-trained on e.g. the
ImageNet dataset [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ]. Popular choices for the image model includes: VGGNet [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ],
GoogLeNet [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ], and ResNet [
          <xref ref-type="bibr" rid="ref10 ref3 ref6">6,10,3</xref>
          ]. Multimodal Compact Bilinear (MCB) [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ],
Multimodal Low-rank Bilinear (MLB) [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ], and Multimodal Tucker Fusion for
Visual Question Answering (MUTAN) [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] are current VQA methods that employ
bilinear transformation to encode image and question. As with these, we used
a ResNet-152 model, that was pre-trained on the ImageNet dataset, to extract
visual features.
        </p>
        <p>
          Common models employed to extract question features include Long
Shortterm Memory (LSTM) [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], Gated Recurrent Units (GRU) [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ], and Skip-thought
vectors [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]. Skip-thought vectors is a powerful unsupervised encoder-decoder
approach that has been used in many recent VQA models [
          <xref ref-type="bibr" rid="ref10 ref3 ref6">6,10,3</xref>
          ]. In the present
work, we not only used Skip-thought vectors but also evaluated the use of a
pretrained BERT model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to extract question features. The BERT model has
obtained state-of-the-art results on a wide variety of NLP tasks recently.
        </p>
        <p>
          Attention mechanisms have led to breakthroughs in many NLP applications,
for example, in neural machine translation [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ], and in computer vision, such as
in image classi cation [
          <xref ref-type="bibr" rid="ref17">17</xref>
          ]. Propelled by the remarkable success accomplished
by attention mechanisms in computer vision and NLP, numerous VQA models
have employed attention schemes to improve predictions.
3
        </p>
      </sec>
      <sec id="sec-7-2">
        <title>Proposed Approach</title>
        <p>
          For the task of VQA [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ], we are interested in predicting the most likely answer,
a^, given a question, q, about an image, v. The problem can be stated as
a^ = arg max P (a j q; v; );
a2A
(1)
where A is the set of possible answers and denotes all model parameters.
        </p>
        <p>
          Figure 2 illustrates the proposed method. It uses pre-trained networks to
extracts image and question features (in red and green, respectively), and feed
them to a fusion scheme. These features are combined using an attention
mechanism [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] (orange) to compute global image features, v~. We proposed an e cient
bilinear transformation that takes two inputs: global image features and global
question features, q~, and yields a single latent feature vector, f~, that is then
linearly mapped to the answer vector (white) to generate the output. The proposed
bilinear fusion scheme is further described in the following section.
3.1
        </p>
        <sec id="sec-7-2-1">
          <title>Proposed Method</title>
          <p>
            To encode questions and images, we rst make use of a multi-glimpse attention
mechanism [
            <xref ref-type="bibr" rid="ref6">6</xref>
            ] to compute global image features, v~ = [!1T ; : : : ; !GT]T 2 RKG,
where K denotes the dimensions of the identity core tensor, that is decomposed
using Tucker Decomposition in the attention scheme (see [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] for more details),
and G is the number of glimpses.
          </p>
          <p>https://github.com/facebook/fb.resnet.torch</p>
          <p>ResNet-152
“What is the modality?”</p>
          <p>Question model
14 × 14 × 2048
Attention
Scheme</p>
          <p>v˜
J</p>
          <p>Dropout-Linear
˜
f
Bilinear-ReLU
q˜</p>
          <p>1700
Dropout-Linear-ReLU “Nuclear medicine”
where q~ 2 RKG, q 2 RJ are the question features, and Wq 2 RKG J and
bq 2 RKG denote the weight and bias terms, respectively. ReLU is the recti ed
linear unit activation function.</p>
          <p>Given these, the output features of the proposed model are encoded as
0KG KG 1
f~i = ReLU @X X q~j wifjkv~k + bif A = ReLU q~T Wif v~ + bif ;</p>
          <p>j=1 k=1
where f~ 2 RK , Wif 2 RKG KG and bif 2 R denote the weight and bias terms
in the bilinear scheme, respectively.</p>
          <p>The probabilities of each target answer over all possible target answers are
then written as</p>
          <p>f = SoftMax(Waf~ + ba);
where f 2 RN , and Wa 2 RN K and ba 2 RN denote the weight and bias
terms, respectively.
3.2</p>
        </sec>
        <sec id="sec-7-2-2">
          <title>Implementation Details and Training</title>
          <p>The proposed method, illustrated in Figure 2, contains three di erent
components: an image model (see Section 4.1), a question model (see Section 4.2), and
the proposed fusion with an attention mechanism model. The implementation
and training details of the latter one is discussed below.
(2)
(3)
(4)</p>
          <p>
            To implement the attention mechanism, we followed the description in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ].
We used the Adam optimizer [
            <xref ref-type="bibr" rid="ref11">11</xref>
            ] with a learning rate of 0.0001, a batch size of
128 and used a dropout rate of 0.5 for all linear and bilinear layers. We trained
the proposed model for 100 epochs on an Nvidia GTX 1080 Ti GPU, the training
time for the whole network with the attention scheme was around 1.5 hours.
4
          </p>
        </sec>
      </sec>
      <sec id="sec-7-3">
        <title>Ensemble of Multiple Models</title>
        <p>We employed ensemble learning to build a committee from a collection of trained
VQA models, each casts a weighted vote for the predicted answer, in order to
use the wisdom of the crowd to produce better predictions.</p>
        <sec id="sec-7-3-1">
          <title>4.1 Image Model</title>
          <p>We preprocessed and augmented the images before passing them through the
pre-trained ResNet-152 model to extract image features.</p>
          <p>To remove unwanted outer areas (text and/or background) from an image,
we applied the following sequence of image processing techniques:
(1) Normalize the intensities of the input image to 0-255.
(2) Apply Otsu's method to binarize the normalized image using a threshold of
5.
(3) Apply an open operation on the thresholded image with a rectangular
structuring element of size 40 40.
(4) Fill the holes of the binary image.
(5) Remove all connected components, except the two largest ones.
(6) Compute a bounding-box of the foreground.
(7) Crop the image to the bounding box.
(8) Apply an open operation with a rectangular structuring element of size 50
50.
(9) Crop the normalized image to the enlarged bounding box.
(10) Multiply the results from steps (8) and (9) to obtain a cropped image.
(11) Resize the cropped image to 448 448.
(12) Z-normalize the resized image.</p>
          <p>Data augmentation was applied on the pre-processed dataset before the
images were sent to the network to improve the generalization. We used two types
of data augmentation: (i) rotate the image by a randomly selected number of
degrees from the range [ 20; 20], and (ii) randomly scale the image size using a
scaling factor in the range [0:9; 1:1].
4.2</p>
        </sec>
        <sec id="sec-7-3-2">
          <title>Question Model</title>
          <p>We evaluated the use of Skip-thought vectors and a pre-trained BERT model
for extracting the question features. These features were then used in the VQA
models (see Table 2).</p>
          <p>
            We used the same preprocessing techniques for the questions as was used
in [
            <xref ref-type="bibr" rid="ref3 ref6">3,6</xref>
            ]. These were: (i) removing the punctuation marks, and (ii) converting to
lower-case.
          </p>
          <p>
            To overcome the challenge of seeing new words in the medical domain, we
employed a Word2Vec model trained on the Google News dataset that includes
three million words vectors. We then used a linear regression model without
regularization to map the Word2Vec to the Skip-thought embedding space [
            <xref ref-type="bibr" rid="ref12">12</xref>
            ]. This
enabled our Skip-thought vectors to generate 2,400-dimensional word features.
          </p>
          <p>
            BERT is a deep bidirectional transformer encoder that has obtained new
state-of-the-art results on multiple NLP tasks [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. The BERT model was
pretrained for general-purpose language understanding on a large text corpus, called
WikiText-103, on two unique tasks: Masked Language Model (MLM) and Next
Sentence Prediction (NSP) [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ]. We employed two pre-trained BERT models: (i)
bert-base-multilingual-uncased, and (ii) bert-base-multilingual-cased, to extract
question features. Of each pre-trained model, we used a feature-based approach
by generating ELMo-like [
            <xref ref-type="bibr" rid="ref14">14</xref>
            ] pre-trained contextual representations using two
methods: (i) Second-to-Last Hidden (768-dimensional), and (ii) Concat Last
Four Hidden (3,072-dimensional) (see Table 7 in [
            <xref ref-type="bibr" rid="ref5">5</xref>
            ] for details).
4.3
          </p>
        </sec>
        <sec id="sec-7-3-3">
          <title>Fusion Model with Attention Mechanism</title>
          <p>In addition to the proposed model, the ensemble contained MLB and MUTAN
models, for which we used freely available PyTorch code.</p>
          <p>
            We integrated ten di erent MLB models [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] in the ensemble (see Table 2).
To prevent over tting, we reduced the dimensions of the identity core tensor, K,
https://github.com/Cadene/skip-thought.torch/tree/master/pytorch
https://blog.einstein.ai/the-wikitext-long-term-dependency-languagemodeling-dataset/
https://github.com/huggingface/pytorch-pretrained-BERT
https://github.com/Cadene/vqa.pytorch
to 64, 100 and 200 (the original value was K = 1,200, see [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ]). Furthermore, we
replaced all hyperbolic tangent (tanh) activations by ReLU activation functions.
          </p>
          <p>
            We employed ve di erent versions of the MUTAN architecture [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ] in the
ensemble model (see Table 2). All hyper-parameters were set as in [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ]. As with
the MLB model, all hyperbolic tangent activations were replaced by ReLU
activation functions.
          </p>
          <p>Both MLB and MUTAN were trained to minimize the categorical cross
entropy loss using the Adam optimizer with a learning rate of 0.0001 and
exponential decay rates of 1 = 0:9 and 2 = 0:999. As in the proposed model, the batch
size was 128 and the model was trained for 100 epochs. As for the proposed
method, the training time for both the MLB and the MUTAN models on an
Nvidia GTX 1080 Ti GPU was about 1.5 hours.
4.4</p>
        </sec>
        <sec id="sec-7-3-4">
          <title>Ensemble Model</title>
          <p>By varying the pre-trained question models and a few hyper-parameters of the
fusion schemes, we trained more than 40 base models separately on the training
set. We then evaluated their performance on the validation set to select the top
26 performing models (see Table 2), and built ensemble models using those 26
models.</p>
          <p>To generate the outputs for the test set, we trained the 26 aforementioned
models on the concatenation of the training and validation sets with the aim of
making the networks learn a wider range of answers.</p>
          <p>We then used two ensemble techniques: the average,
and the weighted average,
a~ =
where a~; a~weighted 2 RN are the output probability vectors over the answers,
fm 2 RN is the answer vector corresponding to model m that was computed
by Equation 4. The M is the number of models, and wm 2 R is the weight
corresponding to the performance of the mth model (computed as the mean
accuracy over the last 21 epochs on the validation set, as seen in Table 2).
5</p>
        </sec>
      </sec>
      <sec id="sec-7-4">
        <title>Experiments</title>
        <p>
          In this section, we detail the ImageCLEF-VQA-Med dataset, and compare the
proposed method to MLB [
          <xref ref-type="bibr" rid="ref10">10</xref>
          ] and MUTAN [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ] when applied on the validation
set. In addition, we discuss the results of the ensemble model on the test set.
        </p>
        <p>K
n.a.
100
200
1200
64</p>
        <p>G
2
8
4
4
8</p>
        <p>Mean
58.35</p>
        <p>SE
The ImageCLEF-VQA-Med data were partitioned in three sets: (i) a training
set of 3,200 images with 12,792 Question &amp; Answer (QA) pairs, (ii) a
validation set of 500 images with 2,000 QA pairs, and (iii) a test set of 500 images
with 500 questions. Di erent from previous challenges, the organizers of the
ImageCLEF-VQA-Med 2019 categorized the questions in four groups: Modality,
Plane, Organ System, and Abnormality. The task was to answer the questions
about the medical images in the test set as correctly as possible.</p>
        <p>The evaluation metrics were: (i) strict accuracy, de ned as the
percentage of correctly classi ed predictions, and (ii) Bilingual Evaluation Understudy
(BLEU) score that computes the similarity between n-grams of the ground truth
answers and the corresponding predictions.
5.2</p>
        <sec id="sec-7-4-1">
          <title>Results and Discussion</title>
          <p>
            Table 1 compares the performance of the proposed method to MLB [
            <xref ref-type="bibr" rid="ref10">10</xref>
            ] and
MUTAN [
            <xref ref-type="bibr" rid="ref3">3</xref>
            ], while Table 2 shows the mean and standard error of the
accuracies of the last 21 epochs of the 26 best-performing methods on the validation
set. From Table 1 and Table 2 we see that: (1) the proposed method performs
better than state-of-the-art methods on the ImageCLEF-VQA-Med dataset, (2)
bert-base-multilingual-uncased gives better question representations than
bertbase-multilingual-cased does, and (3) the question features extracted by the
pretrained BERT models are as good as those produced by the Skip-thought vectors.
          </p>
          <p>There are two possible explanations to why the proposed model outperforms
MLB and MUTAN. First, the ReLU overcomes the vanishing gradient
problem that hyperbolic tangent activation functions su ers from. It thus allows the
proposed model to learn faster and therefore it may perform better. Second,
by using the bilinear transformation instead of an inner product operation to
produce the global question and image features, that are used in MLB and
MUTAN, the proposed method considers every possible combination of elements
Fusion</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>MUTAN [3] MLB [10]</title>
    </sec>
    <sec id="sec-9">
      <title>Proposed</title>
      <p>Question
bert-cased
bert-uncased
bert-cased
bert-uncased
skip-thought
bert-cased
bert-cased
bert-cased
bert-cased
bert-uncased
bert-uncased
bert-uncased
skip-thought
skip-thought
bert-uncased
bert-uncased
bert-cased
bert-uncased
bert-cased
bert-cased
skip-thought
bert-cased
bert-uncased
skip-thought
bert-uncased
skip-thought</p>
      <p>G
from two aforementioned features, and thus become more capable of learning a
larger range of answers.</p>
      <p>K
n.a.
n.a.
n.a.
n.a.
n.a.
200
200
100
100
100
200
200
200
100
100
200
200
100
200
100
200
100
200
100
100
64
2
2
2
2
2
4
4
8
8
8
4
4
4
8
8
4
4
8
4
8
4
8
4
8
8
8</p>
      <p>SE
in a 1 % improvement on strict accuracy, which is consistent with the literature
results of using ensembles.</p>
      <p>Our best performing model, that achieved the strict accuracy of 61:60 and
the BLEU score of 63:89 on the validation set, was the ensemble of 11 proposed
models (see Table 2). This ensemble model also performed the best on the test set
(61:60 accuracy and 63:89 BLEU score), and won 3rd place in the
ImageCLEFVQA-Med 2019 challenge without using additional training data.
6</p>
      <sec id="sec-9-1">
        <title>Conclusion</title>
        <p>
          We have presented a novel fusion scheme for the VQA task. The proposed
approach was shown to perform better than current methods in the
ImageCLEFVQA-Med 2019 challenge. In addition, we introduced an image preprocessing
pipeline and utilized a pre-trained BERT model [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] to extract question features
for further processing. Last, we presented an ensemble method that boosted the
performance.
        </p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019</article-title>
          .
          <source>In: CLEF2019 Working Notes. CEUR Workshop Proceedings (CEURWS.org)</source>
          ,
          <source>ISSN 1613-0073</source>
          , http://ceur-ws.
          <source>org/</source>
          Vol-
          <volume>2380</volume>
          /, Lugano,
          <source>Switzerland (September 9-12</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Bahdanau</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.:</given-names>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>arXiv preprint arXiv:1409.0473</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3. Ben-younes, H.,
          <string-name>
            <surname>Cadene</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cord</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thome</surname>
          </string-name>
          , N.:
          <article-title>MUTAN: Multimodal Tucker fusion for visual question answering</article-title>
          . In: ICCV. p.
          <volume>3</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Van Merrienboer,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Learning phrase representations using RNN encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          : BERT:
          <article-title>Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fukui</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal compact bilinear pooling for visual question answering and visual grounding</article-title>
          .
          <source>arXiv preprint:1606</source>
          .
          <year>01847</year>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Peteri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klimuk</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tarasau</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tran</surname>
            ,
            <given-names>M.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Garcia</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kavallieratou</surname>
            ,
            <given-names>E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>del Blanco</surname>
            ,
            <given-names>C.R.</given-names>
          </string-name>
          , Rodr guez, C.C.,
          <string-name>
            <surname>Vasillopoulos</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Karampidis</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chamberlain</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Campello</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>ImageCLEF 2019: Multimedia retrieval in medicine, lifelogging, security and nature</article-title>
          . In:
          <article-title>Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the 10th International Conference of the CLEF Association (CLEF</source>
          <year>2019</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Lugano,
          <source>Switzerland (September 9-12</source>
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Ka e,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Visual question answering: Datasets, algorithms, and future challenges</article-title>
          .
          <source>Computer Vision and Image Understanding</source>
          <volume>163</volume>
          ,
          <issue>3</issue>
          {
          <fpage>20</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>On</surname>
            ,
            <given-names>K.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            ,
            <given-names>J.W.</given-names>
          </string-name>
          , Zhang, B.T.:
          <article-title>Hadamard product for low-rank bilinear pooling</article-title>
          .
          <source>arXiv preprint: 1610.04325</source>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Kingma</surname>
            ,
            <given-names>D.P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
          </string-name>
          , J.:
          <article-title>Adam: A method for stochastic optimization</article-title>
          .
          <source>arXiv preprint arXiv:1412.6980</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhutdinov</surname>
            ,
            <given-names>R.R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Urtasun</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Torralba</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fidler</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          :
          <article-title>Skip-thought vectors</article-title>
          .
          <source>In: NIPS</source>
          . pp.
          <volume>3294</volume>
          {
          <issue>3302</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>ImageNet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Peters</surname>
            ,
            <given-names>M.E.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Neumann</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gardner</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Clark</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zettlemoyer</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Deep contextualized word representations</article-title>
          .
          <source>arXiv preprint arXiv:1802</source>
          .
          <volume>05365</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jia</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sermanet</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Reed</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Anguelov</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Erhan</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rabinovich</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , et al.:
          <article-title>Going deeper with convolutions</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
          </string-name>
          , J.,
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          :
          <article-title>The application of two-level attention models in deep convolutional neural network for ne-grained image classi cation</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>842</volume>
          {
          <issue>850</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>