<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>JUST at VQA-Med: A VGG-Seq2Seq Model</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Bashar Talafha</string-name>
          <email>talafha@live.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Mahmoud Al-Ayyoub</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Jordan University of Science and Technology</institution>
          ,
          <addr-line>Irbid</addr-line>
          ,
          <country country="JO">Jordan</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the VGG-Seq2Seq system for the Medical Domain Visual Question Answering (VQA-Med) Task of ImageCLEF 2018. The proposed system follows the encoder-decoder architecture, where the encoders fuses a pretrained VGG network with an LSTM network that has a pretrained word embedding layer to encode the input. To generate the output, another LSTM network is used for decoding. When used with a pretrained VGG network, the VGG-Seq2Seq model managed to achieve reasonable results with 0.06, 0.12, 0.03 BLEU, WBSS and CBSS, respectively. Moreover, the VGG-Seq2Seq is not expensive to train.</p>
      </abstract>
      <kwd-group>
        <kwd>Sequence to sequence</kwd>
        <kwd>VGG Network</kwd>
        <kwd>Global Vectors for Word Representation</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>Introduction</title>
      <p>
        Visual Question Answering (VQA) is a recent and exciting problem at the
intersection between Computer Vision (CV) and Natural Language Processing
(NLP), where the input is an image and a question related to it written in a
natural language and the output is the correct answer to the question. The
answer can be a simple yes/no, choosing one of several options, a single word, or a
complete phrase of sentence [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ].
      </p>
      <p>
        From a rst glance, the VQA problem seem like a very challenging one. The
traditional CV techniques used for extracting useful information from images
and the NLP techniques typically used for Question Answering (QA) are very far
from each other and the interplay between them seem to be complex. Moreover
the ability to construct an useful answer based on such multi-modal input adds
to the complexity of the problem. Luckily, the recent advances in Deep Learning
(DL) have paved the way to building more robust VQA techniques [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ].
      </p>
      <p>
        In this paper, we are interested in an interesting variation of VQA where both
the image and question are from the medical domain. It is known as the Medical
Domain Visual Question Answering (VQA-Med) Task [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] of ImageCLEF 2018
[
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This task requires building a model that provide an answer to question about
the content of a medical image. In order to address this task, we propose a DL
model we call: VGG-Seq2Seq model. The model takes an image and a question as
input and outputs the answer of this question based on fusing features extracted
based on the image content with those extracted from the question itself.
      </p>
      <p>The rest of this paper is organized as follows. The following section presents
a very brief coverage of the related work. Sections 3 and 4 discuss the problem
at hand and the model we propose to handle it. The experimental evaluation
of our model and its discussion are presented in Section 5. Finally, the paper is
concluded in Section 6.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Works</title>
      <p>
        According to a recent survey on the VQA problem [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], most of the exiting
approaches are based on DL techniques. The only interesting exceptions are the
Answer Type Prediction (ATP) technique of [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] and the Multi-World QA of
[
        <xref ref-type="bibr" rid="ref12">12</xref>
        ]. Of course, there are other non-DL approaches that are used as baseline for
various datasets and approaches. Discussing them is outside the scope of this
paper.
      </p>
      <p>
        Regarding the DL-based approaches for VQA, most of them employ one of the
word embedding techniques (typically, Word2Vec [
        <xref ref-type="bibr" rid="ref14">14</xref>
        ]) sometimes coupled with
a Recurrent Neural Networks (RNN) to embed the question. Moreover, most of
them use Convolutional Neural Networks (CNN) to extract features from the
images. Examples of such approaches include iBOWIMG [25], Full-CNN [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ],
Ask Your Neurons (AYN) [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ], Vis+LSTM [18], Dynamic Parameter Prediction
(DPPnet) [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ], etc. Another type of DL-based techniques employ some sort of
attention mechanism such as Where to Look (WTL) [19], Recurrent Spatial
Attention (R-SA) [26], Stacked Attention Networks (SAN) [23], Hierarchical
Coattention (CoAtt) [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], Neural Module Networks (NMNs) [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], etc.
      </p>
      <p>
        Most of the work discussed in this section is not directly applicable to the
VQA-Med for two reasons. The rst one is an obvious one which is the focus on
the medical domain, which gives this problem its unique set of challenges. As for
the other one, it is related to how the sentences of the answers are constructed
in VQA-Med, which is di erent from existing VQA datasets such as DAtaset
for QUestion Answering on Realworld images (DAQUAR) [
        <xref ref-type="bibr" rid="ref12">12</xref>
        ], Visual7W [26],
Visual Madlibs [24], COCO-QA [18], Freestyle Multilingual Image Question
Answering dataset (FM-IQA) [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ], Visual Question Answering (VQA) [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ], etc.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Task Description and Dataset</title>
      <p>
        Nowadays, patients can access and review their medical reports related to their
healthcare due to the availability and accessibility of electronic medical records
which will help them better understand their conditions. This increases the need
for an automated system capable of taking questions related to some medical
problems along with accompanying images to support this question and provide
correct answer for them. This is exactly the task we are addressing in this work.
Given an image in the medical domain associated with a set of clinically relevant
questions, the goal of the task is answering the questions based on the visual
image content [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
      </p>
      <p>
        The dataset represents images related to medical domain. It was extracted
from PubMed Central articles (essentially a subset of the ImageCLEF 2017
caption prediction task). The dataset is divided into about 5k training set and about
0.5k validation set of medical images associated with question-answer pairs, and
about 0.5k testing set of medical images associated with only questions. Figure 1
shows some examples from the training set [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
In this section, we discuss our VGG-Seq2Seq model, which follows the
encoderdecoder architecture. The model is shown in Figure 2. In the following
paragraphs, we discuss in detail its di erent parts.
      </p>
      <p>The encoder consists of two main components. The rst component is a
Long short term memory (LSTM) network with a pretrained word embedding
layer which encodes the question into a vector representation, while the second
component is a pretrained VGG network that takes the image as an input and
extracts a vector representation for that image. The nal state of the encoding,
the outputs of the two components are concatenated together into one vector
called thought vector.</p>
      <p>The decoder consists of LSTM network that takes the thought vector as
initial state and 〈start〉 token as input in the rst time step and try to predict
the answer using softmax layer.
Two main components have been conducted in the encoder, the rst component
is a LSTM network with a pretrained word embedding layer, and the second
component is the VGG network.</p>
      <p>
        In the rst component the semantic meaning of the question will be extracted,
a 300 dimensional pretrained word embedding layer is used to encode the word
into a dense semantic space using Glove [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ]. This word representation is then
fed to a LSTM network with 1024 hidden nodes.
      </p>
      <p>
        LSTM [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] is a special type of Recurrent Neural Network (RNN) that has
been designed to solve the problem of vanishing gradient. The LSTM layer used
its memory cells to store the context information. LSTM has three gates (i.e.
Input gate, forget gate and output gate) which will decide how the input will be
handled.
      </p>
      <p>At any time step, inputs to the LSTM cell are current word (x), previous
hidden state (h-1) and previous memory state (c-1), and LSTM cell outputs
are current hidden state (h) and current memory state (c). These states have
1024 hidden nodes. At last time step in the sequence, we will call last LSTM
cell output hidden state (h) nal hidden state, and output memory state nal
memory state.</p>
      <p>In the second component, we use the concept of transfer learning where a
pretrained model is used with some modi cation to serve a wholly new task.
We use pretrained VGG network [20] with removing the last softmax layer. This
network will output a vector of size 4096 representing a vector of features for
the input image. This vector is then passed to tow fully-connected layers with
2500 and 1024 hidden nodes respectively. The main purpose of these two layers
is to decrease the features vector dimension to become close to the LSTM output
vectors.</p>
      <p>The 1024 image features vector is then concatenated with both the LSTM
nal hidden state and nal memory state as shown in gure 2, we will call those
tow vectors thought vectors, we believe that the thought vectors will represent
the semantic meaning of the input question and features of the input image.
4.2</p>
      <p>Decoder
In this part the answer of the input image and question will be extracted. The
decoder consists of LSTM layer that takes three inputs, the rst input is 〈start〉
token that indicates to start decoding, the second input and third input are
the decoder initial states which are previous hidden state and previous memory
state. The decoder takes the encoder nal states (i.e. encoder nal hidden state
and encoder nal memory state) as initial states. Thus, the decoder initial states
will be the thought vectors.</p>
      <p>At the rst time step, LSTM cell will takes 〈start〉 token as input given the
initial state and calculates the probability distribution of the target word using
softmax layer. The word with the highest probability will be the rst word of
the answer, this word will be then passed to the second LSTM cell as input and
predict the second word of the answer. The full answer will be generated by
repeating this process until the model predicts 〈end〉 token.
5</p>
    </sec>
    <sec id="sec-4">
      <title>Evaluation and Results</title>
      <p>This section discusses the experiments used to evaluate our model and the
obtained results. However, we rst need to discuss the evaluation process.</p>
      <p>
        As described in VQA-Med task description [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], three pre-processing steps are
conducted on each answer before running the evaluation metrics: (a) converting
each answer to lower-case, (b) removing all punctuations and tokenizing the
the answer to a list of words and (c) removing stopwords using NLTKs English
stopwords list.
      </p>
      <p>
        In order to evaluate our models, three metrics are used as discussed in
VQAMed task description [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]: BLEU, WBSS and CBSS. The BLEU [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ] metric is
used to calculate the similarity between the predicted answer and the actual
answer. The second metric is WBSS (Word-based Semantic Similarity) [21],
which calculates semantic similarity in the biomedical domain. Finally, CBSS
(Concept-based Semantic Similarity) [22], which is similar to WBSS, except
that it can extract biomedical concepts from the answers using MetaMap via the
pymetamap wrapper, and it builds a dictionary using these extracted concepts.
      </p>
      <p>Three experiments are conducted to evaluate our model. They described as
follows.</p>
      <p>{ In the rst experiment, instead of using pretrained VGG-net we built a
Convolutional Neural Network (CNN) that consists of three convolutional and
max-pooling layers which behave as the feature extractor, followed by a fully
connected layer. This network outputs a vector of size 4096 representing the
input image features, this vector then will be fed to the 2500 fully connected
layer and the rest of the architecture stayed as is.
{ In the second experiment, we implemented VGG-Seq2Seq model but, instead
of using the pretrained network, we built and trained the VGG network with
its convolutional layers on the dataset, the rest of the architecture stayed as
is.
{ In the last experiment, we run our proposed model (VGG-Seq2Seq) with
pretrained VGG-net on the dataset.</p>
      <p>
        The three above experiments are trained using a single layer of LSTM
network on the encoder with a dimension of 1024 and a single layer of LSTM
network on the decoder with a dimension of 2048. All models were trained using
RMSprop optimizer [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] with 0.001 learning rate on 500 epochs with 512 batch
size at each epoch and 300 word embedding size. As shown in Table 1, the
results show that VGG-Seq2Seq (pre-trained VGG) achieves reasonable results
with 0.06, 0.12, 0.03 BLEU, WBSS and CBSS, respectively.
      </p>
      <p>It is worth mentioning that the best performing VGG-Seq2Seq is not very
expensive to train. It took an average of 252.8 seconds per epoch on a Virtual
Machine (VM) equipped with Tesla K80 GPU card with 24GB of RAM. The
VM had Ubuntu OS with CUDA 9.0. For the implementation, we use Keras with
TensorFlow 1.8 backend.
6</p>
    </sec>
    <sec id="sec-5">
      <title>Conclusion</title>
      <p>In this paper, we addressed the very interesting yet challenging VQA-Med Task
of ImageCLEF 2018. We introduced our VGG-Seq2Seq model which employs
an encoder-decoder architecture, where the encoders fuses a pretrained VGG
network with an LSTM network that has a pretrained word embedding layer to
encode the input. As for the answer generation, another LSTM network is used
as a decoded. When used with a pretrained VGG network, the VGG-Seq2Seq
model managed to achieve reasonable results with 0.06, 0.12, 0.03 BLEU, WBSS
and CBSS, respectively. Moreover, the VGG-Seq2Seq is not expensive to train.
Obviously, the VGG-Seq2Seq model is far from perfect. We intend to work on
it to increase its accuracy and enhance its run-time and space requirements.
18. Ren, M., Kiros, R., Zemel, R.: Exploring models and data for image question
answering. In: Advances in neural information processing systems. pp. 2953{2961
(2015)
19. Shih, K.J., Singh, S., Hoiem, D.: Where to look: Focus regions for visual
question answering. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. pp. 4613{4621 (2016)
20. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale
image recognition. arXiv preprint arXiv:1409.1556 (2014)
21. Soganc oglu, G., Ozturk, H., Ozgur, A.: Biosses: a semantic sentence similarity
estimation system for the biomedical domain. Bioinformatics 33(14), i49{i58 (2017)
22. Wu, Z., Palmer, M.: Verbs semantics and lexical selection. In: Proceedings of the
32nd annual meeting on Association for Computational Linguistics. pp. 133{138.</p>
      <p>Association for Computational Linguistics (1994)
23. Yang, Z., He, X., Gao, J., Deng, L., Smola, A.: Stacked attention networks for
image question answering. In: Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition. pp. 21{29 (2016)
24. Yu, L., Park, E., Berg, A.C., Berg, T.L.: Visual madlibs: Fill in the blank
description generation and question answering. In: Computer Vision (ICCV), 2015 IEEE
International Conference on. pp. 2461{2469. IEEE (2015)
25. Zhou, B., Tian, Y., Sukhbaatar, S., Szlam, A., Fergus, R.: Simple baseline for visual
question answering. arXiv preprint arXiv:1512.02167 (2015)
26. Zhu, Y., Groth, O., Bernstein, M., Fei-Fei, L.: Visual7w: Grounded question
answering in images. In: Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition. pp. 4995{5004 (2016)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andreas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Deep compositional question answering with neural module networks</article-title>
          .
          <source>CoRR abs/1511</source>
          .02799 (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Antol</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            , J., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Lawrence Zitnick,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          : Vqa:
          <article-title>Visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE International Conference on Computer Vision</source>
          . pp.
          <volume>2425</volume>
          {
          <issue>2433</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Are you talking to a machine? dataset and methods for multilingual image question</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>2296</volume>
          {
          <issue>2304</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gupta</surname>
            ,
            <given-names>A.K.</given-names>
          </string-name>
          :
          <article-title>Survey of visual question answering: Datasets and techniques</article-title>
          .
          <source>arXiv preprint arXiv:1705.03865</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the ImageCLEF 2018 medical domain visual question answering task</article-title>
          .
          <source>In: CLEF2018 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceurws.org&gt;, Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hinton</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Srivastava</surname>
            ,
            <given-names>N.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Swersky</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Rmsprop: Divide the gradient by a running average of its recent magnitude. Neural networks for machine learning</article-title>
          ,
          <source>Coursera lecture 6e</source>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Villegas</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrearczyk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEF 2018:
          <article-title>Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9. Ka e,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Answer-type prediction for visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>4976</volume>
          {
          <issue>4984</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hierarchical question-image co-attention for visual question answering</article-title>
          .
          <source>In: Advances In Neural Information Processing Systems</source>
          . pp.
          <volume>289</volume>
          {
          <issue>297</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Ma</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          :
          <article-title>Learning to answer questions from image using convolutional neural network</article-title>
          .
          <source>In: AAAI</source>
          . vol.
          <volume>3</volume>
          , p.
          <volume>16</volume>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fritz</surname>
            ,
            <given-names>M.:</given-names>
          </string-name>
          <article-title>A multi-world approach to question answering about real-world scenes based on uncertain input</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1682</volume>
          {
          <issue>1690</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fritz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ask your neurons: A deep learning approach to visual question answering</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>125</volume>
          (
          <issue>1-3</issue>
          ),
          <volume>110</volume>
          {
          <fpage>135</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Noh</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hongsuck</surname>
            <given-names>Seo</given-names>
          </string-name>
          , P., Han,
          <string-name>
            <surname>B.</surname>
          </string-name>
          :
          <article-title>Image question answering using convolutional neural network with dynamic parameter prediction</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>30</volume>
          {
          <issue>38</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , W.J.:
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In: Proceedings of the 40th annual meeting on association for computational linguistics</source>
          . pp.
          <volume>311</volume>
          {
          <fpage>318</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Pennington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Manning</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Glove:
          <article-title>Global vectors for word representation</article-title>
          .
          <source>In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP)</source>
          . pp.
          <volume>1532</volume>
          {
          <issue>1543</issue>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>