<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Employing Inception-Resnet-v2 and Bi-LSTM for Medical Domain Visual Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yangyang Zhou</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Xin Kang</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Fuji Ren</string-name>
          <email>reng@is.tokushima-u.ac.jp</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Tokushima University</institution>
          ,
          <addr-line>Tokushima 770-8506, JP</addr-line>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our method for generating the answers for questions based on medical images, in the ImageCLEF VQAMed 2018 task [7][5]. Firstly, we use some image enhancement methods like clipping and questions preprocessing methods like lemmatization. Secondly, we use Inception-Resnet-v2 model (CNN) to extract image features, and use Bi-LSTM model (RNN) to encode the questions. Finally, we concatenate the coded questions with the image features to generate the answers. Our result was ranked secondly based on the BLEU, WBSS and CBSS metrics for evaluating semantic similarity, which suggests that our method is e ective for generating answers from medical images and related questions.</p>
      </abstract>
      <kwd-group>
        <kwd>VQA-Med mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Inception-Resnet-v2
Bi-LSTM
Attention</p>
    </sec>
    <sec id="sec-2">
      <title>Introduction</title>
      <p>
        Visual question answering (VQA) is the task of generating textual answers for
questions based on the contents of images. The VQA system takes images and
questions as input, and combines the information of the input to generate
readable answers as output. To generate the answers of speci c questions, the VQA
system needs to understand the content of the images and to get related
background knowledge, which involves natural language processing and computer
vision techniques. On the other hand, with the increasing of pouring attention
into the medical domain, the combination of VQA and medical domain has
become an extremely interesting challenge. It can not only provide a reference of
diagnosis to the doctor, but also allow the patient to obtain health information
directly, thereby improving the e ciency of diagnosis and treatment. Existing
systems like MYCIN [
        <xref ref-type="bibr" rid="ref13">13</xref>
        ] have been able to simulate the diagnostic process and
generate treatment plans based on relevant medical knowledge and a series of
rules.
      </p>
      <p>This paper aims to generate readable answers in the ImageCLEF VQA-Med
2018 task. The dataset involves a variety of medical images, related questions
and answers. We divide the data into two parts as input. We use some image
enhancement methods, and generate the image features by pre-trained CNN
model. As for the part of the questions, we use kinds of text preprocessing
methods like lemmatization, and after that, encode the questions by RNN model.
Then, we add attention mechanism to the model. At last, we formulate simple
rules on the output and generate reliable answers.</p>
      <p>The rest of this paper is organized as follows. Section 2 brie y reviews the
related work of VQA-Med task. Section 3 describes the analysis of data sets and
methods used for generation in details during the experiment. We report our
experiment result and evaluation in section 4, and conclude this paper in section
5.
2</p>
    </sec>
    <sec id="sec-3">
      <title>Related work</title>
      <p>A very close study to the VQA-Med task is the VQA challenge1. The VQA
challenge has been held every year since 2016. The data set is based on open domain
and includes more than 260 thousand images and 5.4 questions on average per
image.</p>
      <p>
        Ka e K et al. [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ] and other researchers summarized quite a few methods for
VQA. The majority of them used recurrent neural networks such as LSTM to
encode questions, and used deep convolutional neural networks such as
VGG16 to focus on image recognition in advance. On the basis of these, there were
variant models such as attention mechanisms [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], neural modules [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], dynamic
memory [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ], and even the addition of external knowledge bases [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ], to improve
the accuracy of the answers.
      </p>
      <p>
        Deep convolutional neural networks [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] (CNN) can be used to extract the
features of an image and identify the objects in it. The Inception-Resnet-v2
model [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] is one kind of advanced convolutional neural network that combines
the inception module with ResNet. The remaining connections allow shortcuts
in the model to make the network more e cient.
      </p>
      <p>
        Elman J L [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ] rst used a recurrent neural network (RNN) to handle
sequences problems. Nevertheless, context information is easily ignored when
RNN processes long sequences. The proposal of LSTM [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] alleviated the problem
of long-distance dependence. Furthermore, the researchers also found that if
the input sequence is reversed, the corresponding path from the decoder to the
encoder will be shortened, contributing to network memory. The Bi-LSTM
model [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ] combines the two points above, and makes the result better.
      </p>
      <p>
        On the other hand, there have been many Computer-aided diagnosis systems
in medical imaging [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ]. However, the majority of them are dealing with
singledisease problems, and mainly concentrated on easily-determined regions such as
the lungs and skins. The progress of the complex parts is slow. Compared with
detection technology, the global lesions and structural lesions are still intensely
di cult for the machines to learn.
      </p>
      <p>The VQA-Med task di ers from the VQA challenge in that it requires the
understanding of di erent kinds of medical images with di erent body parts.</p>
      <sec id="sec-3-1">
        <title>1 http://visualqa.org/index.html</title>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Methodology</title>
      <sec id="sec-4-1">
        <title>Dataset analysis</title>
        <p>Images
Questions
Answers</p>
        <p>The dataset of VQA-Med task consists of more than two thousand images,
containing several kinds of medical images, such as computed tomography,
magnetic resonance imaging, positron emission tomography, etc. However, compared
to the open eld VQA dataset, the number of training examples in the VQA-Med
task is very small. For the deep learning models of VQA, which usually contain
millions of parameters, the learning process would converge quickly with high
bias, i.e. over tting. Table 1 shows the statistics of the data. From the
training set, there are an average of 2.4 questions per image, and a maximum of 7
questions per image. This ratio is even smaller in the validation set and test set.
Additionally, there is only one reference answer for each question, which has a
great limitation for answers generation.</p>
        <p>Fig.1 shows statistics about di erent types of questions. The number of
questions started with word \what" is large in three datasets, while the questions
asking positions and other questions, including Yes-no questions, occupy
relatively small proportions. Moreover, the proportions of questions in the three datasets
are also quite di erent. Therefore, it is di cult for computers to learn the
characteristics from the questions in small proportion in the course of training, and
the performance in validation and test may not be as good as we expected.</p>
        <p>We count the sequences of questions and answers. As shown in Fig.2, the
sequence length of the questions is obviously longer than the answers. In
fact, many of the answers are just phrases and cannot form complete sentences.
Through horizontal comparison, we can nd that the length of sentences in the
training set is longer than that of the validation set. To prevent too slow training
due to the long sequences, and to prevent loss of information due to the short
sequences, we x the length of the training sentences to 9 words. This length of
sentences allows us to reserve the contents of most of the questions and answers.
Speci cally, the \empty" words will be lled up at the end of the short sentences,
and only the beginning 9 words will be reserved for the long sentences.</p>
        <p>After merging all the questions and answers separately, we calculate the word
frequency in Fig.3. In order to ensure the e ectiveness of training, we plan to
remove low-frequency words. Considering that it is appropriate to control the
dictionary size of questions or answers within one thousand, we eventually set
the words whose frequency is less than 5 as low-frequency words.
3.2</p>
      </sec>
      <sec id="sec-4-2">
        <title>Preprocessing</title>
        <p>For images, we use Inception-Resnet-v2 models to generate their features. In
order to reduce the over tting case, we adopt some image enhancement methods.
Considering there are position judgments in the task, we reconstruct the picture
with exceedingly small random rotations, o sets, scaling, clipping, and increase
to 20 images per image (Fig.4).</p>
        <p>For questions, we adopt some methods like stemming and lemmatization to
alter verbs, nouns, and other words into original forms, to prevent over tting.
Furthermore, there is a situation that both full name and abbreviation coexist,
like \inferior vena cava" and \IVC". We have changed all these medical terms
into abbreviation. There are also a lot of pure numbers and combinations of
numbers and letters. Therefore, the combinations of letters and numbers used
to represent positions are mapped to an \pos" token, and the pure numbers are
mapped to an \num" token, so as to reduce information complexity.
In addition, we try to remove useless information such as stop words.
According to the word frequency distribution in data analysis, we remove the
lowfrequency words to ensure training e ciency. In the meanwhile, we establish the
dictionaries separately and make sure that the sizes of the dictionaries are both
within one thousand.</p>
        <p>There are some high-frequency verbs like \show" that emerge in almost
every question. Several less useful adjectives like \large" also appear in questions
from time to time. To cooperate with image enhancement methods, these verbs
and adjectives are removed in the questions each time, so that each question is
enhanced to 20 questions, and the answer remains unchanged at the same time.</p>
        <p>The preprocessing of the answers is simpler than that of the questions. We
use lemmatization and removing stop words. Besides, we create dictionaries
separately and make sure that the sizes of them are within one thousand, just like
questions part. However, the di erence is that the low-frequency words in
answers would be replaced by \abnormality" instead of simply removing them.
Words with numbers have not been replaced. And the output sequences are the
same as input sequences.
3.3</p>
      </sec>
      <sec id="sec-4-3">
        <title>VQA-Med model</title>
        <p>The basic model we build is to combine Inception-Resnet-v2 with Bi-LSTM.
Firstly, as shown in Fig.5, the medical images are transformed into the
features through the Inception-Resnet-v2 network. The pre-training weights of the
Inception-Resnet-v2 are based on the Apache License2. Secondly, the questions
are fed to the embedding layer and the Bi-LSTM layer. The last time step
output of the Bi-LSTM layer is reserved as the question encoding features. Thirdly,
after concatenating the features of images and questions, we use another
BiLSTM layer. And this time the output returns decoded sequences. Finally the
fully connected layer outputs the predicted sequence with \softmax" activation.</p>
        <sec id="sec-4-3-1">
          <title>2 https://github.com/tensor ow/models/blob/master/LICENSE</title>
          <p>The loss function of the model we selected is categorical cross entropy, using
the following formula:</p>
          <p>H(T; q) =
∑n 1</p>
          <p>N
i=1
log2 q(xi)
(1)
where N is the size of validation set, and q(x) is the probability of event x
estimated from the training set.</p>
          <p>Considering that the over tting is severe with small amount of training data,
we adopt a dropout value of 50%, a L2 regularization in the Bi-LSTM layers and
a batch normalization after the Bi-LSTM layers. However, we nd that there are
some problems with the syntax and semantics of the generated answers, which
is not satisfactory. In particular, the over tting problem still exists.</p>
          <p>To solve this problem, we add attention mechanism and modify the model. As
shown in Fig.6, the image features are converted by Inception-Resnet-v2 network,
as in the previous model. Then, we use a dense layer and a repeat vector layer
to deal with the image features. The questions are trained in a Bi-LSTM layer
after an embedding layer and return the sequences directly, with a 50% dropout
rate and a batch normalization. We adopt the attention module to integrate
the features of the images and the questions. After that, we concatenate the
outcomes of attention module with the question features. Eventually, the
fullconnected layer outputs the predicted sequences with \softmax" activation.</p>
          <p>We also added several simple rules to the output to make the generated
answers more reasonable. It may be due to the fact that the word frequency
of prepositions is relatively high, some of the generated answers have
successive and repetitive prepositions outputs. Thus, we choose to delete these extra
prepositions. In addition, for the answers of Yes-no questions, there is a case in
which \yes" or \no" is output at the same time with other words unrelated. We
choose to delete these extra words as well.
4</p>
        </sec>
      </sec>
    </sec>
    <sec id="sec-5">
      <title>Experiment</title>
      <sec id="sec-5-1">
        <title>Model selection</title>
        <p>Based on the performance of VQA-Med on the validation set, the parameters
are set as follows. The size of dictionary is 1000, and the length of sequences
is 9. The hidden size of Bi-LSTM is 128. And the batch size of training is 256.
The metrics method is categorical accuracy. We use the ADAM optimizer with
1 = 0:9; 2 = 0:999; " = 10 8.</p>
        <p>We set the epoch to 300, and the training process is shown in Fig.7. The
accuracy of the validation set and the degree of over tting are both better than
that without attention mechanism. The nal loss of no-attention-model is over
8 while that of attention-model is about 4.5, which means it is e ective to add
attention mechanism.</p>
        <p>The nal result we submitted is using training set and the validation set to
participate in the training.
4.2</p>
      </sec>
      <sec id="sec-5-2">
        <title>Evaluation</title>
        <p>The following evaluation methods are employed for evaluating the VQA-Med
results.</p>
        <p>
          Bilingual evaluation understudy [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ] is an auxiliary tool for assessing the
quality of bilingual translations. It is used to determine the degree of
similarity between sentences translated by machines and by humans. BLEU uses the
matching rule of N-gram to calculate the proportion of similarity between two
sentences. Actually, it is to calculate the frequency of two sentences co-occurrence
words. This tool is fast, and the results are also close to human evaluation scores.
Nevertheless, there are also de ciencies. For instance, it is easily interfered by
frequent words, cannot consider synonym expression, and do not consider
grammatical accuracy. In this task, the method is used to compare the similarity
between the generated answers and the referenced answers.
        </p>
        <p>
          Word-based Semantic Similarity method is used to measure the semantics
similarity between the generated answers and the factual answers at the word
level by tokenizing predictions and real answers as words. This algorithm is
recently used to calculate the semantic similarity in the biomedical domain [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ].
        </p>
        <p>Concept-based Semantic Similarity is similar to WBSS as described above.
The di erence is that this metric is to extract the biomedical concepts in the
predictions and the real answers respectively, then construct a dictionary. After
vectorizing the words and calculating the cosine between them, the similarity
could be expressed.</p>
        <p>BLEU
Improved model without output rules 0.103070853
Basic model with output rules 0.106454315
Improved model with output rules 0.134830654
WBSS
0.147733901
0.159756011
0.173731936</p>
        <p>CBSS
0.3236155
0.334431201
0.329503441</p>
        <p>The scores of our task submissions are shown in Table 2. As can be seen
from the table, the use of output rules is crucial: the scores of all three
evaluation methods drop if it is removed. Attention mechanism is also a signi cant
component that improves BLEU and WBSS scores.</p>
        <p>Most of the generated results are phrases, such as \right region" and \anterior
part bladder". However, since there are no medical imaging professionals who
can provide suggestions for the improvement of our process, the results may
di er from the actual situation.
5</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>Conclusion</title>
      <p>In this paper, we described our participation in ImageCLEF VQA-Med 2018
task, which is a problem of answering questions for the medical domain. We
use images and questions enhancement preprocessing. We adopt the VQA-Med
model introduced above during the training. Our result has a BLEU score of
0.135, a WBSS score of 0.174 and a CBSS score of 0.330. As can be seen, due to
the small number of datasets, it is di cult to generate highly accurate answers
without using external data.</p>
      <p>
        Our future work will focus on making the answers more accurate. In the
preprocessing section, we can classify the medical images and train them separately.
External data and relevant medical knowledge can be used in data enhancement.
As for the model, we consider to use other new methods, such as Hierarchical
Co-Attention model [
        <xref ref-type="bibr" rid="ref11">11</xref>
        ] to improve the accuracy of answers.
      </p>
    </sec>
    <sec id="sec-7">
      <title>Acknowledgements</title>
      <p>This research has been partially supported by the Ministry of Education, Science,
Sports and Culture of Japan, Grantin-Aid for Scienti c Research(A), 15H01712.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Andreas</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Klein</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Neural module networks</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>39</volume>
          {
          <issue>48</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Doi</surname>
          </string-name>
          , K.:
          <article-title>Computer-aided diagnosis in medical imaging: historical review, current status and future potential</article-title>
          .
          <source>Computerized medical imaging and graphics 31(4-5)</source>
          ,
          <volume>198</volume>
          {
          <fpage>211</fpage>
          (
          <year>2007</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Elman</surname>
            ,
            <given-names>J.L.</given-names>
          </string-name>
          :
          <article-title>Finding structure in time</article-title>
          .
          <source>Cognitive science 14</source>
          (
          <issue>2</issue>
          ),
          <volume>179</volume>
          {
          <fpage>211</fpage>
          (
          <year>1990</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Graves</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Framewise phoneme classi cation with bidirectional lstm and other neural network architectures</article-title>
          .
          <source>Neural Networks</source>
          <volume>18</volume>
          (
          <issue>5-6</issue>
          ),
          <volume>602</volume>
          {
          <fpage>610</fpage>
          (
          <year>2005</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the ImageCLEF 2018 medical domain visual question answering task</article-title>
          .
          <source>In: CLEF2018 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org &lt;http://ceurws.org&gt;, Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Villegas</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrearczyk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEF 2018:
          <article-title>Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ),
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Avignon,
          <source>France (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8. Ka e,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Visual question answering: Datasets, algorithms, and future challenges</article-title>
          .
          <source>Computer Vision and Image Understanding</source>
          <volume>163</volume>
          ,
          <issue>3</issue>
          {
          <fpage>20</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Krizhevsky</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hinton</surname>
          </string-name>
          , G.E.:
          <article-title>Imagenet classi cation with deep convolutional neural networks</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>1097</volume>
          {
          <issue>1105</issue>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Kumar</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Irsoy</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ondruska</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Iyyer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bradbury</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gulrajani</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhong</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Paulus</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Socher</surname>
          </string-name>
          , R.:
          <article-title>Ask me anything: Dynamic memory networks for natural language processing</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . pp.
          <volume>1378</volume>
          {
          <issue>1387</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hierarchical question-image co-attention for visual question answering</article-title>
          .
          <source>In: Advances In Neural Information Processing Systems</source>
          . pp.
          <volume>289</volume>
          {
          <issue>297</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , W.J.:
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In: Proceedings of the 40th annual meeting on association for computational linguistics</source>
          . pp.
          <volume>311</volume>
          {
          <fpage>318</fpage>
          .
          <article-title>Association for Computational Linguistics (</article-title>
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13. Shortli e, E.:
          <article-title>Computer-based medical consultations: MYCIN</article-title>
          , vol.
          <volume>2</volume>
          .
          <string-name>
            <surname>Elsevier</surname>
          </string-name>
          (
          <year>2012</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14. Soganc oglu, G., Ozturk, H., Ozgur, A.:
          <article-title>Biosses: a semantic sentence similarity estimation system for the biomedical domain</article-title>
          .
          <source>Bioinformatics</source>
          <volume>33</volume>
          (
          <issue>14</issue>
          ),
          <year>i49</year>
          {
          <fpage>i58</fpage>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alemi</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          .
          <source>In: AAAI</source>
          . vol.
          <volume>4</volume>
          , p.
          <volume>12</volume>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hengel</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          v.d.,
          <string-name>
            <surname>Dick</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Explicit knowledge-based reasoning for visual question answering</article-title>
          .
          <source>arXiv preprint arXiv:1511.02570</source>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ba</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Courville</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Salakhudinov</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bengio</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          :
          <article-title>Show, attend and tell: Neural image caption generation with visual attention</article-title>
          .
          <source>In: International Conference on Machine Learning</source>
          . pp.
          <year>2048</year>
          {
          <year>2057</year>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>