<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>NLM at ImageCLEF 2018 Visual Question Answering in the Medical Domain</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Asma Ben Abacha</string-name>
          <email>asma.benabacha@nih.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Soumya Gayen</string-name>
          <email>soumya.gayen@nih.gov</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Jason J Lau</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sivaramakrishnan Rajaraman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Dina Demner-Fushman</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Lister Hill National Center for Biomedical Communications, National Library of Medicine</institution>
          ,
          <addr-line>Bethesda, MD</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the U.S. National Library of Medicine (NLM) in the Visual Question Answering task (VQAMed) of ImageCLEF 2018. We studied deep learning networks with stateof-the-art performance in open-domain VQA. We selected Stacked Attention Network (SAN) and Multimodal Compact Bilinear pooling (MCB) for our o cial runs. SAN performed better on VQA-Med test data, achieving the second best WBSS score of 0.174 and the third best BLEU score of 0.121. We discuss the current limitations and future improvements to VQA in the medical domain. We analyze the use of automatically generated questions and images selected from the literature based on ImageCLEF data. We describe four areas of improvements dedicated to medical VQA: (i) designing goal-oriented VQA systems and datasets (e.g. clinical decision support, education), (ii) generating and categorizing medical/clinical questions, (iii) selecting (clinically) relevant images, and (iv) capturing the context and the medical knowledge.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Question Answering</kwd>
        <kwd>Deep Learning</kwd>
        <kwd>Medical Images</kwd>
        <kwd>Medical Questions and Answers</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        This paper describes the participation of the U.S. National Library of Medicine1
(NLM) in the Visual Question Answering task2 (VQA-Med) of ImageCLEF
2018 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. ImageCLEF is an evaluation campaign that is being organized as part
of the CLEF3 initiative labs since 2003.
      </p>
      <p>
        This year, the VQA-Med task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] was introduced for the rst time, inspired
by the open-domain VQA challenges4 that started in 2015. Given a medical
image and a natural language question about the image, participating systems
are tasked with answering the question based on the visual image content. Three
datasets were provided for training, validation and testing.
      </p>
    </sec>
    <sec id="sec-2">
      <title>1 http://www.nlm.nih.gov</title>
    </sec>
    <sec id="sec-3">
      <title>2 http://www.imageclef.org/2018/VQA-Med</title>
    </sec>
    <sec id="sec-4">
      <title>3 http://www.clef-initiative.eu</title>
    </sec>
    <sec id="sec-5">
      <title>4 http://visualqa.org</title>
      <p>In our experiments, we used two deep learning VQA models: Stacked
Attention Network (SAN) and Multimodal Compact Bilinear pooling (MCB). For
image processing, both VQA models use Convolutional Neural Networks (CNNs).
SAN uses VGG-16 and MCB uses ResNet-152 and ResNet-50, pre-trained on the
ImageNet database5. Image features are extracted from the last pooling layer
of the CNNs. For question processing, both VQA models use LSTMs without
pre-trained embeddings. Question vectors are extracted from the nal hidden
layer of the LSTMs. We submitted ve runs using these two VQA models6.</p>
      <p>The rest of the paper is organized as follows: Section 2 describes the datasets
provided in the scope of the VQA-Med challenge. Section 3 presents the deep
learning networks that we selected and used for VQA in the medical domain.
Section 4 describes our method to ne-tune pre-trained CNNs on modality
classication. Section 5 provides a description of the submitted runs. Section 6 presents
the o cial results. Finally, we discuss the VQA task, the data, and future
improvements in Section 7.
2</p>
      <sec id="sec-5-1">
        <title>Data Description</title>
        <p>Given an image and a natural language question, the VQA task consists in
providing an accurate natural language answer based on the content of the image.
Figure 1 shows an example from VQA-Med data.</p>
        <p>In the scope of the VQA-Med challenge, three datasets were provided:
{ The training set contains 5,413 question-answer pairs associated with 2,278
training images.
{ The validation set contains 500 question-answer pairs associated with 324
validation images.
{ The test set contains 500 questions associated with 264 test images.</p>
        <p>By analyzing the questions manually, four main types of questions could be
identi ed:
1. Location, e.g. where is the lesion located? where is the abnormality found?
where is the lesion seen?
2. Finding, e.g. what does the mri show? what does the ct scan con rm? what
does the image demonstrate? what does the ct chest reveal? what is the
lesion suggestive of?
3. Yes/No questions, e.g. is the brain magnetic resonance imaging normal?
are the opacities in both lungs? does the image demonstrate cerebral
infarction? Could the recess be mistaken for subcarinal lymphadenopathy?
4. Other questions, e.g. what kind of image is this? what is the mass invading?
what other abnormality can be seen in the image?</p>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>5 http://www.image-net.org</title>
    </sec>
    <sec id="sec-7">
      <title>6 For the SAN model, we utilized Torch and the computational resources of the NIH</title>
      <p>HPC Biowulf cluster (http://hpc.nih.gov). We thank Wolfgang Resch for his support.
Question: what does transverse ct image demonstrate?
Answer: focal defect in in amed appendiceal wall and periappendiceal in ammatory
stranding.</p>
      <sec id="sec-7-1">
        <title>Visual Question Answering Networks</title>
        <p>
          Recently, VQA has been widely addressed with the introduction of new datasets
such as the open-domain VQA 1.0 dataset of 250K images, over 760K questions,
and around 10M answers [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ], and new applications such as answering visual
questions for blind people [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
        </p>
        <p>
          Di erent deep networks and attention mechanisms have been applied to
opendomain VQA [
          <xref ref-type="bibr" rid="ref5 ref6">5, 6</xref>
          ]. We studied several VQA networks and selected the following
models for our participation in VQA-Med 2018.
3.1
        </p>
        <sec id="sec-7-1-1">
          <title>Stacked Attention Network</title>
          <p>
            The Stacked Attention Network (SAN) [
            <xref ref-type="bibr" rid="ref7">7</xref>
            ] was proposed to allow multi-step
reasoning for answer prediction. SAN includes three components: (i) the image
model based on a CNN to extract high level image representations, (ii) the
question model using an LSTM to extract a semantic vector of the question
and (iii) the stacked attention model which locates the image regions that are
relevant to answer the question. The SAN model achieves an Accuracy of 57.3%
on VQA 1.0 test data.
          </p>
          <p>For the image model, we used the last pooling layer of VGG-16 pre-trained
on imageNet as image features. For the question model, we used the last LSTM
layer as question features. The image features and the question vector were used
to generate the attention distribution over the regions of the image.</p>
          <p>The rst attention layer of the SAN is then computed to capture the
correlations between the tokens of the question and the regions in the image.
Multimodal pooling is performed to generate a combined question and image vector
that is then used as the query for the image in the next layer. We used two
attention layers, as it showed better results in open-domain VQA. The last step is
answer prediction. For a set of N answer candidates, the answer prediction task
is modeled as N-class classi cation problem and performed using a one-layer
neural network. Answers are predicted using Softmax probabilities.
3.2</p>
        </sec>
        <sec id="sec-7-1-2">
          <title>Multimodal Compact Bilinear pooling</title>
          <p>
            Multimodal Compact Bilinear pooling (MCB) [
            <xref ref-type="bibr" rid="ref8">8</xref>
            ] is the winner of the
CVPR2016 VQA Workshop challenge using an attention mechanism that implicitly
computes the outer product of visual and textual vectors. MCB architecture
contains: (i) a CNN image model, (ii) an LSTM question model, and (iii) MCB
pooling that rst predicts the spatial attention and then combines the attention
representation with the textual representation to predict the answers.
          </p>
          <p>For the image model, we used ResNet-152 and ResNet-50 pre-trained on
imageNet. For the question model, a 2-layer (1024 units in each layer) LSTM
model is used. Concatenated output from both layers (2048 units) forms the
input to the next pooling layer. MCB pooling is then used to combine both image
and textual vectors to produce a multimodal representation. To incorporate
attention, MCB pooling is used again to merge the multimodal representation
with the textual representation for each spatial grid location. We also ne-tuned
ResNet-50 on modality classi cation. Our method is described in the following
section.
4</p>
        </sec>
      </sec>
      <sec id="sec-7-2">
        <title>Fine-tuning pre-trained CNNs on modality classi cation</title>
        <p>CNNs are shown to deliver promising results with an increase in the
availability of annotated data and computational resources. However, with the scarcity
of annotated data, especially in the case of medical imagery, transfer learning
is preferred where the CNNs are pre-trained on a huge selection of stock
photographic images like ImageNet that contains more than 15 million annotated
images belonging to 20K categories. These pre-trained models learn generic
features from these large-scale collections, the knowledge can be transferred to the
current task. This transfer of knowledge is generic rather than unique to the
task under study. Pre-trained CNNs are ne-tuned [9] and/or used as feature
extractors [10] for a variety of visual recognition tasks to improve performance.</p>
        <p>For ImageClef, we selected ve categories of modalities that are the most
relevant to VQA-Med data. We used a set of 54,200 medical images
associated with the categories: CT (17k images), MRI (12.7k images), XRAY (20k
images), COMPOUND (1,1k images), and Other (3.4k images). We evaluated
the performance of a pre-trained ResNet-50 (winner of ILSVRC 2015) toward
classifying these modalities. The hyper-parameters were optimized by a method
based on randomized grid search. The search ranges were initialized to [1e-7
1e2], [0.8 0.95] and [1e-10 1e-1] for the learning rate, momentum and L2- weight
decay parameters, respectively. The convolutional part of the pre-trained CNNs
was instantiated, the pre-trained weights were loaded, a fully connected model
was added, the pre-trained layers were frozen up to the deepest convolutional
layer and the fully connected model was trained on the extracted features. We
then ne-tuned the model from the layer (res5c-branch2c) rather than the entire
model to prevent over tting since the entire CNN has an extremely high entropic
capacity and the tendency to over t. We empirically found that ne-tuning the
model from the aforementioned layer improved the classi cation performance.
The model achieved an accuracy of 99.19% in classifying the di erent imaging
modalities.
5</p>
      </sec>
      <sec id="sec-7-3">
        <title>Submitted Runs</title>
        <p>We submitted ve automatic runs to ImageCLEF 2018 VQA-Med:
{ Run1-SAN: This run used the stacked attention network (SAN) with 2
attention layers. The VQA model was trained on imageClef VQA-Med training
and validation datasets (10k iterations). VGG-16 pre-trained on ImageNet
was used for image features. No additional resources were used.
{ Run2-SAN: Same as run number 1, with 12k iterations.
{ Run3-SAN: Same as run number 1, with 4k iterations.
{ Run4-MCB: This run used the MCB model trained on the training and
validation datasets (50k iterations). ResNet-152 was used for image features.</p>
        <p>No additional resources were used.
{ Run5-MCB: Similar to run number 4 (20k iterations). For image features,</p>
        <p>ResNet-50 was ne-tuned on modality classi cation.
6</p>
        <p>O</p>
        <p>cial Results
In VQA-Med 2018, two metrics were used to evaluate the submitted runs: (i) a
new metric called Word-based Semantic Similarity (WBSS), inspired by previous
e orts [11, 12] and (ii) the BLEU score [13].</p>
        <p>Table 1 shows our o cial results in the VQA-Med task. The SAN model
provided the best results with 0.174 WBSS score and 0.121 BLEU score. Changing
the number of iterations had a small impact on the results (Run1, Run2 and
Run3). The fourth run used the MCB model and ResNet-152 pre-trained on
imageNet for image features. Fine-tuning ResNet-50 on modality classi cation
(Run5) had slightly improved the results of the MCB model.</p>
        <p>Figure 2 presents the results of the ve participating teams. Our best overall
result was obtained by Run1-SAN, achieving the second best WBSS score of
0.1744 and the third best BLEU score of 0.121 in the VQA-Med challenge.</p>
        <p>NLM Runs
Run1-SAN
Run2-SAN
Run3-SAN
Run4-MCB
Run5-MCB
Systems capable of understanding clinical images to the extent of answering
questions about the images could support clinical education, clinical decisions and
patient education. ImageCLEF VQA-Med is an essential step towards achieving
these goals. Our goal for this rst challenge was to evaluate the currently existing
VQA systems.
While ImageCLEF VQA-Med represents one of the rst attempts for VQA to
enter the medical domain, and is an excellent starting point, the dataset has
several limitations, illustrated by the following example from VQA-Med training
set:
{ Question: who does ct chest demonstrate interval resolution of without
cardiophrenic sparing?
{ Answer: pneumothoraces and persistent
{ Image: Figure 3b.</p>
        <p>(a) First image of the article.</p>
        <p>(b) Second image of the article, used
in ImageClef 2018 VQA-Med
f
This example illustrates some of the issues facing medical VQA:
(1) Tradeo s between time-consuming manual construction of datasets and
automatically constructing datasets using readily-available resources. Arti cial
questions do not always make sense, to the point where we cannot reason what
the question was trying to ask. Searching PubMed, we found that the original
caption for this image is:
{ "CT chest (axial slices in lung window) demonstrating interval resolution
of pneumothoraces and persistent, di use numerous thin-walled pulmonary
cysts without cardiophrenic sparing."</p>
        <p>Maybe the question was supposed to be 'what does the ct chest demonstrate
interval resolution of ?' A more natural way to ask the question might be 'the
ct chest demonstrates resolution of what?'.</p>
        <p>(2) The context and domain knowledge needed to answer the question. A
radiologist may not have been able to answer the above question given only
Figure 3b because so much context was lost. The article is a case report of
a patient experiencing acute respiratory distress syndrome and Figure 3b is a
follow-up image taken 8 months after the patient was discharged. Figure 3a is
needed to correctly answer the question with pneumothoraces.</p>
        <p>(3) The clinical task that VQA aims to support. The purpose of radiology
images and captions in PubMed are to summarize and present an interesting
case to the scienti c community. This di ers from clinical radiology reports in
hospitals where they assist clinical decision making and direct patient care. If
VQA tools are to assist clinical processes, the design of the dataset needs to have
aligned goals.</p>
      </sec>
    </sec>
    <sec id="sec-8">
      <title>7 https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4638149</title>
      <p>The VQA community has greatly bene ted from large datasets such as
ImageNet [14] and MS-COCO [15] used to construct public domain VQA datasets.
Similarly, specialized medical image resources can be leveraged for VQA,
however, there are some fundamental challenges that need to be addressed to apply
VQA to a more specialized domain such as Medicine.</p>
      <p>If we are to develop tools for VQA to assist clinicians, we will need to better
understand what clinincians might ask and how clinical questions are naturally
phrased. We also need insight to what questions are relevant in di erent contexts.
Just like a question about weather would not make sense for an image of an
indoor space, a question about the brain would not make sense for a CT of the
chest.</p>
      <p>Categorization of question and answer types will help both to direct
clinical usefulness and support evaluation of di erent algorithm designs. Question
categorization can help give insight if certain algorithms are good at
characterizing the intensity of a tumor while others may be better at counting the number
of ribs.</p>
      <p>Clinically relevant images are also needed. If the goal of medical VQA is
to assist clinical processes, then we need to have images used in clinical settings.
Many of the PubMed images are designed for the scienti c community.
Composite images of multiple images or 3D reconstructions may rarely be relevant for
clinical decision-making and supporting direct patient care.</p>
      <p>Finally, context and background knowledge might play a more important
role in medicine than in the open domain. For example, radiologists will know
to look for internal bleeding in a head CT scan with contrast even if they are
not given the patient's history or symptoms. This context is implicit yet will
in uence the types of clinically relevant questions and answers. Understanding
how we can capture these types of context and the necessary minimum amount
are important future goals for bene ting clinical processes.
8</p>
      <sec id="sec-8-1">
        <title>Conclusions</title>
        <p>This paper describes our participation in the ImageCLEF 2018 VQA-Med task.
We tested two publicly available VQA networks. We achieved the second best
WBSS score in the challenge using the SAN model. Our results also showed the
positive impact of (i) performing visual attention to learn image regions relevant
to answer the clinical questions, and (ii) using a multiple-layer SAN to query an
image multiple times and infer the answer progressively. This rst VQA challenge
in the medical domain allowed testing the combination of language models and
vision models in a restricted domain.</p>
        <p>Several tracks can be investigated for future improvement including adding
questions asked by health care professionals, and using medical terminologies
in the linguistic analysis of the questions in deep networks. Based on a new
clinical VQA dataset that was recently created [16], we plan to study further
the automatic construction of clinical images and related question-answer pairs.</p>
      </sec>
      <sec id="sec-8-2">
        <title>Acknowledgments</title>
        <p>This research was supported in part by the Intramural Research Program of the
National Institutes of Health (NIH), National Library of Medicine (NLM), and
Lister Hill National Center for Biomedical Communications (LHNCBC).
This research was partially made possible through the NIH Medical Research
Scholars Program, a public-private partnership supported jointly by the NIH
and generous contributions to the Foundation for the NIH from the Doris Duke
Charitable Foundation, Genentech, the American Association for Dental
Research, the Colgate-Palmolive Company, Elsevier, alumni of student research
programs, and other individual supporters via contributions to the Foundation
for the NIH. For a complete list, please visit the Foundation website8.</p>
      </sec>
    </sec>
    <sec id="sec-9">
      <title>8 http://fnih.org/what-we-do/current-education-and-training-programs/mrsp</title>
      <p>In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language
Processing, EMNLP 2016, Austin, Texas, USA, November 1-4, 2016. (2016) 457{
468
9. Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features o
-theshelf: An astounding baseline for recognition. In: IEEE Conference on Computer
Vision and Pattern Recognition, CVPR Workshops 2014, Columbus, OH, USA,
June 23-28, 2014. (2014) 512{519
10. Rajaraman, S., Antani, S., Poostchi, M., Silamut, K., Hossain, M., Maude, R.,
Jaeger, S., Thoma, G.: Pre-trained convolutional neural networks as feature
extractors toward improved malaria parasite detection in thin blood smear images.</p>
      <p>PeerJ 6:e4568 (2018)
11. Sogancioglu, G., Ozturk, H., Ozgur, A.: BIOSSES: a semantic sentence similarity
estimation system for the biomedical domain. Bioinformatics 33(14) (2017) i49{i58
12. Wu, Z., Palmer, M.S.: Verb semantics and lexical selection. In: 32nd Annual
Meeting of the Association for Computational Linguistics, 27-30 June 1994, New
Mexico State University, Las Cruces, New Mexico, USA, Proceedings. (1994) 133{
138
13. Papineni, K., Roukos, S., Ward, T., Zhu, W.: Bleu: a method for automatic
evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of
the Association for Computational Linguistics, July 6-12, 2002, Philadelphia, PA,
USA. (2002) 311{318
14. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale
hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer
Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida,
USA. (2009) 248{255
15. Lin, T., Maire, M., Belongie, S.J., Hays, J., Perona, P., Ramanan, D., Dollar, P.,
Zitnick, C.L.: Microsoft COCO: common objects in context. In: Computer Vision
- ECCV 2014 - 13th European Conference, Zurich, Switzerland, September 6-12,
2014, Proceedings, Part V. (2014) 740{755
16. Lau, J.J., Gayen, S., Ben Abacha, A., Demner-Fushman, D.: A dataset of clinically
generated visual questions and answers about radiology images. Submitted to
Scienti c Data (2018)</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Villegas</surname>
          </string-name>
          , M.,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Eickho</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andrearczyk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dang-Nguyen</surname>
            ,
            <given-names>D.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lux</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gurrin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          : Overview of ImageCLEF 2018:
          <article-title>Challenges, datasets and evaluation. In: Experimental IR Meets Multilinguality, Multimodality, and Interaction</article-title>
          .
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ), Avignon, France,
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer (September
          <volume>10</volume>
          -14
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the ImageCLEF 2018 medical domain visual question answering task</article-title>
          .
          <source>In: CLEF2018 Working Notes. CEUR Workshop Proceedings</source>
          , Avignon, France, CEUR-WS.org &lt;http://ceur-ws.
          <source>org&gt; (September 10-14</source>
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Antol</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            , J., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zitnick</surname>
            ,
            <given-names>C.L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          : VQA:
          <article-title>Visual Question Answering</article-title>
          . In: International Conference on Computer Vision (ICCV).
          <article-title>(</article-title>
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Gurari</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Stangl</surname>
            ,
            <given-names>A.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Guo</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Grauman</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Luo</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Bigham</surname>
            ,
            <given-names>J.P.</given-names>
          </string-name>
          :
          <article-title>Vizwiz grand challenge: Answering visual questions from blind people</article-title>
          . CoRR abs/
          <year>1802</year>
          .08218 (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5. Ka e,
          <string-name>
            <given-names>K.</given-names>
            ,
            <surname>Kanan</surname>
          </string-name>
          ,
          <string-name>
            <surname>C.</surname>
          </string-name>
          :
          <article-title>Visual question answering: Datasets, algorithms, and future challenges</article-title>
          .
          <source>CoRR abs/1610</source>
          .01465 (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hierarchical question-image co-attention for visual question answering</article-title>
          .
          <source>In: Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems</source>
          <year>2016</year>
          , December 5-
          <issue>10</issue>
          ,
          <year>2016</year>
          , Barcelona,
          <string-name>
            <surname>Spain.</surname>
          </string-name>
          (
          <year>2016</year>
          )
          <volume>289</volume>
          {
          <fpage>297</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.J.:</given-names>
          </string-name>
          <article-title>Stacked attention networks for image question answering</article-title>
          .
          <source>In: 2016 IEEE Conference on Computer Vision</source>
          and Pattern Recognition,
          <string-name>
            <surname>CVPR</surname>
          </string-name>
          <year>2016</year>
          ,
          <string-name>
            <surname>Las</surname>
            <given-names>Vegas</given-names>
          </string-name>
          ,
          <string-name>
            <surname>NV</surname>
          </string-name>
          , USA, June 27-30,
          <year>2016</year>
          . (
          <year>2016</year>
          )
          <volume>21</volume>
          {
          <fpage>29</fpage>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Fukui</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal compact bilinear pooling for visual question answering and visual grounding</article-title>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>