<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>AIML at VQA-Med 2020: Knowledge Inference via a Skeleton-based Sentence Mapping Approach for Medical Domain Visual Question Answering</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Zhibin Liao</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Qi Wu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chunhua Shen</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Anton van den Hengel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Johan Verjans</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Australian Institute for Machine Learning, University of Adelaide</institution>
          ,
          <country country="AU">Australia</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>South Australian Health and Medical Research Institute</institution>
          ,
          <addr-line>Adelaide</addr-line>
          ,
          <country country="AU">Australia</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our contribution to the 2020 ImageCLEF Medical Domain Visual Question Answering (VQA-Med) challenge. Our submissions scored rst place on the VQA challenge leaderboard, and also the rst place on the associated Visual Question Generation (VQG) challenge leaderboard. Our VQA approach was developed using a knowledge inference methodology called Skeleton-based Sentence Mapping (SSM). Using all the questions and answers, we derived a set of classi able tasks and inferred the corresponding labels. As a result, we were able to transform the VQA task into a multi-task image classi cation problem which allowed us to focus on the image modelling aspect. We further propose a class-wise and task-wise normalization facilitating optimization of multiple tasks in a single network. This enabled us to apply a multi-scale and multi-architecture ensemble strategy for robust prediction. Lastly, we positioned the VQG task as a transfer learning problem using the VGA task trained models. The VQG task was also solved using classi cation.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Question Answering • Visual Question Generation •</kwd>
        <kwd>Knowledge Inference • Deep Neural Networks • Skeleton-based Sentence</kwd>
        <kwd>Mapping • Class-wise and Task-wise Normalization</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Visual question answering (VQA) [
        <xref ref-type="bibr" rid="ref20 ref4">4,20</xref>
        ] is a challenging new task which requires
a broad knowledge of image processing, natural language processing (NLP),
and multi-modal learning. In the medical domain, VQA is an attractive topic
showing great potential in automated medical image interpretation and machine
supported diagnoses, with potential to bene t both medical practitioners and
patients. Nevertheless, medical VQA remains an unsolved problem. The
ImageCLEF association [
        <xref ref-type="bibr" rid="ref15">15</xref>
        ] has been hosting the Medical Domain VQA (VQA-Med)
challenges for three consequent years since 2018 [
        <xref ref-type="bibr" rid="ref10 ref2 ref5">2, 5, 10</xref>
        ]. In the 2018 challenge,
the images were extracted from PubMed Central articles with the questions and
answers automatically generated from image captions before checked manually
by human annotators. In addition to the clarity issues of the machine generated
questions as reported by [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ], it is also noticeable that both the questions and
ground-truth answers are in variable-length and free-form, both of which add
difculties to the answer generation task. The 2019 challenge [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] advanced from the
previous challenge by narrowing the task scope: 1) using only radiology images;
and 2) asking questions in four topics (i.e., image modality, imaging plane,
visualized organ systems, and abnormality detectable from an image). As noticed
by many participated teams, the 2019 challenge is solvable in a classi cation
manner, i.e., there are 36 unique answers for the modality questions, 16 for the
plane questions, 10 for the organ questions, with an exception of over a
thousand possible answers for the abnormality category. A post-challenge question
category-wise accuracy analysis [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] suggests that the modality, plane, and organ
categories possess much better accuracy compared to the abnormality category.
      </p>
      <p>
        In the 2020 VQA challenge, our AIML team participated in, the dataset [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]
was curated with only questions in abnormality category. While analyzing the
questions, we found that questions come in two major forms: 1) yes/no
questions, e.g., \is this image normal/abnormal?", and 2) wh-questions e.g., \what is
abnormal in the image?". In comparison to the last year's challenge, we noticed
the unique question phrasings were reduced from 253 to 52 and the unique
answer phrasings from 1,749 to 332, while having a 25% increase of images (from
3,200 to 4,000 in the training set; validation and test sets are equal), resulting
in a much richer data support for the VQA task.
      </p>
      <p>
        Our initial attempt at the 2020 VQA-Med challenge was to ne-tuning of the
Pythia [
        <xref ref-type="bibr" rid="ref27">27</xref>
        ] model. However, this did not yield a desirable performance, hence
after which we conducted an analysis of the predicted answers. The analysis led
to the development of a novel knowledge inference method, namely
Skeletonbased Sentence Mapping (SSM) that helped reverse engineer a set of question
backbones. SSM helped us to determine the question categories and infer
corresponding labels, reducing the VQA problem to a pure multi-task image classi
cation problem. As a result, we were able to focus on the imaging modality. In
particular, we developed a class-wise and task-wise normalization method to give
balanced weighting to presented classes and tasks in a mini-batch. This helps to
jointly optimize multiple tasks in a single network. At last, we applied multi-scale
and multi-architecture ensemble learning. Our best submission scored 0.496 in
accuracy and 0.542 in BLEU score which won the rst place at the 2020 VQA
challenge.
      </p>
      <p>For the associated Medical Domain Visual Question Generation (VQG-Med)
challenge, we considered the task as a transfer learning problem, where we
applied the VQA-Med data trained models as non-trainable feature extractors. The
answer generation is also formed as a classi cation task. Our best submission
scored 0.348 in BLEU score which won the rst place at the VQG challenge.</p>
      <p>In the rest of the paper, we give explanations on our VQA and VQG
approaches. Each approach is a self-contained section to avoid cluttering.
2
2.1</p>
    </sec>
    <sec id="sec-2">
      <title>VQA-Med Challenge Participation</title>
      <sec id="sec-2-1">
        <title>Literature Review</title>
        <p>We will rst introduce the general domain VQA methods followed by an
introduction to the methods that have been applied speci cally in the medical domain
VQA.</p>
        <p>
          General domain VQA: the goal of a VQA method is to produce an answer
from a given image-question pair. Early VQA works [
          <xref ref-type="bibr" rid="ref20 ref24 ref4 ref9">4, 9, 20, 24</xref>
          ] used a general
CNN-RNN framework. In brief, the CNN-RNN approach is carried out using a
Convolutional Neural Network (CNN) model (e.g., VGG-Net [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ]) to process the
input image and a Recurrent Neural Network (RNN) Encoder{Decoder [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ] (more
speci cally, LSTM [
          <xref ref-type="bibr" rid="ref12">12</xref>
          ]) to handle the language modelling. While the vision and
language information fusion component can also be handled by the RNN
language model altogether, or just by concatenation, there are also more advanced
options such as the Multi-modal Factorized Bilinear (MFB) pooling and
Highorder pooling (MFH) [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ] and MUTAN [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ]. Attention is also a frequently visited
topic in VQA, e.g., question-guided visual attention methods [
          <xref ref-type="bibr" rid="ref35 ref37">35, 37</xref>
          ] and
visionlanguage co-attention methods [
          <xref ref-type="bibr" rid="ref19 ref38">19, 38</xref>
          ]. Finally, semantic image representation
(e.g., attribute-based image representation [
          <xref ref-type="bibr" rid="ref31">31</xref>
          ]), pretrained language
representation (e.g., BERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ]), external knowledge and common sense knowledge [
          <xref ref-type="bibr" rid="ref32">32</xref>
          ]
could all be bene cial towards solving VQA.
        </p>
        <p>Medical domain VQA: a noticeable di erence between the medical domain
and general domain VQA is the size of the dataset. The general domain VQA can
accumulate a sizable dataset due to the fact that a common-sense knowledge is
su cient for generating question and answers. On the other hand, the necessity
of clinical expertise imposes a huge di culty in the medical domain VQA data
collection.</p>
        <p>
          In the 2018 VQA-Med challenge, the leading3 three participating teams [
          <xref ref-type="bibr" rid="ref1 ref23 ref39">1,
23, 39</xref>
          ] di erentiate in image modelling (i.e., ResNet-152 [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ],
Inception-ResNetv2 [
          <xref ref-type="bibr" rid="ref28">28</xref>
          ], VGG-16), language modelling (i.e., LSTM, Bi-LSTM), vision-language
fusion (i.e., MFB/MFH [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ], SAN [
          <xref ref-type="bibr" rid="ref37">37</xref>
          ]), attention models (i.e., question guided
attention [
          <xref ref-type="bibr" rid="ref35">35</xref>
          ], co-attention [
          <xref ref-type="bibr" rid="ref38">38</xref>
          ]), and word embeddings (i.e., word2vec [
          <xref ref-type="bibr" rid="ref21">21</xref>
          ] or
3 The 2018 VQA-Med challenge employed three measurements: BLEU [
          <xref ref-type="bibr" rid="ref22">22</xref>
          ],
Wordbased Semantic Similarity (WBSS) [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ], and Concept-based Semantic Similarity
(CBSS). The leading teams are referred to the BLEU and WBSS [
          <xref ref-type="bibr" rid="ref33">33</xref>
          ] score rankings.
        </p>
        <p>
          The CBSS can result a di erent ranking.
medical article pretrained embedding [
          <xref ref-type="bibr" rid="ref23">23</xref>
          ]). Considering the component-wise
diversity and minor performance gaps, it is di cult to nd out which component
is favourable. However, we notice that all three teams treated the VQA task as
a classi cation problem whereas the rest two teams treated the problem as a
generation task [
          <xref ref-type="bibr" rid="ref29">29</xref>
          ] or still a classi cation task but not ne-tuning the image
model [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ].
        </p>
        <p>
          In the 2019 VQA-Med challenge, the top three teams [
          <xref ref-type="bibr" rid="ref30 ref36 ref40">30, 36, 40</xref>
          ] (with a
working notes paper) all used BERT [
          <xref ref-type="bibr" rid="ref8">8</xref>
          ] for language processing. Apart from
that, we point to some of the unique techniques from the top three teams. The
winning team Hanlin [
          <xref ref-type="bibr" rid="ref36">36</xref>
          ] has adopted Global Average Pooling (GAP) [
          <xref ref-type="bibr" rid="ref18">18</xref>
          ]
shortcuts. This di ers from the conventional position of GAP which connects the last
convolution layer and the classi cation layer. The Hanlin team placed multiple
GAPs that each links to a low-level convolution layer and forwards the pooled
low-level features to be concatenated with the nal image representation. The
second-place team minhvu [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] adopted an ensemble learning approach with a
variation of VQA components. The third-place team TUA1 [
          <xref ref-type="bibr" rid="ref40">40</xref>
          ] used a question
classi er to gure out the question category and then choose answers from a set
of modality, plane, and organ classi ers and a generative model for
abnormality answers. Note that the question classi cation strategy was also employed by
several other participated teams; therefore we speculate the use of BERT could
have been the delimiting factor that caused a noticeable gap of 0.04 (in both
accuracy and BLEU) between the third place [
          <xref ref-type="bibr" rid="ref30">30</xref>
          ] and fourth place [
          <xref ref-type="bibr" rid="ref25">25</xref>
          ] (who
also used question classi cation and sub answer models).
2.2
        </p>
      </sec>
      <sec id="sec-2-2">
        <title>Dataset</title>
        <p>The VQA-Med 2020 dataset has a composition of 4,000 radiology images for
training, 500 for validation, and 500 for testing. Each image has exactly one
Question-Answer (QA) pair from the abnormality question category.</p>
        <p>We followed the o cial suggestion to use the VQA-Med 2019 dataset4 as
additional training data. The VQA-Med 2019 dataset has 3,200 medical images
for training, 500 for validation, and another 500 for testing. For training and
validation sets, there are 12,792 and 2,000 QA pairs, giving most images exactly
one QA pair in each question category (i.e., imaging modality, imaging plane,
organ systems, and abnormality). For the test set, each question category has 125
images. In addition, the yes-no questions appear only in the imaging modality
and abnormality question categories.
2.3</p>
      </sec>
      <sec id="sec-2-3">
        <title>Skeleton-based Sentence Mapping</title>
        <p>
          As mentioned in Sec. 1, Pythia [
          <xref ref-type="bibr" rid="ref27">27</xref>
          ] was our initial attempt, from which we
observed a proportion of the yes-no questions answered by categorical
abnormality answers and vice versa. This could be a sign of insu cient question
variations. To address this issue, we tried to develop a question generator to
4 https://github.com/abachaa/VQA-Med-2019
populate training questions while keeping the meaning unchanged. The
Skeletonbased Sentence Mapping (SSM) method was developed to summarize questions
with similar sentence structures into a uni ed backbone. An example of the
derived sentence backbones are shown in Table 1. Taking the question backbone
\is $fthis pronoun altsg $fct altsg $fnormal altsg? " as an example, we call
the swap-able parts the skeleton variables and write in the Shell variable style
\$f. . . g". An example can be found in Table 2.5
        </p>
        <p>Skeleton variables Candidates
this alts this, the
ct alts ct, ct scan, mri, pet, x-ray, image, . . .
normal alts normal, abnormal
imaged alts imaged, displayed, seen, shown, . . .</p>
        <p>is being alts is, is being</p>
        <p>Before applying SSM, we rst removed the duplicated questions in the dataset,
resulting in 266 unique questions. After then, we applied word-level edit distance
(i.e., levenshtein distance) to pairs of questions, nding groups of questions with
1-distance and 2-distance. For example, in Table 1, the corresponding questions
of each question backbone mostly have either 1-distance or 2-distance within the
group, and the highest 4-distance is between \what is shown in the x-ray?" and
5 The naming was determined by choosing the a representative candidate from
candidates for each skeleton variable; by ignoring the \alts" su x, a question backbone
becomes readable.
\what is seen in this ct scan?". The grouped questions were manually checked
to see if the dissimilar parts can be described by a uni ed skeleton variable. If
so, the generated backbone would replace the group of question and enter the
next iteration of edit distance computation. The rst iteration was able to detect
most of the easy question groups, leave the later iterations with a small number
of questions.</p>
        <p>The process was ran until all questions were skeletonized, resulting in 68
question backbones. We labeled the question backbones in the four aforementioned
question categories, partially based on the corresponding answers. In addition,
we also determined two sub categories under the imaging modality category,
namely the MR modality category and the contrast imaging type category. Next,
we compared our own question category annotation with the o cial question
category annotation for the VQA-Med 2019 test set (only available in this set),
which is equivalent. The SSM was able to populate dynamic question variations
(with some rule based restrictions, e.g., changing \ct scan" in \is the ct scan
normal?" to other candidates except \ct" and \image" results in a fallacious
judgement of the image modality, hence is not allowed) and the same Pythia
model trained with the augmented questions was able to rectify the yes-no and
wh-question cross answering errors. Nevertheless, we found the SSM method
rendered language modelling trivial. With its help, we can solve the VQA task
as an image classi cation task.</p>
        <p>Label inference from question backbones: based on the question category
annotation, we were able to record the paired answer annotation as the label for
each mapped task. In addition, we could also extract labels from the skeleton
variables. For example, for the rst question \is the ct scan normal?" in Table 1,
\ct" is capturable by $fct altsg and \normal" is capturable by $fnormal altsg;
hence producing a coarse modality label \ct", and also produce a binary
abnormality label \normal" if the answer is a \yes". We found the same can also be
generalized to infer task labels from the wh-questions.</p>
        <p>
          An issue with the question backbone derived modality labels is that the
detailed modality (e.g., ct with contrast or not) is unknown. To address this issue,
we treat the coarse modality labels as an independent task. The answer derived
modality labels were mapped back to the coarse labels following the information
provided in [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ]. Next, we treated all abnormality wh-questions to have an
\abnormal" label to add to the yes-no question derived binary abnormality labels.
        </p>
        <p>At the end of the process, we were able to produce six classi cation tasks:
1) ne imaging modalities; 2) coarse imaging modalities; 3) imaging plane; 4)
organ systems; 5) binary abnormality, and 6) categorical abnormality.
2.4</p>
      </sec>
      <sec id="sec-2-4">
        <title>Multi-task Image Classi cation</title>
        <p>The schematic of an exemplar image classi cation network we used is illustrated
in Fig. 1, sketched with the knowledge inference process. The two important
tasks are the binary and categorical abnormality classi cation tasks while the
rest four can be thought as regularization tasks. We believe that all the tasks
should have strong correlation to each other, i.e., the correct imaging modality
and organ judgements should be strong prior knowledge for correct recognition
of abnormality.</p>
        <p>Q: “what is most
alarming about this ct</p>
        <p>scan?”
A: “pancreatic carcinoma”</p>
        <p>Knowledge Inference
Matched Backbone</p>
        <p>what is most alarming about ${this_alts} ${ct_alts}?
Categorical Abnormality</p>
        <p>Binary Abnormality
Coarse Imaging Modality
pancreatic carcinoma
abnormal</p>
        <p>ct</p>
        <p>Backbone Network
(e.g., ResNet, DenseNet, VGG)</p>
        <p>Fine Imaging Modality
Coarse Imaging Modality</p>
        <p>Imaging Plane</p>
        <p>Organ System
Binary Abnormality
Input Image</p>
        <p>Categorical Abnormality
Shared Feature Space</p>
        <p>Class-wise and task-wise normalization: since only the 2019 challenge
images have (almost) complete four QA pairs per image, a large number of images
in the joint 2019 and 2020 dataset do not have a complete label set (mainly the
2020 images). Hence when all six tasks are jointly optimized via a mini-batch
gradient method, a conventional normalization by the batch size e ectively
assigns a lower weight to a less populated task, e.g., for a batch with 12 images,
a task that has 3 labeled images e ectively has 0.25 weighting. In addition to
the incomplete label problem, we also observed imbalanced class distributions
within the tasks. For example in the categorical abnormality question category,
the number of samples per abnormality class ranges from 4 to 104. We propose
to solve both issues together by a class-wise and task-wise normalization in order
to jointly optimize all six tasks together. Assume that t 2 fcoarse modality; : : :g
represents a task, for a set of images X and the label set Yt, the mini-batch
training loss L is computed as:
L = X
t</p>
        <p>1
Pct 1(ct 2 Yt)</p>
        <p>X
ct</p>
        <p>1
Pyt2Yt 1(yt = ct)</p>
        <p>X
x2X;yt2Yt
1(yt = ct) `t(x; yt)</p>
        <p>;
where x 2 X and yt 2 Yt represent individual image and label, 1(:) denotes an
indicator function, and ct denotes a candidate class of t (e.g., ct 2 fct, . . . , x-rayg,
if t = coarse modality ).</p>
      </sec>
      <sec id="sec-2-5">
        <title>Multi-scale and multi-architecture ensemble</title>
        <p>We adopted a multi-scale learning technique, using 128, 256, 384, and 512 as
candidate image resize options. After applying the resize operation, we randomly
crop the network input image at a ratio of 87.5% along both dimensions from a
resized image. Random a ne transformations and horizontal ip were used. The
initial learning rate is set to 1e-3, linearly reduced 1e-6 after 100 epochs using
Adam optimizer.</p>
        <p>
          On the other hand, ResNets [
          <xref ref-type="bibr" rid="ref11">11</xref>
          ], DenseNets [
          <xref ref-type="bibr" rid="ref14">14</xref>
          ], ResNexts [
          <xref ref-type="bibr" rid="ref34">34</xref>
          ], MobileNet [
          <xref ref-type="bibr" rid="ref13">13</xref>
          ],
and VGG nets [
          <xref ref-type="bibr" rid="ref26">26</xref>
          ] were selected as the image backbone candidates. We put the
backbone and input scale options as training script hyper-parameters, which
helped us to disperse the training over several GPU stations and gradually
expand the number of ensemble members.
2.6
        </p>
      </sec>
      <sec id="sec-2-6">
        <title>Experiment Results</title>
        <p>We show the validation results from all trained models in Table 3, the
corresponding training volume includes 2019-ftrain, val, testg and 2020-train. Based
on these results, we made decisions of which models to be trained for test
evaluation. Note that the training volume was changed to all of the 2019-ftrain,
val, testg and 2020-ftrain, val, testg sets for training the testing-use models.
We included the 2020-test set because some amount of partial coarse imaging
modality labels (i.e., from $fct altsg) and binary abnormality labels (i.e., only
the abnormal ones from wh-question abnormality) were extractable by SSM from
only the questions, which served as a form of weak regularization for the test
images. Finally, for the categorical abnormality type questions, we only select a
top prediction from the VQA-Med 2020 subset of the abnormality classes as the
predictions.</p>
        <p>Our submissions on the 2020 validation set are shown in Table 4. Our
second submission was purposed to determine the exact category type of the last
question backbone in Table 1 as the ve instances all appear in the 2020 test
set. Although all other 2020 questions were in the abnormality question
category (aligned with the o cial statement), we found the ve questions could also
be interpreted as asking which organ is present. We treated the 5 questions as
categorical abnormality questions in the rst submission and as organ questions
in the second submission. Given the accuracy dropped, the ground truth should
be the abnormality category.</p>
        <p>From a post-challenge point of view, our third submission secured the leading
position in the leaderboard. Our fourth submission was purposed to include
more DenseNet-121 instances in the ensemble as the DenseNet-121-only
multiscale ensemble showed the highest 0.6 accuracy in Table 3. Our fth submission
added the two VGG multi-scale groups, presenting the nal ensemble result
of all trained models. Nevertheless, these nal attempts only pushed up the
performance marginally, suggesting a performance saturation in our approach.</p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>VQG-Med Challenge Participation</title>
      <sec id="sec-3-1">
        <title>Challenge Overview</title>
        <p>The VQG-Med challenge dataset is a much smaller dataset compared to the
VQA-Med datasets. The training set contains 780 radiology images with 2,156
associated QA pairs. The validation set has 141 images with 164 QA pairs. The
test set has only 80 images. The goal of the VQG challenge is to generate between
1 to 7 answers for each test image.
3.2</p>
      </sec>
      <sec id="sec-3-2">
        <title>Methodology</title>
        <p>The VQG challenge describes a question generation task which in concept is
close to image captioning but our proposed solution continued as a classi cation
approach. The main reason is that we found there were more than one ground
truth questions tied to each image. Unlike a VQA task, a question can be
considered as a prior knowledge on which the corresponding answer is conditionally
dependent. Generating multiple questions while lacking such prior knowledge
could be resolved by sampling approaches, but it can be di cult to associate
a random state to a speci c ground truth question. Hence, we instead treated
all observed questions for an image as its attributes and modelled the question
generation task as again an image attributes classi cation task. A downside of
the classi cation approach is not able to produce novel questions.</p>
        <p>Our VQG approach was built upon our VQA-Med solution with the following
settings.</p>
        <p>
          { Solving the question generation task as a classi cation task leads to a total
of 2,073 classes each as an unique observed question from the joint training
and validation sets.
{ We were concerned about netuning the entire image model by the
limited amount of data and the large number of class, which may end up
over- tting in a much faster rate, hence we did not choose to ne-tune the
backbones. However, as a compensation of non-linear capacity, we added a
2-layer batch-normalized and fully-connected (FC) (512 units each, ReLU
activation) multiple-level perceptron (MLP) model before the softmax layer.
The MLP model also avoided a direct mapping from the image features
(e.g., 2048 dimensional features) to the 2,073 classes which would result in a
computational expensive matrix multiplication and a large memory usage.
{ At the training hyper-parameter level, we kept the initial learning rate as
1e-3 but adjusted the nal learning rate to 1e-5. Finally, we shortened the
number of epochs to 40.
{ Each training image could be associated with more than one question,
resulting a multi-label problem. We used the Stochastic Ground Truth method
in [
          <xref ref-type="bibr" rid="ref16">16</xref>
          ] which treats each image with multiple observed questions as multiple
one-question-for-one-image samples, converting the multi-label problem to a
single-label problem.
{ The multi-scale and multi-architecture ensemble were continued in the VQG
approach.
        </p>
        <p>These settings helped us to reuse most of the VQA-Med code base and models
to develop a tangible solution within a very short time frame.
Similar to the VQA-Med 2020 result presentation, we show the VQG-Med 2020
validation and test results separately in Table 5 and 6, respectively. While the
o cial evaluation only has BLEU score, in our local evaluation, we used top-7
accuracy to evaluate the validation performance. For o cial testing, each of our
submission generates seven questions according to the highest probabilities for
each image.</p>
        <p>The rst two submissions tested whether the large input size models should
be continued. Given the lower top-7 accuracy on the validation set and the
same BLEU value on the test set, we decided to not continue the 512 input size
training. In the third submission, we tried to utilize the ground truth answer
annotations by introducing the answer classi cation as an additional
regularization task, but the result dropped by 0.009. In addition, the results from the
rst three submissions suggested a low correlation between the validation top-7
accuracy and the test BLEU scores. Hence in our forth submission we made
two decisions in order to push for a much larger margin on the local evaluation:
1) forgoing the low accuracy models from the ensemble (validation accuracy &lt;
0.079); 2) including the DenseNet-121 architecture given its good performance
in the VQA-Med challenge. The fourth submission scored 0.11 for the
validation accuracy and 0.348 for the test BLEU score, secured our leading position
in the VQG-Med challenge. Finally, in the fth submission, we further added
the DenseNet-161 multi-scale models as a last-minute attempt. Given the local
evaluation dropped by 0.012, the test performance drop was expected as well.
4</p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>Discussion and Conclusion</title>
      <p>In this paper, we described our participation at the 2020 VQA-Med challenge and
the associated VQG-Med challenge. The center of our approach is a knowledge
inference method which we named Skeleton-based Sentence Mapping (SSM). In
the VQA-Med challenge, the SSM method was useful on multiple fronts: 1) it
mapped questions to a set of backbones which were useful to populate dynamic
question instances; 2) it replaced the need of the language modelling and was
able to provide the direct selection to the corresponding answer predictor; and
3) it was used to infer six image classi cation tasks and corresponding training
labels. Bypassing the development of language modelling allowed us to focus
on tweaking the image classi cation model so that we devoted more time and
resource on the multi-scale and multi-architecture ensemble learning. At last,
we developed a class-wise and task-wise normalization technique for balancing
the class and task populations, allowing the tasks with incomplete labels to be
jointly optimized in one network.</p>
      <p>
        The main inspiration of SSM came from [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ], where we back-translated the
questions via a number of foreign languages for augmentation purpose, resulting
from a group of sentences with a small wording variation; hence a sentence
backbone could be inferred. Nevertheless, whether the augmented questions carry the
same meaning needs to be manually checked. The idea of reverse-engineering the
sentence backbone was extended during our participation at the VQA-Med
challenge and led to the proposal of SSM.
      </p>
      <p>We are aware of the fact that SSM is not fully automated which requires
further development. In addition, we understand SSM is a form of explicit
reasoning model and its e ciency highly depends on the question regularity and
dataset size which may not generalize well for VQA datasets containing free-form
questions.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gayen</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lau</surname>
            ,
            <given-names>J.J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rajaraman</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Nlm at imageclef 2018 visual question answering in the medical domain</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Demner-Fushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Muller, H.:
          <article-title>Vqa-med: Overview of the medical visual question answering task at imageclef 2019</article-title>
          . In: CLEF (Working Notes) (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Allaouzi</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ahmed</surname>
            ,
            <given-names>M.B.</given-names>
          </string-name>
          :
          <article-title>Deep neural networks and decision tree classi er for visual question answering in the medical domain</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Antol</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Agrawal</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lu</surname>
            , J., Mitchell,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          , Lawrence Zitnick,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Parikh</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          : Vqa:
          <article-title>Visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <volume>2425</volume>
          {
          <issue>2433</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Ben</given-names>
            <surname>Abacha</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Datla</surname>
          </string-name>
          ,
          <string-name>
            <given-names>V.V.</given-names>
            ,
            <surname>Hasan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>S.A.</given-names>
            ,
            <surname>Demner-Fushman</surname>
          </string-name>
          ,
          <string-name>
            <surname>D.</surname>
          </string-name>
          , Muller, H.:
          <article-title>Overview of the vqa-med task at imageclef 2020: Visual question answering and generation in the medical domain</article-title>
          .
          <source>In: CLEF 2020 Working Notes. CEUR Workshop Proceedings</source>
          , CEUR-WS.org, Thessaloniki,
          <source>Greece (September</source>
          <volume>22</volume>
          -25
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Ben-Younes</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cadene</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cord</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Thome</surname>
          </string-name>
          , N.:
          <article-title>Mutan: Multimodal tucker fusion for visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <volume>2612</volume>
          {
          <issue>2620</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Cho</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          , Van Merrienboer,
          <string-name>
            <given-names>B.</given-names>
            ,
            <surname>Gulcehre</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Bahdanau</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Bougares</surname>
          </string-name>
          ,
          <string-name>
            <given-names>F.</given-names>
            ,
            <surname>Schwenk</surname>
          </string-name>
          ,
          <string-name>
            <given-names>H.</given-names>
            ,
            <surname>Bengio</surname>
          </string-name>
          ,
          <string-name>
            <surname>Y.</surname>
          </string-name>
          :
          <article-title>Learning phrase representations using rnn encoder-decoder for statistical machine translation</article-title>
          .
          <source>arXiv preprint arXiv:1406.1078</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <surname>Devlin</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chang</surname>
            ,
            <given-names>M.W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Toutanova</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Bert: Pre-training of deep bidirectional transformers for language understanding</article-title>
          . arXiv preprint arXiv:
          <year>1810</year>
          .
          <volume>04805</volume>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          :
          <article-title>Are you talking to a machine? dataset and methods for multilingual image question</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>2296</volume>
          {
          <issue>2304</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          :
          <article-title>Overview of imageclef 2018 medical domain visual question answering task</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sun</surname>
          </string-name>
          , J.:
          <article-title>Deep residual learning for image recognition</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>770</volume>
          {
          <issue>778</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Hochreiter</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Schmidhuber</surname>
            ,
            <given-names>J.:</given-names>
          </string-name>
          <article-title>Long short-term memory</article-title>
          .
          <source>Neural computation 9(8)</source>
          ,
          <volume>1735</volume>
          {
          <fpage>1780</fpage>
          (
          <year>1997</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Howard</surname>
            ,
            <given-names>A.G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kalenichenko</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weyand</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Andreetto</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Adam</surname>
          </string-name>
          , H.:
          <article-title>Mobilenets: E cient convolutional neural networks for mobile vision applications</article-title>
          .
          <source>arXiv preprint arXiv:1704.04861</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Huang</surname>
            ,
            <given-names>G.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Van Der Maaten</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Weinberger</surname>
            ,
            <given-names>K.Q.</given-names>
          </string-name>
          :
          <article-title>Densely connected convolutional networks</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>4700</volume>
          {
          <issue>4708</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Ionescu</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          , Muller, H.,
          <string-name>
            <surname>Peteri</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abacha</surname>
            ,
            <given-names>A.B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Datla</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>DemnerFushman</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kozlovski</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liauchuk</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Cid</surname>
            ,
            <given-names>Y.D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kovalev</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Pelka</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Friedrich</surname>
            ,
            <given-names>C.M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>de Herrera</surname>
            ,
            <given-names>A.G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ninh</surname>
            ,
            <given-names>V.T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Le</surname>
            ,
            <given-names>T.K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Piras</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Riegler</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          , l Halvorsen,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Tran</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.T.</given-names>
            ,
            <surname>Lux</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Gurrin</surname>
          </string-name>
          ,
          <string-name>
            <given-names>C.</given-names>
            ,
            <surname>Dang-Nguyen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.T.</given-names>
            ,
            <surname>Chamberlain</surname>
          </string-name>
          ,
          <string-name>
            <given-names>J.</given-names>
            ,
            <surname>Clark</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Campello</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            ,
            <surname>Fichou</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            ,
            <surname>Berari</surname>
          </string-name>
          ,
          <string-name>
            <given-names>R.</given-names>
            ,
            <surname>Brie</surname>
          </string-name>
          ,
          <string-name>
            <given-names>P.</given-names>
            ,
            <surname>Dogariu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            ,
            <surname>Stefan</surname>
          </string-name>
          ,
          <string-name>
            <given-names>L.D.</given-names>
            ,
            <surname>Constantin</surname>
          </string-name>
          ,
          <string-name>
            <surname>M.G.</surname>
          </string-name>
          :
          <article-title>Overview of the ImageCLEF 2020: Multimedia retrieval in medical, lifelogging, nature, and internet applications</article-title>
          .
          <source>In: Experimental IR Meets Multilinguality, Multimodality, and Interaction. Proceedings of the 11th International Conference of the CLEF Association (CLEF</source>
          <year>2020</year>
          ), vol.
          <volume>12260</volume>
          .
          <source>LNCS Lecture Notes in Computer Science</source>
          , Springer, Thessaloniki,
          <source>Greece (September</source>
          <volume>22</volume>
          - 25
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girgis</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abdi</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Vaseli</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Hetherington</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohling</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gin</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tsang</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Abolmaesumi</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          :
          <article-title>On modelling label uncertainty in deep neural networks: Automatic estimation of intra-observer variability in 2d echocardiography quality assessment</article-title>
          .
          <source>IEEE Transactions on Medical Imaging</source>
          <volume>39</volume>
          (
          <issue>6</issue>
          ),
          <year>1868</year>
          {
          <year>1883</year>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Liao</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Teney</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , van den Hengel, A.,
          <string-name>
            <surname>Verjans</surname>
          </string-name>
          , J.:
          <article-title>Medical data inquiry using a question answering model</article-title>
          .
          <source>In: 2020 IEEE 17th International Symposium on Biomedical Imaging (ISBI)</source>
          . pp.
          <volume>1490</volume>
          {
          <fpage>1493</fpage>
          .
          <string-name>
            <surname>IEEE</surname>
          </string-name>
          (
          <year>2020</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Lin</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>S.:</given-names>
          </string-name>
          <article-title>Network in network</article-title>
          .
          <source>International Conference on Learning Representations</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <surname>Lu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          :
          <article-title>Hierarchical question-image co-attention for visual question answering</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>289</volume>
          {
          <issue>297</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Malinowski</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fritz</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Ask your neurons: A neural-based approach to answering questions about images</article-title>
          .
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <volume>1</volume>
          {
          <issue>9</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref21">
        <mixed-citation>
          21.
          <string-name>
            <surname>Mikolov</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sutskever</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Corrado</surname>
            ,
            <given-names>G.S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dean</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          :
          <article-title>Distributed representations of words and phrases and their compositionality</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>3111</volume>
          {
          <issue>3119</issue>
          (
          <year>2013</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref22">
        <mixed-citation>
          22.
          <string-name>
            <surname>Papineni</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Roukos</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ward</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zhu</surname>
          </string-name>
          , W.J.:
          <article-title>Bleu: a method for automatic evaluation of machine translation</article-title>
          .
          <source>In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics</source>
          . pp.
          <volume>311</volume>
          {
          <issue>318</issue>
          (
          <year>2002</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref23">
        <mixed-citation>
          23.
          <string-name>
            <surname>Peng</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          :
          <article-title>Umass at imageclef medical visual question answering (med-vqa) 2018 task</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref24">
        <mixed-citation>
          24.
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kiros</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zemel</surname>
          </string-name>
          , R.:
          <article-title>Exploring models and data for image question answering</article-title>
          .
          <source>In: Advances in neural information processing systems</source>
          . pp.
          <volume>2953</volume>
          {
          <issue>2961</issue>
          (
          <year>2015</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref25">
        <mixed-citation>
          25.
          <string-name>
            <surname>Shi</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rosen</surname>
            ,
            <given-names>M.P.</given-names>
          </string-name>
          :
          <article-title>Deep multimodal learning for medical visual question answering</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref26">
        <mixed-citation>
          26.
          <string-name>
            <surname>Simonyan</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Zisserman</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Very deep convolutional networks for large-scale image recognition</article-title>
          .
          <source>arXiv preprint arXiv:1409.1556</source>
          (
          <year>2014</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref27">
        <mixed-citation>
          27.
          <string-name>
            <surname>Singh</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Natarajan</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shah</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Jiang</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Chen</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Batra</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Parikh</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Towards vqa models that can read</article-title>
          .
          <source>In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition</source>
          . pp.
          <volume>8317</volume>
          {
          <issue>8326</issue>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref28">
        <mixed-citation>
          28.
          <string-name>
            <surname>Szegedy</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          , Io e, S.,
          <string-name>
            <surname>Vanhoucke</surname>
            ,
            <given-names>V.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Alemi</surname>
            ,
            <given-names>A.A.</given-names>
          </string-name>
          :
          <article-title>Inception-v4, inception-resnet and the impact of residual connections on learning</article-title>
          .
          <source>In: Thirty- rst AAAI conference on arti cial intelligence</source>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref29">
        <mixed-citation>
          29.
          <string-name>
            <surname>Talafha</surname>
            ,
            <given-names>B.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Al-Ayyoub</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Just at vqa-med: A vgg-seq2seq model</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref30">
        <mixed-citation>
          30.
          <string-name>
            <surname>Vu</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Sznitman</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Nyholm</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          , Lofstedt, T.:
          <article-title>Ensemble of streamlined bilinear visual question answering models for the imageclef 2019 challenge in the medical domain</article-title>
          .
          <source>In: CLEF 2019</source>
          . vol.
          <volume>2380</volume>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref31">
        <mixed-citation>
          31.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dick</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Van Den Hengel, A.:
          <article-title>What value do explicit high level concepts have in vision to language problems? In: Proceedings of the IEEE conference on computer vision and pattern recognition</article-title>
          . pp.
          <volume>203</volume>
          {
          <issue>212</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref32">
        <mixed-citation>
          32.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Q.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Wang</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Shen</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dick</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          , Van Den Hengel, A.:
          <article-title>Ask me anything: Free-form visual question answering based on knowledge from external sources</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>4622</volume>
          {
          <issue>4630</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref33">
        <mixed-citation>
          33.
          <string-name>
            <surname>Wu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Palmer</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Verb semantics and lexical selection</article-title>
          .
          <source>arXiv preprint cmplg/9406033</source>
          (
          <year>1994</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref34">
        <mixed-citation>
          34.
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Girshick</surname>
            ,
            <given-names>R.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Dollar</surname>
            ,
            <given-names>P.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Aggregated residual transformations for deep neural networks</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>1492</volume>
          {
          <issue>1500</issue>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref35">
        <mixed-citation>
          35.
          <string-name>
            <surname>Xu</surname>
            ,
            <given-names>H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Saenko</surname>
            ,
            <given-names>K.</given-names>
          </string-name>
          :
          <article-title>Ask, attend and answer: Exploring question-guided spatial attention for visual question answering</article-title>
          .
          <source>In: European Conference on Computer Vision</source>
          . pp.
          <volume>451</volume>
          {
          <fpage>466</fpage>
          . Springer (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref36">
        <mixed-citation>
          36.
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Li</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xie</surname>
            ,
            <given-names>C.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Xiao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gu</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          :
          <article-title>Zhejiang university at imageclef 2019 visual question answering in the medical domain</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref37">
        <mixed-citation>
          37.
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>He</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Gao</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Deng</surname>
            ,
            <given-names>L.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Smola</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          :
          <article-title>Stacked attention networks for image question answering</article-title>
          .
          <source>In: Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          . pp.
          <volume>21</volume>
          {
          <issue>29</issue>
          (
          <year>2016</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref38">
        <mixed-citation>
          38.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>: Multi-modal factorized bilinear pooling with coattention learning for visual question answering</article-title>
          .
          <source>In: Proceedings of the IEEE international conference on computer vision</source>
          . pp.
          <year>1821</year>
          {
          <year>1830</year>
          (
          <year>2017</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref39">
        <mixed-citation>
          39.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Employing inception-resnet-v2 and bi-lstm for medical domain visual question answering</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2018</year>
          )
        </mixed-citation>
      </ref>
      <ref id="ref40">
        <mixed-citation>
          40.
          <string-name>
            <surname>Zhou</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kang</surname>
            ,
            <given-names>X.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ren</surname>
            ,
            <given-names>F.</given-names>
          </string-name>
          :
          <article-title>Tua1 at imageclef 2019 vqa-med: a classi cation and generation model based on transfer learning</article-title>
          .
          <source>In: CLEF (Working Notes)</source>
          (
          <year>2019</year>
          )
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>