<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>UMass at ImageCLEF Medical Visual Question Answering(Med-VQA) 2018 Task</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Yalei Peng</string-name>
          <email>ypeng5@wpi.edu</email>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Feifan Liu</string-name>
          <email>feifan.liu@umassmed.edu</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Max P. Rosen</string-name>
          <email>max.rosen@umassmemorial.org</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Attention Mechanism LSTM Topic Analysis</string-name>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>University of Massachusetts Medical School</institution>
          ,
          <addr-line>Worcester MA 01655</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Worcester Polytechnic Institute</institution>
          ,
          <addr-line>Worcester MA 01609</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>This paper describes the participation of the University of Massachusetts Medical School in the ImageCLEF 2018 Med-VQA task. The goal is to build a system that is able to reason over medial images and questions and generate the corresponding answers. We explored and implemented a co-attention based deep learning framework where residual networks is used to extract visual features from image that interact with the long-short term memory(LSTM) based question representation providing ne-grained contextual information for answer derivation. To e ciently integrate visual features from the image and textual features from the question, we employed Multi-modal Factorized Bilinear(MFB) pooling as well as Multi-modal Factorized High-order(MFH) pooling. In addition, we exploited transfer learning on pre-trained ImageNet model where embedding based topic model(ETM) is applied on the question texts of the training data and the corresponding topic labels are attached to each image for transfer learning. We submitted 3 valid runs for this task, and we found the ETM based transfer learning outperformed other models, achieving the best WBSS score of 0.186, which ranked rst among participating groups.</p>
      </abstract>
      <kwd-group>
        <kwd>Visual Question Answering</kwd>
        <kwd>Residual Nets</kwd>
        <kwd>Multi-modal Fusion</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>Given an image and a question in natural language, visual question answering
(VQA) system is expected to reason over both visual and textual information to
infer the correct answer. It is a challenging task that combines computer vision
with natural language processing (NLP) and has received increasing attention.
Various kinds of methods, like joint embedding approaches, attention
mechanisms and compositional models, have been designed and practiced on this task.
Meanwhile, data sets for learning VQA have also been evolving from simple
image-QA datasets like COCO-QA to knowledge base-enhanced datasets like
Visual Genome.</p>
      <p>
        However, the study of VQA so far is mainly in general domain. There are few
practice of VQA in other domain. With the increasing implementation of arti
cial intelligence (AI) into medical domain to support clinical decision making and
improve patient engagement, the automation of medical image interpretation is
becoming more and more desirable. The system is expected to help patients
better understand their conditions regarding their available data which can be
structured and unstructured, graphical and textual. Also the system, as a
opinion machine, may enhance clinicians con dence in interpreting complex medical
images. Motivated by this important need for automated image
understanding in an advanced question answering manner for clinical domain, ImageCLEF
2018 [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ] organized the inaugural edition of the Medical Domain Visual Question
Answering(Med-VQA) Task [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ].
      </p>
      <p>The implementation of VQA into medical domain is challenging not only
because texts and images in medical domain are distinct from those in general
domain, but also because the data resources in medical domain are limited
compared with those in general domain. Thus, transfer learning from general domain
is more promising than directly training from scratch.</p>
      <p>Our main contributions in participating this challenge are as follows: First, we
explored transfer learning on image channel to extract meaningful features from
the medical images, where we present a novel approach of utilizing
Embeddingbased topic modeling for transfer learning. Second, we implemented co-attention
mechanism integrated with Multi-modal Factorized Bilinear Pooling (MFB) and
Multi-modal Factorized High-order Pooling (MFH) to medical VQA.
2</p>
    </sec>
    <sec id="sec-2">
      <title>Related Work</title>
      <p>
        There Research on VQA has been showing increased interest due to
methodological advances in both computer vision and NLP, and the availability of relevant
large-scale datasets. The straightforward solution to VQA is the joint embedding
method(e.g. [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ]), where image and question are represented as global features
which are merged to predict the answers. The limitation for this approach is
that an image could contain more information than needed to answer a
question, which may add noises to the classi cation model, making it di cult to
answer questions pertaining to a speci c part of the image. Therefore recent
work on VQA explored attention mechanisms(e.g. [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) to improve the
performance by steering the model to speci c sections of the input (image and/or
question. The main idea is to replace the global image features with ne-grained
spatial feature maps so that feature maps can interact with the given question
to derive salient features for answer prediction.
      </p>
      <p>
        Another line of work in VQA focuses on e cient ways for multi-modal
feature fusion. A simple approach that has been widely used is linear fusion model,
where visual features from image and textual features from question are
concatenated or element-wise added. Due to the largely di erent distributions of two
feature sets, the expressive power of the resulting fused representation is limited
in terms of facilitating the nal answer prediction. To address this issue, several
approaches were proposed, such as Multi-modal Compact Bilinear (MCB) [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ],
Multi-modal Low-rank Bilinear (MLB) [
        <xref ref-type="bibr" rid="ref4">4</xref>
        ], and Multi-modal Factorized
Bilinear pooling (MFB) [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. In our medical VQA system, we integrated the MFB
approach for multi-modal feature fusion which was shown to outperform both
MCB and MLB in general domain VQA datasets.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>Methods</title>
      <p>Our system consists of four main components: Feature fusion, Co-attention
mechanism, Transfer learning, and answer prediction, which are shown in Fig.
1. Speci cally, visual context is extracted from the image facilitated by transfer
learning, then fused with textual context from the question using co-attention
mechanism and feature fusion techniques. Finally, answer is predicted based on
the fused multi-modal contextual information.
We used MFB pooling method to merge the visual features from image and
textual features from question, as it was shown to have dual bene ts of compact
output features of MLB and robust expressive capacity of MCB. For
comparison, we also integrated multi-modal factorized high-order(MFH) pooling which
consists of N MFB modules (N is a hyper-parameter).</p>
      <p>Each MFB block contains two stages: expand and squeeze. In the expand stage,
the textual context and the visual context are transformed into the same
dimension by a fully-connected layer respectively for the next element-wise
multiplication. Additionally, a dropout layer is next to the element-wise multiplication
unit. Then, the fused context is further transformed in squeeze stage which
contains sum pooling, power normalization and L2-normalization.</p>
      <p>
        In the MFH module, the output from the dropout layer of the previous MFB
block is fed into the next MFB block as additional input as shown in Fig. 2,
and the output from multiple MFB blocks are merged together as a nal fused
feature representation.
Similar to [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ], we also implemented Co-attention mechanism for MED-VQA.
The pre-trained ResNet152 model of ImageNet (excluding the last 3 layers) is
used as a image feature extractor, and a LSTM layer is used to encode the
question into textual feature vectors. A pre-trained word-embedding (dimension
of 200) on wikipedia, pubmed articles and Pittsburgh clinical notes is used as
embedding input layer. MFB was used to fuse the the multi-modal features,
followed by some feature transformations (e.g., 1 1 convolution and ReLU
activation) and softmax normalization to predict the attention weight for each
grid location. Based on the attention map, the attentional image features are
obtained by the weighted sum of the spatial grid vectors. Multiple attention
maps are generated to enhance the learned attention map, and these attention
maps are concatenated to output the attentional image features. Next, the nal
attentional image features are merged with the question features using MFB for
downstream answer prediction.
3.3
      </p>
      <p>Transfer Learning to Tune Pretrained ResNet with ETM Labels
ImageNet data are very di erent from medical images in MED-VQA task, which
motivates us to employ transfer learning to adapted pre-trained model to this
task. Instead of ne tuning the pre-trained model on the y, the o -line transfer
learning based method can e ciently reduce the training time.</p>
      <p>We explored topic analysis to derive semantic label for each image in order
to enable transfer learning. The assumption is that the semantics of the
question text should match the corresponding image. However, the question text
is typically short which is challenging for traditional topic analysis approaches,
such as probabilistic latent semantic analysis (PLSA) and latent Dirichlet
allocation (LDA), to infer reliable topics as only very limited word co-occurrence
information is available in short texts. Embedding-based topic model [8] not
only solves the problem of very limited word co-occurrence information by
aggregating short texts into long pseudo-texts, but also utilizes a Markov Random
Field regularized model that gives correlated words a better chance to be put
into the same topic as shown in Fig. 3. First, short texts are merged into long
pseudo-texts based on clustering methods using a word embedding pre-trained
on a large relevant corpus. Then, embedding-based topic model is applied on the
long pseudo-texts to generate latent topics.</p>
      <p>Speci cally, we applied ETM on question texts of the MED-VQA data,
assigning a topic label to each question which can in turn be used as a semantic label
for its corresponding image. We then performed transfer learning in a context
of image classi cation, where the parameters of pre-trained residual nets were
tuned with the goal of correctly classifying all the images to their corresponding
topic labels. The ne-tuned network (removing the last convolution block,
fullyconnected layer and softmax layer) was used as the static feature extractor in
our system architecture.
The input to answer prediction is the attentional image features from Co-attention,
fused with the LSTM based question representational features through MFB.
Here we employed a simple multi-label classi cation method where each unique
word in the answer sentence is considered a answer label for the
corresponding image-question pair. Based on distribution of all the answer labels, the nal
answer is generated using sampling method.</p>
    </sec>
    <sec id="sec-4">
      <title>Experiments</title>
      <sec id="sec-4-1">
        <title>Data</title>
        <p>Statistics of Med-VQA dataset is shown in Table 1. The training, validation and
test data splits have 5413, 500 and 500 instances respectively. Both questions and
answers are on average longer than those in VQA datasets in general domain.
The word-embedding (dimension of 200) ,which was pretained on wikipedia,
pubmed articles and Pittsburgh clinical notes, has a good coverage (roughly
over 95%) on both question and answer words of each data split. Also, note that
the number of images is less than the number of question-answer pairs, which
means several question-answer pairs may share a common image. Especially in
training dataset shown in Table 1, there are 2278 images which are less than half
of the number of question-answer pairs (5413).
Question-Answer Pair Preprocessing on question-answer pairs includes
tokenization and lower casing, so that each word can be mapped to its dense
representation by looking up pre-trained word embeddings.</p>
        <p>Image Although original preprocessing procedure is recommended to better
facilitate the transfer learning, we notice that lots of images in medical VQA
data set are long shape consisting 2 - 5 sub-images. Therefore, a lot of areas
would be cut o , and features would be resized to be too small and blur if the
original preprocessing is directly applied. Therefore, we reshape the long images
into approximate squares by re-arranging the order of sub-images. Then, the
original preprocessing when pre-training the ImageNet ResNet is applied.
4.3</p>
      </sec>
      <sec id="sec-4-2">
        <title>Validation Runs</title>
        <p>We experimented with the three co-attention systems with variant settings on
feature fusion and transfer learning:(1)ResNet152+MFB which uses MFB for
feature fusion and the pre-trained ResNet152 is directly used; (2)ResNet152+MFH
which uses MFH for feature fusion and the pre-trained ResNet152 is directly
used; (3)ResNet152+ETM+MFH which uses MFH for feature fusion, and the
pre-trained ResNet152 is also tuned through transfer learning which is based
on ETM topic modeling. In Fig. 4, we shows the performance curves of 3
systems on validation dataset. We can see the MFH based feature fusion constantly
outperforms the MFB based method.
We submitted 3 valid runs based on the aforementioned system architectures,
and the run from "ResNet152+ETM+MFH" achieved the best WBSS score of
0.186, and the run from "ResNet152+MHF" obtained the best BLEU score of
0.162 as shown in Table. 2. Note that the run (ID6091) is not a valid run due to
a code error.
We experimented with 3 di erent deep learning architectures for MED-VQA task
2018, where we proposed a novel method for transfer learning using embedding
based topic analysis. We found that transfer learning and MFH based feature
fusion is helpful on improving the system's performance. Due to time limitation,
we didn't model the sequential information in the answer sequence which will
be explored in the future to make the answer more natural and readable.
8. Qiang, J., Chen, P., Wang, T., Wu, X.: Topic Modeling over Short Texts by
Incorporating Word Embeddings. arXiv:1609.08496 [cs]. (2016).</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Henning Mller, Mauricio Villegas, Alba Garca Seco de Herrera, Carsten Eickho , Vincent Andrearczyk, Yashin Dicente Cid, Vitali Liauchuk, Vassili Kovalev, Sadid A.
          <string-name>
            <surname>Hasan</surname>
          </string-name>
          , Yuan Ling, Oladimeji Farri, Joey Liu, Matthew Lungren,
          <string-name>
            <surname>Duc-Tien</surname>
            Dang-Nguyen,
            <given-names>Luca</given-names>
          </string-name>
          <string-name>
            <surname>Piras</surname>
            , Michael Riegler, Liting Zhou, Mathias Lux and
            <given-names>Cathal</given-names>
          </string-name>
          <string-name>
            <surname>Gurrin</surname>
          </string-name>
          .
          <source>Overview of ImageCLEF</source>
          <year>2018</year>
          :
          <article-title>Challenges, Datasets and Evaluation</article-title>
          ,
          <source>Proceedings of the Ninth International Conference of the CLEF Association (CLEF</source>
          <year>2018</year>
          ),
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <surname>Hasan</surname>
            ,
            <given-names>S.A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ling</surname>
            ,
            <given-names>Y.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Farri</surname>
            ,
            <given-names>O.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Liu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lungren</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Mller</surname>
          </string-name>
          , H.:
          <article-title>Overview of the ImageCLEF 2018 Medical Domain Visual Question Answering Task</article-title>
          . In: CLEF2018 Working Notes. http://ceur-ws.org/, Avignon, France (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <surname>Ilievski</surname>
            ,
            <given-names>I.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yan</surname>
            ,
            <given-names>S.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Feng</surname>
            ,
            <given-names>J.: A Focused</given-names>
          </string-name>
          <string-name>
            <surname>Dynamic</surname>
          </string-name>
          <article-title>Attention Model for Visual Question Answering</article-title>
          . arXiv:
          <volume>1604</volume>
          .01485 [cs]. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>On</surname>
          </string-name>
          , K.-W.,
          <string-name>
            <surname>Lim</surname>
            ,
            <given-names>W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          , Zhang, B.-T.:
          <article-title>Hadamard Product for Low-rank Bilinear Pooling</article-title>
          . arXiv:
          <volume>1610</volume>
          .04325 [cs]. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>Z.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yu</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Fan</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Tao</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          <article-title>: Multi-modal Factorized Bilinear Pooling with CoAttention Learning for Visual Question Answering</article-title>
          . arXiv:
          <volume>1708</volume>
          .01471 [cs]. (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <surname>Fukui</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Park</surname>
            ,
            <given-names>D.H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Yang</surname>
            ,
            <given-names>D.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>A.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Darrell</surname>
            ,
            <given-names>T.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Rohrbach</surname>
            ,
            <given-names>M.</given-names>
          </string-name>
          :
          <article-title>Multimodal Compact Bilinear Pooling for Visual Question Answering and Visual Grounding</article-title>
          .
          <source>In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing</source>
          . pp.
          <fpage>457468</fpage>
          .
          <article-title>Association for Computational Linguistics</article-title>
          , Austin, Texas (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.-H.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lee</surname>
            ,
            <given-names>S.-W.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Kwak</surname>
          </string-name>
          , D.-H.,
          <string-name>
            <surname>Heo</surname>
          </string-name>
          , M.-O.,
          <string-name>
            <surname>Kim</surname>
            ,
            <given-names>J.</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Ha</surname>
            ,
            <given-names>J.-W.</given-names>
          </string-name>
          , Zhang, B.-T.:
          <article-title>Multimodal Residual Learning for Visual QA</article-title>
          . arXiv:
          <volume>1606</volume>
          .01455 [cs]. (
          <year>2016</year>
          ).
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>