<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>PRNA at ImageCLEF 2017 Caption Prediction and Concept Detection Tasks</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sadid A. Hasan</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yuan Ling</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Joey Liu</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Rithesh Sreenivasan</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Shreya Anand</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Tilak Raj Arora</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Vivek Datla</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kathy Lee</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Ashequl Qadir</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Christine Swisher</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Oladimeji Farri</string-name>
          <email>dimeji.farrig@philips.com</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Arti cial Intelligence Laboratory, Philips Research North America</institution>
          ,
          <addr-line>Cambridge, MA</addr-line>
          ,
          <country country="US">USA</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Philips Innovation Campus</institution>
          ,
          <addr-line>Bengaluru</addr-line>
          ,
          <country country="IN">India</country>
        </aff>
      </contrib-group>
      <abstract>
        <p>In this paper, we describe our caption prediction and concept detection systems submitted for the ImageCLEF 2017 challenge. We submitted four runs for the caption prediction task and three runs for the concept detection task by using an attention-based image caption generation framework. The attention mechanism automatically learns to emphasize on salient parts of the medical image while generating corresponding words in the output for the caption prediction task and corresponding clinical concepts for the concept detection task. Our system was ranked rst in the caption prediction task while showed a decent performance in the concept detection task. We present the evaluation results with detailed comparison and analysis of our di erent runs.</p>
      </abstract>
      <kwd-group>
        <kwd>Caption Prediction</kwd>
        <kwd>Concept Detection</kwd>
        <kwd>Encoder-Decoder Framework</kwd>
        <kwd>Attention Mechanism</kwd>
      </kwd-group>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>-</title>
      <p>
        Automatically understanding the content of an image and describing in natural
language is a challenging task which has gained a lot of attention from
computer vision and natural language processing researchers in recent years through
various challenges for visual recognition and caption generation [
        <xref ref-type="bibr" rid="ref1 ref2">1, 2</xref>
        ]. Due to
the ever-increasing number of images in the medical domain that are generated
across the clinical diagnostic pipeline, automated understanding of the image
content could especially be bene cial for clinicians to provide useful insights
and reduce the signi cant burden on the overall work ow across the care
continuum. Motivated by this need for automated image understanding methods in the
healthcare domain, ImageCLEF3 organized its rst caption prediction and
concept detection tasks in 2017 [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. The main objective of the concept detection
task was to retrieve the relevant clinical concepts that are re ected in a
medical image whereas in the caption prediction task, participants were supposed to
      </p>
    </sec>
    <sec id="sec-2">
      <title>3 http://www.imageclef.org/2017/caption</title>
      <p>leverage the clinical concept vocabulary created in the concept detection task
towards generating a coherent caption for each medical image.</p>
      <p>The recent advances in deep neural networks have been shown to work well
for large scale image processing, classi cation and captioning tasks. Speci cally,
the combined use of deep convolutional neural networks (CNN) with recurrent
neural networks (RNN) has demonstrated superior performance for these tasks
[5{11] based on the use of sequence to sequence learning and
encoder-decoderbased frameworks for neural machine translation [12{14].</p>
      <p>
        Motivated by the success of such prior works, we use an encoder-decoder
based deep neural network architecture for the caption prediction task [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ], where
the encoder uses a deep CNN [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ] to encode a raw medical image to a feature
representation, which is in turn decoded using an attention-based RNN to generate
the most relevant caption for the given image. We follow a similar approach to
address the concept detection task by treating it as a text generation problem.
Our system was ranked rst in the caption prediction task while showed a decent
performance in the concept detection task. In the next sections, we discuss the
experimental settings, present the evaluation results with analysis, and conclude
the paper.
2
2.1
      </p>
      <p>Experimental Setup</p>
      <sec id="sec-2-1">
        <title>Data</title>
        <p>The training data contains 164,614 biomedical images with associated clinical
concepts or captions extracted from PubMed Central4. Furthermore, 10K images
per task are provided as the validation set while 10K additional images are
provided as the test set for both tasks.
2.2</p>
      </sec>
      <sec id="sec-2-2">
        <title>Training</title>
        <p>
          We use an encoder-decoder-based framework that uses a CNN-based architecture
to extract the image feature representation and a RNN-based architecture with
an attention-based mechanism to translate the image feature representation to
relevant captions [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]. We use the VGGnet-19 [
          <xref ref-type="bibr" rid="ref5">5</xref>
          ] deep CNN model pre-trained
on the ImageNet dataset [
          <xref ref-type="bibr" rid="ref6">6</xref>
          ] with ne tuning on the given ImageCLEF training
dataset to extract the image feature representation from a lower convolution
layer such that the decoder can focus on the salient aspects of the image via an
attention mechanism.
        </p>
        <p>
          The decoder uses a long short-term memory (LSTM) network [
          <xref ref-type="bibr" rid="ref15">15</xref>
          ] with a
soft attention mechanism [
          <xref ref-type="bibr" rid="ref12 ref9">12, 9</xref>
          ] that generates a caption by predicting one word
at every time step based on a context vector (which represents the important
parts of the image to focus on), the previous hidden state, and the previously
generated words.
        </p>
      </sec>
    </sec>
    <sec id="sec-3">
      <title>4 https://www.ncbi.nlm.nih.gov/pmc/</title>
      <p>
        Our models are trained with stochastic gradient descent using Adam [
        <xref ref-type="bibr" rid="ref16">16</xref>
        ]
as the adaptive learning rate algorithm and dropout [
        <xref ref-type="bibr" rid="ref17">17</xref>
        ] as the regularization
mechanism. Our models were trained with two NVIDIA Tesla M40 GPUs.
2.3
      </p>
      <sec id="sec-3-1">
        <title>Run Description</title>
        <p>
          For the caption prediction task, we submitted four runs as follows:
{ Run1: This run does not consider any semantic pre-processing of the
captions; the entire training and validation set are used to train the model as
described in Section 2.2.
{ Run2: This run considers semantic pre-processing of captions using MetaMap
[
          <xref ref-type="bibr" rid="ref18">18</xref>
          ] and the Uni ed Medical Language System (UMLS) metathesaurus [
          <xref ref-type="bibr" rid="ref19">19</xref>
          ],
initially trained on the modi ed VGG19 model with a randomly selected
subset of 20K ImageCLEF training images to automatically generate image
features and classify the imaging modality, and then nally trained as
described in Section 2.2 with a random subset of 24K training images and 2K
validation images to minimize time and computational complexity.
{ Run3: This run is similar to Run1 with automatic generation of UMLS
concept unique identi ers (CUIs) using the training dataset for the concept
detection task, instead of the captions from the caption prediction task, and
then replacing the CUIs (generated for the test set) with the longest relevant
clinical terms from the UMLS metathesaurus as the caption.
{ Run4: This run is similar to Run3 where we replace the CUIs (generated
for the test set) with all relevant clinical terms (including synonyms) from
the UMLS metathesaurus as the caption.
        </p>
        <p>For the concept detection task, we submitted three runs as follows:
{ Run1: In this run, we consider the task as a sequence-to-sequence generation
problem similar to caption generation, where the CUIs associated with an
image are simply treated as a sequence of concepts; the entire training and
validation set are used to train the model as described in Section 2.2.
{ Run2: This run is created by simply transforming the generated captions
(for the test set) from Run1 of the caption prediction task by replacing
clinical terms with the best possible CUIs from the UMLS metathesaurus.
{ Run3: This run is created by simply transforming the generated captions
(for the test set) from Run2 of the caption prediction task by replacing
clinical terms with the best possible CUIs from the UMLS metathesaurus.
2.4</p>
      </sec>
      <sec id="sec-3-2">
        <title>Evaluation and Analysis</title>
        <p>
          The evaluation for the caption prediction task is conducted using the well-known
metric for machine translation, BLEU [
          <xref ref-type="bibr" rid="ref20">20</xref>
          ] whereas F1 score is used to evaluate
the concept detection systems. Table 1 and Table 2 show the evaluation results.
        </p>
        <p>We can see that for the caption prediction task, Run4 and Run1 achieved
high scores denoting the e ectiveness of our approach. Overall, our system was
ranked rst in the caption prediction task. Run4 is better as it includes all
possible terms from the ontologies in the generated caption but trades-o the
coherence of the caption. Hence, this approach increases the BLEU scores, which
essentially computes exact word overlaps between the generated caption and the
ground truth caption. Run2 likely su ered from the limited training data whereas
Run3 has a lower score as it accepts only the longest possible clinical term as a
replacement for a CUI in the caption.</p>
        <p>For the concept detection task, Run1 performed reasonably well, but shows
that there is still room for improvement. We may consider treating the task as
a multi-label classi cation problem to achieve possible improvements. Run2 and
Run3 were limited due to the 2-step translation of clinical terms to CUIs from
the generated captions of the other task, which potentially indicates propagation
of errors in learning the captions to the downstream task.
We presented the details of our participation in the caption prediction and
concept detection tasks of the ImageCLEF 2017 challenge. Our system was ranked
rst in the caption prediction task and showed decent performance in the
concept detection task. Overall, evaluation results showed the e ectiveness of our
approach. We highlighted potential reasons for errors in our submissions and
discussed future work to consider for improved results.</p>
      </sec>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          1.
          <string-name>
            <given-names>Olga</given-names>
            <surname>Russakovsky</surname>
          </string-name>
          , Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael S. Bernstein,
          <string-name>
            <surname>Alexander C. Berg</surname>
          </string-name>
          , and
          <string-name>
            <surname>Fei-Fei Li</surname>
          </string-name>
          .
          <article-title>ImageNet Large Scale Visual Recognition Challenge</article-title>
          .
          <source>International Journal of Computer Vision</source>
          <volume>115</volume>
          (
          <issue>3</issue>
          ):
          <fpage>211</fpage>
          -
          <lpage>252</lpage>
          (
          <year>2015</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          2.
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Alexander Toshev, Samy Bengio, and
          <string-name>
            <given-names>Dumitru</given-names>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and Tell: Lessons Learned from the 2015 MSCOCO Image Captioning Challenge</article-title>
          .
          <source>IEEE Trans. Pattern Anal. Mach. Intell</source>
          .
          <volume>39</volume>
          (
          <issue>4</issue>
          ):
          <fpage>652</fpage>
          -
          <lpage>663</lpage>
          (
          <year>2017</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          3.
          <string-name>
            <given-names>Bogdan</given-names>
            <surname>Ionescu</surname>
          </string-name>
          , Henning Mller, Mauricio Villegas, Helbert Arenas, Giulia Boato,
          <string-name>
            <surname>Duc-Tien</surname>
            Dang-Nguyen, Yashin Dicente Cid, Carsten Eickho , Alba Garcia Seco de Herrera, Cathal Gurrin, Bayzidul Islam, Vassili Kovalev, Vitali Liauchuk, Josiane Mothe, Luca Piras,
            <given-names>Michael</given-names>
          </string-name>
          <string-name>
            <surname>Riegler</surname>
            , and
            <given-names>Immanuel</given-names>
          </string-name>
          <string-name>
            <surname>Schwall</surname>
          </string-name>
          .
          <source>Overview of ImageCLEF</source>
          <year>2017</year>
          :
          <article-title>Information extraction from images</article-title>
          .
          <source>Experimental IR Meets Multilinguality, Multimodality, and Interaction 8th International Conference of the CLEF Association, CLEF 2017</source>
          , Springer LNCS 10456,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          4.
          <string-name>
            <given-names>Carsten</given-names>
            <surname>Eickho</surname>
          </string-name>
          , Immanuel Schwall, Alba Garca Seco de Herrera, and
          <string-name>
            <given-names>Henning</given-names>
            <surname>Mller</surname>
          </string-name>
          .
          <article-title>Overview of ImageCLEFcaption 2017 - Image Caption Prediction and Concept Detection for Biomedical Images</article-title>
          ,
          <source>CLEF 2017 Labs Working Notes, CEUR Workshop Proceedings</source>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          5.
          <string-name>
            <given-names>Karen</given-names>
            <surname>Simonyan</surname>
          </string-name>
          and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <article-title>Very Deep Convolutional Networks for Large-Scale Image Recognition</article-title>
          ,
          <source>arXiv:1409.1556</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          6.
          <string-name>
            <given-names>Alex</given-names>
            <surname>Krizhevsky</surname>
          </string-name>
          , Ilya Sutskever, and
          <string-name>
            <surname>Geo</surname>
          </string-name>
          rey E. Hinton.
          <article-title>ImageNet Classi cation with Deep Convolutional Neural Networks</article-title>
          .
          <source>NIPS</source>
          <year>2012</year>
          :
          <fpage>1106</fpage>
          -
          <lpage>1114</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          7.
          <string-name>
            <given-names>Oriol</given-names>
            <surname>Vinyals</surname>
          </string-name>
          , Alexander Toshev, Samy Bengio, and
          <string-name>
            <given-names>Dumitru</given-names>
            <surname>Erhan</surname>
          </string-name>
          .
          <article-title>Show and tell: A neural image caption generator</article-title>
          .
          <source>CVPR</source>
          <year>2015</year>
          :
          <fpage>3156</fpage>
          -
          <lpage>3164</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          8.
          <string-name>
            <given-names>Je</given-names>
            <surname>Donahue</surname>
          </string-name>
          , Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Trevor Darrell, and
          <article-title>Kate Saenko. Long-term recurrent convolutional networks for visual recognition and description</article-title>
          .
          <source>CVPR</source>
          <year>2015</year>
          :
          <fpage>2625</fpage>
          -
          <lpage>2634</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          9.
          <string-name>
            <given-names>Kelvin</given-names>
            <surname>Xu</surname>
          </string-name>
          , Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          . Show,
          <article-title>Attend and Tell: Neural Image Caption Generation with Visual Attention</article-title>
          ,
          <string-name>
            <surname>ICML</surname>
          </string-name>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          10.
          <string-name>
            <surname>Jimmy</surname>
            <given-names>Ba</given-names>
          </string-name>
          , Volodymyr Mnih, and
          <string-name>
            <given-names>Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <article-title>Multiple Object Recognition with Visual Attention</article-title>
          . ICLR,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref11">
        <mixed-citation>
          11.
          <string-name>
            <surname>Volodymyr</surname>
            <given-names>Mnih</given-names>
          </string-name>
          , Nicolas Heess, Alex Graves, and
          <string-name>
            <given-names>Koray</given-names>
            <surname>Kavukcuoglu</surname>
          </string-name>
          .
          <source>Recurrent Models of Visual Attention. NIPS</source>
          ,
          <year>2014</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref12">
        <mixed-citation>
          12.
          <string-name>
            <surname>Dzmitry</surname>
            <given-names>Bahdanau</given-names>
          </string-name>
          , Kyunghyun Cho, and
          <string-name>
            <surname>Yoshua Bengio.</surname>
          </string-name>
          <article-title>Neural machine translation by jointly learning to align and translate</article-title>
          .
          <source>ICLR</source>
          ,
          <year>2015</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref13">
        <mixed-citation>
          13.
          <string-name>
            <surname>Ilya</surname>
            <given-names>Sutskever</given-names>
          </string-name>
          , Oriol Vinyals, and
          <string-name>
            <surname>Quoc</surname>
            <given-names>V.</given-names>
          </string-name>
          <string-name>
            <surname>Le</surname>
          </string-name>
          .
          <article-title>Sequence to Sequence Learning with Neural Networks</article-title>
          .
          <source>NIPS</source>
          <year>2014</year>
          :
          <fpage>3104</fpage>
          -
          <lpage>3112</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref14">
        <mixed-citation>
          14.
          <string-name>
            <surname>Kyunghyun</surname>
            <given-names>Cho</given-names>
          </string-name>
          , Bart van Merrienboer, aglar Glehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and
          <string-name>
            <given-names>Yoshua</given-names>
            <surname>Bengio</surname>
          </string-name>
          .
          <article-title>Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation</article-title>
          .
          <source>EMNLP</source>
          <year>2014</year>
          :
          <fpage>1724</fpage>
          -
          <lpage>1734</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref15">
        <mixed-citation>
          15.
          <string-name>
            <surname>Sepp</surname>
            <given-names>Hochreiter</given-names>
          </string-name>
          , and
          <article-title>Jrgen Schmidhuber. Long Short-Term Memory</article-title>
          .
          <source>Neural Computation</source>
          <volume>9</volume>
          (
          <issue>8</issue>
          ):
          <fpage>1735</fpage>
          -
          <lpage>1780</lpage>
          (
          <year>1997</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref16">
        <mixed-citation>
          16.
          <string-name>
            <surname>Diederik</surname>
            <given-names>P.</given-names>
          </string-name>
          <string-name>
            <surname>Kingma</surname>
            , and
            <given-names>Jimmy</given-names>
          </string-name>
          <string-name>
            <surname>Ba</surname>
          </string-name>
          .
          <article-title>Adam: A Method for Stochastic Optimization</article-title>
          .
          <source>CoRR abs/1412</source>
          .6980 (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref17">
        <mixed-citation>
          17.
          <string-name>
            <surname>Nitish</surname>
            <given-names>Srivastava</given-names>
          </string-name>
          , Geo rey
          <string-name>
            <given-names>E.</given-names>
            <surname>Hinton</surname>
          </string-name>
          , Alex Krizhevsky, Ilya Sutskever, and
          <string-name>
            <given-names>Ruslan</given-names>
            <surname>Salakhutdinov</surname>
          </string-name>
          .
          <article-title>Dropout: a simple way to prevent neural networks from over tting</article-title>
          .
          <source>Journal of Machine Learning Research</source>
          <volume>15</volume>
          (
          <issue>1</issue>
          ):
          <fpage>1929</fpage>
          -
          <lpage>1958</lpage>
          (
          <year>2014</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref18">
        <mixed-citation>
          18.
          <string-name>
            <surname>Alan</surname>
            <given-names>R.</given-names>
          </string-name>
          <string-name>
            <surname>Aronson</surname>
          </string-name>
          .
          <article-title>E ective mapping of biomedical text to the UMLS Metathesaurus: the MetaMap program</article-title>
          .
          <source>AMIA</source>
          <year>2001</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref19">
        <mixed-citation>
          19.
          <string-name>
            <given-names>Olivier</given-names>
            <surname>Bodenreider</surname>
          </string-name>
          .
          <article-title>The uni ed medical language system (umls): integrating biomedical terminology</article-title>
          .
          <source>Nucleic acids research</source>
          ,
          <volume>32</volume>
          (
          <issue>suppl 1</issue>
          ):
          <fpage>D267</fpage>
          -
          <lpage>D270</lpage>
          ,
          <year>2004</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref20">
        <mixed-citation>
          20.
          <string-name>
            <surname>Kishore</surname>
            <given-names>Papineni</given-names>
          </string-name>
          , Salim Roukos, Todd Ward, and
          <string-name>
            <surname>Wei-Jing Zhu</surname>
          </string-name>
          .
          <article-title>BLEU: A Method for Automatic Evaluation of Machine Translation</article-title>
          .
          <source>ACL</source>
          <year>2002</year>
          :
          <fpage>311</fpage>
          -
          <lpage>318</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>