<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>A Temporal-Spatial Atention Model for Medical Image Detection</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Maxwell Hwang</string-name>
          <email>hwang@g-mail.nsysu.edu.tw</email>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Cai-Wu</string-name>
          <xref ref-type="aff" rid="aff2">2</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Kao-Shing Hwang</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Yong Si Xu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Chien-Hsing Wu</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of Colorectal Surgery, the Second Afiliated Hospital of Zhejiang University School of Medicine</institution>
          ,
          <addr-line>Zhejiang</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Electrical Engineering, National Sun Yat-sen University</institution>
          ,
          <addr-line>Kaohsiung 80424</addr-line>
          ,
          <country country="TW">Taiwan</country>
        </aff>
        <aff id="aff2">
          <label>2</label>
          <institution>Department of Hematology, the Fourth Afiliated Hospital of Zhejiang University School of Medicine</institution>
          ,
          <addr-line>Zhejiang</addr-line>
          ,
          <country country="CN">China</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>A local region model with attentive temporal-spatial pathways is proposed for automatically learning various target structures. The attentive spatial pathway highlights the salient region to generate bounding boxes and ignores irrelevant regions in an input image. The proposed attention mechanism allows eficient object localization, and the overall predictive performance is increased because there are fewer false positives for the object detection task for medical images with manual annotations. The experimental results show that proposed models consistently increase the base architecture's predictive performance on the Medico dataset with satisfactory computational eficiency.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        This study proposes a simple and efective solution that interfaces
an attention mechanism in a standard CNN model. The feature
maps are utilized more eficiently, and localization does not require
processing the entire image. The proposed attentive model, which
consists of tempo-spatial pathways, automatically learns to focus on
target structures without additional supervision. The spatial
pathway generates local region proposals on-the-fly using the salient
features for a specific task. The temporal attention model proposes
a sequence of locations for the local region search and not the
entire image, so the computational overhead is significantly reduced,
and many model parameters are omitted, similarly to multi-model
frameworks. CNN models that use the proposed attentive model
can be trained from scratch using standard methods or transfer
learning. Similar attention mechanisms have been proposed for
natural image classification and captioning [
        <xref ref-type="bibr" rid="ref2 ref4">2, 4</xref>
        ] for adaptive
feature pooling, where model predictions are conditioned only using
a subset of selected image regions. The proposed process assigns
attention coeficients to specific local regions.
      </p>
      <p>
        This study uses a novel hybrid attention model (HAM) as an
interface between any feature extractors, such as a CNN, and a
decision-making module for end-to-end tasks, such as RL,
classification, regression. The proposed module determines spatial pinpoints
in feature space using a hard attestation pathway. The model also
synthesizes the context vector using a soft attention mechanism
and a GRU for decision-making downstream. Real images are used
to determine the eficacy of the proposed model and are used as
a pre-training data set for detection and classification for
colonoscopic images [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] that are the motif of this work. The contributions
of this work are summarized as follows:
      </p>
      <p>A hybrid attention approach allows an attention mechanism
specific to local regions and the subsequent strategy or
decisionmaking process. This improved model performs better than
stateof-art methods that use global or local search schemes.</p>
      <p>
        An attention interface is used for region proposals and sequential
search of glimpses on local regions simultaneously for medical
images. The proposed attention interface, which can be trained
from end to end, replaces the hard-attention approaches currently
used only for image classification. It eliminates the need for the
global generation of bounding boxes for a Faster R-CNN [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] and
provides better accuracy and greater computational eficiency than
a local search scheme method. The study demonstrates that the
proposed attention mechanism produces fine-scale attention maps
that can be visualized with minimal computational overhead.
      </p>
      <p>A masking scheme is applied to the distribution of attention
scores to increase computational eficiency, instead of imposing
directly on the feature map and influencing downstream operations.
It ensures better classification performance than the baseline
approach. It is shown that attention maps and an observation pinpoint
allow fewer glimpses and fewer useful observations. A modification
to the standard FPN is used for feature extraction, so the process is
sensitive and specific.</p>
    </sec>
    <sec id="sec-2">
      <title>APPROACH</title>
    </sec>
    <sec id="sec-3">
      <title>Method</title>
      <p>
        The process for the proposed local search method for polyps
detection involves two stages [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. During the first stage, the local region
proposal network (RPN) proposes candidate ROIs from glimpsed
regions located in sequence by the HAM. The weighted feature’s
attention scores are used to determine a glimpsed region in which
target objects may reside. Bounding boxes are generated, and the
process and the process then involves classification and position
regression for preliminary screening. The confidence index for the
classification is used to determine bounding boxes with higher
values. Local non-maximum suppression is used to filter out some
bounding boxes as regions of interest (ROIs), and these are used as
inputs for the second stage network, which involves binding box
regression and classification. When the RoIs are generated and
accumulated in all the sequences for classification and bounding box
regression, an exhaustive search is initiated. This process involves
considerable computing resources, so a method that uses a hybrid
attention mechanism with RL to the RPN reduces calculation.
      </p>
      <p>Instead of an exhaustive search over the entire image, the
proposed method uses a Faster RCNN for a sequential search directed
by a hybrid attention module (HAM) to determine glimpse regions
that are likely to contain an object. RoI’s are generated in a
restricted area, where target objects are likely to be located. This local
search reduces the amount of calculation for insignificant ROIs. The
proposed model has four modules: a CNN-based feature extractor,
the proposed HAM, a local RPN, and a detector for bounding box
regression and object classification. Glimpse regions are pinpointed,
and the length of the sequence of glimpses is determined
sequentially. The local RPN generates bounding boxes of diferent sizes
and aspect ratios within a glimpsed region. The detector regresses
bounding boxes and classifies objects. The architecture of the HAM
is shown in Figure 1.</p>
    </sec>
    <sec id="sec-4">
      <title>2.2 Preparation and Data set</title>
      <p>
        The experiments were executed using the Ubuntu 18.04 operating
system, Python 3.7, Tensorflow. The data sets for the experiments
are provided in Medico Challenge [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. A public data set of real scenes
(PASCAL VOC [
        <xref ref-type="bibr" rid="ref3">3</xref>
        ]) is used to pre-train the Faster R-CNN framework.
The data set contains only images, so data augmentation operations,
such as rotation, reflection, and resizing, increase the number of
images. Five-fold cross-validation is used for the experiments.
      </p>
    </sec>
    <sec id="sec-5">
      <title>3 RESULTS OF COMPARISONS WITH PEER</title>
    </sec>
    <sec id="sec-6">
      <title>METHODS</title>
      <p>The results for the colonoscopy dataset in Figure 2 show that the
HAM-beta and HAM-beta-mask are similar to drl-RPN in terms
of 50.There are fewer average glimpses and a smaller average
glimpsed area than for the drl-RPN, and the AP density and glimpse
contribution are better than peer methods.</p>
      <p>The drl-RPN must search three times for important areas before
terminating the glimpsing process, requiring more computation
time. The HAM-beta and HAM-beta-mask accurately locate the
correct in the first time search.</p>
    </sec>
    <sec id="sec-7">
      <title>4 CONCLUSION AND FUTURE WORK</title>
      <p>This study proposes an innovative attention module that uses soft
and hard attention. This module can interface with any architecture
that involves simultaneous spatial and temporal tasks, such as polys
detection. A global search scans the entire image in an object
detection task, but it requires much time and resources. The proposed
approach obviates the need to use an extra model by learning to
highlight salient local regions in images. The proposed
temporalspatial attention module leverages the salient information in the
state space for a policy learner, such as reinforcement learning, in
addition to object detection in image tasks.</p>
    </sec>
    <sec id="sec-8">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work is supported by the grant of the Key Project of Yiwu
Science and Technology plan, China. No.20-3-067.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Sharib</given-names>
            <surname>Ali</surname>
          </string-name>
          , Felix Zhou, Barbara Braden, Adam Bailey, Suhui Yang, Guanju Cheng, Pengyi Zhang, Xiaoqiong Li,
          <string-name>
            <given-names>Maxime</given-names>
            <surname>Kayser</surname>
          </string-name>
          ,
          <string-name>
            <surname>Roger D Soberanis-Mukul</surname>
          </string-name>
          , Shadi Albarqouni, Xiaokang Wang,
          <string-name>
            <surname>Chunqing</surname>
            <given-names>Wang</given-names>
          </string-name>
          , Seiryo Watanabe, Ilkay Oksuz, Qingtian Ning, Shufan Yang, Mohammad Azam Khan, Xiaohong W Gao, Stefano Realdon, Maxim Loshchenov, Julia A Schnabel, James E East, Georges Wagnieres, Victor B Loschenov, Enrico Grisan, Christian Daul, Walter Blondel, and
          <string-name>
            <given-names>Jens</given-names>
            <surname>Rittscher</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>An objective comparison of detection and segmentation algorithms for artefacts in clinical endoscopy</article-title>
          .
          <source>Scientific Reports</source>
          <volume>10</volume>
          ,
          <issue>1</issue>
          (
          <year>2020</year>
          ),
          <volume>2748</volume>
          . https://doi.org/10.1038/s41598-020-59413-5
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Peter</given-names>
            <surname>Anderson</surname>
          </string-name>
          , Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould,
          <string-name>
            <given-names>and Lei</given-names>
            <surname>Zhang</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Bottom-up and topdown attention for image captioning and visual question answering</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>6077</volume>
          -
          <fpage>6086</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Everingham</surname>
          </string-name>
          , Luc Van Gool,
          <source>Christopher KI Williams</source>
          , John Winn, and
          <string-name>
            <given-names>Andrew</given-names>
            <surname>Zisserman</surname>
          </string-name>
          .
          <year>2010</year>
          .
          <article-title>The pascal visual object classes (voc) challenge</article-title>
          .
          <source>International journal of computer vision 88</source>
          , 2 (
          <year>2010</year>
          ),
          <fpage>303</fpage>
          -
          <lpage>338</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Saumya</given-names>
            <surname>Jetley</surname>
          </string-name>
          , Nicholas A.
          <string-name>
            <surname>Lord</surname>
            ,
            <given-names>Namhoon</given-names>
          </string-name>
          <string-name>
            <surname>Lee</surname>
          </string-name>
          , and
          <string-name>
            <surname>Philip</surname>
            <given-names>H. S.</given-names>
          </string-name>
          <string-name>
            <surname>Torr</surname>
          </string-name>
          .
          <year>2018</year>
          . Learn To Pay Attention. CoRR abs/
          <year>1804</year>
          .02391 (
          <year>2018</year>
          ).
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          , Steven A.
          <string-name>
            <surname>Hicks</surname>
          </string-name>
          , Krister Emanuelsen, Håvard Johansen, Dag Johansen, Thomas de Lange,
          <article-title>Michael A</article-title>
          .
          <string-name>
            <surname>Riegler</surname>
            , and
            <given-names>Pål</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2020</year>
          . Medico Multimedia Task at MediaEval 2020:
          <article-title>Automatic Polyp Segmentation</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H Smedsrud</surname>
          </string-name>
          ,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , Thomas de Lange, Dag Johansen, and
          <string-name>
            <given-names>Håvard D</given-names>
            <surname>Johansen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Kvasirseg: A segmented polyp dataset</article-title>
          .
          <source>In International Conference on Multimedia Modeling</source>
          . Springer,
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Shaoqing</given-names>
            <surname>Ren</surname>
          </string-name>
          , Kaiming He,
          <string-name>
            <surname>Ross Girshick</surname>
            , and
            <given-names>Jian</given-names>
          </string-name>
          <string-name>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2015</year>
          .
          <article-title>Faster RCNN: Towards Real-Time Object Detection with Region Proposal Networks</article-title>
          .
          <source>In Advances in Neural Information Processing Systems</source>
          , C. Cortes,
          <string-name>
            <given-names>N.</given-names>
            <surname>Lawrence</surname>
          </string-name>
          ,
          <string-name>
            <given-names>D.</given-names>
            <surname>Lee</surname>
          </string-name>
          ,
          <string-name>
            <given-names>M.</given-names>
            <surname>Sugiyama</surname>
          </string-name>
          , and R.
          <source>Garnett (Eds.)</source>
          , Vol.
          <volume>28</volume>
          . Curran Associates, Inc.
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>