<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Automatic Polyp Segmentation Using Channel-Spatial Attention with Deep Supervision</article-title>
      </title-group>
      <contrib-group>
        <contrib contrib-type="author">
          <string-name>Sahadev Poudel</string-name>
          <xref ref-type="aff" rid="aff0">0</xref>
        </contrib>
        <contrib contrib-type="author">
          <string-name>Sang-Woong Lee</string-name>
          <xref ref-type="aff" rid="aff1">1</xref>
        </contrib>
        <aff id="aff0">
          <label>0</label>
          <institution>Department of IT Convergence Engineering, Gachon University</institution>
          ,
          <addr-line>Seongnam 13120</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
        <aff id="aff1">
          <label>1</label>
          <institution>Department of Software, Gachon University</institution>
          ,
          <addr-line>Seongnam 13120</addr-line>
          ,
          <country country="KR">South Korea</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>This paper introduces our approach for the automatic segmentation of polyp images in the Gastro-Intestinal (GI) tract. We employ an EficientNet as an encoder backbone with UNet decoder and leverage the concept of UNet++ of redesigning the skip connections to use multi-scale semantic details. Further, the addition of deep supervision and channel-spatial attention module on the network results in good segmentation performance. The experimental results show the eficiency of the proposed method, which achieves an accuracy of 95.46 %, a recall of 90.31 %, a precision of 86.07 %, and F2-score of 87.4%.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>
        The aim of Medico automatic polyp segmentation challenge is to
segment irregular, small or flat polyps automatically applying diferent
robust and eficient algorithms [
        <xref ref-type="bibr" rid="ref1">1</xref>
        ]. A training set of 1000 polyp
images with their corresponding masks labeled by medical experts
is provided for the participating team. Each team is expected to
develop a powerful architecture that can predict the region of interest
(ROI) on the testing set. Organizers compare and evaluate all the
submitted approaches based on two primary measures: (1) better
polyp segmentation task, and (2) eficiency task. In this paper, an
encoder-decoder based convolutional neural network (CNN) is
introduced to facilitate good segmentation results and eficient polyp
detection using the provided data.
      </p>
    </sec>
    <sec id="sec-2">
      <title>METHODS</title>
      <p>
        In our method, we utilize the pre-trained weight of variants
EficientNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] for the encoder path. Even though medical images are
diferent from the natural images, it is often beneficial to use the
pretrained weights of state-of-the-art CNN architectures for a small
datasets [
        <xref ref-type="bibr" rid="ref3 ref4">3, 4</xref>
        ]. Considering the presence of polyps of varying scales,
we utilize the redesigned skip connections from the UNet++ [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ].
The densely connected skip connections to the decoder side enable
lfexible multi-scale feature fusion both horizontally and vertically at
the same resolution. Besides, the proposed method powered by deep
supervision and channel-spatial attention [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ] enables significantly
better performance and fast convergence. Integrating channel and
spatial attention modules restrain irrelevant features and allow only
useful spatial details.
      </p>
      <p>
        Figure 1 shows a broad overview of our proposed method.
EficientNet uses the MobileNet inverted block (MB) [
        <xref ref-type="bibr" rid="ref7">7</xref>
        ] with squeeze
and excitation network [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ], and a combination of these components
works as the good feature extractor module. We keep a network
level of s=1 to s=5 depending upon the size of feature map. We
reduce the size of feature maps by 2 at each level. The size of the
spatial feature map at the last layer (s=5) is 7 7, which indicates
that the feature maps are down-sampled by five times and is halved
according to the previous level. At diferent levels, each node
concatenates the feature maps from its previous node of the same level
and the upsampled feature maps of the next level, enabling
aggregation of multi-scale features. Next, the concatenated features are
passed through the channel-spatial network at each node. On the
decoder side, a transposed convolution is used for upsampling the
feature maps. Similarly, we upscale the outputs of the decoder block
at level s=2 to s=5 and apply a 1x1 convolution with 1 kernel and a
sigmoid function. Then, all the outputs (after deep supervision) are
averaged and a final result is generated. We performed experiments
on five diferent settings as explained below:
      </p>
      <p>
        For task 2, we use the same architectural design as Method 5.
However, we utilize the compound scaling method proposed by
EficientNet [
        <xref ref-type="bibr" rid="ref2">2</xref>
        ] in decreasing order to find the optimum scaling
dimension of the network. We decrease the network’s depth and
width and keep the fixed image size of 224x224 to prevent loss of
spatial details.
3
      </p>
    </sec>
    <sec id="sec-3">
      <title>DATASET</title>
      <p>
        The dataset includes a total of 1000 polyp images with their
corresponding ground truth [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The images have a resolution of 332x487
to 1920x1072 pixels. The images are split into training, validation
set at a ratio of 80:20. Both training and validation were conducted
using images with a pixel resolution of 224x224. We perform a
heavy augmentation using albumentation library [
        <xref ref-type="bibr" rid="ref10">10</xref>
        ] which
includes rotation, vertical and horizontal flipping, cutout, shearing,
scaling, zooming.
      </p>
    </sec>
    <sec id="sec-4">
      <title>IMPLEMENTATION DETAILS</title>
      <p>The implementation is based on Keras, and the backend is
TensorFlow. We use a stochastic gradient descent with a batch size of 16
and use a weight decay of 0.0001 with a momentum of 0.9 without
an accelerated gradient. The experiments were conducted using
an Intel® Core™ i7-7700 CPU @ 3.60GHz × 8 with a GeForce GTX
1080 Ti with 36 GB of RAM.
5</p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND DISCUSSION</title>
      <p>We submitted the predictions of five methods for the testing set
(160 images) to the organizers for the evaluation. Table 1 and Table
2 report the experimental results achieved by diferent models
on the segmentation dataset for task1 and task2. Table 1 shows
that the addition of deep supervision in model 2 enables better
segmentation performance. The model achieves a 2% improvement
in performance in terms of the Dice coeficient score. However,
under the same settings, applying EficientB1 and EficientNetB2 on
the encoder path gives a similar performance and a small marginal
gain in F2-score. The channel-spatial attention module’s addition
in model 5 turns out to be the best model achieving 86.07 dice
coeficient score and 78.97 Jaccard index. This suggests that the
attention module contributes more in comparison to other modules.
Similarly, for task 2, the eficient model achieved an accuracy of
91.49 % with F2-score of 0.60, with just 2.5 million parameters.
Further, the frame rate in Frames per Second (FPS) while testing in
CPU is 2.25142.
6</p>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION</title>
      <p>This paper presented five diferent methods for the accurate
segmentation of polyps in GI tract diseases. The proposed methods use
an encoder-decoder based architecture where the variants of
EficientNet are applied as an encoder backbone with the UNet decoder.
Further, the combination of deep supervision and channel-spatial
attention module with an additionally redesigned skip connections
achieved the best performance on the test set. We plan to continue
researching eficient tasks and further improve the results.</p>
    </sec>
    <sec id="sec-7">
      <title>ACKNOWLEDGMENTS</title>
      <p>This work was supported by Institute of Information
communications Technology Planning Evaluation (IITP) grant funded by
the Korea government (MSIT) (No.2020-0-01907, Development of
Smart Signage Technology for Automatic Classification of Untact
Examination and Patient Status Based on AI).</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          , Steven A.
          <string-name>
            <surname>Hicks</surname>
            , Krister Emanuelsen,
            <given-names>Håvard D.</given-names>
          </string-name>
          <string-name>
            <surname>Johansen</surname>
          </string-name>
          , Dag Johansen, Thomas de Lange,
          <article-title>Michael A</article-title>
          .
          <string-name>
            <surname>Riegler</surname>
            , and
            <given-names>Pål</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <source>Medico Multimedia Task at MediaEval</source>
          <year>2020</year>
          :
          <article-title>Automatic Polyp Segmentation</article-title>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <given-names>Mingxing</given-names>
            <surname>Tan and Quoc V Le.</surname>
          </string-name>
          <article-title>Eficientnet: Rethinking model scaling for convolutional neural networks</article-title>
          .
          <source>arXiv preprint arXiv:1905.11946</source>
          ,
          <year>2019</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>Nima</given-names>
            <surname>Tajbakhsh</surname>
          </string-name>
          , Jae Y Shin,
          <string-name>
            <surname>Suryakanth R Gurudu</surname>
            ,
            <given-names>R Todd</given-names>
          </string-name>
          <string-name>
            <surname>Hurst</surname>
          </string-name>
          , Christopher B Kendall,
          <article-title>Michael B Gotway,</article-title>
          and
          <string-name>
            <given-names>Jianming</given-names>
            <surname>Liang</surname>
          </string-name>
          .
          <article-title>Convolutional neural networks for medical image analysis: Full training or fine tuning? IEEE transactions on medical imaging</article-title>
          ,
          <volume>35</volume>
          (
          <issue>5</issue>
          ):
          <fpage>1299</fpage>
          -
          <lpage>1312</lpage>
          ,
          <year>2016</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Sahadev</given-names>
            <surname>Poudel</surname>
          </string-name>
          , Yoon Kim, Duc Vo, and
          <string-name>
            <surname>Sang-Woong Lee</surname>
          </string-name>
          .
          <article-title>Colorectal disease classification using eficiently scaled dilation in convolutional neural network</article-title>
          .
          <source>IEEE Access</source>
          , PP:
          <fpage>1</fpage>
          -
          <lpage>1</lpage>
          ,
          <year>05 2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Zongwei</given-names>
            <surname>Zhou</surname>
          </string-name>
          , Md Mahfuzur Rahman Siddiquee, Nima Tajbakhsh, and
          <string-name>
            <given-names>Jianming</given-names>
            <surname>Liang</surname>
          </string-name>
          . Unet++
          <article-title>: A nested u-net architecture for medical image segmentation. In Deep Learning in Medical Image Analysis and Multimodal Learning for Clinical Decision Support</article-title>
          , pages
          <fpage>3</fpage>
          -
          <lpage>11</lpage>
          . Springer,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Long</given-names>
            <surname>Chen</surname>
          </string-name>
          , Hanwang Zhang, Jun Xiao, Liqiang Nie, Jian Shao, Wei Liu, and
          <string-name>
            <surname>Tat-Seng Chua</surname>
          </string-name>
          .
          <article-title>Sca-cnn: Spatial and channel-wise attention in convolutional networks for image captioning</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>5659</fpage>
          -
          <lpage>5667</lpage>
          ,
          <year>2017</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Mark</given-names>
            <surname>Sandler</surname>
          </string-name>
          , Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and LiangChieh Chen.
          <article-title>Mobilenetv2: Inverted residuals and linear bottlenecks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>4510</fpage>
          -
          <lpage>4520</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Jie</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Gang</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <article-title>Squeeze-and-excitation networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          , pages
          <fpage>7132</fpage>
          -
          <lpage>7141</lpage>
          ,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H Smedsrud</surname>
          </string-name>
          ,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , Thomas de Lange, Dag Johansen, and
          <string-name>
            <given-names>Håvard D</given-names>
            <surname>Johansen</surname>
          </string-name>
          .
          <article-title>Kvasir-SEG: A segmented polyp dataset</article-title>
          .
          <source>In Proc. of International Conference on Multimedia Modeling</source>
          , pages
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          ,
          <year>2020</year>
          .
        </mixed-citation>
      </ref>
      <ref id="ref10">
        <mixed-citation>
          [10]
          <string-name>
            <given-names>E. Khvedchenya V. I. Iglovikov A.</given-names>
            <surname>Buslaev</surname>
          </string-name>
          ,
          <string-name>
            <given-names>A.</given-names>
            <surname>Parinov</surname>
          </string-name>
          and
          <string-name>
            <given-names>A. A.</given-names>
            <surname>Kalinin</surname>
          </string-name>
          .
          <article-title>Albumentations: fast and flexible image augmentations</article-title>
          .
          <source>ArXiv</source>
          e-prints,
          <year>2018</year>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>