<!DOCTYPE article PUBLIC "-//NLM//DTD JATS (Z39.96) Journal Archiving and Interchange DTD v1.0 20120330//EN" "JATS-archivearticle1.dtd">
<article xmlns:xlink="http://www.w3.org/1999/xlink">
  <front>
    <journal-meta />
    <article-meta>
      <title-group>
        <article-title>Depth-Wise Separable Atrous Convolution for Polyps Segmentation in Gastro-Intestinal Tract</article-title>
      </title-group>
      <contrib-group>
        <aff id="aff0">
          <label>0</label>
          <institution>Syed Muhammad Faraz Ali, Muhammad Taha Khan, Syed Unaiz Haider, Talha Ahmed, Zeshan Khan, Muhammad Atif Tahir National University of Computer and Emerging Sciences, Karachi Campus</institution>
          ,
          <country country="PK">Pakistan</country>
        </aff>
      </contrib-group>
      <pub-date>
        <year>2020</year>
      </pub-date>
      <fpage>14</fpage>
      <lpage>15</lpage>
      <abstract>
        <p>Identification of polyps in endoscopic images is critical for the diagnosis of colon cancer. Finding the exact shape and size of polyps requires the segmentation of endoscopic images. This research explores the advantage of using depth-wise separable convolution in the atrous convolution of the ResUNet++ architecture. Deep atrous spatial pyramid pooling was also implemented on the ResUNet++ architecture. The results show that architecture with separable convolution has a smaller size and fewer Giga-Floating Point Operations (GFLOPs) without degrading the performance too much.</p>
      </abstract>
    </article-meta>
  </front>
  <body>
    <sec id="sec-1">
      <title>INTRODUCTION</title>
      <p>Wireless capsule endoscopy (WSE) has been used for diagnosis for
nearly 10 years now. WSE images provide diagnosis capability for
many diseases such as colon cancer, ulcer, polyps detection, etc.
With the advent of deep learning in computer vision, this diagnosis
task can be automated.</p>
    </sec>
    <sec id="sec-2">
      <title>RELATED WORK</title>
      <p>
        The gastrointestinal tract has been an active area of research. The
benefit that can be achieved through computer-aided diagnosis
is significant. Jha et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ] studied the semantic segmentation of
polyps in the GI tract. This research utilizes the well-accepted U-net
architecture and modified U-net architecture also called ResUNet
for segmentation. Further research was conducted to introduce a
novel architecture named ResUNet++.
      </p>
    </sec>
    <sec id="sec-3">
      <title>APPROACH</title>
      <p>
        The approach follows the method used by Jha et al. [
        <xref ref-type="bibr" rid="ref9">9</xref>
        ]. The
ResUNet++ architecture was employed which uses the encoder and
decoder structure for semantic segmentation. Pyramid pooling was
used as a bridge between the encoder and decoder block. The
encoder block contains residual units that take advantage of skip
connection in a neural network. The skip connection allows
training a deep neural network without degrading the performance.
Squeeze and excitation blocks were used which ensure that the
channel output features are weighted equally [
        <xref ref-type="bibr" rid="ref5">5</xref>
        ]. The attention
mechanism is used in the decoder block. The attention mechanism
is useful in making a pixel-wise prediction. This approach is
popular in natural language processing (NLP) where attention is given
to each word of a sentence. In semantic segmentation, an attention
mechanism is used to give attention to each pixel of an image which
can then be used to make a prediction at pixel level [
        <xref ref-type="bibr" rid="ref6">6</xref>
        ].
      </p>
      <p>Input
Output</p>
      <sec id="sec-3-1">
        <title>Encoder</title>
      </sec>
      <sec id="sec-3-2">
        <title>Bridge</title>
      </sec>
      <sec id="sec-3-3">
        <title>Bridge</title>
      </sec>
      <sec id="sec-3-4">
        <title>Decoder</title>
        <p>
          A bridge of pyramid pooling is used between encoder and
decoder block [
          <xref ref-type="bibr" rid="ref2">2</xref>
          ] [
          <xref ref-type="bibr" rid="ref1">1</xref>
          ]. The atrous convolution is used in this bridge
through which the output of the encoder is viewed at the various
respective fields. This block convolves the features with the kernel
of diferent dilation rates and the final output is the concatenation
of all the convolutions. This way the contextual information in
features is captured at various scales.
        </p>
        <p>
          This Atrous Spatial Pyramid Pooling (ASPP) block in ResUNet++
was implemented using depth-wise separable convolution as well
as replaced with Deep Atrous Spatial Pyramid Pooling (DASPP)
module from [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ] in separate experiment. The implementation of
depth-wise separable convolution is done by applying kernel on
input at channel level. The output from here is passed through the
pointwise convolution with 1x1 kernel [
          <xref ref-type="bibr" rid="ref3">3</xref>
          ]. The application of
depthwise convolution results in fewer GFLOPs and parameters.DASPP
was implemented to see if going deep in network improves
performance on polyps segmentation. Three modified architecture
are:
(1) sepv_conv_resunet++ : ASPP module from ResUNet++ [
          <xref ref-type="bibr" rid="ref9">9</xref>
          ]
replaced with depth-wise separable convolution.
(2) dsapp_resunet++ : ASPP module replaced with DASPP
module from [
          <xref ref-type="bibr" rid="ref4">4</xref>
          ].
(3) dsapp_relu_resunet++ : 2 implemented with ReLu
activation.
        </p>
        <p>
          Semantic segmentation, unlike object detection, can be treated
as a pixel wise classification problem. The output of semantic
segmentation for a pixel is a mask identifying the class to which the
pixel belongs. For the polyps segmentation problem [
          <xref ref-type="bibr" rid="ref7">7</xref>
          ], this mask
is either 0 or 1. The evaluation metrics used in semantic
segmentation are accuracy, precision, recall, mean Intersection over Union
(mIoU), and dice co-eficient. All these except for accuracy were
used to identify model performance. The custom loss function for
mIoU was implemented and all model architectures were trained
on this custom loss.
4
        </p>
      </sec>
    </sec>
    <sec id="sec-4">
      <title>DATASET</title>
      <p>
        The experiments were performed on Kvasir-SEG dataset [
        <xref ref-type="bibr" rid="ref8">8</xref>
        ]. This
data consists of thousand polyps images. The ground truth values
against each of these images were provided as image masks in a
separate folder.
5
      </p>
    </sec>
    <sec id="sec-5">
      <title>RESULTS AND ANALYSIS</title>
      <p>All the experiments were performed on google Colab which
provides a session for up to 12 hours. This 12 hours session is not
enough to train a deep learning model. So to make a fair
comparison, the number of epochs for all the experiments was kept the
same. The data was split into training, validation, and testing set
with the ratio of 80, 10, and 10 percent respectively. With this split,
800 images were selected for model training. These 800 images are
not enough to train a deep learning model. To increase the training
set data augmentation technique was applied to the training set.
The validation set and testing set were not modified and thus the
size of validation and test sets were 100 images each. 30 diferent
augmentations were applied to the training set after that the size
of it grew to 24800 images. The augmentations were also applied
to the provided mask so that the target variable is transformed in
the same way as the input image.</p>
      <p>The optimizer used from training was NAdams optimizer with
a learning rate of 0.0001 and a batch size of 8. The learning curve
for training and validation loss was recorded for each epoch. The
learning curve provides insights into the model convergence.</p>
      <p>Figure 2 shows the learning curve for each architecture. The
architecture with the DASPP bridge shows that it may have
converged within 10 epochs as the validation error started increasing.
However, the ResUNet++ and separable convolved ResUNet show
that the model can be trained for few more epochs as both training
and validation error are still decreasing. For UNet, the learning
curve is also decreasing at the 10th epoch. However, the value of
the loss is higher than the loss of ResUNet++ architecture.</p>
      <sec id="sec-5-1">
        <title>Model</title>
        <p>Unet 75.23% 84.52%
resunet++ 64.97% 89.81%
sepv_conv_resunet++ 60.55% 93.31%
dsapp_resunet++ 69.72% 82.62%
dsapp_relu_resunet++ 61.54% 92.33%
Table 1: Test Data Results</p>
      </sec>
      <sec id="sec-5-2">
        <title>Recall</title>
      </sec>
      <sec id="sec-5-3">
        <title>Precision Dice</title>
        <p>71.91%
78.35%
77.25%
76.66%
74.63%
mIoU
59.53%
69.48%
67.56%
66.71%
66.03%</p>
        <p>Table 1 gives the performance of each model on testing data.
The performance of ResUNet++ on dice coeficient and mIoU is
better than other models. The performance of the model with
separable convolution has comparable results on Dice and mIoU metrics.
However, the model with the DASPP bridge did not perform well.
This shows that increasing depth any further did not improve
performance. The size of the model which is measure by the number
of parameters and Giga-Floating Point Operations (GFLOPs) is best
for the model with separable convolution. The results are compiled
in table 2. The less number of parameters means that the model
size is smaller and it may be easy to move this in a production
environment.</p>
      </sec>
      <sec id="sec-5-4">
        <title>Model</title>
        <p>Unet 3,588,997
resunet++ 4,371,265
sepv_conv_resunet++ 3,047,265
dsapp_resunet++ 5,024,705
dsapp_relu_resunet++ 5,024,705
Table 2: Model Size</p>
      </sec>
      <sec id="sec-5-5">
        <title>Params GFLOPs</title>
      </sec>
    </sec>
    <sec id="sec-6">
      <title>CONCLUSION AND FUTURE WORK</title>
      <p>The research gives empirical results of the advantage of using
depthwise separable convolution which resulted in smaller model size
without significantly afecting the performance. It has also been
shown that increasing the depth further may not improve
performance and can result in overfitting of the model. It has been
observed that the implementation of depth-wise separable
convolution results in a smaller model without much degradation in overall
performance. The tuning of hyper-parameter and a larger number
of epochs will give a better understanding of the performance.</p>
    </sec>
  </body>
  <back>
    <ref-list>
      <ref id="ref1">
        <mixed-citation>
          [1]
          <string-name>
            <surname>Liang-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          , George Papandreou, Iasonas Kokkinos, Kevin Murphy, and Alan L Yuille.
          <year>2017</year>
          .
          <article-title>Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs</article-title>
          .
          <source>IEEE transactions on pattern analysis and machine intelligence 40</source>
          ,
          <issue>4</issue>
          (
          <year>2017</year>
          ),
          <fpage>834</fpage>
          -
          <lpage>848</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref2">
        <mixed-citation>
          [2]
          <string-name>
            <surname>Liang-Chieh</surname>
            <given-names>Chen</given-names>
          </string-name>
          , Yukun Zhu, George Papandreou, Florian Schrof, and
          <string-name>
            <given-names>Hartwig</given-names>
            <surname>Adam</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Encoder-decoder with atrous separable convolution for semantic image segmentation</article-title>
          .
          <source>In Proceedings of the European conference on computer vision (ECCV)</source>
          .
          <volume>801</volume>
          -
          <fpage>818</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref3">
        <mixed-citation>
          [3]
          <string-name>
            <given-names>François</given-names>
            <surname>Chollet</surname>
          </string-name>
          .
          <year>2017</year>
          .
          <article-title>Xception: Deep learning with depthwise separable convolutions</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>1251</volume>
          -
          <fpage>1258</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref4">
        <mixed-citation>
          [4]
          <string-name>
            <given-names>Taha</given-names>
            <surname>Emara</surname>
          </string-name>
          ,
          <string-name>
            <surname>Hossam E Abd El Munim</surname>
          </string-name>
          , and
          <string-name>
            <surname>Hazem M Abbas</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>LiteSeg: A Novel Lightweight ConvNet for Semantic Segmentation</article-title>
          .
          <article-title>In 2019 Digital Image Computing: Techniques and Applications (DICTA)</article-title>
          .
          <source>IEEE</source>
          , 1-
          <fpage>7</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref5">
        <mixed-citation>
          [5]
          <string-name>
            <given-names>Jie</given-names>
            <surname>Hu</surname>
          </string-name>
          ,
          <string-name>
            <given-names>Li</given-names>
            <surname>Shen</surname>
          </string-name>
          ,
          <string-name>
            <given-names>and Gang</given-names>
            <surname>Sun</surname>
          </string-name>
          .
          <year>2018</year>
          .
          <article-title>Squeeze-and-excitation networks</article-title>
          .
          <source>In Proceedings of the IEEE conference on computer vision and pattern recognition</source>
          .
          <volume>7132</volume>
          -
          <fpage>7141</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref6">
        <mixed-citation>
          [6]
          <string-name>
            <given-names>Zilong</given-names>
            <surname>Huang</surname>
          </string-name>
          ,
          <string-name>
            <surname>Xinggang</surname>
            <given-names>Wang</given-names>
          </string-name>
          ,
          <string-name>
            <surname>Lichao Huang</surname>
            , Chang Huang,
            <given-names>Yunchao</given-names>
          </string-name>
          <string-name>
            <surname>Wei</surname>
          </string-name>
          , and Wenyu Liu.
          <year>2019</year>
          .
          <article-title>Ccnet: Criss-cross attention for semantic segmentation</article-title>
          .
          <source>In Proceedings of the IEEE International Conference on Computer Vision</source>
          . 603-
          <fpage>612</fpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref7">
        <mixed-citation>
          [7]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          , Steven A.
          <string-name>
            <surname>Hicks</surname>
          </string-name>
          , Krister Emanuelsen, Håvard Johansen, Dag Johansen, Thomas de Lange,
          <article-title>Michael A</article-title>
          .
          <string-name>
            <surname>Riegler</surname>
            , and
            <given-names>Pål</given-names>
          </string-name>
          <string-name>
            <surname>Halvorsen</surname>
          </string-name>
          .
          <year>2020</year>
          . Medico Multimedia Task at MediaEval 2020:
          <article-title>Automatic Polyp Segmentation</article-title>
          .
          <source>In Proc. of the MediaEval 2020 Workshop.</source>
        </mixed-citation>
      </ref>
      <ref id="ref8">
        <mixed-citation>
          [8]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H Smedsrud</surname>
          </string-name>
          ,
          <article-title>Michael A Riegler, Pål Halvorsen</article-title>
          , Thomas de Lange, Dag Johansen, and
          <string-name>
            <given-names>Håvard D</given-names>
            <surname>Johansen</surname>
          </string-name>
          .
          <year>2020</year>
          .
          <article-title>Kvasir-seg: A segmented polyp dataset</article-title>
          .
          <source>In International Conference on Multimedia Modeling</source>
          . Springer,
          <fpage>451</fpage>
          -
          <lpage>462</lpage>
          .
        </mixed-citation>
      </ref>
      <ref id="ref9">
        <mixed-citation>
          [9]
          <string-name>
            <given-names>Debesh</given-names>
            <surname>Jha</surname>
          </string-name>
          ,
          <string-name>
            <surname>Pia H Smedsrud</surname>
          </string-name>
          ,
          <article-title>Michael A Riegler, Dag Johansen</article-title>
          , Thomas De Lange, Pål Halvorsen, and
          <string-name>
            <given-names>Håvard D</given-names>
            <surname>Johansen</surname>
          </string-name>
          .
          <year>2019</year>
          .
          <article-title>Resunet++: An advanced architecture for medical image segmentation</article-title>
          .
          <source>In 2019 IEEE International Symposium on Multimedia (ISM)</source>
          . IEEE,
          <fpage>225</fpage>
          -
          <lpage>2255</lpage>
          .
        </mixed-citation>
      </ref>
    </ref-list>
  </back>
</article>